[MarkLogic Dev General] Type safe data and referencing questions
Danny Sokolsky
dsokolsky at marklogic.com
Thu Jan 17 17:32:21 PST 2008
Hi Mark,
It is true that it would take extra time to cast one or two million
times in a query. But it will take time to do anything that many times
in a query. The trick is to write the query in a such a way that it
does this fast. Range indexes are a good tool for this, in combination
with the order by optimizations. For example, if you want to find the
10 latest dates from an element named stringdate, for example:
<stringdate>2008-12-02</stringdate>
then you can write a query like the following:
(for $x in //stringdate order by xs:date($x) descending return $x)[1 to
10]
Without a range index, it will need to find all of the stringdates and
cast them all to dates in the order by clause. For a ballpark estimate,
on my laptop with 1,000,000 stringdate elements, this takes about 13
seconds. Not bad considering it has to order 1 million items.
Now if I add a date range index for this element, the same query takes
about 0.3 seconds, for a speedup of about 40x. That is because the
range index optimized the sort in the order by clause, and we just
returned the first 10 of them. For details about the order by
optimizations, see the Query Performance and Tuning book (
http://developer.marklogic.com/pubs/3.2/books/performance.pdf).
Another useful tool is the profile button in cq. It shows you where
your query is spending time processing.
My recommendation is to try some tests with range indexes and order by
optimizations and see how it works. It is quite easy to generate some
dummy data for these tests.
I'm not 100% sure I answered your question, but hopefully it will lead
you in the direction of what you are trying to accomplish.
-Danny
-----Original Message-----
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Mark
Waschkowski
Sent: Thursday, January 17, 2008 11:38 AM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Type safe data and referencing
questions
OK great, thanks for the information Danny.
I'm a bit concerned about the type safety issue (#1) not because I'm
worried about the data being stored correctly, but because a
conversion might have to be carried out many many time during an
evaluation. I may be repeating the question here, but do you have any
idea of how the above use case would work with 1M+ rows of data. Seems
to me that converting some date text 2M+ times (twice per record in
this case) would have an adverse effect on a query, no? Likewise
converting when wanting to order a larger data set by date?
Really appreciate the feedback.
Mark
On Jan 14, 2008 8:12 PM, Danny Sokolsky <dsokolsky at marklogic.com> wrote:
> Hi Mark,
>
> I will take a stab at your questions.
>
> 1) You do not need a schema to use typed data. A schema will make it
so
> Mark Logic treats an element or attribute as its defined type without
an
> explicit cast, but you can always add an explicit cast (like the
> use-case example) to make sure XQuery treats a value as a certain type
> (with or without a schema). The schema just makes that a little
easier.
> There might be some performance advantage to using a schema, but I
don't
> think it will be that big. It is worth trying though, as this might
> depend somewhat on your content. The real performance advantage will
> come from creating range indexes on elements or attributes you will
use
> in comparisons. Schemas can also help you ensure that your data is in
> the correct format when you load it, as Mark Logic will throw an
> exception if it cannot cast content in an element or attribute to the
> type specified in the schema.
>
> 2) You could put the referencing information in the properties
document.
> The default conversion application in CPF does this, for example, to
> keep track of the original documents and various converted documents.
>
> 3) There are no foreign key constraints built in. I think any best
> practices would depend on what you are trying to do. Two approaches
> that tend to work well are to a) put the constraining items in the
same
> document and/or b) use the properties document corresponding to a
> document to store information about what is in the document.
>
> -Danny
>
>
> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of Mark
> Waschkowski
> Sent: Monday, January 14, 2008 1:25 PM
> To: general at developer.marklogic.com
> Subject: [MarkLogic Dev General] Type safe data and referencing
> questions
>
> Hi,
>
> Have been using Marklogic for a while now and haven't seen answers to
> the below questions yet, anyone know of an answer or two?
>
> 1) Type safe data - I'm concerned with retrieval of typed data,
> especially for date information. The only way to store typed data is
> through the use of a schema right? I can't specify the type of data on
> a per element basis, correct? ie. <person> <birthday
> xs:date>01-01-1970</birthday></person>
>
> As well, I noticed the below query in the use case examples:
>
> let $item := doc("items.xml")//item_tuple
> [end_date >= xs:date("1999-03-01")
> and
> end_date <= xs:date("1999-03-31")]
> return
> <item_count>
> {
> count($item)
> }
> </item_count>
>
> Is there a schema behind the loaded data or are the examples un-type
> safe? Should I just not worry about type safety and convert the data
> values to the type I need when querying? If so, won't that be a
> performance issue?
>
> 2) Referencing - what is the (if there is one) best practice approach
> to reference documents together?
> ie. Document A and Document B should both refer to Document C
>
> 3) Foreign key constraints - is this supported at all in some fashion?
> If not, any approaches to suggest?
>
> Thanks in advance for any and all suggestions!
>
> Mark
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general
More information about the General
mailing list