[MarkLogic Dev General] Type safe data and referencing questions

Mark Waschkowski mwaschkowski at gmail.com
Thu Feb 14 14:15:27 PST 2008


Hi Danny,

One quick followup, I need to do a cast in the order by, but the cast fails
if the value is missing, whats the simplest way to do what I've done below,
which doesn't look ideal to me

for $x in collection('Contacts')/*
order by  (for $x in $x where exists($x/age) return xs:int($x/age))
return $x

Thanks!

Mark

On Thu, Jan 17, 2008 at 8:32 PM, Danny Sokolsky <dsokolsky at marklogic.com>
wrote:

> Hi Mark,
>
> It is true that it would take extra time to cast one or two million
> times in a query.  But it will take time to do anything that many times
> in a query.  The trick is to write the query in a such a way that it
> does this fast.   Range indexes are a good tool for this, in combination
> with the order by optimizations.  For example, if you want to find the
> 10 latest dates from an element named stringdate, for example:
>
> <stringdate>2008-12-02</stringdate>
>
> then you can write a query like the following:
>
> (for $x in //stringdate order by xs:date($x) descending return $x)[1 to
> 10]
>
> Without a range index, it will need to find all of the stringdates and
> cast them all to dates in the order by clause.  For a ballpark estimate,
> on my laptop with 1,000,000 stringdate elements, this takes about 13
> seconds.  Not bad considering it has to order 1 million items.
>
> Now if I add a date range index for this element, the same query takes
> about 0.3 seconds, for a speedup of about 40x.  That is because the
> range index optimized the sort in the order by clause, and we just
> returned the first 10 of them.  For details about the order by
> optimizations, see the Query Performance and Tuning book (
> http://developer.marklogic.com/pubs/3.2/books/performance.pdf).
>
> Another useful tool is the profile button in cq.  It shows you where
> your query is spending time processing.
>
> My recommendation is to try some tests with range indexes and order by
> optimizations and see how it works.  It is quite easy to generate some
> dummy data for these tests.
>
> I'm not 100% sure I answered your question, but hopefully it will lead
> you in the direction of what you are trying to accomplish.
>
> -Danny
>
> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of Mark
> Waschkowski
> Sent: Thursday, January 17, 2008 11:38 AM
> To: General Mark Logic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Type safe data and referencing
> questions
>
> OK great, thanks for the information Danny.
>
> I'm a bit concerned about the type safety issue (#1) not because I'm
> worried about the data being stored correctly, but because a
> conversion might have to be carried out many many time during an
> evaluation. I may be repeating the question here, but do you have any
> idea of how the above use case would work with 1M+ rows of data. Seems
> to me that converting some date text 2M+ times (twice per record in
> this case) would have an adverse effect on a query, no? Likewise
> converting when wanting to order a larger data set by date?
>
> Really appreciate the feedback.
>
> Mark
>
> On Jan 14, 2008 8:12 PM, Danny Sokolsky <dsokolsky at marklogic.com> wrote:
> > Hi Mark,
> >
> > I will take a stab at your questions.
> >
> > 1) You do not need a schema to use typed data.  A schema will make it
> so
> > Mark Logic treats an element or attribute as its defined type without
> an
> > explicit cast, but you can always add an explicit cast (like the
> > use-case example) to make sure XQuery treats a value as a certain type
> > (with or without a schema).  The schema just makes that a little
> easier.
> > There might be some performance advantage to using a schema, but I
> don't
> > think it will be that big.  It is worth trying though, as this might
> > depend somewhat on your content.  The real performance advantage will
> > come from creating range indexes on elements or attributes you will
> use
> > in comparisons.  Schemas can also help you ensure that your data is in
> > the correct format when you load it, as Mark Logic will throw an
> > exception if it cannot cast content in an element or attribute to the
> > type specified in the schema.
> >
> > 2) You could put the referencing information in the properties
> document.
> > The default conversion application in CPF does this, for example, to
> > keep track of the original documents and various converted documents.
> >
> > 3) There are no foreign key constraints built in.  I think any best
> > practices would depend on what you are trying to do.  Two approaches
> > that tend to work well are to a) put the constraining items in the
> same
> > document and/or b) use the properties document corresponding to a
> > document to store information about what is in the document.
> >
> > -Danny
> >
> >
> > -----Original Message-----
> > From: general-bounces at developer.marklogic.com
> > [mailto:general-bounces at developer.marklogic.com] On Behalf Of Mark
> > Waschkowski
> > Sent: Monday, January 14, 2008 1:25 PM
> > To: general at developer.marklogic.com
> > Subject: [MarkLogic Dev General] Type safe data and referencing
> > questions
> >
> > Hi,
> >
> > Have been using Marklogic for a while now and haven't seen answers to
> > the below questions yet, anyone know of an answer or two?
> >
> > 1) Type safe data -  I'm concerned with retrieval of typed data,
> > especially for date information. The only way to store typed data is
> > through the use of a schema right? I can't specify the type of data on
> > a per element basis, correct? ie. <person> <birthday
> > xs:date>01-01-1970</birthday></person>
> >
> > As well, I noticed the below query in the use case examples:
> >
> >  let $item := doc("items.xml")//item_tuple
> >               [end_date >= xs:date("1999-03-01")
> >                and
> >                end_date <= xs:date("1999-03-31")]
> >  return
> >  <item_count>
> >  {
> >    count($item)
> >  }
> >  </item_count>
> >
> > Is there a schema behind the loaded data or are the examples un-type
> > safe? Should I just not worry about type safety and convert the data
> > values to the type I need when querying? If so, won't that be a
> > performance issue?
> >
> > 2) Referencing - what is the (if there is one) best practice approach
> > to reference documents together?
> > ie. Document A and Document B should both refer to Document C
> >
> > 3) Foreign key constraints - is this supported at all in some fashion?
> > If not, any approaches to suggest?
> >
> > Thanks in advance for any and all suggestions!
> >
> > Mark
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://xqzone.com/mailman/listinfo/general
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://xqzone.com/mailman/listinfo/general
> >
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20080214/13f7a9b3/attachment.html


More information about the General mailing list