[MarkLogic Dev General] Type safe data and referencing questions

Mark Waschkowski mwaschkowski at gmail.com
Fri Jan 18 08:12:58 PST 2008


Hi Danny,

OK great. Yes, that answers my question very well thank you. I was
asking more about the theory but you took the extra step to try out a
real world scenario, which is very helpful. I had actually asked a
couple of these questions in Dec and didn't receive a response, so was
beginning to lose faith that this board was going to be helpful, but
with feedback like yours, it definitely is.

I find it interesting that the raw type of the stored data isn't that
important when doing the optimization (ie. an index). I will be
reading up on the performance tuning capabilities once all my core
work is done, and I expect that will provide a lot of insight, but at
this point I needed to know a bit of the theory behind data storage
and optimization so I can start storing data in the proper format and
not worry about having to convert it all at a later date due to
optimization requirements.

Furthermore, at this point I'm just going to store data in string
format and for doing reporting will just convert it, which I
previously wasn't sure was going to be a good approach, but certainly
seems to work (with proper indexing). Quite a bit different than how I
would be thinking with regards to an RDBMS.

Thanks again!

Mark


On Jan 17, 2008 8:32 PM, Danny Sokolsky <dsokolsky at marklogic.com> wrote:
> Hi Mark,
>
> It is true that it would take extra time to cast one or two million
> times in a query.  But it will take time to do anything that many times
> in a query.  The trick is to write the query in a such a way that it
> does this fast.   Range indexes are a good tool for this, in combination
> with the order by optimizations.  For example, if you want to find the
> 10 latest dates from an element named stringdate, for example:
>
> <stringdate>2008-12-02</stringdate>
>
> then you can write a query like the following:
>
> (for $x in //stringdate order by xs:date($x) descending return $x)[1 to
> 10]
>
> Without a range index, it will need to find all of the stringdates and
> cast them all to dates in the order by clause.  For a ballpark estimate,
> on my laptop with 1,000,000 stringdate elements, this takes about 13
> seconds.  Not bad considering it has to order 1 million items.
>
> Now if I add a date range index for this element, the same query takes
> about 0.3 seconds, for a speedup of about 40x.  That is because the
> range index optimized the sort in the order by clause, and we just
> returned the first 10 of them.  For details about the order by
> optimizations, see the Query Performance and Tuning book (
> http://developer.marklogic.com/pubs/3.2/books/performance.pdf).
>
> Another useful tool is the profile button in cq.  It shows you where
> your query is spending time processing.
>
> My recommendation is to try some tests with range indexes and order by
> optimizations and see how it works.  It is quite easy to generate some
> dummy data for these tests.
>
> I'm not 100% sure I answered your question, but hopefully it will lead
> you in the direction of what you are trying to accomplish.
>
> -Danny
>
> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of Mark
> Waschkowski
>
> Sent: Thursday, January 17, 2008 11:38 AM
> To: General Mark Logic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Type safe data and referencing
> questions
>
> OK great, thanks for the information Danny.
>
> I'm a bit concerned about the type safety issue (#1) not because I'm
> worried about the data being stored correctly, but because a
> conversion might have to be carried out many many time during an
> evaluation. I may be repeating the question here, but do you have any
> idea of how the above use case would work with 1M+ rows of data. Seems
> to me that converting some date text 2M+ times (twice per record in
> this case) would have an adverse effect on a query, no? Likewise
> converting when wanting to order a larger data set by date?
>
> Really appreciate the feedback.
>
> Mark
>
> On Jan 14, 2008 8:12 PM, Danny Sokolsky <dsokolsky at marklogic.com> wrote:
> > Hi Mark,
> >
> > I will take a stab at your questions.
> >
> > 1) You do not need a schema to use typed data.  A schema will make it
> so
> > Mark Logic treats an element or attribute as its defined type without
> an
> > explicit cast, but you can always add an explicit cast (like the
> > use-case example) to make sure XQuery treats a value as a certain type
> > (with or without a schema).  The schema just makes that a little
> easier.
> > There might be some performance advantage to using a schema, but I
> don't
> > think it will be that big.  It is worth trying though, as this might
> > depend somewhat on your content.  The real performance advantage will
> > come from creating range indexes on elements or attributes you will
> use
> > in comparisons.  Schemas can also help you ensure that your data is in
> > the correct format when you load it, as Mark Logic will throw an
> > exception if it cannot cast content in an element or attribute to the
> > type specified in the schema.
> >
> > 2) You could put the referencing information in the properties
> document.
> > The default conversion application in CPF does this, for example, to
> > keep track of the original documents and various converted documents.
> >
> > 3) There are no foreign key constraints built in.  I think any best
> > practices would depend on what you are trying to do.  Two approaches
> > that tend to work well are to a) put the constraining items in the
> same
> > document and/or b) use the properties document corresponding to a
> > document to store information about what is in the document.
> >
> > -Danny
> >
> >
> > -----Original Message-----
> > From: general-bounces at developer.marklogic.com
> > [mailto:general-bounces at developer.marklogic.com] On Behalf Of Mark
> > Waschkowski
> > Sent: Monday, January 14, 2008 1:25 PM
> > To: general at developer.marklogic.com
> > Subject: [MarkLogic Dev General] Type safe data and referencing
> > questions
> >
> > Hi,
> >
> > Have been using Marklogic for a while now and haven't seen answers to
> > the below questions yet, anyone know of an answer or two?
> >
> > 1) Type safe data -  I'm concerned with retrieval of typed data,
> > especially for date information. The only way to store typed data is
> > through the use of a schema right? I can't specify the type of data on
> > a per element basis, correct? ie. <person> <birthday
> > xs:date>01-01-1970</birthday></person>
> >
> > As well, I noticed the below query in the use case examples:
> >
> >  let $item := doc("items.xml")//item_tuple
> >               [end_date >= xs:date("1999-03-01")
> >                and
> >                end_date <= xs:date("1999-03-31")]
> >  return
> >  <item_count>
> >  {
> >    count($item)
> >  }
> >  </item_count>
> >
> > Is there a schema behind the loaded data or are the examples un-type
> > safe? Should I just not worry about type safety and convert the data
> > values to the type I need when querying? If so, won't that be a
> > performance issue?
> >
> > 2) Referencing - what is the (if there is one) best practice approach
> > to reference documents together?
> > ie. Document A and Document B should both refer to Document C
> >
> > 3) Foreign key constraints - is this supported at all in some fashion?
> > If not, any approaches to suggest?
> >
> > Thanks in advance for any and all suggestions!
> >
> > Mark
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://xqzone.com/mailman/listinfo/general
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://xqzone.com/mailman/listinfo/general
> >
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>


More information about the General mailing list