[MarkLogic Dev General] xdmp:estimate() and fn:distinct-values()

Danny Sinang d.sinang at gmail.com
Mon Aug 6 11:47:29 PDT 2012


Hi Ron,

The count I'm getting is 50% more than what I get by using the slow
fn:count() approach.

It could be because productId is used elsewhere other than in
/user/registeredBooks/registration/productId .

How do I further limit the query to return fragments containing productId
within the structure  /user/registeredBooks/registration/productId  ?

Regards,
Danny

On Sat, Aug 4, 2012 at 2:52 PM, Ron Hitchens <ron at ronsoft.com> wrote:

>
>    Yes, I think I understood what you were trying to do, and that's
> what the query I provided does (without having tested it, anyway).
>
>    My query goes at it the the other way around from how you're describing
> it.  By starting with this:
>
>      cts:element-values (xs:QName("userId"), ...
>
>    We get distinct values of userId so we know there will never be any
> duplicates.  The key thing to know is that the range index not only
> contains the unique values of userId elements, but for each value
> it also has a list of the fragments that contain that value.
>
>    The fourth argument to cts:element-values is a cts:query.  All
> cts:queries select fragments that match some set of criteria.  In
> this case it's:
>
>      cts:element-value-query (xs:QName("productId"), $myBooks, "exact")
>
>    Which matches fragments that contain a productId element with a
> value that exactly matches at least one of the values in the sequence
> $myBooks.
>
>    By passing this cts:query as an optional argument to cts:element-values,
> you're asking for the intersection of the two lists: fragments that have
> a value in the userId index, and that also have a productId that matches
> one
> of the values of $myBooks.  Values of userId that don't occur in a fragment
> with matching productId are not returned.
>
>    Hey presto, distinct values of userId.  Count that list and you're done.
>
>    Like I said, you can do the same thing with date ranges or any other
> sort of query (or queries) you need to do.  If all the values that your
> queries need to test are resolvable from indexes, then evaluation is
> very fast.
>
>    Long story short, you have the information you need to answer the
> questions you're asking in the indexes.  You just need to phrase the
> questions in a form that can make good use of those indexes.
>
>    Note that many XPath expressions can be accelerated by the presence
> of range indexes, but only if the XQuery evaluator can unambiguously
> know that it's safe to use an index.  For example, if you have a
> dateTime index for an element, but write a predicate that compares
> that element to an xs:string or xs:untypedAtomic, then the XPath evaluator
> may not use the index.  But if you cast to an xs:dateTime it might.
> I personally recently (re)discovered that integer range indexes are xs:int,
> not xs:integer.  Casting a predicate to xs:integer was slow, changing
> the cast to xs:int made it fast.  Using the cts: library makes it much
> more explicit as to how you expect the indexes to be used.
>
>    Hope that's helpful.
>
> On Aug 4, 2012, at 2:17 PM, Danny Sinang wrote:
>
> > Hi Ron,
> >
> > Thanks.
> >
> > I have element range indexes for userId and productId now, but I'm not
> sure I explained well what I needed.
> >
> > I'm trying to get :
> >
> > 1. all the unique userId's of users who have registrations for
> particular books.
> > 2. all the unique userId's of users whose book registrations have not
> expired yet.
> >
> > Each user is represented like this :
> >
> > <user>
> >     <userId>12345</userId>
> >     ...
> >     <registeredBooks>
> >                     <registration>
> >                              <productId>ABCDEFG</productId>
> >                              <startDate></startDate>
> >                              <endDate></endDate>
> >                              ...
> >                     </registration>
> >
> >                     <registration>
> >                              <productId>TUVWXY</productId>
> >                              <startDate></startDate>
> >                              <endDate></endDate>
> >                              ...
> >                     </registration>
> >     </registeredBooks>
> > </user>
> >
> > As you can see, a user can have more than 1 book registration, and he
> can also have more than 1 book registration for the same book (i.e. his
> previous registration expired and he bought some more time to read it
> again).
> >
> > So given the above business rules, my queries (to give me all users who
> registered for specific books) can return the same userId more than once.
> That's why I need to get the distinct values of the userId's returned.
> >
> > The queries I showed earlier (simplified versions of the actual query)
> work fast but don't eliminate the duplicate userId's returned by the
> queries.
> >
> > Your suggested query returns the unique userId's in the index, but not
> the unique userId's returned by the query.
> >
> > I'm pretty new to cts stuff so I'd really appreciate all the assistance
> I could get. First off, how do I express in cts the query
> /user[registeredBooks/registration/productId=$myBooks]/userId ? Next,  how
> do I get the distinct userId's returned ?
> >
> > Regards,
> > Danny
> >
> > On Sat, Aug 4, 2012 at 7:46 AM, Ron Hitchens <ron at ronsoft.com> wrote:
> >
> >    Put an element range index on both userId and productId.
> > Then you can do (also untested):
> >
> >     fn:count (cts:element-values (xs:QName("userId"), (), (),
> >        cts:element-value-query (xs:QName("productId"), $myBooks,
> "exact")))
> >
> >    This fn:count should be fast because it will only count the
> > values in the range index (those that survive the filter that
> > selects matching productId's, which can be resolved from the
> > range index on productId).
> >
> >    The slowdown comes when a query cannot answer the question
> > you're asking from the indexes and has to look inside the documents
> > to test the values.  Range indexes store the unique values in the
> > index and correlate them back to the fragment those values occur in.
> >
> >    Just be careful that you define the proper type when creating
> > the element range indexes and that you provide the same collation
> > if the indexes are strings.
> >
> >    You may also get a boost from creating appropriate dateTime
> > range indexes and applying similar filter queries for those.
> >
> > On Aug 4, 2012, at 11:28 AM, David Lee wrote:
> >
> > > Untested Suggestion.
> > > Put userId into a element range index then use   estimate
> (cts:values())
> > >
> > >
> > >
> -----------------------------------------------------------------------------
> > > David Lee
> > > Lead Engineer
> > > MarkLogic Corporation
> > > dlee at marklogic.com
> > > Phone: +1 650-287-2531
> > > Cell:  +1 812-630-7622
> > > www.marklogic.com
> > >
> > > This e-mail and any accompanying attachments are confidential. The
> information is intended solely for the use of the individual to whom it is
> addressed. Any review, disclosure, copying, distribution, or use of this
> e-mail communication by others is strictly prohibited. If you are not the
> intended recipient, please notify us immediately by returning this message
> to the sender and delete all copies. Thank you for your cooperation.
> > >
> > > From: general-bounces at developer.marklogic.com [mailto:
> general-bounces at developer.marklogic.com] On Behalf Of Danny Sinang
> > > Sent: Friday, August 03, 2012 10:38 PM
> > > To: general
> > > Subject: [MarkLogic Dev General] xdmp:estimate() and
> fn:distinct-values()
> > >
> > > Hello,
> > >
> > > The query below runs quite fast (i.e. below 1 second).
> > >
> > > let $totalCount := xdmp:estimate(/user[reg/productId=$myBooks]/userId)
> > > let $numUnexpired := xdmp:estimate(/user[reg[productId=$myBooks and
> (endDate = 0 or endDate >= $current-epoch-time)]]/userId)
> > > return ($totalCount, $numUnexpired, xdmp:elapsed-time())
> > >
> > > Problem is, what I really need is to get the number of distinct values
> of "userId".
> > >
> > > Doing xdmp:estimate(fn:distinct-values()) results in in
> XDMP:UNSEARCHABLE error.
> > >
> > > Using fn:count() instead of xdmp:estimate() works, but takes so long
> (i.e. 30 seconds).
> > >
> > > Is there a workaround for this ?
> > >
> > > Regards,
> > > Danny
> > >
> > >
> > > _______________________________________________
> > > General mailing list
> > > General at developer.marklogic.com
> > > http://developer.marklogic.com/mailman/listinfo/general
> >
> > ---
> > Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
> >      +44 7879 358 212 (voice)          http://www.ronsoft.com
> >      +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> > "No amount of belief establishes any fact." -Unknown
> >
> >
> >
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://developer.marklogic.com/mailman/listinfo/general
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://developer.marklogic.com/mailman/listinfo/general
>
> ---
> Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
>      +44 7879 358 212 (voice)          http://www.ronsoft.com
>      +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
>
>
>
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20120806/476bee48/attachment-0001.html 


More information about the General mailing list