[MarkLogic Dev General] xdmp:estimate() and fn:distinct-values()

Ron Hitchens ron at ronsoft.com
Sat Aug 4 11:52:34 PDT 2012


   Yes, I think I understood what you were trying to do, and that's
what the query I provided does (without having tested it, anyway).

   My query goes at it the the other way around from how you're describing
it.  By starting with this:

     cts:element-values (xs:QName("userId"), ...

   We get distinct values of userId so we know there will never be any
duplicates.  The key thing to know is that the range index not only
contains the unique values of userId elements, but for each value
it also has a list of the fragments that contain that value.

   The fourth argument to cts:element-values is a cts:query.  All
cts:queries select fragments that match some set of criteria.  In
this case it's:

     cts:element-value-query (xs:QName("productId"), $myBooks, "exact")

   Which matches fragments that contain a productId element with a
value that exactly matches at least one of the values in the sequence
$myBooks.

   By passing this cts:query as an optional argument to cts:element-values,
you're asking for the intersection of the two lists: fragments that have
a value in the userId index, and that also have a productId that matches one
of the values of $myBooks.  Values of userId that don't occur in a fragment
with matching productId are not returned.

   Hey presto, distinct values of userId.  Count that list and you're done.

   Like I said, you can do the same thing with date ranges or any other
sort of query (or queries) you need to do.  If all the values that your
queries need to test are resolvable from indexes, then evaluation is
very fast.

   Long story short, you have the information you need to answer the
questions you're asking in the indexes.  You just need to phrase the
questions in a form that can make good use of those indexes.

   Note that many XPath expressions can be accelerated by the presence
of range indexes, but only if the XQuery evaluator can unambiguously
know that it's safe to use an index.  For example, if you have a
dateTime index for an element, but write a predicate that compares
that element to an xs:string or xs:untypedAtomic, then the XPath evaluator
may not use the index.  But if you cast to an xs:dateTime it might.
I personally recently (re)discovered that integer range indexes are xs:int,
not xs:integer.  Casting a predicate to xs:integer was slow, changing
the cast to xs:int made it fast.  Using the cts: library makes it much
more explicit as to how you expect the indexes to be used.

   Hope that's helpful.

On Aug 4, 2012, at 2:17 PM, Danny Sinang wrote:

> Hi Ron,
> 
> Thanks. 
> 
> I have element range indexes for userId and productId now, but I'm not sure I explained well what I needed.
> 
> I'm trying to get :
> 
> 1. all the unique userId's of users who have registrations for particular books. 
> 2. all the unique userId's of users whose book registrations have not expired yet.
> 
> Each user is represented like this :
> 
> <user>
>     <userId>12345</userId>
>     ...
>     <registeredBooks>
>                     <registration>
>                              <productId>ABCDEFG</productId>
>                              <startDate></startDate>
>                              <endDate></endDate>
>                              ...
>                     </registration>
> 
>                     <registration>
>                              <productId>TUVWXY</productId>
>                              <startDate></startDate>
>                              <endDate></endDate>
>                              ...
>                     </registration>
>     </registeredBooks>
> </user>
> 
> As you can see, a user can have more than 1 book registration, and he can also have more than 1 book registration for the same book (i.e. his previous registration expired and he bought some more time to read it again).
> 
> So given the above business rules, my queries (to give me all users who registered for specific books) can return the same userId more than once. That's why I need to get the distinct values of the userId's returned.
> 
> The queries I showed earlier (simplified versions of the actual query) work fast but don't eliminate the duplicate userId's returned by the queries.
> 
> Your suggested query returns the unique userId's in the index, but not the unique userId's returned by the query.
> 
> I'm pretty new to cts stuff so I'd really appreciate all the assistance I could get. First off, how do I express in cts the query /user[registeredBooks/registration/productId=$myBooks]/userId ? Next,  how do I get the distinct userId's returned ?
> 
> Regards,
> Danny
> 
> On Sat, Aug 4, 2012 at 7:46 AM, Ron Hitchens <ron at ronsoft.com> wrote:
> 
>    Put an element range index on both userId and productId.
> Then you can do (also untested):
> 
>     fn:count (cts:element-values (xs:QName("userId"), (), (),
>        cts:element-value-query (xs:QName("productId"), $myBooks, "exact")))
> 
>    This fn:count should be fast because it will only count the
> values in the range index (those that survive the filter that
> selects matching productId's, which can be resolved from the
> range index on productId).
> 
>    The slowdown comes when a query cannot answer the question
> you're asking from the indexes and has to look inside the documents
> to test the values.  Range indexes store the unique values in the
> index and correlate them back to the fragment those values occur in.
> 
>    Just be careful that you define the proper type when creating
> the element range indexes and that you provide the same collation
> if the indexes are strings.
> 
>    You may also get a boost from creating appropriate dateTime
> range indexes and applying similar filter queries for those.
> 
> On Aug 4, 2012, at 11:28 AM, David Lee wrote:
> 
> > Untested Suggestion.
> > Put userId into a element range index then use   estimate (cts:values())
> >
> >
> > -----------------------------------------------------------------------------
> > David Lee
> > Lead Engineer
> > MarkLogic Corporation
> > dlee at marklogic.com
> > Phone: +1 650-287-2531
> > Cell:  +1 812-630-7622
> > www.marklogic.com
> >
> > This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation.
> >
> > From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Danny Sinang
> > Sent: Friday, August 03, 2012 10:38 PM
> > To: general
> > Subject: [MarkLogic Dev General] xdmp:estimate() and fn:distinct-values()
> >
> > Hello,
> >
> > The query below runs quite fast (i.e. below 1 second).
> >
> > let $totalCount := xdmp:estimate(/user[reg/productId=$myBooks]/userId)
> > let $numUnexpired := xdmp:estimate(/user[reg[productId=$myBooks and (endDate = 0 or endDate >= $current-epoch-time)]]/userId)
> > return ($totalCount, $numUnexpired, xdmp:elapsed-time())
> >
> > Problem is, what I really need is to get the number of distinct values of "userId".
> >
> > Doing xdmp:estimate(fn:distinct-values()) results in in XDMP:UNSEARCHABLE error.
> >
> > Using fn:count() instead of xdmp:estimate() works, but takes so long (i.e. 30 seconds).
> >
> > Is there a workaround for this ?
> >
> > Regards,
> > Danny
> >
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://developer.marklogic.com/mailman/listinfo/general
> 
> ---
> Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
>      +44 7879 358 212 (voice)          http://www.ronsoft.com
>      +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
> 
> 
> 
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general

---
Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
     +44 7879 358 212 (voice)          http://www.ronsoft.com
     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
"No amount of belief establishes any fact." -Unknown






More information about the General mailing list