[MarkLogic Dev General] xdmp:estimate() and fn:distinct-values()

Danny Sokolsky Danny.Sokolsky at marklogic.com
Wed Aug 8 16:23:28 PDT 2012


Make sure word positions, element word positions, and element value positions are enabled in the database for Ron's element-query suggestion.  This will likely improve the index resolution accuracy of those queries.

-Danny

-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Ron Hitchens
Sent: Wednesday, August 08, 2012 2:12 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] xdmp:estimate() and fn:distinct-values()


   MarkLogic has been talking about it publicly for months,
but they don't make official announcements until the official
release.  I've heard it said that ML6 will be out within a few
months, but that's also unofficial.

   In the meantime, you could perhaps try a filtered search,
which should eliminate the false positives.  If the candidate
set of results derived from the indexes is relatively small,
it may still perform pretty well.  Study up on cts:, there's
lots of good stuff in there.

On Aug 8, 2012, at 8:44 PM, Danny Sinang wrote:

> Hi Ron,
> 
> Yep, am getting some false positives.
> 
> Is there somewhere I can go to know what to expect in ML 6 ?
> 
> Regards,
> Danny
> 
> On Mon, Aug 6, 2012 at 4:21 PM, Ron Hitchens <ron at ronsoft.com> wrote:
> 
>    What you really want is an XPath range index, but you'll
> have to wait for MarkLogic 6 for that.
> 
>    In the meantime, assuming all the productId elements you
> care about are descendants of registeredBooks, you can place
> the element value query for productId under an element query
> on registeredBooks.
> 
>    Change the fourth argument of the cts:element-values call
> to be this (again, untested):
> 
>       cts:element-query (xs:QName("registeredBooks"),
>           cts:element-value-query (xs:QName("productId"), $myBooks, "exact"))
> 
>    There is still a chance of false positive here because the
> query will always be unfiltered.  But I think this will work
> better for your use case.
> 
> On Aug 6, 2012, at 7:47 PM, Danny Sinang wrote:
> 
> > Hi Ron,
> >
> > The count I'm getting is 50% more than what I get by using the slow fn:count() approach.
> >
> > It could be because productId is used elsewhere other than in /user/registeredBooks/registration/productId .
> >
> > How do I further limit the query to return fragments containing productId within the structure  /user/registeredBooks/registration/productId  ?
> >
> > Regards,
> > Danny
> >
> > On Sat, Aug 4, 2012 at 2:52 PM, Ron Hitchens <ron at ronsoft.com> wrote:
> >
> >    Yes, I think I understood what you were trying to do, and that's
> > what the query I provided does (without having tested it, anyway).
> >
> >    My query goes at it the the other way around from how you're describing
> > it.  By starting with this:
> >
> >      cts:element-values (xs:QName("userId"), ...
> >
> >    We get distinct values of userId so we know there will never be any
> > duplicates.  The key thing to know is that the range index not only
> > contains the unique values of userId elements, but for each value
> > it also has a list of the fragments that contain that value.
> >
> >    The fourth argument to cts:element-values is a cts:query.  All
> > cts:queries select fragments that match some set of criteria.  In
> > this case it's:
> >
> >      cts:element-value-query (xs:QName("productId"), $myBooks, "exact")
> >
> >    Which matches fragments that contain a productId element with a
> > value that exactly matches at least one of the values in the sequence
> > $myBooks.
> >
> >    By passing this cts:query as an optional argument to cts:element-values,
> > you're asking for the intersection of the two lists: fragments that have
> > a value in the userId index, and that also have a productId that matches one
> > of the values of $myBooks.  Values of userId that don't occur in a fragment
> > with matching productId are not returned.
> >
> >    Hey presto, distinct values of userId.  Count that list and you're done.
> >
> >    Like I said, you can do the same thing with date ranges or any other
> > sort of query (or queries) you need to do.  If all the values that your
> > queries need to test are resolvable from indexes, then evaluation is
> > very fast.
> >
> >    Long story short, you have the information you need to answer the
> > questions you're asking in the indexes.  You just need to phrase the
> > questions in a form that can make good use of those indexes.
> >
> >    Note that many XPath expressions can be accelerated by the presence
> > of range indexes, but only if the XQuery evaluator can unambiguously
> > know that it's safe to use an index.  For example, if you have a
> > dateTime index for an element, but write a predicate that compares
> > that element to an xs:string or xs:untypedAtomic, then the XPath evaluator
> > may not use the index.  But if you cast to an xs:dateTime it might.
> > I personally recently (re)discovered that integer range indexes are xs:int,
> > not xs:integer.  Casting a predicate to xs:integer was slow, changing
> > the cast to xs:int made it fast.  Using the cts: library makes it much
> > more explicit as to how you expect the indexes to be used.
> >
> >    Hope that's helpful.
> >
> > On Aug 4, 2012, at 2:17 PM, Danny Sinang wrote:
> >
> > > Hi Ron,
> > >
> > > Thanks.
> > >
> > > I have element range indexes for userId and productId now, but I'm not sure I explained well what I needed.
> > >
> > > I'm trying to get :
> > >
> > > 1. all the unique userId's of users who have registrations for particular books.
> > > 2. all the unique userId's of users whose book registrations have not expired yet.
> > >
> > > Each user is represented like this :
> > >
> > > <user>
> > >     <userId>12345</userId>
> > >     ...
> > >     <registeredBooks>
> > >                     <registration>
> > >                              <productId>ABCDEFG</productId>
> > >                              <startDate></startDate>
> > >                              <endDate></endDate>
> > >                              ...
> > >                     </registration>
> > >
> > >                     <registration>
> > >                              <productId>TUVWXY</productId>
> > >                              <startDate></startDate>
> > >                              <endDate></endDate>
> > >                              ...
> > >                     </registration>
> > >     </registeredBooks>
> > > </user>
> > >
> > > As you can see, a user can have more than 1 book registration, and he can also have more than 1 book registration for the same book (i.e. his previous registration expired and he bought some more time to read it again).
> > >
> > > So given the above business rules, my queries (to give me all users who registered for specific books) can return the same userId more than once. That's why I need to get the distinct values of the userId's returned.
> > >
> > > The queries I showed earlier (simplified versions of the actual query) work fast but don't eliminate the duplicate userId's returned by the queries.
> > >
> > > Your suggested query returns the unique userId's in the index, but not the unique userId's returned by the query.
> > >
> > > I'm pretty new to cts stuff so I'd really appreciate all the assistance I could get. First off, how do I express in cts the query /user[registeredBooks/registration/productId=$myBooks]/userId ? Next,  how do I get the distinct userId's returned ?
> > >
> > > Regards,
> > > Danny
> > >
> > > On Sat, Aug 4, 2012 at 7:46 AM, Ron Hitchens <ron at ronsoft.com> wrote:
> > >
> > >    Put an element range index on both userId and productId.
> > > Then you can do (also untested):
> > >
> > >     fn:count (cts:element-values (xs:QName("userId"), (), (),
> > >        cts:element-value-query (xs:QName("productId"), $myBooks, "exact")))
> > >
> > >    This fn:count should be fast because it will only count the
> > > values in the range index (those that survive the filter that
> > > selects matching productId's, which can be resolved from the
> > > range index on productId).
> > >
> > >    The slowdown comes when a query cannot answer the question
> > > you're asking from the indexes and has to look inside the documents
> > > to test the values.  Range indexes store the unique values in the
> > > index and correlate them back to the fragment those values occur in.
> > >
> > >    Just be careful that you define the proper type when creating
> > > the element range indexes and that you provide the same collation
> > > if the indexes are strings.
> > >
> > >    You may also get a boost from creating appropriate dateTime
> > > range indexes and applying similar filter queries for those.
> > >
> > > On Aug 4, 2012, at 11:28 AM, David Lee wrote:
> > >
> > > > Untested Suggestion.
> > > > Put userId into a element range index then use   estimate (cts:values())
> > > >
> > > >
> > > > -----------------------------------------------------------------------------
> > > > David Lee
> > > > Lead Engineer
> > > > MarkLogic Corporation
> > > > dlee at marklogic.com
> > > > Phone: +1 650-287-2531
> > > > Cell:  +1 812-630-7622
> > > > www.marklogic.com
> > > >
> > > > This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation.
> > > >
> > > > From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Danny Sinang
> > > > Sent: Friday, August 03, 2012 10:38 PM
> > > > To: general
> > > > Subject: [MarkLogic Dev General] xdmp:estimate() and fn:distinct-values()
> > > >
> > > > Hello,
> > > >
> > > > The query below runs quite fast (i.e. below 1 second).
> > > >
> > > > let $totalCount := xdmp:estimate(/user[reg/productId=$myBooks]/userId)
> > > > let $numUnexpired := xdmp:estimate(/user[reg[productId=$myBooks and (endDate = 0 or endDate >= $current-epoch-time)]]/userId)
> > > > return ($totalCount, $numUnexpired, xdmp:elapsed-time())
> > > >
> > > > Problem is, what I really need is to get the number of distinct values of "userId".
> > > >
> > > > Doing xdmp:estimate(fn:distinct-values()) results in in XDMP:UNSEARCHABLE error.
> > > >
> > > > Using fn:count() instead of xdmp:estimate() works, but takes so long (i.e. 30 seconds).
> > > >
> > > > Is there a workaround for this ?
> > > >
> > > > Regards,
> > > > Danny
> > > >
> > > >
> > > > _______________________________________________
> > > > General mailing list
> > > > General at developer.marklogic.com
> > > > http://developer.marklogic.com/mailman/listinfo/general
> > >
> > > ---
> > > Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
> > >      +44 7879 358 212 (voice)          http://www.ronsoft.com
> > >      +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> > > "No amount of belief establishes any fact." -Unknown
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > General mailing list
> > > General at developer.marklogic.com
> > > http://developer.marklogic.com/mailman/listinfo/general
> > >
> > > _______________________________________________
> > > General mailing list
> > > General at developer.marklogic.com
> > > http://developer.marklogic.com/mailman/listinfo/general
> >
> > ---
> > Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
> >      +44 7879 358 212 (voice)          http://www.ronsoft.com
> >      +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> > "No amount of belief establishes any fact." -Unknown
> >
> >
> >
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://developer.marklogic.com/mailman/listinfo/general
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://developer.marklogic.com/mailman/listinfo/general
> 
> ---
> Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
>      +44 7879 358 212 (voice)          http://www.ronsoft.com
>      +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
> 
> 
> 
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general

---
Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
     +44 7879 358 212 (voice)          http://www.ronsoft.com
     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
"No amount of belief establishes any fact." -Unknown




_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


More information about the General mailing list