[MarkLogic Dev General] xdmp:estimate() and fn:distinct-values()

Danny Sinang d.sinang at gmail.com
Wed Aug 8 12:44:10 PDT 2012


Hi Ron,

Yep, am getting some false positives.

Is there somewhere I can go to know what to expect in ML 6 ?

Regards,
Danny

On Mon, Aug 6, 2012 at 4:21 PM, Ron Hitchens <ron at ronsoft.com> wrote:

>
>    What you really want is an XPath range index, but you'll
> have to wait for MarkLogic 6 for that.
>
>    In the meantime, assuming all the productId elements you
> care about are descendants of registeredBooks, you can place
> the element value query for productId under an element query
> on registeredBooks.
>
>    Change the fourth argument of the cts:element-values call
> to be this (again, untested):
>
>       cts:element-query (xs:QName("registeredBooks"),
>           cts:element-value-query (xs:QName("productId"), $myBooks,
> "exact"))
>
>    There is still a chance of false positive here because the
> query will always be unfiltered.  But I think this will work
> better for your use case.
>
> On Aug 6, 2012, at 7:47 PM, Danny Sinang wrote:
>
> > Hi Ron,
> >
> > The count I'm getting is 50% more than what I get by using the slow
> fn:count() approach.
> >
> > It could be because productId is used elsewhere other than in
> /user/registeredBooks/registration/productId .
> >
> > How do I further limit the query to return fragments containing
> productId within the structure
>  /user/registeredBooks/registration/productId  ?
> >
> > Regards,
> > Danny
> >
> > On Sat, Aug 4, 2012 at 2:52 PM, Ron Hitchens <ron at ronsoft.com> wrote:
> >
> >    Yes, I think I understood what you were trying to do, and that's
> > what the query I provided does (without having tested it, anyway).
> >
> >    My query goes at it the the other way around from how you're
> describing
> > it.  By starting with this:
> >
> >      cts:element-values (xs:QName("userId"), ...
> >
> >    We get distinct values of userId so we know there will never be any
> > duplicates.  The key thing to know is that the range index not only
> > contains the unique values of userId elements, but for each value
> > it also has a list of the fragments that contain that value.
> >
> >    The fourth argument to cts:element-values is a cts:query.  All
> > cts:queries select fragments that match some set of criteria.  In
> > this case it's:
> >
> >      cts:element-value-query (xs:QName("productId"), $myBooks, "exact")
> >
> >    Which matches fragments that contain a productId element with a
> > value that exactly matches at least one of the values in the sequence
> > $myBooks.
> >
> >    By passing this cts:query as an optional argument to
> cts:element-values,
> > you're asking for the intersection of the two lists: fragments that have
> > a value in the userId index, and that also have a productId that matches
> one
> > of the values of $myBooks.  Values of userId that don't occur in a
> fragment
> > with matching productId are not returned.
> >
> >    Hey presto, distinct values of userId.  Count that list and you're
> done.
> >
> >    Like I said, you can do the same thing with date ranges or any other
> > sort of query (or queries) you need to do.  If all the values that your
> > queries need to test are resolvable from indexes, then evaluation is
> > very fast.
> >
> >    Long story short, you have the information you need to answer the
> > questions you're asking in the indexes.  You just need to phrase the
> > questions in a form that can make good use of those indexes.
> >
> >    Note that many XPath expressions can be accelerated by the presence
> > of range indexes, but only if the XQuery evaluator can unambiguously
> > know that it's safe to use an index.  For example, if you have a
> > dateTime index for an element, but write a predicate that compares
> > that element to an xs:string or xs:untypedAtomic, then the XPath
> evaluator
> > may not use the index.  But if you cast to an xs:dateTime it might.
> > I personally recently (re)discovered that integer range indexes are
> xs:int,
> > not xs:integer.  Casting a predicate to xs:integer was slow, changing
> > the cast to xs:int made it fast.  Using the cts: library makes it much
> > more explicit as to how you expect the indexes to be used.
> >
> >    Hope that's helpful.
> >
> > On Aug 4, 2012, at 2:17 PM, Danny Sinang wrote:
> >
> > > Hi Ron,
> > >
> > > Thanks.
> > >
> > > I have element range indexes for userId and productId now, but I'm not
> sure I explained well what I needed.
> > >
> > > I'm trying to get :
> > >
> > > 1. all the unique userId's of users who have registrations for
> particular books.
> > > 2. all the unique userId's of users whose book registrations have not
> expired yet.
> > >
> > > Each user is represented like this :
> > >
> > > <user>
> > >     <userId>12345</userId>
> > >     ...
> > >     <registeredBooks>
> > >                     <registration>
> > >                              <productId>ABCDEFG</productId>
> > >                              <startDate></startDate>
> > >                              <endDate></endDate>
> > >                              ...
> > >                     </registration>
> > >
> > >                     <registration>
> > >                              <productId>TUVWXY</productId>
> > >                              <startDate></startDate>
> > >                              <endDate></endDate>
> > >                              ...
> > >                     </registration>
> > >     </registeredBooks>
> > > </user>
> > >
> > > As you can see, a user can have more than 1 book registration, and he
> can also have more than 1 book registration for the same book (i.e. his
> previous registration expired and he bought some more time to read it
> again).
> > >
> > > So given the above business rules, my queries (to give me all users
> who registered for specific books) can return the same userId more than
> once. That's why I need to get the distinct values of the userId's returned.
> > >
> > > The queries I showed earlier (simplified versions of the actual query)
> work fast but don't eliminate the duplicate userId's returned by the
> queries.
> > >
> > > Your suggested query returns the unique userId's in the index, but not
> the unique userId's returned by the query.
> > >
> > > I'm pretty new to cts stuff so I'd really appreciate all the
> assistance I could get. First off, how do I express in cts the query
> /user[registeredBooks/registration/productId=$myBooks]/userId ? Next,  how
> do I get the distinct userId's returned ?
> > >
> > > Regards,
> > > Danny
> > >
> > > On Sat, Aug 4, 2012 at 7:46 AM, Ron Hitchens <ron at ronsoft.com> wrote:
> > >
> > >    Put an element range index on both userId and productId.
> > > Then you can do (also untested):
> > >
> > >     fn:count (cts:element-values (xs:QName("userId"), (), (),
> > >        cts:element-value-query (xs:QName("productId"), $myBooks,
> "exact")))
> > >
> > >    This fn:count should be fast because it will only count the
> > > values in the range index (those that survive the filter that
> > > selects matching productId's, which can be resolved from the
> > > range index on productId).
> > >
> > >    The slowdown comes when a query cannot answer the question
> > > you're asking from the indexes and has to look inside the documents
> > > to test the values.  Range indexes store the unique values in the
> > > index and correlate them back to the fragment those values occur in.
> > >
> > >    Just be careful that you define the proper type when creating
> > > the element range indexes and that you provide the same collation
> > > if the indexes are strings.
> > >
> > >    You may also get a boost from creating appropriate dateTime
> > > range indexes and applying similar filter queries for those.
> > >
> > > On Aug 4, 2012, at 11:28 AM, David Lee wrote:
> > >
> > > > Untested Suggestion.
> > > > Put userId into a element range index then use   estimate
> (cts:values())
> > > >
> > > >
> > > >
> -----------------------------------------------------------------------------
> > > > David Lee
> > > > Lead Engineer
> > > > MarkLogic Corporation
> > > > dlee at marklogic.com
> > > > Phone: +1 650-287-2531
> > > > Cell:  +1 812-630-7622
> > > > www.marklogic.com
> > > >
> > > > This e-mail and any accompanying attachments are confidential. The
> information is intended solely for the use of the individual to whom it is
> addressed. Any review, disclosure, copying, distribution, or use of this
> e-mail communication by others is strictly prohibited. If you are not the
> intended recipient, please notify us immediately by returning this message
> to the sender and delete all copies. Thank you for your cooperation.
> > > >
> > > > From: general-bounces at developer.marklogic.com [mailto:
> general-bounces at developer.marklogic.com] On Behalf Of Danny Sinang
> > > > Sent: Friday, August 03, 2012 10:38 PM
> > > > To: general
> > > > Subject: [MarkLogic Dev General] xdmp:estimate() and
> fn:distinct-values()
> > > >
> > > > Hello,
> > > >
> > > > The query below runs quite fast (i.e. below 1 second).
> > > >
> > > > let $totalCount :=
> xdmp:estimate(/user[reg/productId=$myBooks]/userId)
> > > > let $numUnexpired := xdmp:estimate(/user[reg[productId=$myBooks and
> (endDate = 0 or endDate >= $current-epoch-time)]]/userId)
> > > > return ($totalCount, $numUnexpired, xdmp:elapsed-time())
> > > >
> > > > Problem is, what I really need is to get the number of distinct
> values of "userId".
> > > >
> > > > Doing xdmp:estimate(fn:distinct-values()) results in in
> XDMP:UNSEARCHABLE error.
> > > >
> > > > Using fn:count() instead of xdmp:estimate() works, but takes so long
> (i.e. 30 seconds).
> > > >
> > > > Is there a workaround for this ?
> > > >
> > > > Regards,
> > > > Danny
> > > >
> > > >
> > > > _______________________________________________
> > > > General mailing list
> > > > General at developer.marklogic.com
> > > > http://developer.marklogic.com/mailman/listinfo/general
> > >
> > > ---
> > > Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
> > >      +44 7879 358 212 (voice)          http://www.ronsoft.com
> > >      +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> > > "No amount of belief establishes any fact." -Unknown
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > General mailing list
> > > General at developer.marklogic.com
> > > http://developer.marklogic.com/mailman/listinfo/general
> > >
> > > _______________________________________________
> > > General mailing list
> > > General at developer.marklogic.com
> > > http://developer.marklogic.com/mailman/listinfo/general
> >
> > ---
> > Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
> >      +44 7879 358 212 (voice)          http://www.ronsoft.com
> >      +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> > "No amount of belief establishes any fact." -Unknown
> >
> >
> >
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://developer.marklogic.com/mailman/listinfo/general
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://developer.marklogic.com/mailman/listinfo/general
>
> ---
> Ron Hitchens {mailto:ron at ronsoft.com}   Ronsoft Technologies
>      +44 7879 358 212 (voice)          http://www.ronsoft.com
>      +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
>
>
>
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20120808/523883de/attachment-0001.html 


More information about the General mailing list