[MarkLogic Dev General] Faster way to simulate SQL Group By in ML xquery

Damon Feldman Damon.Feldman at marklogic.com
Mon Jun 25 15:04:39 PDT 2012


Danny,

I'm glad that helped.

I suspect that iterating through the $logs sequence is actually streaming them out of the DB, and you are seeing the time for document access off the disks rather than categorization, or your machine may be swapping.

If the physical quantity of log data is large you'll be limited by various I/O bottlenecks like the disks and network, and that may kick in now. I can categorize 25,000 simple, generated docs via:

let $logs :=
  for $i in 10 to 25000
  return <log-entry id="{$i}"><type>{xdmp:random(20)}</type></log-entry>

in about 5 seconds, so I think it may be the document access.

What's your actual requirement? Nobody can view 250,000 docs on the screen at once, so I assume they want rollups or averages? You can use range indexes or random sampling to accomplish that without moving so much data across the network.

Yours,
Damon


From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Danny Sinang
Sent: Monday, June 25, 2012 1:14 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML xquery

Hi Damon,

Thanks. My script is generating group totals a lot faster now.

However, local:categorize() takes 3 seconds to load 25,000 XML docs into the map.

       let $groupMap := map:map()
       let $_ := for $res in $logs
                 return local:categorize($groupMap, $res/book/bookId, $res)

This is just 3 days' worth of user logs.

We're expecting our customer to query a month's worth of logs at a time.

Is there an even faster solution ?

Regards,
Danny

On Fri, Jun 22, 2012 at 11:45 PM, Damon Feldman <Damon.Feldman at marklogic.com<mailto:Damon.Feldman at marklogic.com>> wrote:

Danny,

It looks like this one is also suffering from use of predicates. They get evaluated for every item in the sequence, so you get O(n^2) behavior when they are in a for loop. This is one of the few places I would suggest using procedural programming by adding values to a map:map structure. Here's some sample code that runs in linear time:


declare function local:categorize($map, $key, $value) {
  let $new-seq := (map:get($map, $key), $value)
  return map:put($map, $key, $new-seq)
};

let $categorized := map:map()
let $items := for $i in 1 to 1000 return <item idx="{$i}">{if (xdmp:random(1) eq 1) then "one" else "zero"}</item>
let $add-all :=
  for $item in $items
  return local:categorize($categorized, $item/text(), $item)
return
  <groups>
    <ones>{map:get($categorized, "one")}</ones>
    <zeros>{map:get($categorized, "zero")}</zeros>
  </groups>

Use map:keys($map) to get all "groups."

Yours,
Damon

________________________________
From: general-bounces at developer.marklogic.com<mailto:general-bounces at developer.marklogic.com> [general-bounces at developer.marklogic.com<mailto:general-bounces at developer.marklogic.com>] On Behalf Of Danny Sinang [d.sinang at gmail.com<mailto:d.sinang at gmail.com>]
Sent: Friday, June 22, 2012 9:08 PM
To: general
Subject: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML xquery
Hi,

I'm trying to simulate a SQL SELECT ... Group By functionality via an ML xquery script.

So I wrote a searchLogs function that uses search:search to return a list of logs that match some given filters. So far, it runs very fast.

For the Group By part, I was able to write something that works but it's turning out to be very slow. When searchLog returns 2,500 rows, it takes 1 minute or so to generate group totals.

The code is a bit complex as it handles multi-column groupBy's recursively. But the basic logic goes like this :

let $results := local:seachLogs($searchQuery)
let $userIds := fn:distinct-values($results/userId)
for $userId in $userIds
let $userLogs := $results[userId=$userId]
let $userTotal := fn:count($userLogs)
return
            <userId>{$userId}</userId>
            <userTotal>{$userTotal}</userTotal>
            <book> {
            let $bookIds := fn:distinct-values($userLogs)
            for $bookId in $bookIds
            let $bookLogs := $userLogs[bookId=$bookId]
            let $bookTotal := fn:count($bookLogs)
            return
            <bookId>{$bookId}</bookId>
            <bookTotal>{$bookTotal}</bookTotal>

            } </book>

I did some crude timings via xdmp:log() and saw that the red lines above eat up like 30 milliseconds for each user / book row. With thousands of rows to be processed, the delays all add up and become noticeable.

Can anybody here suggest a way to speed this thing up dramatically ?

If not, I'm thinking of inserting the raw results into an SQL table and letting SQL do the group totals.

Regards,
Danny


_______________________________________________
General mailing list
General at developer.marklogic.com<mailto:General at developer.marklogic.com>
http://community.marklogic.com/mailman/listinfo/general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://community.marklogic.com/pipermail/general/attachments/20120625/ce1f5a8b/attachment.html 


More information about the General mailing list