[MarkLogic Dev General] Faster way to simulate SQL Group By in ML xquery

Damon Feldman Damon.Feldman at marklogic.com
Fri Jun 22 20:45:21 PDT 2012


Danny,

It looks like this one is also suffering from use of predicates. They get evaluated for every item in the sequence, so you get O(n^2) behavior when they are in a for loop. This is one of the few places I would suggest using procedural programming by adding values to a map:map structure. Here's some sample code that runs in linear time:


declare function local:categorize($map, $key, $value) {
  let $new-seq := (map:get($map, $key), $value)
  return map:put($map, $key, $new-seq)
};

let $categorized := map:map()
let $items := for $i in 1 to 1000 return <item idx="{$i}">{if (xdmp:random(1) eq 1) then "one" else "zero"}</item>
let $add-all :=
  for $item in $items
  return local:categorize($categorized, $item/text(), $item)
return
  <groups>
    <ones>{map:get($categorized, "one")}</ones>
    <zeros>{map:get($categorized, "zero")}</zeros>
  </groups>

Use map:keys($map) to get all "groups."

Yours,
Damon

________________________________
From: general-bounces at developer.marklogic.com [general-bounces at developer.marklogic.com] On Behalf Of Danny Sinang [d.sinang at gmail.com]
Sent: Friday, June 22, 2012 9:08 PM
To: general
Subject: [MarkLogic Dev General] Faster way to simulate SQL Group By in ML xquery

Hi,

I'm trying to simulate a SQL SELECT ... Group By functionality via an ML xquery script.

So I wrote a searchLogs function that uses search:search to return a list of logs that match some given filters. So far, it runs very fast.

For the Group By part, I was able to write something that works but it's turning out to be very slow. When searchLog returns 2,500 rows, it takes 1 minute or so to generate group totals.

The code is a bit complex as it handles multi-column groupBy's recursively. But the basic logic goes like this :

let $results := local:seachLogs($searchQuery)
let $userIds := fn:distinct-values($results/userId)
for $userId in $userIds
let $userLogs := $results[userId=$userId]
let $userTotal := fn:count($userLogs)
return
            <userId>{$userId}</userId>
            <userTotal>{$userTotal}</userTotal>
            <book> {
            let $bookIds := fn:distinct-values($userLogs)
            for $bookId in $bookIds
            let $bookLogs := $userLogs[bookId=$bookId]
            let $bookTotal := fn:count($bookLogs)
            return
            <bookId>{$bookId}</bookId>
            <bookTotal>{$bookTotal}</bookTotal>

            } </book>

I did some crude timings via xdmp:log() and saw that the red lines above eat up like 30 milliseconds for each user / book row. With thousands of rows to be processed, the delays all add up and become noticeable.

Can anybody here suggest a way to speed this thing up dramatically ?

If not, I'm thinking of inserting the raw results into an SQL table and letting SQL do the group totals.

Regards,
Danny

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://community.marklogic.com/pipermail/general/attachments/20120622/31249c73/attachment.html 


More information about the General mailing list