[MarkLogic Dev General] RE: Help with co-occurrence (?) xquery
dlee at epocrates.com
Sun Dec 13 06:03:07 PST 2009
Thanks for the suggestions. I've solve the problem by re-organizing my
But I think the original problem is still "academically" interesting so
let me restate it
in a simpler form.
Suppose I have a document with a list of elements like
To make sense of these, think of them as relation items.
They are saying e.g
a=1 is related both b=1 and b=2
Another way to look at this data is "property" data
for item a=2 it has 2 properties (3,4)
My Query is a co-occurrence like query.
I want to find all <a> values where there exists 2 relations (or
properties) of specific values.
In this case both <a> and <b> refer to elements in a different document
(linked via the <a> value)
So I cant use the co-occurrence search in ML (requires the docs to be in
the same fragment).
That leaves me with pure XQuery, where I'm forced to loop over all
values of a
Something like this. (Havent tried it but its close I think)
for $a in fn:disinct-values( //item/a/string() )
let $as := //item[a eq $a]
where exists($as[b eq $value1) and exists($as[b eq $value2])
The problem is that this is extremely slow when I have 500,000 item
None of the suggestions solved the basic problem that I cant find
to do a query which doesnt require a loop over all values of <a>
there is no (to my finding)
cts:search( //item ,
magic search that returns matches where a is the same and b
matches both $value1 and $value2 )
The final answer ?
What I should have done in the first place. DE-normalize the data.
As a second pass at data loading ,I've put all of these relations
*within* the master element,
as apposed to a separate relations document. this looks something like
<value1> ... </value1>
It made the data size a bit bigger, because in reality these are
relations, not properties,
so I had to include in elements id=x all items where either a=x or b=x
so potentially it doubled the number of <item> elements overall.
But the end result is much easier to query.
Now the query is something ML can optimize (and does very well)
//element/[items/item/b = $value1 and items/item/b = $value2]
Or this can be reformulated into a cts:search easily (with an and
query), which is what I did.
So the conclusion ?
what I'm finding (and what was suggested earlier in another question)
is that when given normalized relational type data , its often best to
DE-normalize the data
before loading into ML.
The problem I have is that the raw data is pretty big and doesnt fit
so I need a database to load the raw data in order to de-normalize it
And even when it does fit into memory, memory based xquery code handles
it pretty badly because
there is no indexing, so it takes forever to run the denormalization.
So I'm left with these 2 options ... other suggestions welcome.
1) Load the relational data into RDBMS
Eg. load the "flat" data into something like mysql. Then use a
programming language and SQL code to produce de-normalized XML. (I'm
thinking of extending xmlsh's xsql to handle master-detail queries to
2) Load the Flat data into a separate ML database (or directory), but
probably a totally separate DB which has its settings tuned for fastest
load (i.e. all the wildcard and stemmed searching turned off), then run
xquery on this DB to produce the de-normalized XML back to the
filesystem then load that data into the target DB. I have found I have
to do this iteratively because the resultant XML document is too big to
fit in memory so XCC crashes out if I produce one big denormalized file.
Any other suggestions on how people de-normalize flat data to load into
Thanks for any suggestions.
From: general-bounces at developer.marklogic.com
[mailto:general-bounces at developer.marklogic.com] On Behalf Of Danny
Sent: Saturday, December 12, 2009 6:55 PM
To: General Mark Logic Developer Discussion
Subject: [MarkLogic Dev General] RE: Help with co-occurance (?) xquery
I am not sure I understand what you are trying to do, but here are a few
* consider creating a range index on the CONCEPTID element. Then you
can use cts:element-values to get all of the unique CONCEPTID values
very quickly (and you can use the cts:query parameter to constrain it to
a cts:query--like a cts:directory-query)
* the predicate [CONCEPTID2 = ($d1,$d2)] will return true if *either*
value is there, I think (not sure) that you wanted both to be there.
* if you know the full path to your elements, that is preferable to
From: general-bounces at developer.marklogic.com
[general-bounces at developer.marklogic.com] On Behalf Of Lee, David
[dlee at epocrates.com]
Sent: Friday, December 11, 2009 6:30 PM
To: General at developer.marklogic.com
Subject: [MarkLogic Dev General] Help with co-occurance (?) xquery
If anyone has any suggestions for this I'd love to hear them.
I have a bunch of records (500k+) of elements like this:
Given 2 concepts ID's .. .I want to query for RELATIONSHIP records where
CONCEPTID1 is the same, and CONCEPTID2 matches the 2 ID's I have.
I also have a 'master' record set of <CONCEPT> which lists all the
concept ID's if that helps.
I'm trying this nieve xquery ... which hasnt completed yet:
let $d1 := 387494007,
$d2 := 387458008
for $c in xdmp:directory("/SNOMED/concepts/")//CONCEPT
let $cid := $c/CONCEPTID/string()
$cid][CONCEPTID2 = ($d1,$d2)]
I dont think its quite what I'm looking for but its close.
Problem is 10 minutes later it hasnt returned yet.
I'm sure its not using any kind of indexing which is going to be way too
I'm looking at the cts:element-value-co-occurrences
which seems to be very close to what I want but these 2 relationship
elements dont co-occur within the same fragment.
My next thought is to regenerate the data putting all relationships
where CONCEPTID1 inside the CONCEPT element which is associated to it.
That's probably a better design anyway ...
but any suggestions on how to query this in a different way very
David A. Lee
Senior Principal Software Engineer
dlee at epocrates.com<mailto:dlee at epocrates.com>
General mailing list
General at developer.marklogic.com
More information about the General