[MarkLogic Dev General] xpath string construction

Ian Small ian at marklogic.com
Wed Oct 15 11:17:17 PDT 2008


What Michael is saying is that the query that finds the sequence
document is of similar complexity to the query that would check for
uniqueness.  So the cost of finding the sequence document = the cost of
checking for uniqueness.  After you are done with that cost, then the
sequence approach requires reading the doc (probably already somewhere
in cache), then updating it and writing it back, updating indexes etc.

So "similar query" != "cts:similar-query()".  "similar query" = "a query
of effectively the same complexity as the other one I am talking about"

ian 

> -----Original Message-----
> From: general-bounces at developer.marklogic.com 
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of 
> Eric Palmitesta
> Sent: Wednesday, October 15, 2008 11:14 AM
> To: General Mark Logic Developer Discussion
> Subject: Re: [MarkLogic Dev General] xpath string construction
> 
> It does have work to do in terms of locking and unlocking the 
> sequence document, but I don't see which similar query would 
> be performed.  It reads an int, increments it, and writes it 
> back.  At no point does it need to check an ever-increasing 
> index for a matching (duplicate) id.
> 
> There's probably something silly I'm missing here.  Can you 
> clarify why you consider reading a node from a sequence 
> document a 'similar query'?
> 
> Thanks,
> 
> Eric
> 
> Michael Blakeley wrote:
> > Yes, the check for uniqueness has a cost - but it's a small one, 
> > because this is exactly the kind of index lookup work that 
> the server 
> > is designed for. Reading the sequence document performs a similar 
> > query, and then goes on to do even more work.
> > 
> > -- Mike
> > 
> > Eric Palmitesta wrote:
> >> Hey Michael,
> >>
> >> When I read your earlier description, I got the impression 
> that you'd 
> >> have to, for each randomly generated id, check every previously 
> >> generated id.  For a large (millions of documents) data 
> store, where 
> >> unique ids are interspersed throughout your documents, won't this 
> >> start to chug?  Maybe I didn't understand your ideas, can 
> you clarify 
> >> this point for me?
> >>
> >> I see what you mean about scaling, and how 
> locking/unlocking a single 
> >> file doesn't scale at all with a growing number of threads 
> that need 
> >> unique ids.
> >>
> >> Interesting idea of unique ids on a per-document basis, 
> concatenated 
> >> with another sub-id, I hadn't thought of something like this.
> >>
> >> I appreciate the debate, it's a quick way to learn.  :)
> >>
> >> Much thanks,
> >>
> >> Eric
> >>
> >> Michael Blakeley wrote:
> >>> Eric,
> >>>
> >>> I don't understand why you say that random ids don't 
> scale well. I'd 
> >>> argue that sequential ids don't scale at all, and that 
> the recursive 
> >>> approach scales pretty well. Maybe we mean different 
> things when we 
> >>> talk about scaling?
> >>>
> >>> For small numbers of ids per transaction, where small is 
> less than a 
> >>> few hundred, the recursive random approach will perform well and 
> >>> will very rarely lock any extra documents at all. The 
> queries to see 
> >>> if a document already exists will be indexed, and will 
> never return 
> >>> more than 1 fragment. So I would expect this approach to scale at 
> >>> O(log n) with the number of documents in the database.
> >>>
> >>> The sequential approach can do better with large number 
> of ids per 
> >>> transaction. But this approach requires a lock on the sequence 
> >>> document, for every transaction, effectively serializing 
> everything.
> >>> You can mitigate this with multiple sequence documents, 
> but then you 
> >>> don't, strictly speaking, have a sequence anymore.
> >>>
> >>> I suppose it depends on the workload you need to optimize: the 
> >>> random approach is better for highly-parallel workloads, 
> while the 
> >>> sequential approach might do better for workloads that need 
> >>> thousands of ids per transaction and don't mind being 
> single-threaded.
> >>>
> >>> In today's world, I'd rather use a solution that is capable of a 
> >>> high degree of parallelism. So I rarely, if ever, use 
> sequential ids.
> >>>
> >>> To generate thousands of ids per transaction (say, 1 for each 
> >>> element in a document), I might start with a single 
> random id. Any 
> >>> check would be highly unlikely to take more than zero iterations. 
> >>> Then I'd concatenate my own, in-memory sequence onto it, for each 
> >>> descendant element.
> >>>
> >>> So the root element might have an id "123af097324-1", its first 
> >>> child would be "123af097324-2", etc (apply string-pad if you want 
> >>> the ids to have a fixed length). This approach would preserve the 
> >>> scalability of random ids, without requiring extra work 
> to check any 
> >>> but the first id.
> >>>
> >>> -- Mike
> >>>
> >>> Eric Palmitesta wrote:
> >>>> Wow, thanks for the reply, Michael.  I'll probably be using some 
> >>>> variation of one of your examples.
> >>>>
> >>>> Michael Blakeley wrote:
> >>>>> Many people ask about sequential ids. It is possible to 
> model an 
> >>>>> id sequence as a database document. But as with RDBMS 
> sequences, 
> >>>>> there are serialization penalties. I don't see the advantage of 
> >>>>> sequential ids, so I rarely, if ever, use this approach.
> >>>> Assuming the recursive check isn't feasible (it doesn't scale 
> >>>> well), the advantage of sequential ids is being able to sleep at 
> >>>> night knowing collisions are simply impossible, and are 
> not reliant 
> >>>> on a 'good-enough' random() function.  I'm nit-picking 
> of course, 
> >>>> I'm sure random() is fine.  :)
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Eric
> >>>> _______________________________________________
> >>>> General mailing list
> >>>> General at developer.marklogic.com
> >>>> http://xqzone.com/mailman/listinfo/general
> >>> _______________________________________________
> >>> General mailing list
> >>> General at developer.marklogic.com
> >>> http://xqzone.com/mailman/listinfo/general
> >> _______________________________________________
> >> General mailing list
> >> General at developer.marklogic.com
> >> http://xqzone.com/mailman/listinfo/general
> > 
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com
> > http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
> 


More information about the General mailing list