[MarkLogic Dev General] RE: Text Updates Garbage Collection? (Neil Bradley)

Kelly Stirman Kelly.Stirman at marklogic.com
Fri Oct 9 02:57:24 PDT 2009


HI Neil,

Have you thought about using cts:highlight() to do the replacing of your string values? You basically construct a cts:or-query(()) of all the different values you'd like to replace:

let $q := cts:or-query(("Doc","ume","nt"))

Then you call cts:highlight() on the document. Normally you would use cts:highlight() to replace a matching string with some new markup for style, such as a span tag. It turns out you can use it to replace the matching string with whatever you want. Where cts:highlight() finds a match, you have some useful options. One is the $cts:queries variable, which returns the matching query for the text that is matched. You can use this with a lookup document like so:

<replace>
 <item from="Doc">DOC</item>
 <item from="ume">UME</item>
 <item from="nt">NT</item>
</replace>

For each match, you'll get back a cts:query, and you can use this to find matches in your replace node, and use the substitution string as the value for the third argument in cts:highlight():

let $doc :=
<doc>I have some text that includes the words Doc, ume, and nt.</doc>

let $replace :=
<replace>
 <item from="Doc">DOC</item>
 <item from="ume">UME</item>
 <item from="nt">NT</item>
</replace>

let $q := cts:or-query(("Doc","ume","nt"))

return
cts:highlight($doc,$q,local:replace($cts:queries,$replace))

-->

<doc>I have some text that includes the words DOC, UME, and NT.</doc>

This can be extended with cts:reverse-query() to perform custom enrichment on XML. Rather than having one large or-query() for all the strings you might want to replace, you would store a document with your query and any other useful metadata you wish to associate with the query. For example, if you wanted to do some custom enrichment on drug names, you might have a series of documents like this:

<drug>
  <name type="commercial">Tylenol</name>
  <img type="commercial">/Thumbs/generic/acetamenophin.png</img>
  <name type="generic">Acetamenophin</name>
  <img type="generic">/Thumbs/generic/acetamenophin.png</img>
  <link>http://drugdictionary.com/drugid/j674ui832190</link>
  <query>{cts:or-query((cts:word-query("Tylenol","case-insensitive"),cts:word-query("Acetamenophin","case-insensitive")))}</query>
</drug>

And for each document you want to enrich, you would use the reverse indexes to see which drugs are in the document. This is a much easier approach to manage than an or-query() of thousands of drug names:

cts:search(doc(),cts:reverse-query($new-document))

This would return the matching query documents, and you can then retrieve the queries from these docs and pass them to cts:highlight(). Here's how you might do that:

let $drug-groups := cts:search(doc(),cts:reverse-query($doc))
let $query := cts:or-query((cts:query($drug-groups/drug/query/*)))
return
  cts:highlight($doc,$query,local:drug-enrich($cts:queries,$drug-groups))


In this case, instead of a single replace document, the new value is one of several pieces of metadata you store with each query. You can write your own function to build elaborate replacement markup. Here's a simple example for the drugs:

declare function local:drug-enrich($query as cts:query,$drug-groups as node()*){
  let $this-drug := $drug-groups/drug/name[cts:contains(.,$query)]
  let $this-type := fn:data($this-drug/@type)
  let $other-type := if($this-type eq "commercial") then "generic" else "commercial"
  let $img := fn:data($this-drug/@img)
  let $link := $this-drug/../link/text()
  let $equivalent := $this-drug/../name[@type eq $other-type]/text()
  return <drug img="{$img}" link="{$link}">{$match} [{$equivalent}]</drug>
};


Kelly

Hi,



I want to check if there is likely to be any problem with memory exhaustion
in the following scenario.



I will have text documents stored in a MarkLogic database that I will to
update using a large number of consecutive search/replaces, then finally
convert to XML.



It seems obvious to me that I could easily run out of memory if I adopt this
approach (and have hundreds of replaces applied to large text documents). In
this trivial example, I am simply converting the word "Document" to
"DOCUMENT" in three steps, which I would obviously do in one for real, but
just to show the method I originally considered...



    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := fn:replace($NewText1, "ume", "UME"))

    let $NewText3 := fn:replace($NewText2, "nt", "NT"))

    let $XML := xdmp:unquote($NewText3)

    return

      $XML



I am assuming that each variable contains a variant of the text document, so
memory will quickly become exhausted.



However, if I use xdmp:set(), would that solve the problem, because the
first variable content is being replaced, and the later variables have no
content at all?...



    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := xdmp:set($NewText1, fn:replace($NewText1, "ume",
"UME"))

    let $NewText3 := xdmp:set($NewText1, fn:replace($NewText1, "nt", "NT"))

    let $XML := xdmp:unquote($NewText1)

    return

      $XML



Or would I still expect old text to still be occupying memory (lack of
string garbage collection)?



Thanks,



Neil.

-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of general-request at developer.marklogic.com
Sent: Friday, October 09, 2009 2:27 AM
To: general at developer.marklogic.com
Subject: General Digest, Vol 64, Issue 25

Send General mailing list submissions to
        general at developer.marklogic.com

To subscribe or unsubscribe via the World Wide Web, visit
        http://xqzone.com/mailman/listinfo/general
or, via email, send a message with subject or body 'help' to
        general-request at developer.marklogic.com

You can reach the person managing the list at
        general-owner at developer.marklogic.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of General digest..."


Today's Topics:

   1. Performance Meters http test configuration (Curtis Wilde)
   2. Re: Performance Meters http test configuration (Michael Blakeley)
   3. Re: Performance Meters http test  configuration (Curtis Wilde)
   4. To set threshold for search:search results (mano m)
   5. Text Updates Garbage Collection? (Neil Bradley)


----------------------------------------------------------------------

Message: 1
Date: Thu, 8 Oct 2009 16:06:24 -0600
From: Curtis Wilde <galvorn at gmail.com>
Subject: [MarkLogic Dev General] Performance Meters http test
        configuration
To: General at developer.marklogic.com
Message-ID:
        <7cd019b80910081506h7e3375c5k6feffb33596d9f1a at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

The performance meters tutorial does a good job at explaining how to execute
xcc tests with performance meters, but it is less clear how an http test
should work. I've taken a stab at a very simple http test with no success:

<h:script xmlns:h="http://marklogic.com/xdmp/harness">
    <h:test>
        <h:name>login</h:name>
        <h:set-up/>
        <h:tear-down/>
        <h:comment-expected-result><![CDATA[<response
status="AUTHENTICATED"/>]]>
        </h:comment-expected-result>
        <h:query><![CDATA[login?username=foo&password=bar]]></h:query>
    </h:test>
</h:script>

The test makes a restful call (login) to a service, which should
authenticate the specified user and receive the authenticated status message
reply, but this never succeeds. In the address bar of the browser the call
looks like:

http://myTestServer:8030/login?username=foo&password=bar

properties file:
checkResults=true
host=myTestServer
port=8030
isRandomTest=false
inputPath=../tests/httptests.xml
numThreads=1
shared=false
readSize=32768
recordResults=true
#reporter=XMLReporter
#outputPath=results.xml
reporter=CSVReporter
outputPath=../reports/
reportTime=true
reportPercentileDuration=95
reportStandardDeviation=true
testTime=0
testType=HTTP
testListClass=com.marklogic.performance.XMLFileTestList

Not sure what I'm doing wrong.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20091008/c2c8a698/attachment-0001.html

------------------------------

Message: 2
Date: Thu, 08 Oct 2009 15:46:25 -0700
From: Michael Blakeley <michael.blakeley at marklogic.com>
Subject: Re: [MarkLogic Dev General] Performance Meters http test
        configuration
To: General Mark Logic Developer Discussion
        <general at developer.marklogic.com>
Message-ID: <4ACE6BC1.6090309 at marklogic.com>
Content-Type: text/plain; charset=UTF-8; format=flowed

Curtis,

Try testType=URI instead. The HTTP test type is more specialized: it
posts the <h:query> value to a special "/evaluate.xqy" service on the
target host. The idea with that test type is to evaluate arbitrary
XQuery expressions.

-- Mike

On 2009-10-08 15:06, Curtis Wilde wrote:
> The performance meters tutorial does a good job at explaining how to execute xcc tests with performance meters, but it is less clear how an http test should work. I've taken a stab at a very simple http test with no success:
>
> <h:script xmlns:h="http://marklogic.com/xdmp/harness">
>      <h:test>
>          <h:name>login</h:name>
>          <h:set-up/>
>          <h:tear-down/>
>          <h:comment-expected-result><![CDATA[<response status="AUTHENTICATED"/>]]>
>          </h:comment-expected-result>
>          <h:query><![CDATA[login?username=foo&password=bar]]></h:query>
>      </h:test>
> </h:script>
>
> The test makes a restful call (login) to a service, which should authenticate the specified user and receive the authenticated status message reply, but this never succeeds. In the address bar of the browser the call looks like:
>
> http://myTestServer:8030/login?username=foo&password=bar
>
> properties file:
> checkResults=true
> host=myTestServer
> port=8030
> isRandomTest=false
> inputPath=../tests/httptests.xml
> numThreads=1
> shared=false
> readSize=32768
> recordResults=true
> #reporter=XMLReporter
> #outputPath=results.xml
> reporter=CSVReporter
> outputPath=../reports/
> reportTime=true
> reportPercentileDuration=95
> reportStandardDeviation=true
> testTime=0
> testType=HTTP
> testListClass=com.marklogic.performance.XMLFileTestList
>
> Not sure what I'm doing wrong.



------------------------------

Message: 3
Date: Thu, 8 Oct 2009 18:01:18 -0600
From: Curtis Wilde <galvorn at gmail.com>
Subject: Re: [MarkLogic Dev General] Performance Meters http test
        configuration
To: General Mark Logic Developer Discussion
        <general at developer.marklogic.com>
Message-ID:
        <7cd019b80910081701g738dea58i12aa2d8f49426626 at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Thanks for the guidance, but changing to URI is still unsuccessful.

Manually requesting authentication with the browser should return:
<response status="AUTHENTICATED"/>
but I still receive
<response status="NOT_AUTHENTICATED"/>
(http://mytestserver:8030/login?username=foo&password=bar)

This is not a problem with the service since currently any username/password
combo will authenticate on our test system.
I'll try to monitor the actual request through a proxy or something and see
if it's getting mangled.

On Thu, Oct 8, 2009 at 4:46 PM, Michael Blakeley <
michael.blakeley at marklogic.com> wrote:

> Curtis,
>
> Try testType=URI instead. The HTTP test type is more specialized: it posts
> the <h:query> value to a special "/evaluate.xqy" service on the target host.
> The idea with that test type is to evaluate arbitrary XQuery expressions.
>
> -- Mike
>
>
> On 2009-10-08 15:06, Curtis Wilde wrote:
>
>> The performance meters tutorial does a good job at explaining how to
>> execute xcc tests with performance meters, but it is less clear how an http
>> test should work. I've taken a stab at a very simple http test with no
>> success:
>>
>> <h:script xmlns:h="http://marklogic.com/xdmp/harness">
>>     <h:test>
>>         <h:name>login</h:name>
>>         <h:set-up/>
>>         <h:tear-down/>
>>         <h:comment-expected-result><![CDATA[<response
>> status="AUTHENTICATED"/>]]>
>>         </h:comment-expected-result>
>>         <h:query><![CDATA[login?username=foo&password=bar]]></h:query>
>>     </h:test>
>> </h:script>
>>
>> The test makes a restful call (login) to a service, which should
>> authenticate the specified user and receive the authenticated status message
>> reply, but this never succeeds. In the address bar of the browser the call
>> looks like:
>>
>> http://myTestServer:8030/login?username=foo&password=bar
>>
>> properties file:
>> checkResults=true
>> host=myTestServer
>> port=8030
>> isRandomTest=false
>> inputPath=../tests/httptests.xml
>> numThreads=1
>> shared=false
>> readSize=32768
>> recordResults=true
>> #reporter=XMLReporter
>> #outputPath=results.xml
>> reporter=CSVReporter
>> outputPath=../reports/
>> reportTime=true
>> reportPercentileDuration=95
>> reportStandardDeviation=true
>> testTime=0
>> testType=HTTP
>> testListClass=com.marklogic.performance.XMLFileTestList
>>
>> Not sure what I'm doing wrong.
>>
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20091008/52dfdc33/attachment-0001.html

------------------------------

Message: 4
Date: Thu, 8 Oct 2009 23:16:05 -0700 (PDT)
From: mano m <mano07good at yahoo.co.in>
Subject: [MarkLogic Dev General] To set threshold for search:search
        results
To: general at developer.marklogic.com
Message-ID: <976626.70658.qm at web95112.mail.in2.yahoo.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi
?
In a search application, we are performing the following steps:
?
1.???? A constant value is set as threshold. From the search response, get the total number of results and compare with threshold.
?
2.???? If the search result exceeds the threshold then display the search results.
?
3.???? Otherwise?will perform the "Did You Mean?" search (Spell check and auto correction using dictionary)?and display the result
?
Please suggest me is there any efficient way to set the threshold instead of the constant.
?
Regards,
Mano


      Try the new Yahoo! India Homepage. Click here. http://in.yahoo.com/trynew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20091008/9c8bdc52/attachment-0001.html

------------------------------

Message: 5
Date: Fri, 9 Oct 2009 11:56:36 +0100
From: "Neil Bradley" <neil at bradley.co.uk>
Subject: [MarkLogic Dev General] Text Updates Garbage Collection?
To: <general at developer.marklogic.com>
Message-ID: <00f901ca48cf$34ced200$9e6c7600$@co.uk>
Content-Type: text/plain; charset="us-ascii"

Hi,



I want to check if there is likely to be any problem with memory exhaustion
in the following scenario.



I will have text documents stored in a MarkLogic database that I will to
update using a large number of consecutive search/replaces, then finally
convert to XML.



It seems obvious to me that I could easily run out of memory if I adopt this
approach (and have hundreds of replaces applied to large text documents). In
this trivial example, I am simply converting the word "Document" to
"DOCUMENT" in three steps, which I would obviously do in one for real, but
just to show the method I originally considered...



    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := fn:replace($NewText1, "ume", "UME"))

    let $NewText3 := fn:replace($NewText2, "nt", "NT"))

    let $XML := xdmp:unquote($NewText3)

    return

      $XML



I am assuming that each variable contains a variant of the text document, so
memory will quickly become exhausted.



However, if I use xdmp:set(), would that solve the problem, because the
first variable content is being replaced, and the later variables have no
content at all?...



    let $Text :=
".............................................................. (large text
document).............................."

    let $NewText1 := fn:replace($Text, "Doc", "DOC")

    let $NewText2 := xdmp:set($NewText1, fn:replace($NewText1, "ume",
"UME"))

    let $NewText3 := xdmp:set($NewText1, fn:replace($NewText1, "nt", "NT"))

    let $XML := xdmp:unquote($NewText1)

    return

      $XML



Or would I still expect old text to still be occupying memory (lack of
string garbage collection)?



Thanks,



Neil.







-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://xqzone.marklogic.com/pipermail/general/attachments/20091009/8406b6db/attachment.html

------------------------------

_______________________________________________
General mailing list
General at developer.marklogic.com
http://xqzone.com/mailman/listinfo/general


End of General Digest, Vol 64, Issue 25
***************************************


More information about the General mailing list