[MarkLogic Dev General] building a dictionary from a word lexicon
Alan Darnell
alan.darnell at gmail.com
Tue May 1 10:32:28 PDT 2007
Thanks Danny,
This worked great but when I tried to load the resulting file (about
400K words -- lots of specialized medical terms) I got this error:
ERROR: eval-in sp-nih at file:/opt/MarkLogic/Modules/
XDMP-FRAGTOOLARGE: Fragment of /sp-dictionary.xml too large for
in-memory storage: XDMP-INMMLISTFULL: In-memory list storage full;
list: table=89%, wordsused=67%, wordsfree=0%, overhead=33%; tree:
table=0%, wordsused=12%, wordsfree=88%, overhead=0%
Are there some admin settings I can adjust to get past this or should
I break the dictionary file up into smaller chunks or load the thing
via XCC one word at a time?
Alan
On 4/30/07, Danny Sokolsky <dsokolsky at marklogic.com> wrote:
> Hi Alan,
>
> I think your approach would work.
>
> If you really want a dictionary of all of the words in the database,
> however, this might be easier:
>
> xdmp:save("c:/tmp/tmp.xml",
> <dictionary>{"
> ",
> for $x in cts:words()
> return (
> <word>{$x}</word>, "
> ")
> }</dictionary>)
>
> The spaces are in there so line breaks will appear between the terms.
> This includes everything in the db, not just things starting with a-z
> (not sure if that is what you want or not). I didn't try this on a
> large data set, but I think it will work because it will just stream
> everything out to the disk (assuming you don't run out of disk
> space...).
>
> Of course using the lexicon to create a dictionary means that all of the
> words (including the misspelled ones) are put in the dictionary. So
> maybe I am not reading the intent of your question correctly.
>
> -Danny
>
> -----Original Message-----
> From: general-bounces at developer.marklogic.com
> [mailto:general-bounces at developer.marklogic.com] On Behalf Of Alan
> Darnell
> Sent: Monday, April 30, 2007 3:33 PM
> To: General at developer.marklogic.com
> Subject: [MarkLogic Dev General] building a dictionary from a word
> lexicon
>
>
> I'd like to build a dictionary file for use with the spelling module and
> base that dictionary on words that appear in my word lexicon. So I want
> to dump the contents of the lexicon to a file formatted according to the
> spelling dictionary schema.
>
> To do this, I'm thinking of running through the lexicon letter by letter
> and constructing the spelling dictionary from the output.
>
> for $i in cts:word-match("a*") [1 to 2000]
> return
> <word>{$i}</word>
>
> Is this the best way to do this? I'm thinking that creating a
> dictionary out of lexicons is probably a pretty common task and that my
> approach seems cumbersome. I'm thinking also it would be great if you
> could have the dictionary automatically update itself based on the
> content of one or more word lexicons as new documents were added,
> updated, and deleted in a database or databases.
>
> Alan
>
> Alan Darnell
> University of Toronto _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general
>
More information about the General
mailing list