[MarkLogic Dev General] Determining stems for proper nouns?

Danny Sokolsky Danny.Sokolsky at marklogic.com
Sat Mar 17 22:15:56 PDT 2012


In 5.0, you can also add to the stemming dictionary by creating a custom dictionary.  It is good for this type of use case:

http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/xml/search-dev-guide/custom-dictionaries.xml

-Danny
________________________________________
From: general-bounces at developer.marklogic.com [general-bounces at developer.marklogic.com] On Behalf Of David Sewell [dsewell at virginia.edu]
Sent: Saturday, March 17, 2012 4:44 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Determining stems for proper nouns?

Some more experimentation does indicate that there is an internal
dictionary of proper nouns used to determine stemming. Looking for
capitalized words ending in "s" that are treated as plurals and stemmed
without terminal "s", i.e.

cts:words("a", "document")[matches(., '^[A-Z][a-z]+s$')][cts:stem(.) ne .]

yields (for my data) around 1100 entries, from "Abbotts" to "Aztecs"
through "Methodists" and "Unitarians" to "Yankees", "Youngs", and
"Zanes".

Whereas the list of capitalized words ending in "s" that stem only to
themselves,

cts:words("a", "document")[matches(., '^[A-Z][a-z]+s$')][cts:stem(.) eq .]

runs to 26,000 entries, most of them common nouns, but many unusual or
misspelled proper nouns from "Aanabaptists" to "Zweerts". The poor
Quakers were just plain overlooked, I guess. :-)

Thanks, Mike and Geert, for suggestions on refining search results.

DAvid

On Sat, 17 Mar 2012, Mary Holstege wrote:

> On Fri, 16 Mar 2012 19:46:48 -0700, David Sewell <dsewell at virginia.edu> wrote:
>
>>
>> If I had some clue as to the set of words like "Quakers" and "Whigs"
>> that do not stem to singular nouns, I could create a custom dictionary
>> to handle such cases. Are MarkLogic's decisions here based on an
>> internal dictionary? algorithms? both?
>
> Both, but unfortunately I can't tell you what is in that dictionary,
> or the exact circumstances under which the rules get applied because
> stemming is licensed from Inxight (or one of their successors or
> assigns) and we don't have a lot in the way of details.
>
> I think your options are to either run a little test over the word
> lexicon to determine which words need special handling in a custom
> dictionary, and maybe repeat this experiment from time to time to
> see if it needs adjustment, or to accept the lack of precision
> and run case-insensitive.
>
> Sorry I can't be more help here.
>
> //Mary
>
> Mary Holstege
> Principal Engineer
> Mark Logic Corporation
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>

--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: dsewell at virginia.edu   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/
_______________________________________________
General mailing list
General at developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general


More information about the General mailing list