[MarkLogic Dev General] Determining stems for proper nouns?

David Sewell dsewell at virginia.edu
Fri Mar 16 19:46:48 PDT 2012


In debugging some word queries that didn't return expected results, 
given case-sensitive stemmed searches, I discovered via cts:stem() that 
handling of proper nouns (capitalized terms) is inconsistent. I'm trying 
to figure out whether there's a pattern. Here are some results:

cts:stem("Baptists"), cts:stem("Buddhists"), cts:stem("Quakers")
==> 'Baptist', 'Buddhist', 'Quakers'  [note the last one]

cts:stem("baptists"), cts:stem("buddhists"), cts:stem("quakers")
==> 'baptist', 'buddhist', 'quaker'   [note the last one]

cts:stem("Democrats"), cts:stem("Republicans"), cts:stem("Whigs")
==> 'Democrat', 'Republican', 'Whigs'

cts:stem("democrats"), cts:stem("republicans"), cts:stem("whigs")
==> 'democrat', 'republican', 'whig'

In practice, this means that a case-sensitive search on "Baptist" will 
match, as expected, one or more Baptists, but a search on "Quaker" will 
not (assuming a cts:word-query() where case-sensitivity is not 
specified, so that the capitalization of the query text is used as a 
trigger for a case-sensitive search).

I don't want to treat all queries as case-insensitive because this loses 
important distinctions between generic "young" and "Young" as a name, 
etc.

If I had some clue as to the set of words like "Quakers" and "Whigs" 
that do not stem to singular nouns, I could create a custom dictionary 
to handle such cases. Are MarkLogic's decisions here based on an 
internal dictionary? algorithms? both?


-- 
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: dsewell at virginia.edu   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/


More information about the General mailing list