[MarkLogic Dev General] Determining stems for proper nouns?

David Sewell dsewell at virginia.edu
Fri Mar 16 19:46:48 PDT 2012

In debugging some word queries that didn't return expected results, 
given case-sensitive stemmed searches, I discovered via cts:stem() that 
handling of proper nouns (capitalized terms) is inconsistent. I'm trying 
to figure out whether there's a pattern. Here are some results:

cts:stem("Baptists"), cts:stem("Buddhists"), cts:stem("Quakers")
==> 'Baptist', 'Buddhist', 'Quakers'  [note the last one]

cts:stem("baptists"), cts:stem("buddhists"), cts:stem("quakers")
==> 'baptist', 'buddhist', 'quaker'   [note the last one]

cts:stem("Democrats"), cts:stem("Republicans"), cts:stem("Whigs")
==> 'Democrat', 'Republican', 'Whigs'

cts:stem("democrats"), cts:stem("republicans"), cts:stem("whigs")
==> 'democrat', 'republican', 'whig'

In practice, this means that a case-sensitive search on "Baptist" will 
match, as expected, one or more Baptists, but a search on "Quaker" will 
not (assuming a cts:word-query() where case-sensitivity is not 
specified, so that the capitalization of the query text is used as a 
trigger for a case-sensitive search).

I don't want to treat all queries as case-insensitive because this loses 
important distinctions between generic "young" and "Young" as a name, 

If I had some clue as to the set of words like "Quakers" and "Whigs" 
that do not stem to singular nouns, I could create a custom dictionary 
to handle such cases. Are MarkLogic's decisions here based on an 
internal dictionary? algorithms? both?

David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: dsewell at virginia.edu   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/

More information about the General mailing list