[MarkLogic Dev General] Determining stems for proper nouns?

Mike Sokolov sokolov at ifactory.com
Sat Mar 17 04:47:09 PDT 2012

It looks as if it just doesn't "know" that there is such a thing as a 
Quaker or a Whig, and doesn't apply rule-based stemming to unknown 
capitalized words, which is sensible, because how could it know whether 
(for example):

Barsoomians is a plural noun that could be stemmed or simply a name 
(David Barsoomians) that should not.

Just a guess, and I have no clue what the MarkLogic word list is, but I 
suppose you could derive it from exhaustive search...


On 3/16/2012 10:46 PM, David Sewell wrote:
> In debugging some word queries that didn't return expected results,
> given case-sensitive stemmed searches, I discovered via cts:stem() that
> handling of proper nouns (capitalized terms) is inconsistent. I'm trying
> to figure out whether there's a pattern. Here are some results:
> cts:stem("Baptists"), cts:stem("Buddhists"), cts:stem("Quakers")
> ==>  'Baptist', 'Buddhist', 'Quakers'  [note the last one]
> cts:stem("baptists"), cts:stem("buddhists"), cts:stem("quakers")
> ==>  'baptist', 'buddhist', 'quaker'   [note the last one]
> cts:stem("Democrats"), cts:stem("Republicans"), cts:stem("Whigs")
> ==>  'Democrat', 'Republican', 'Whigs'
> cts:stem("democrats"), cts:stem("republicans"), cts:stem("whigs")
> ==>  'democrat', 'republican', 'whig'
> In practice, this means that a case-sensitive search on "Baptist" will
> match, as expected, one or more Baptists, but a search on "Quaker" will
> not (assuming a cts:word-query() where case-sensitivity is not
> specified, so that the capitalization of the query text is used as a
> trigger for a case-sensitive search).
> I don't want to treat all queries as case-insensitive because this loses
> important distinctions between generic "young" and "Young" as a name,
> etc.
> If I had some clue as to the set of words like "Quakers" and "Whigs"
> that do not stem to singular nouns, I could create a custom dictionary
> to handle such cases. Are MarkLogic's decisions here based on an
> internal dictionary? algorithms? both?

More information about the General mailing list