[MarkLogic Dev General] Determining stems for proper nouns?
sokolov at ifactory.com
Sat Mar 17 04:47:09 PDT 2012
It looks as if it just doesn't "know" that there is such a thing as a
Quaker or a Whig, and doesn't apply rule-based stemming to unknown
capitalized words, which is sensible, because how could it know whether
Barsoomians is a plural noun that could be stemmed or simply a name
(David Barsoomians) that should not.
Just a guess, and I have no clue what the MarkLogic word list is, but I
suppose you could derive it from exhaustive search...
On 3/16/2012 10:46 PM, David Sewell wrote:
> In debugging some word queries that didn't return expected results,
> given case-sensitive stemmed searches, I discovered via cts:stem() that
> handling of proper nouns (capitalized terms) is inconsistent. I'm trying
> to figure out whether there's a pattern. Here are some results:
> cts:stem("Baptists"), cts:stem("Buddhists"), cts:stem("Quakers")
> ==> 'Baptist', 'Buddhist', 'Quakers' [note the last one]
> cts:stem("baptists"), cts:stem("buddhists"), cts:stem("quakers")
> ==> 'baptist', 'buddhist', 'quaker' [note the last one]
> cts:stem("Democrats"), cts:stem("Republicans"), cts:stem("Whigs")
> ==> 'Democrat', 'Republican', 'Whigs'
> cts:stem("democrats"), cts:stem("republicans"), cts:stem("whigs")
> ==> 'democrat', 'republican', 'whig'
> In practice, this means that a case-sensitive search on "Baptist" will
> match, as expected, one or more Baptists, but a search on "Quaker" will
> not (assuming a cts:word-query() where case-sensitivity is not
> specified, so that the capitalization of the query text is used as a
> trigger for a case-sensitive search).
> I don't want to treat all queries as case-insensitive because this loses
> important distinctions between generic "young" and "Young" as a name,
> If I had some clue as to the set of words like "Quakers" and "Whigs"
> that do not stem to singular nouns, I could create a custom dictionary
> to handle such cases. Are MarkLogic's decisions here based on an
> internal dictionary? algorithms? both?
More information about the General