[MarkLogic Dev General] Determining stems for proper nouns?

Mike Sokolov sokolov at ifactory.com
Sat Mar 17 08:09:24 PDT 2012


On 3/17/2012 9:02 AM, David Sewell wrote:
> On Sat, 17 Mar 2012, Mike Sokolov wrote:
>
>> It looks as if it just doesn't "know" that there is such a thing as a Quaker
>> or a Whig, and doesn't apply rule-based stemming to unknown capitalized
>> words, which is sensible, because how could it know whether (for example):
>>
>> Barsoomians is a plural noun that could be stemmed or simply a name (David
>> Barsoomians) that should not.
>>
>> Just a guess, and I have no clue what the MarkLogic word list is, but I
>> suppose you could derive it from exhaustive search...
> Right, the brute-force fallback would be processing a lexicon list of
> all the capitalized words in the database. I'm sort of hoping to avoid
> that, though.
>
Have you considered a two-pass search where you widen by lower-casing 
all terms when no results are found?  The result wouldn't be as precise 
as it could be if you knew which terms were in the stemming dict, but 
would enable you to find Young as a name (or at the start of a sentence) 
without matching young, and also match Quaker->Quakers.

-Mike


More information about the General mailing list