[MarkLogic Dev General] Determining stems for proper nouns?

Geert Josten geert.josten at dayon.nl
Sat Mar 17 10:22:07 PDT 2012


Or to extend on the idea of Mike, add two query terms, one case-sensitive,
one case-insensitive, and give the later a lower weight..

Kind regards,
Geert

> -----Oorspronkelijk bericht-----
> Van: general-bounces at developer.marklogic.com [mailto:general-
> bounces at developer.marklogic.com] Namens Mike Sokolov
> Verzonden: zaterdag 17 maart 2012 16:09
> Aan: MarkLogic Developer Discussion
> Onderwerp: Re: [MarkLogic Dev General] Determining stems for proper
nouns?
>
> On 3/17/2012 9:02 AM, David Sewell wrote:
> > On Sat, 17 Mar 2012, Mike Sokolov wrote:
> >
> >> It looks as if it just doesn't "know" that there is such a thing as a
Quaker
> >> or a Whig, and doesn't apply rule-based stemming to unknown
capitalized
> >> words, which is sensible, because how could it know whether (for
example):
> >>
> >> Barsoomians is a plural noun that could be stemmed or simply a name
(David
> >> Barsoomians) that should not.
> >>
> >> Just a guess, and I have no clue what the MarkLogic word list is, but
I
> >> suppose you could derive it from exhaustive search...
> > Right, the brute-force fallback would be processing a lexicon list of
> > all the capitalized words in the database. I'm sort of hoping to avoid
> > that, though.
> >
> Have you considered a two-pass search where you widen by lower-casing
> all terms when no results are found?  The result wouldn't be as precise
> as it could be if you knew which terms were in the stemming dict, but
> would enable you to find Young as a name (or at the start of a sentence)
> without matching young, and also match Quaker->Quakers.
>
> -Mike
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general


More information about the General mailing list