Recommendations about Stemming Options

by Mary Holstege

We have some internal email lists at MarkLogic and sometimes the information that pops up is too good not to share. Recently, we had this question:

Are there any recommendations regarding the stemming option to use among basic, advanced and decompounding? Would it be a good approach to always use the "advanced" option when enabling stemming for French?

The answer came from Mary Holstege, who built many of the MarkLogic search features.

In languages with a lot of inflections, alternative stems are fairly common and you should use advanced stemming. You end up with homonyms colliding, especially for short words. So: pretty much everything except English and Chinese. Most European languages will also see certain verb forms produce both an adjective stem and a verb stem (e.g. English "crowded" or "flying"). In English, with few inflections, this is the main case where advanced stemming buys you anything -- even in the case of homonyms the stems end up the same anyway. Decompounding is mainly useful for Germanic languages that do a lot of noun compound formation (German, Dutch, Norwegian) and to a lesser extent Japanese. English would be in this camp except at some point in our linguistic past we decided to put spaces in our noun compounds (French influence, probably) so you don't get anything out of decompounding.

I would also add, that if you are doing stemmed searches in languages that care about accents (like French) you'll get better results with explicitly diacritic-sensitive searches (assuming you spelled your French words with the correct accents), and likewise for German you'll get better results if you spell your nouns with Capital Letters the German Way and use case-sensitive searches. It so happens the stemmers are sensitive to that detail.

Comments

  • Hi Mary, Thanks for this post. I develop search applications for German speaking people. While it is true that capitalized nouns with case-sensitive searches give better results, we have the problem that users are lazy and do not use capital letters for nouns in their search terms. Google, for instance, does not make a difference. The MarkLogic stemmers are sensitive to the case. If the search option "case-insensitive" is specified with "stemmed" then the input is transformed to lowercase before the stemming (see https://help.marklogic.com/knowledgebase/article/View/41/15/case-sensitive-search-with-stemming). This means that we have to search for both the lower case version and the capitalized version of the user search terms to really make use of the stemming. An example: The user enters "hund" (= dog) and we need to search for both "hund" and the correct spelling "Hund" to have a high recall. If the user enters multiple words like "meine hunde" (= my dogs), then we have to search for all combinations of the words, where each word starts either with a capital letter or a lowercase latter. This becomes an or-query like the following: cts:or-query(( cts:word-query("Meine Hunde", ("case-sensitive", "stemmed", "lang=de")), cts:word-query("meine Hunde", ("case-sensitive", "stemmed", "lang=de")), cts:word-query("Meine hunde", ("case-sensitive", "stemmed", "lang=de")), cts:word-query("meine hunde", ("case-sensitive", "stemmed", "lang=de")) )) Unfortunately this does not scale well with the number of words. The number of queries is 2^n where n is the number of words. Another issue associated with the case-sensitivity of stemming are the first words in sentences. They always start with a capital letter and so the stemmer does not "know" them if they are normally in lower case. The above "trick" does not help either because the database documents would have to be manipulated - not only the query. Kind regards, Andreas