[MarkLogic Dev General] pre-processing and filtering out common words

Michael Blakeley michael.blakeley at marklogic.com
Wed May 21 12:18:54 PDT 2008


It's common practice to remove "stop words" 
(http://en.wikipedia.org/wiki/Stopwords) from queries, but also to 
provide some syntax for exceptions. For example, there should be a way 
to find Hamlet's soliloquy by searching for "to be or not to be". One 
technique is to remove individual query terms that are stop words, but 
to leave quoted phrases intact.

Other common practices are to lower-case individual query terms, to 
remove some or all punctuation, and to remove singular possessives 
(trailing "'s"). But not every application will implement all of these 
techniques: requirements vary.

-- Mike

Paul M wrote:
> Do you pre-process your search queries, so that common words are removed, such as (and, the, to, I, an, a, etc...)? Does this speed search results noticeably? (Many fragments returned when common words are use as search terms, correct?)
> 
> thank you
> 
> 
> 
>       
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://xqzone.com/mailman/listinfo/general



More information about the General mailing list