Linguistically, there are derivational and inflectional stemmers. Derivational stemmers are more aggressive and can reduce “conduction” to “conduct” and “catty” to “cat". Some even remove prefixes, so “superconductor” becomes “conduct”. They are allowed to change between parts of speech, like noun to verb. Inflectional stemmers do not change the part of speech. They might remove endings (“cars” to “car”) or make internal changes (“women” to “woman”). Some handle irregular plurals, like “people” to “person”.

In Solr, Lucene, and ElasticSearch, the KStem analyzer is the least aggressive.

I’m not a fan of stemming proper nouns like “Holbrooks”. “Steve Jobs” and “Bill Gates” are not a job or a gate.

Because the search includes date sorting, precision becomes really important. When sorting by relevance, it is OK to have lots of poor matches as long as they are on page 10 of the results. But when sorted by date, all those bad matches are mixed in, with no way to ignore them.

Wildcards, like asterisks, usually put a huge load on the search engine. First it must scan all the terms to get matches, then search with the hundreds or thousands of terms that match.

I agree with Walter. I'd rather have it this way than the other way.

Ideal, I think, would be the third way: partial but not stemming.

I still think it should be possible to use an asterisk to indicate "contains this string." No?


I'd much rather have partial matches without needing a "syntax" element like an asterisk to mark it. That is, I'd like to have a search for "driver" match "drivers", which would be a match on part of the word, but not "driving", which would be a match based on word stems.

Partial matching would solve Nancy's case of Holbrook versus Holbrooks, and it would allow one to do "psuedo-stemming". For example, searching for "driv" if one wanted to match both "drivers" and "driving". Note the "psuedo", as that would also match "drivel", which a true stemming search would not.


