New Stemmer in TIM | Knowledge for policy

Text Mining

Since 15 June 2022, a new stemmer is used by default for new and existing queries.

The stemmer determines the way in which the TIM search stems the keywords (reduces word variants to a root form) that are used in the search.
The new stemmer, Kstem, has several advantages compared to the previous one and will help you get more precise results for your queries (more information below).

However, this means that the results of your searches may have changed. In most cases, the difference will be hardly noticeable, with only few documents removed/added. However, in some cases, it is possible that the new stemmer gives unexpected results and that the query needs to be modified to return the proper results.

In particular, queries that use terms that include numbers in between or at the end might cause problems , e.g. UF6. This can be solved if you put the term in quotes, like so: “UF6”.

We will keep the possibility to manually return to the old search for a limited period but then no new data will be available for that dataset.

We are at your disposal to help you modify or understand the impact of the new stemmer on your datasets. Don’t hesitate to contact us at JRC-TIM-SUPPORT@ec.europa.eu .

Please find below some more information on Kstem:

The primary advantage of Kstem over previous stemmers (e.g, the Porter stemmer that we were using until now) is that it returns words instead of truncated word forms.
For example,
elephants->elephant
amplification->amplify
european->Europe

Kstem reduces common endings by default, such as `-ness', `-ly'.
Some exceptions are introduced as Kstem requires a word to be in the lexicon (the basic list of words that the system knows about) before it will reduce one word form to another.
For example, the lexicon needs to contain the word `factorial', or it would be reduced to the presumed root, `factory'.
In contrast, we want to allow `immunity'->`immune', but avoid reducing `station'->`state', or `authority'->`author'. This is done by making sure that the root form (`immune') is mentioned in the lexicon, and omitting any variant (`immunity') that you want to be related to that root.

The lexicon that is used by Kstem is a pretty exhaustive list that solves some of the problems we had previously identified with searches in TIM.
For example, these words can be found in the lexicon:
product, production, productive, productivity
emergence, emergency, emergent
Which means that now these word forms are kept “as is” and TIM will not conflate “product” into “production” or “emergency” into “emergent”.
Moreover, the list of exceptions used by Kstem can be expanded, and keywords that are reduced incorrectly can be added.

You can find more information on Kstem here.

Originally Published \| Last Updated	26 Apr 2022 \| 01 Jul 2022
Knowledge service \| Metadata	Text Mining \| TIM analytics

More information and links