posted on 2010-11-30, 15:11authored byKarlis Kreslins
The thesis covers construction, application and evaluation of a stemming algorithm for
advanced information searching and retrieval in Latvian databases. Its aim is to examine
the following two questions:
Is it possible to apply for Latvian a suffix removal algorithm originally designed
for English?
Can stemming in Latvian produce the same or better information retrieval results
than manual truncation?
In order to achieve these aims, the role and importance of automatic word conflation
both for document indexing and information retrieval are characterised. A review of
literature, which analyzes and evaluates different types of stemming techniques and
retrospective development of stemming algorithms, justifies the necessity to apply this
advanced IR method also for Latvian. Comparative analysis of morphological structure
both for English and Latvian language determined the selection of Porter's suffix
removal algorithm as a basis for the Latvian sternmer.
An extensive list of Latvian stopwords including conjunctions, particles and adverbs,
was designed and added to the initial sternmer in order to eliminate insignificant words
from further processing. A number of specific modifications and changes related to the
Latvian language were carried out to the structure and rules of the original stemming
algorithm.
Analysis of word stemming based on Latvian electronic dictionary and Latvian text
fragments confirmed that the suffix removal technique can be successfully applied also
to Latvian language. An evaluation study of user search statements revealed that the
stemming algorithm to a certain extent can improve effectiveness of information
retrieval.