Automatic indexing is the automatic creation of a text surrogate, normally keywords or phrases,
to represent the original text. In the current English text retrieval systems, this process of content
representation is accomplished by extracting words using spaces and punctuations as word delimiters.
The same technique cannot easily be applied to Chinese texts which contain no obvious word boundaries;
they appear to be a linear sequence of non-spaced or equally spaced ideographic characters and thenumber
of characters in words varies.
The solution to the problem lies in morphological and syntactic analyses of Chinese morphemes,
words and phrases. The idea is inspired by the experiments on English computational morphology and
its application to English text retrieval, mainly automatic compound and phrase indexing. These areas
are particularly germane to Chinese because typographically there are no morph and phrase boundaries
in either Chinese or English texts. The experiment is based on the hypothesis that words and phrases
exceeding two Chinese characters can be characterised by a grammar that describes the concatenation
behaviour of morphological and syntactic categories. This is examined using the following three
procedures:
(1) text segmentation - texts are divided into one and two character segments by searching a
dictionary containing over 17000 morphemes and words, which are tagged with 'morphological
and syntactic categories.
(2) category disambiguation - for the resulting morphemes and words tagged with more than one
category, the correct one is selected based on context
(3) parsing - the segments are analysed using the grammar, which combines them into compound
and complex words and phrases for indexing and retrieval.
The utilities employed in the experiment include CCOOS, an extended version of MSOOS
providing for Chinese I/O system,Chinese Wordstar for text input and Chinese dBASEIII for dictionary
construction. Source codes are written in Turbo BASIC including its database toolbox. Thiny texts are
drawn randomly from newspapers to form thcsample for the experiment. The results prove that the partial
syntactic analysis-based approach can extract keywords with a good degree of accuracy.