posted on 2013-12-09, 15:06authored byGhim Hwee Ong
The increasing use of computers for document preparation and publishing
coupled with a growth in the general information management facilities available on
computers has meant that most documents exist in computer processable form during
their lifetime. This has led to a substantial increase in the demand for data storage
facilities, which frequently seems to exceed the provision of storage facilities, despite
the advances in storage technology. Furthermore, there is growing demand to transmit
these textual documents from one use to another, rather than use a printed form for
transfer between sites which then needs to be re-entered into a computer at the
receiving site. Transmission facilities are, however, limited and large documents can
be difficult and expensive to transmit.
Problems of storage and transmission capacity can be alleviated by compacting
the textual information beforehand, providing that there is no loss of information in
this process. Conventional compaction techniques have been designed to compact all
forms of data (binary as well as text) and have, predominantly, been based on the byte
as the unit of compression. This thesis investigates the alternative of designing a
compaction procedure for natural language texts, using the textual word as the unit of
compression.
Four related alternative techniques are developed and analysed in the thesis.
These are designed to be appropriate for different circumstances where either
maximum compression or maximum point to point transmission speed is of greatest
importance, and where the characteristics of the transmission, or storage, medium may
be oriented to a seven or eight bit data unit. The effectiveness of the four techniques is
investigated both theoretically and by practical comparison with a widely used
conventional alternative. It is shown that for a wide range of textual material the word
based techniques yield a greater compression and require substantially less processing
time.