Welcome to the homepage of my C-level term paper (i e the final paper required in order to acquire a swedish Bachelor's degree) in English Linguistics. The essay was written at the English Department of the University of Stockholm under the supervision of Professor Magnus Ljung.
The paper presents the Acquaintance method for text analysis and cathegorization, which has been proposed by Marc Damashek who works for the Department of Defence at Fort Meade in the United States. Acquaintance offers language-independent and database-free analysis of texts. It surveys the relative occurances of different n-grams (i e sequences of n characters) in different texts and groups texts with similar n-gram distributions in so called clusters. Texts in these clusters tend to be written in the same language. For example texts with many 5-grams of the type "_and_" are probably written in english. If many texts written in the same language are analysed, this language cluster may be divided into topic clusters through a "zooming-in" operation.
Through a C++ implementation of the Acquaintance-method, its performance has been studied. Unfortunately the method is found to be rather demanding when it comes to computing power, and the results it produces are far from perfect.
The potential application of this method is automatic cathegorization of large text databases in clusers, which then either can be manually named accoring to their language or topic, or searched using a text with the content being searched for. That is, provided that you have found for example one magazine article that deals with a topic that interests you, you could use Acquaintance to find other articles dealing with a similar topic in a large database of articles. Since the articles only have to be analysed once when they are added to the database, the search can be made rapidly comparing the goal text with clusters and then once the relevant cluster has been identified, the individual texts in it.
Press here to download the paper in Microsoft Word-format. (119k) (Since it was written using v6.0c on a Macintosh computer, it might not be fully functional on another configuration, even though I have tested it with success on a PC.)
You may also read it in a rather dull ascii format by clicking here. (36k)
I have also provided a rough HTML-version of the paper here. (114k)
There is also a shorter introductory text, written in Microsoft Word, which you can download by clicking here. (5k)
I have the C++ code available also, should you be interested in (debugging) it.The code is available here. You should, however, be aware of the fact that there are at least French and American Patents protecting the method from commersial use.
N-grams is the focus of much intrest in the information retrieval research field at the moment. You may find James Mayfield's (University of Maryland Baltimore County) list of current papers on the subject here.
Another page about emerging algorithmic technologies can be found at Gnowledge.
Jonas Gustavsson