A Free Implementation of the 1911 Roget's Thesaurus
The Open Roget's Project sets out to create a fully functional lexical resource for Natural Language Processing based on Roget's Thesaurus. A Java 5.0 implementation with the 1911 data is now available. We plan to publish a series of versions with contemporary language. (A version with the 1987 Penguin data cannot be made public: the data belong to Pearson Education.)
Downloads and documentation are available at the downloads page. A comparison of Roget's Thesaurus with the 1911 and 1987 data on semantic similarity between terms can be found on the Compare with 1987 page.
The 1911 Thesaurus is available from the Gutenberg Project.
Version 1.3 of Open Roget's has recently been uploaded. We are pleased to announce that Open Roget's is now available under a BSD License. This will allow more projects to make use of this resource. Some improvements to the code and documentation are also provided, including a new sentence relatedness program.
Acknowledgments
The original design and implementation of Open Roget's are due to Mario Jarmasz. His Master's thesis describes it all in detail; the system worked with the 1987 material. Olena Medelyan skillfully refitted Mario's implementation with the 1911 data and placed the system and additional resources on the Web; her version, now slightly outdated, runs in JRE 1.4.