Home

The Open Roget’s Project:

A freely available NLP-friendly implementation

of the 1911 Roget's Thesaurus

The Open Roget's Project provides a fully functional lexical resource for Natural Language Processing, based on Roget's Thesaurus. A Java implementation with the 1911 data now has a significantly updated lexicon. The process of updating Roget’s Thesaurus is documented in this paper:

Alistair Kennedy, Stan Szpakowicz (2014). Evaluation of Automatic Updates of Roget’s Thesaurus. Journal of Language Modelling 2(1), 1-49

(open access; download it at the JLM site)

To get Open Roget’s, visit Alistair Kennedy's resource page, or download directly the tarred & gzipped thesaurus. It is available under the Attribution-ShareAlike 4.0 International Licence (CC BY-SA 4.0 -- details in the README file inside the archive).

Project Gutenberg offers the not quite NLP-friendly unedited 1911 Roget's Thesaurus.

Please direct questions and comments to Alistair Kennedy or to Stan Szpakowicz.

Thanks to Mario Jarmasz, the author of the original system filled with limited-access data, and to Alyona Medelyan for retooling that system to work with the public-domain 1911 data.