About Clean Roget
Who was Roget?
Peter Mark Roget (1779 – 1869) was a British physician, scientist, and lexicographer best known for creating Roget’s Thesaurus. Rather than organizing words alphabetically, Roget organized language by ideas, concepts, and relationships between meanings.
What is Roget’s Thesaurus?
Created in 1805, Roget’s Thesaurus of English Words and Phrases is a conceptual thesaurus: a structured map of English vocabulary organized by semantic categories. Its hierarchy includes classes, divisions, sections, subsections, semantic heads, parts of speech, and terms.
Clean Roget works from the public-domain Project Gutenberg source: Roget’s Thesaurus on Project Gutenberg .
Why this project?
It all started when I joined the Evaluating Variations in Languages Lab (informally known as EViL, haha) at Duquesne University during my sophomore year of college under Dr. Patrick Juola, who's most known for using forensic stylometry to prove J. K. Rowling wrote the novel The Cuckoo's Calling under the pseudonym "Robert Gailbraith". Our lab group delved into adjectives and adverbs, with the theory that these were the most subjective parts of speeches an author can use, therefore, there might be some stylistic patterns to uncover which would be ground-breaking in the field of authorship attribution and stylometry analysis. Dr. Juola introduced me to Roget's Thesaurus and if I could clean it up, then it would be useful for our adjective categorizer since it would be a guidebook on categorizing adjectives (and adverbs) semantically. When I set out on this project in 2022, Python's BeautifulSoup library could not clean up this mess of a page. It was so messy that I was able to use it for a grad class thesis project (which you can see here). Now, in 2026, tools have updated, but it was still difficult to clean! I still had to make my own parser, but I was determined to finally clean up Roget's Thesaurus because I believe it can be a valuable starting point for developing more tools for stylometric analysis and authorship attribution studies.
Besides all that, I'm a nerd for literature. I enjoy reading as a pastime and I also love to analyze patterns (or lack thereof) among texts.
Clean Roget began as an attempt to turn a historically important but difficult-to-use semantic reference work into a clean, machine-readable ontology. The goal is not only to preserve Roget’s structure, but to make it usable for computational humanities, literary analysis, semantic browsing, and exploratory text comparison.
Instead of asking only what words appear in a text, this project asks what kinds of ideas, concepts, and semantic categories shape that text.
How it was built
The original source contains dense historical formatting, inconsistent structure, archaic symbols, and long semantic blocks. Clean Roget parses the source text, reconstructs Roget’s hierarchy, cleans the semantic entries, extracts term-level data, and exports the result as machine-readable JSON and CSV using Python, JavaScript, Jupyter Notebook, and HTML/CSS.
The site then uses that cleaned ontology to power browser-based tools for semantic analysis, text comparison, corpus comparison, semantic clustering, and network visualization.
What can you do with it?
- Browse Roget’s hierarchy in a cleaner, readable format.
- Download machine-readable Roget datasets.
- Analyze a pasted text or uploaded
.txtfile. - Compare two texts or authors by semantic profile.
- Compare corpora made from multiple text files.
- Explore semantic heads as an interactive network.
- Generate CSV reports of semantic category counts.
Best used with Project Gutenberg texts
Clean Roget works best with plain-text literary works, essays, plays, speeches, and other long-form
texts. Public-domain texts from
Project Gutenberg
are especially good inputs because they are free, accessible, and usually available as .txt
files.
Limitations and future work
Roget’s ontology is historically rich, but it was not designed as a modern NLP model. The original source is ugly, ugly, ugly! Some semantic categories are broad, archaic, or unevenly distributed, which can make literary texts appear more similar than they intuitively feel. Some categories are so general, I don't even know what they mean (yet). Clean Roget is therefore both a tool and an ongoing experiment in ontology-based text analysis.
Future work may include semantic stop-head filtering, stronger weighting methods, dimensionality reduction, dendrograms, corpus-level IDF, and hybrid approaches that combine Roget’s ontology with modern embeddings.