Set of scripts & tools for converting between numbers and major system encoded words.
Uses phonetics instead of letters to convert, sorts by word frequency and indicates part of speech.
For an explanation of the major system, check out the wikipedia page
CLI client for looking up words for a given number.
go build, run using
./msc -d assets/major_system_lookup_250k.csv
Resulting words are sorted by frequency (most frequent to least frequent) and styled based on their frequency and part of speech. I’m not good at designing UI, so this could use some improvement, but here’s roughly how to read it:
Italic & Underlined means the word is within the 500 most common words.
Underlined means the word is within the 1000 most common words.
Italic means the word is within the 2500 most common words.
Dimmed colors mean the word is NOT in the 10000 most common words.
Part of Speech
Adjectives are blue tones, nouns are magenta, verbs are yellow.
The most desirable ones have that as their background color, these will be singular nouns and the base form of verbs.
The ones where this is the foreground color will be plurals, other tenses of verbs, etc.
Script for creating a major system dataset. (this contains a word, the number that word decodes to using the major system, the part of speech of that word, the individual phonemes of the word, and frequency information for that word)
Takes in a wikipedia frequency dataset, see IlyaSemenov/wikipedia-word-frequency.
python create_dataset.py --frequency assets/enwiki-20210820-words-frequency.txt --output assets/major_system_lookup.csv
This uses g2p to get the phonemes for the words (this relies on cmudict), and textblob for getting information about the part of speech. Both of these may be inaccurate in some cases.
Running the script on the whole wikipedia dump takes about 9h on my machine, so maybe use one of the provided datasets.
Experimental python CLI I quickly hacked together to test the dataset.
python major_system_converter.py --dataset assets/major_system_lookup_250k.csv
Contains the latest wikipedia word frequency dataset I could find, as well as precomputed major system datasets created using create_dataset.py.