major-system-converter

Set of scripts & tools for converting between numbers and major system encoded words.

Uses phonetics instead of letters to convert, sorts by word frequency and indicates part of speech.

For an explanation of the major system, check out the wikipedia page

To learn the major system, check out my anki deck. (Github repository)

msc.go

CLI client for looking up words for a given number.

Compile using go build, run using ./msc.

Example:

./msc -d assets/major_system_lookup_250k.csv

Results

Resulting words are sorted by frequency (most frequent to least frequent) and styled based on their frequency and part of speech. I’m not good at designing UI, so this could use some improvement, but here’s roughly how to read it:

Frequency

Italic & Underlined means the word is within the 500 most common words.

Underlined means the word is within the 1000 most common words.

Italic means the word is within the 2500 most common words.

Dimmed colors mean the word is NOT in the 10000 most common words.

Part of Speech

Adjectives are blue tones, nouns are magenta, verbs are yellow.

The most desirable ones have that as their background color, these will be singular nouns and the base form of verbs.

The ones where this is the foreground color will be plurals, other tenses of verbs, etc.

create_dataset.py

Script for creating a major system dataset. (this contains a word, the number that word decodes to using the major system, the part of speech of that word, the individual phonemes of the word, and frequency information for that word)

Takes in a wikipedia frequency dataset, see IlyaSemenov/wikipedia-word-frequency.

Example:

python create_dataset.py --frequency assets/enwiki-20210820-words-frequency.txt --output assets/major_system_lookup.csv

This uses g2p to get the phonemes for the words (this relies on cmudict), and textblob for getting information about the part of speech. Both of these may be inaccurate in some cases.

Running the script on the whole wikipedia dump takes about 9h on my machine, so maybe use one of the provided datasets.