This aim of this project is to make analysing the contents
of a japanese ebook easy and streamline the process
for non-technical users. You can analyse an ebook, and see the
- The length of the book in words
- The length of the book in characters
- The number of unique words used in the book
- The number of unique words that are only used once in the book
- The percentage of unique words that are only used once
- The number of unique characters used
- The number of unique characters that are only used once
- The percentage of unique characters that are only used once
- A list of all the words used in the book as well as how often they
- A list of all the characters used in the book as well as how often
they are used
For text processing, we use MeCab
Currently, the project is not deployed anywhere, so to use the service,
you will need to follow the steps below in the development section to
get the server running.
- Upload a
.epubfile containing japanese text to the server
- The server will redirect you to a page showing you information about the ebook.
You can then also click the ‘See more details’ button to see all the generated
data, including a list of all the words used together with how many occurences there
are for each word, and the same for the characters as well.
- Clone repository:
git clone https://github.com/christofferaakre/japanese-ebook-analysis.git
- Make sure you have
mecabset up on your system. See
(Only required if you will actually upload ebooks or run the
which you will not need to do to contribute to other parts of the app.
for a good guide on how to set it up.
- Install python dependencies:
pip install -r requirements.txt
- Install other dependencies (these all need to be in your system path):
./app.pyto start the flask dev server
I’m very happy for any happy contributions! Before contributing, please
have a look at
Feel free to submit your own issue or pull request about a new feature or anything
else. When submitting a pull request, don’t be afraid to modify any of the files;
I’m not very attached to the coding style used in the repo.