End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.
$ docker run -it -p 5002:5002 rhasspy/larynx:en-us
Larynx's goals are:
- "Good enough" synthesis to avoid using a cloud service
- Faster than realtime performance on a Raspberry Pi 4 (with low quality vocoder)
- Broad language support (8 languages)
- Voices trained purely from public datasets
Pre-built Docker images for each language are available for the following platforms:
linux/arm64- Raspberry Pi 64-bit
linux/arm/v7- Raspberry Pi 32-bit
Run the Larynx web server with:
$ docker run -it -p 5002:5002 rhasspy/larynx:<LANG>
<LANG> is one of:
en-us- U.S. English
A larger docker image with all languages is also available as
Pre-built Debian packages are available for download.
There are three different kinds of packages, so you can install exactly what you want and no more:
- Base Larynx code and dependencies (always required)
ARCHis one of
amd64(most desktops, laptops),
armhf(32-bit Raspberry Pi),
arm64(64-bit Raspberry Pi)
- Language-specific data files (at least one required)
- See above for a list of languages
- Voice-specific model files (at least one required)
- See samples to decide which voice(s) to choose
As an example, let's say you want to use the "harvard-glow_tts" voice for English on an
amd64 laptop for Larynx version 0.4.0. You would need to download these files:
Once downloaded, you can install the packages all at once with:
sudo apt install \ ./larynx-tts_0.4.0_amd64.deb \ ./larynx-tts-lang-en-us_0.4.0_all.deb \ ./larynx-tts-voice-en-us-harvard-glow-tts_0.4.0_all.deb
From there, you may run the
larynx command or
larynx-server to start the web server.
$ pip install larynx
For Raspberry Pi (ARM), you will first need to manually install phonetisaurus.
Larynx uses gruut to transform text into phonemes. You must install the appropriate gruut language before using Larynx. U.S. English is included with gruut, but for other languages:
$ python3 -m gruut <LANGUAGE> download
Voices and vocoders are available to download from the release page. They can be extracted anywhere, and the directory simply needs to be referenced in the command-line (e,g,
You can run a local web server with:
$ python3 -m larynx.server --voices-dir /path/to/voices
The following default settings can be applied (for when they're not provided in an API call):
--quality- vocoder quality (high/medium/low, default: high)
--noise-scale- voice volatility (0-1, default: 0.333)
--length-scale- voice speed (<1 is faster, default: 1.0)
You may also set
--voices-dir to change where your voices/vocoders are stored. The directory structure should be
--help for more options.
MaryTTS Compatible API
$ docker run -it -p 59125:5002 rhasspy/larynx:<LANG>
/process HTTP endpoint should now work for voices formatted as
<LANG>/<VOICE> such as
You can specify the vocoder by adding
;<VOCODER> to the MaryTTS voice.
en-us/harvard-glow_tts;hifi_gan:vctk_small will use the lowest quality (but fastest) vocoder. This is usually necessary to get decent performance on a Raspberry Pi.
Available vocoders are:
hifi_gan:universal_large(best quality, slowest, default)
hifi_gan:vctk_small(lowest quality, fastest)
The command below synthesizes multiple sentences and saves them to a directory. The
--csv command-line flag indicates that each sentence is of the form
id will be the name of the WAV file.
$ cat << EOF | s01|The birch canoe slid on the smooth planks. s02|Glue the sheet to the dark blue background. s03|It's easy to tell the depth of a well. s04|These days a chicken leg is a rare dish. s05|Rice is often served in round bowls. s06|The juice of lemons makes fine punch. s07|The box was thrown beside the parked truck. s08|The hogs were fed chopped corn and garbage. s09|Four hours of steady work faced us. s10|Large size in stockings is hard to sell. EOF larynx \ --debug \ --csv \ --voice harvard-glow_tts \ --quality high \ --output-dir wavs \ --denoiser-strength 0.001
You can use the
--interactive flag instead of
--output-dir to type sentences and have the audio played immediately using the
play command from
The GlowTTS voices support two additional parameters:
--noise-scale- determines the speaker volatility during synthesis (0-1, default is 0.333)
--length-scale- makes the voice speaker slower (> 1) or faster (< 1)
--denoiser-strength- runs the denoiser if > 0; a small value like 0.005 is recommended.
List Voices and Vocoders
$ larynx --list