The simple project to separate mixed voice (2 clean voices) to 2 separate voices.
Predict Voice1’s Spectrogram
Predict Voice2’s Spectrogram
1. Quick train
Download LibriMixSmall, extract it and move it to the root of the project.
It will take about ONLY 2-3 HOURS to train with normal GPU. After each epoch, the prediction is generated to
2. Quick inference
./inference.sh The result will be generated to
3. More detail
Input: The Complex spectrogram. Get from the raw mixed audio signal
Output: The complex ratio mask (cRM) —> complex spectrogram —> separated voices.
Model: Use the simple version of this implementation , which is defined in paper Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Dataset: A small version of
LibriMixdataset. I get from LibriMixSmall
4. Current problem
Due to small dataset size for fast training, the model is a bit overfitting to the training set. Use the bigger dataset will potentially help to overcome that. Some suggestions:
- Use the original LibriMix Dataset which is way much bigger (around 60 times bigger that what I have trained).
- Use this work to download much more in-the-wild dataset and use
datasets/VoiceMixtureDataset.pyinstead of the Libri one that I am using. p/s I have trained and it work too.