Train Opus-MT models
This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.
Pre-trained models
The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.
Quickstart
Setting up:
git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install
Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):
make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release
More information is available in the documentation linked below.
Documentation
- Installation and setup
- Details about tasks and recipes
- Information about back-translation
- Information about Fine-tuning models
- How to generate pivot-language-based translations
Tutorials
References
Please, cite the following paper if you use OPUS-MT software and models:
@InProceedings{TiedemannThottingal:EAMT2020,
author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
year = {2020},
address = {Lisbon, Portugal}
}
Acknowledgements
None of this would be possible without all the great open source software including
- GNU/Linux tools
- Marian-NMT
- eflomal
… and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu …
We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ) and the contributors to the open collection of parallel corpora OPUS.