Sinkhorn Label Allocation

Self-training is a standard approach to semi-supervised learning where the learner's own predictions on unlabeled data are used as supervision during training. Sinkhorn Label Allocation (SLA) models this label assignment process as an optimal transportation problem between examples and classes, wherein the cost of assigning an example to a class is mediated by the current predictions of the classifier. By efficiently approximating the solutions to these optimization problems using the Sinkhorn-Knopp algorithm, SLA can be used in the inner loop of standard stochastic optimization algorithms such as those used to train modern deep neural network architectures.

If you've found this repository useful in your own work, please consider citing our paper:

@article{tai2021sinkhorn,
  author = {Kai Sheng Tai and Peter Bailis and Gregory Valiant},
  title = {{Sinkhorn Label Allocation: Semi-supervised classification via annealed self-training}},
  year = {2021},
  journal = {arXiv:2102.08622},
}

Environment

We recommend using conda to install dependencies:

$ conda env create -f environment.yml
$ conda activate sinkhorn-label-allocation

Usage

SLA can be run with a basic set of options using the following command:

$ python run_sla.py --dataset cifar10 --data_path /tmp/data --output_dir /tmp/sla --run_id my_sla_run --num_labeled 40 --seed 1 --num_epochs 1024 

Similarly, the FixMatch baseline can be run using run_fixmatch.py:

$ python run_fixmatch.py --dataset cifar10 --data_path /tmp/data --output_dir /tmp/sla --run_id my_fixmatch_run --num_labeled 40 --seed 1 --num_epochs 1024 

The following datasets are currently supported: cifar10, cifar100, and svhn.

For a full set of commands, refer to the main() functions in run_sla.py and run_fixmatch.py.

GitHub