Workflow that shows how to train neural networks on EC2 instances with GPU support and compares training times to CPUs.

Workflow that shows how to train neural networks on EC2 instances with GPU support. The goal is to present a simple and stable setup to train on GPU instances by using Docker and the NVIDIA Container Runtime nvidia-docker. A minimal example is given to train a small CNN built in Keras on MNIST. We achieve a 30-fold speedup in training time when training on GPU versus CPU.

Getting started

  1. Install Docker
  2. Install Docker Machine
  3. Install AWS Command Line Interface

Train locally on CPU

  1. Build Docker image for CPU
docker build -t docker-keras .
  1. Run training container (NB: you might have to increase the container resources [link])
docker run docker-keras

Train remote on GPU

  1. Configure your AWS CLI. Ensure that your account has limits for GPU instances [link]
aws configure
  1. Launch EC2 instance with Docker Machine. Choose an Ubuntu AMI based on your region (https://cloud-images.ubuntu.com/locator/ec2/).
    For example, to launch a p2.xlarge EC2 instance named ec2-p2 with a Tesla K80 GPU run
    (NB: change region, VPC ID and AMI ID as per your setup)
docker-machine create --driver amazonec2 \
                      --amazonec2-region eu-west-1 \
                      --amazonec2-ami ami-58d7e821 \
                      --amazonec2-instance-type p2.xlarge \
                      --amazonec2-vpc-id vpc-abc \
  1. ssh into instance
docker-machine ssh ec2-p2
  1. Update NVIDIA drivers and install nvidia-docker (see this blog post for more details)
# update NVIDIA drivers
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt-get update
sudo apt-get install -y nvidia-375 nvidia-settings nvidia-modprobe

# install nvidia-docker
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i /tmp/nvidia-docker_1.0.1-1_amd64.deb && rm /tmp/nvidia-docker_1.0.1-1_amd64.deb
  1. Run training container on GPU instance
sudo nvidia-docker run idealo/nvidia-docker-keras

This will pull the Docker image idealo/nvidia-docker-keras from DockerHub and start the training.
The corresponding Dockerfile can be found under Dockerfile_gpu for reference.

Training time comparison

We trained MNIST for 3 epochs (~98% accuracy on validation set):

• MacBook Pro (2.8 GHz Intel Core i7, 16GB RAM): 620 seconds

• p2.xlarge (Tesla K80): 41 seconds

• p3.2xlarge (Tesla V100): 20 seconds