Over the past year I've become quite the fan of AWS Elastic Container Service (ECS). It allows you to create services based on Docker containers, in a quite easy and understandable manner. I won't dive into the details on how ECS works. Amazon has excellent documentation on the topic and there are also a lot of well explained blogpost such as this.

Recently I had a workload that was processed via a neural network (actually it was an OpenNMT model). The client imposed a strict KPI on the time. It was very hard to reach this KPI using CPU decoding, so I went with GPU. However this introduced a bit of extra work since the Amazon Linux ECS images do not have Nvidia drivers or CUDA installed.

The solution is quite simple: just launch a Linux image containing the Nvidia software and use the user data to launch a script that configures your EC2 instance. I've chosen the Deep Learning AMI (Ubuntu) provided by Amazon since this has not only all the necessary Nvidia drivers and CUDA installed, but also plenty of machine learning frameworks such as Tensorflow, MXNet...

Our script needs to do just 2 things: install & configure the ECS Agent and make nvidia-docker the default runtime for Docker. This last step is crucial since it makes it possible for our Docker container to access the GPU(s) on our instance. The script below was tested on diverse GPU EC2 instances running the Deep Learning AMI (Ubuntu) version 14.0. You will definitely need to make some changes if you wish to use the Deep Learning AMI based on Amazon Linux.

#!/usr/bin/env bash
set -x
CLUSTER_NAME=mycluster

# Configures an Ubuntu 16.04 instance to run the ECS agent usign systemd
# https://docs.aws.amazon.com/AmazonECS/latest/developerguide/example_user_data_scripts.html
# Install docker
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
apt-get update
apt-get install -y docker-ce
# Set iptables rules
echo 'net.ipv4.conf.all.route_localnet = 1' >> /etc/sysctl.conf
sysctl -p /etc/sysctl.conf
iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
# Write iptables rules to persist after reboot
iptables-save > /etc/iptables/rules.v4
# Create directories for ECS agent
mkdir -p /var/log/ecs /var/lib/ecs/data /etc/ecs
# Write ECS config file
cat << EOF > /etc/ecs/ecs.config
ECS_DATADIR=/data
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true
ECS_LOGFILE=/log/ecs-agent.log
ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","awslogs"]
ECS_LOGLEVEL=info
ECS_CLUSTER=${CLUSTER_NAME}
EOF

# Write systemd unit file
cat << EOF > /etc/systemd/system/docker-container@ecs-agent.service
[Unit]
Description=Docker Container %I
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker rm -f %i
ExecStart=/usr/bin/docker run --name %i \
--restart=on-failure:10 \
--volume=/var/run:/var/run \
--volume=/var/log/ecs/:/log \
--volume=/var/lib/ecs/data:/data \
--volume=/etc/ecs:/etc/ecs \
--net=host \
--env-file=/etc/ecs/ecs.config \
amazon/amazon-ecs-agent:latest
ExecStop=/usr/bin/docker stop %i

[Install]
WantedBy=default.target
EOF

# ------------------------
# Install nvidia-docker2
# ------------------------
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update
apt-get -qq install -y nvidia-docker2
cat <<"EOF" > /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
EOF

pkill -SIGHUP dockerd

# ------------------------
# Start ecs-agent at the very end
# ------------------------
systemctl enable docker-container@ecs-agent.service
systemctl start docker-container@ecs-agent.service

After a few minutes your instance should appear in your ECS cluster. If not you can always check the output of the /var/log/cloud-init-output.log log file on your EC2 instance for errors.

Enjoy your cluster of GPU instances, but make sure to set some billing alarms to keep your cost in check.