Docker for Data Science

madderle
Dec 12, 2017
5 min read

One of the common themes you will see through many of my posts is the use of Docker (and Python). I have fallen in love with the tool, and not Im not planning on looking back. Ive seen Docker used a lot in building web applications and it definitely should be one of the tools the data scientist uses. A really good book I would recommend is Docker for Data Science by Joshua Cook.

But in a world of VMs and Python environments why is Docker needed?

Advantages

Two words. Isolated Environments. Not only could you specify particular versions of python modules which is what you get with environments, but you could also specify the version of Python itself. All of the code necessary can run inside of your little “container”. Furthermore, you can now assemble your application as a set of micro services. This also means, that you can setup a development environment to run on a development machine and can run that same setup in production. Doing this for data science means that the transition from developing models to deploying them is minimal.

What is Docker?

Simply put Docker is a platform for managing containers. What are containers? A container is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it. Available for Windows, Linux and OS X containerized software will always run the same, no matter the environment. Though containers and virtual machines have similar isolation benefits, they function differently. Containers are an abstraction layer that packages code and dependencies together. Virtual machines are an abstraction of physical hardware.

Key Docker Terms

You can find an overview of Docker here

Docker Engine: is a client-server application with various components: a server which is a daemon process, Rest API which programs can use to talk to the daemon, and a command line interface.

Docker daemon: (dockerd) listens for Docker API requests and manages images, containers, networks and volumes.

Docker Client: (docker) is the primary way that many users interact with Docker. When you run commands such as docker run, the client sends these commands to the docker daemon which executes them.

Docker Registries: a registry stores Docker images. Docker Hub are public registries that anyone can use.

Images: An impact is a template with instructions for creating a docker container. Docker hub is an online “store” for lots of different images.

Contianers: is a runnable instance of an image. You can create, start, stop or delete a container using the command line interface.

Volumes: Volumes are the preferred mechanism for persisting data generated by and used by Docker containers.

Dockerfile: Is a file that contains instructions to build an image.

Docker compose: is a tool for defining and running multi-continuer Docker applications.

Helpful Docker Commands

docker ps : list all running containers
docker ps —all : list all containers including stopped.
docker start <container id> : start a specific container
docker stop <container id> : stop a specific container
docker exec -i -t <container id> bash : launch bash within a running container
docker build -t <repository/image name:tag> <docker file path> : build an image from a dockerfile
docker run <options> <image> : runs a container based on the image
docker push <repo/image name:tag> : push image to docker hub
docker rm <container id> : delete a container
docker rm $(docker ps -a -q) : delete all containers
docker rmi <image id> : delete image locally
docker rmi $(docker images -q) : delete all images locally

Data Stacks

One of the topics covered in the book Data Science for Docker is this idea of Data Stacks. The point of data stacks is to provide a common starting point for all projects depending on the technology stack used. In a future blog post I will discuss how I leveraged this to build a twitter streaming analysis application. The book however, leverages the Jupyter stacks found here. I decided to build my own for three reasons: 1. I wanted to get experience building my own and 2. I wanted control over the libraries installed and 3. The Jupyter stack images were very large and I could reduce the size. You can find the stacks I pushed to Dockerhub here. If you are interested in looking at the Dockerfiles you can find the Github repo here.

Another advantage to using stacks is that they can build on top of each other. This is the overall goal we are going for (Note for this blog post I only built the Basic-Stack and Data-Python-Stack) :

Basic-Stack

This stack is based off of the Python3-slim image which is about 156MB. This stack also has Jupyter notebook installed and the purpose of this is to be an all purpose notebook. Here is the Dockerfile used to build this image:

The Dockerfile is pretty straight forward, but basically I am using the Python 3.6.3-slim image, running security updates, creating a user ds, installing Jupyter notebook, setting variables and then running the jupyter notebook when the container that is created using this image is launched. Any IP can connect to the notebook through port mapping and using a token.

To build an image using this file, run the command: docker build -t <repository/image name:tag> <docker file path>

The -t option is a tag.

Example:

docker build -t madderle/basestack:1.0 .

Note the . at the end of the example command. This tells docker to build an image using the dockerfile found in the folder location running the command.

To build a container using the example image:

docker run -i -t -p 8888:8888 madderle/basestack:1.0

The -i -t options tells docker to run interactively, -p allows you to map your computer's port 8888 to the docker containers port exposed by command in the Dockerfile. The reason we need to run interactively is because when the notebook is launched a token will be generated. To access the notebook, the token will need to be in the URL.

Data-Python-Stack

This stack is based off of Continuum's minconda which is about 571MB. The purpose of this stack is to contain the necessary libraries I often use. Its based off of Continiuum's image. Continiuum makes Anaconda which is a very popular set of libraries for using Python for Data Science. Miniconda just contains their installer (which is a better version of pip). I specified the libraries I want using a requirements file:

I prefer to include all of the popular libraries to interact with various data sources.

The Dockerfile is:

In a nutshell, Im using the miniconda image, creating user and home directory, running the conda install of the files, cleaning up the installation and then specifying the command to run the jupyter notebook when the container starts up. The image is built like I showed you earlier.

Data Stores

Containers are meant to be ephemeral. You should be able to start, stop and delete at will. But data must persist beyond the lifespan of your container. You can persist data by mounting a host directory as a data volume. There are 2 types: Data volumes and Data volume containers.

Data Volumes can be created outside containers: docker volume create <volume name>, or when creating the container with a -v or -mount flag. You can then map to a local directory.

Data Volume containers are containers whose only purpose is to store data. The advantage is multiple containers can then reference the volume. In my next blog post I will show an example of this.