top of page

Deploying Models as an API with Docker and Django

  • madderle
  • Dec 21, 2017
  • 8 min read

Updated: Jan 8, 2018


ree

Many posts exist around designing and building Machine learning models. When I first started in data science I went through a lot of data sets, built models and they scored well and everything was awesome. Then I wanted to apply my new skills at my job. I wanted to write a model to predict if a project is going to go over estimate. I went through the steps: gathered data, cleaned it up, generated a decent model, and then I was stuck. What do I do next? So many blog posts, and online classes talk about building models, but they just stop there. They don’t go into detail of what to do next. How do I actually use the model? 


This is where deploying models as an API comes in. There are a few good articles on deploying models as an API however all of them do it using Flask.  I prefer Django because it's better and I learned developing web applications with it, so Im totally biased. This blog post is about deploying models as APIs using Docker, Django and Gunicorn.


You can find my code here.



Scenario


For the purpose of this blog, I got the problem and dataset from Kaggle. Its a human analytics dataset and the problem is there is a company wanting to understand why some of their best and most experienced employees are leaving. The end goal is to be able to create a model that predicts whether an employee will leave and deploy that model.



Pipeline


Before we get into the heart of the post, lets discuss Pipelines. Often in machine learning you need to perform a sequence of different transformations (find features, generate new features, select features, etc) on the dataset BEFORE doing a classification. Pipelines allow you to automate manual steps. Data flows through each step before going to the classifier. Each step being some transformation of the data. This ensures your model build can be replicated. I used a pipeline for preparing the model after I figured out what steps needed to be done. 



Process


This is the overall process:

ree

Determine Features & Best Model: The first step is to investigate the dataset. How many rows and columns? Does it have any missing data? How clean is the data? Are there any interesting features? etc. For this specific dataset, there were no missing values and the data was very clean. I only made 2 changes:

  • Drop the sales column

  • Replaced the low, medium, high salary with numbers the classifier can use.

After the data is cleaned, then I can evaluate different predictive models. I used Logistic Regression, Decision Tree and Random Forest; and the Random Forest performed the best.


Build Pre-Process Transformation: Since I determined the steps needed to prepare the model I can turn that into a function to be used in the pipeline object.


Build Pipeline: Create a pipeline object that combines the Transformation function and the Random Forest classifier.


Grid Search: Optimize the estimator by searching over a set of parameters for the best ones and apply them to the model. Please note that the search isn't actually done until the model is trained.


Train Model: Fit the data to the grid search estimator. The grid search is executed to find the best parameters and then the model is refitted with the best parameters.


Serialize Model: Serialize the model using pickle.


Deploy Model: Deploy the model using Django and Gunicorn.


Design Decision: Decided to combine the Model development in the same repo as the infrastructure. For the purposes of this blog, The Model Development will occur in a Jupyter notebook, and the final model will be serialized and placed in a Model Builds File. The Django application will then pull the model from the builds folder and pass the prediction back to the client. In this case, there is just one docker service. In the future, NGINX can be added.



Architecture

ree

Currently, the project consists on a single docker container. The container is mapped to a single volume that has 3 folders in it:

  1. Model Development - contains the Jupyter notebook used to develop and serialize the model will be stored.

  2. Model Builds - contains the serialized models

  3. Model Deployment - contains the Django code that provides an api to the model.

The Jupyter notebook server is launched at port 8888 and the Django/Gunicorn server is launched at port 8000. Both servers can be used simultaneously. The development workflow is:

  1. Work is done in the Jupyter notebook to develop the model.

  2. The Model is then serialized using pickle and placed in the Model Builds folder

  3. The Django view (predict) loads the model from the Model builds folder.


What happens when a user requests a prediction?

  1. User requests a prediction through the URL. The base path is :<ipaddress>:8000/api/predict<data>. The <data> is a dataframe that is serialized into JSON.

  2. The request goes to either the Django development server or Gunicorn, and the request gets passed onto Django.

  3. Django looks up the url (from the urls file) and then looks up to see which view is mapped to the url pattern.

  4. In the urls file I defined a view function called predict and the <data> gets passed to this function.

  5. In the view function the model gets unpacked, the data payload is turned back into a dataframe, and the predict and predict_proba functions on the model are called.

  6. The predictions get passed back to the client as a simple list.



Folder Structure

ree

There are a lot of files so lets discuss them.


README, LICENSE, .gitignore, .git: are files for GIT.


docker-compose.yml: I use docker-compose a lot for standing up applications. The file looks like this:

ree

As I stated before, there is only one service called "web". It uses the data-python-stack docker image I talked about in a previous blog post. The default command of the image is overriden by the command line. The gunicorn web server is launched at port 8000 with 2 workers. Note, the gunicorn command needs to be launched from the same folder as the Django application. (That's why I had to change directory to Model-Deployment before running the command). Additionally, the .src folder is mapped to the home/ds folder of the container and the ports are mapped.


Makefile: This type of file I talked about in my first blog post. I really like using these in projects because it provides a condensed list of command to run the project.

ree

The file contains a lot of commands:

  • create_project: if you are starting from a blank project, this executes the command to create a new Django project called "deploy_models".

  • build: builds the docker project.

  • create_app: executes the command to create a Django app.

  • migrate_data: this command execute the migrate commands to update the database to reflect the models.

  • up: starts the docker web container and runs the Gunicorn web server.

  • down: shuts down the web container.

  • start: starts up the container if previously stopped.

  • stop: stops the container from running.

  • django-dev: this command starts the Django development web server instead of the Gunicorn web sever. There are some big differences between the development server and Gunicorn namely, the development server is a single process but the development server restarts and updates as you edit Django code.

  • jupyter-web: this command starts the Jupyter server.

  • bash-web: opens up bash if the web container is already running.

  • log-web: shows the logs of the web container


src: This folder contains the code for Model development and deployment.



Model Development


Inside the Model development folder there are two files:

  • HR_comma_sep.csv: this is the dataset.

  • Model Development.ipynb: Jupyter notebook containing code.

The notebook can be found here. It outlines each of the steps in the process I described earlier. I will skip to the second step in the process (creating the Transformation).

ree

The class inherits from the BaseEstimator and TransformerMixin classes and there are a few methods to override. Im not doing anything in the __init__ and fit methods and all the code is in the transform method. Previously I used the sales column to create new features in the dataset, but it was not a big factor in the classification score. But I did replace the salary string with values the classifier can use.


The next step is creating the pipe object.

ree

The pipeline object is made from the PreProcessing and RandomForest Classifier objects. In the notebook I fit the pipe object to the training data. Doing this is not required at this point since we will do this at a later phase, but I wanted to validate the pipe and talk about model accuracies. Its important to plot the model accuracy over multiple training datasets.

ree

The next step is creating a grid search object. At this step, the parameters are defined which to search over. Note, the search isn't executed at this step but when the training data is fitted.

ree

Keep in mind the parameters are specific to the Random Forest Classifier.

Before we fit the model, we need to split the data:

ree

The next step is fitting the model with the training data over a range of parameters and then re-fitting to the best parameters.

ree

To make a prediction, simply use the .predict or predict_proba methods with some data you want to predict.

ree

The final step is to serialize the grid object.

ree

Note that Im showing two different serialization techniques. I went with the dill method, and saved the model to the Model-Builds folder.


Model Deployment


Now its time to deploy the model with Django! When a Django project and application are created there are many files generated. To get a more in depth instruction on Django and how to use it I highly recommend diving into the Django documentation.


The three main files I edited are:

  1. deploy_models/urls.py

  2. api/urls.py

  3. api/views.py

The urls files are used to specify which view function is executed when the url is requested. When a Django project is created, a urls file is created in the project folder. It is considered good practice to create a separate urls file per application in your project to make it modular.

If Launching your own Django project:

  1. Run the make create_project command

  2. Run the make create_app command

  3. Run the command docker-compose run web bash to open bash into the container

  4. Once bash is opened, execute python manage. py makemigrations

  5. Execute python manage. py migrate

  6. If you desire, execute command python manage. py createsuperuser (this creates an admin)

  7. Open the settings file and add '*' to "Allowed hosts"

  8. In the deploy_models folder, edit your urls file to include an urls file you will create in the next step:

ree

9. In the "api" app folder create a urls file:


ree
Note: when the url is requested the payload "dataframe" is assumed to be a JSON dataframe and the predict view function will be called.

10. Create the view in the views file:

ree

First, import all the necessary libraries and then load the model. I discovered a loading error that occurred if Gunicorn was running vs. the Django development server. Literally to get it to work I just needed to add a period.


Next I define the predict function. Since I used a named parameter "dataframe" in the url definition, I can use that same name to pass to my predict function. I am assuming that the data is correct and in the future I can use a try/except to check for this. I call the predict and predict_proba methods on the data and only grab the probability of the class it predicated. Finally I packed the results as a dataframe and pass the results as JSON.


Results


To use the api:


1. Create a test dataset:

ree

2. Convert the data to be used in a URL (escape characters):

ree

3. Assemble the URL and paste into a browser:

ree

4. Get results:

ree

The results are a dataframe with the index, Prediction and Probability:

ree


Benchmark


I went a step further and did some benchmarking using apache Benchmark. In bash, you can type the command:

ab -n 100 -c 10 <URL>

Where:

n is the number of requests

c is the number of concurrent requests

<URL> is the same URL you copy and paste into your browser.


Using just the Django development server: ~ 45 requests per second

ree

Using Gunicorn: ~67 requests per second (2 Gunicorn workers)


ree

 
 
 

Comments


© 2017 by Brandyn Adderley

  • github-256
  • Black LinkedIn Icon
Never Miss a Post. Subscribe Now!
bottom of page