Twitter Streaming with Docker Compose & MongoDB

madderle
Dec 14, 2017
4 min read

There are lots of blog posts regarding sentiment analysis with Twitter. The purpose of this post is to show how to create a streaming application using the Stacks I made in the previous blog post. I used the tweepy Python twitter api, Bounding Box, Docker (Docker compose) and MongoDB.

This is the overall architecture of what I built:

Data Generator: This Python container is built using the base-stack image from my earlier blog post. Here you can create a Jupyter notebook that reads a Twitter stream and writes to a MongoDB collection/document.

Data Service: This is a container built using the MongoDB image. This will host the Mongo database process.

Data Container: This container is a simple lightweight container built off the alpine image. Its purpose is to house the actual database data to allow the database to be upgraded without affecting the data.

Data Analysis: This container is based off of the data-python-stack image. Its purpose is to perform whatever analysis on the data being stored in the database.

Project Structure

The code for this project can be found here. But this is the overall folder structure:

Firstly, Im using Git (obviously) from where you downloaded the code.

Analysis Folder: this folder contains another folder called src, where the Jupyter notebook for analysis will be stored.

db: this folder contains the data for the Mongo database.

Streamer: this folder contains another folder called src where the Jupyter notebook containing the code for streaming for Twitter to the database is stored. Additionally, there is a dockerfile that creates an image specific to this application.

docker-compose.yml: this is the docker compose file used for configuring and calling all the services for this application.

Also note if you look in the src folders, Jupyter Notebook adds a lot of different folders as seen here:

The only thing you need to add to Git is the notebook.

Docker Compose

Docker compose is a tool for defining and running multi-container Docker applications. Then with a single command: docker-compose up you can start all the services in your configuration. Furthermore with another command: docker-compose down you can bring the application down.

Features

Compose:

uses a project name to isolate environments from each other.
preserves all volumes used. When it runs, if it finds any containers from previous runs, it copies the volumes from the old container to the new container. This ensures any data created isnt lost.
caches the configuration used to create a container. It reuses the existing container if it hasn't changed.
can be used with either a single instance or using a docker swarm cluster.

Compose File

Here is the docker-compose file I created for this project:

This application contains 4 services and this file uses the 2nd version of the docker-compose format.

The db service:

will always restart if an error occurs
is named mongodb_01 so applications can reference the address
is built from the mongo 3.6 image
is mapped to the default port
uses volumes from the db-data container

The db-data service:

will always restart (this is why if you look at the logs you see this container repeating)
is built from the lightweight alpine image
creates a volume and map the mongo data folder to the local db folder
executes a no-op command

The streamer service:

will be built from the dockerfile found in the Streamer folder
is named streamer_01
home directory is mapped to the local Streamer/src folder
maps port 9999 to the port of the service (also 9999)
depends on the db service being up

The analysis service:

will be built from the data-python-stack
is named analysis_01
home directory is mapped to the local Analysis/src folder
maps ports 8888 to the port of the service
depends on the database to be up

For the streamer service, I used a dockerfile because I needed to tweak the image for the container.

Im using the basic-stack latest image, but I need to install the tweepy module to interact with the twitter stream and pymongo to write to the database. Additionally, the default port for the Jupyter notebook is 8888. Since the analysis notebook will also use this port, there will be a conflict. To avoid this, I launch the jupyter notebook at a different port (9999).

Twitter Application

Twitter has an api that allows uses to build applications that uses Twitter data. In order for this application to work, you will have to register an application in order to receive 4 key pieces of information: access_token, access_token_secret, consumer_key and consumer_secret.

To register for a new application:

Visit here.
Sign up and click on Create New App
Give the new app a name, description and website. One thing to be aware of is the URL structure validation can be a bit annoying.
Agree to the Developer Agreement and click on "Create your Twitter Application"
Once everything is done go to the Keys and Access Tokens tab, scroll to the bottom and click on Generate tokens.

Bounding Box

To make the application more interesting I used a tool called Bounding Box. The idea behind using this tool is I can create a bounded box around a particular area in order to filter tweets from within the bounded area.

For the filter to work correctly, you must select "CSV Raw" from the Copy & Paste drop down box.

Streaming

Now the fun part! This is the python script to perform the streaming:

To execute this notebook simply run the cell and tweets will be written to the Mongo database. At a high level this script:

Creates and or gets reference to the database and collection
Instantiates a stream listener object with appropriate settings
Starts the stream with the Austin location box.

This is the meat of the application, you can now do analysis on the data in the database.

Analysis

For the purpose of this post I included a small Analysis notebook that just simply looks at the tweets to verify the stream worked correctly. You can find the notebook in the Github repo.