Machine Learning

Porting a machine learning model to production

Pinterest LinkedIn Tumblr

The pipeline for building machine learning models often ends at the stage of evaluating the quality of the model: you have reached acceptable accuracy and that’s it.

In addition, perhaps you will also build beautiful graphics for your article / blog or for your internal documentation.
Note : Of course, a machine learning engineer is not always required to bring the model to production. And even if necessary, this task is often delegated to the system administrator.

However, nowadays many researchers / engineers are (possibly morally) responsible for building a complete pipeline: from building models to bringing them to production. Whether a university project is a personal experiment or a commercial product, demonstration of its work is a great way to interest a wide audience. Few people will put the extra effort into researching a model or product that is difficult to reproduce.

In this article, we’re going to go through this pipeline together. It is assumed that you have already created a machine learning model using your favorite deep learning framework (scikit-learn, Keras, Tensorflow, PyTorch, etc.). Now you want to demonstrate it to the world, though only through the API.

We will look at the infrastructure based on Python frameworks and Linux servers. It will include:

  • Anaconda : for managing package installation and creating a Python 3 sandbox. 
  • Keras : A high-level neural network API that can run on top of TensorFlow , CNTK, or Theano . 
  • Flask : A minimalist python framework for building RESTful APIs. Note: Flask’s built-in server is not suitable for production deployments as it only serves one request by default. This server is more designed for easier debugging . 
  • nginx : A stable web server that implements functionality such as load balancing, SSL configuration, etc. 
  • uWSGI : A highly configurable WSGI server (Web Server Gateway Interface) that allows multiple workers to serve multiple requests at the same time. 
  • systemd : An init system used by several Linux distributions to manage system processes after boot. 

Nginx will be our front-end to the Internet, and it will handle client requests. Nginx has built-in support for the uWSGI protocol and they communicate over Unix sockets. In turn, the uWSGI server will refer to the callable in our Flask application directly. Thus, requests will be served.

A few notes at the beginning of this tutorial:

  • Most of the components listed above can be easily replaced with equivalent ones, practically without changes in the rest of the elements of the built pipeline. For example, Keras can easily replace PyTorch, Flask can easily replace Bottle, and so on. 
  • When we talk about moving to production, we are not talking about the enterprise scale of a huge company. The goal is to do everything possible within a single server with a lot of processor cores and a lot of RAM.

Setting up the environment

First, we need to install the systemd and nginx packages :

sudo apt - get install systemd nginx 

Then we have to install Anaconda by following the instructions on the official website , which is to download the executable, run it and add Anaconda to your system’s PATH. Below we will assume that Anaconda is installed in the home directory.

Next, let’s create an Anaconda sandbox from the environment.yml file . This is what this file looks like (it already contains several frameworks that we will be using):

name: production_ml_env
channels:
  - conda-forge
dependencies:
- python = 3.6
- keras
- flask
- uwsgi
- numpy
- pip
- pip:
  - uwsgitop

To create the environment, we run the following:

conda env create --file environment.yml

To activate the resulting environment, we do:

source activate production_ml_env 

By now, we have Keras, flask, uwsgi, uwsgitop, etc. So, we are ready to start.

Building a Flask web application

As part of this tutorial, we will not dive deeply into how to create our own ML model. Instead, we will adapt the text news classification example using the  Reuters newswire dataset that ships with Keras. This is the code to build the classifier:

'' 'Trains and evaluate a simple MLP
on the Reuters newswire topic classification task.
'' ' 
from __future__ import print_function
 import os
 import numpy as np
 import keras
 from keras.datasets import reuters
 from keras.models import Sequential
 from keras.layers import Dense, Dropout, Activation
 from keras.preprocessing.text import Tokenizer
 from keras.callbacks import ModelCheckpoint

MODEL_DIR =  './models'

max_words =  1000 
batch_size =  32 
epochs =  5

print ( 'Loading data ...' )
(x_train, y_train), (x_test, y_test) = reuters.load_data (num_words = max_words,
                                                         test_split = 0.2 )
 print ( len (x_train), 'train sequences' )
 print ( len (x_test), 'test sequences' )

num_classes = np.max (y_train) +  1 
print (num_classes, 'classes' )

print ( 'Vectorizing sequence data ...' )
tokenizer = Tokenizer (num_words = max_words)
x_train = tokenizer.sequences_to_matrix (x_train, mode = 'binary' )
x_test = tokenizer.sequences_to_matrix (x_test, mode = 'binary' )
 print ( 'x_train shape:' , x_train.shape)
 print ( 'x_test shape:' , x_test.shape)

print ( 'Convert class vector to binary class matrix' 
      '(for use with categorical_crossentropy)' )
y_train = keras.utils.to_categorical (y_train, num_classes)
y_test = keras.utils.to_categorical (y_test, num_classes)
 print ( 'y_train shape:' , y_train.shape)
 print ( 'y_test shape:' , y_test.shape)

print ( 'Building model ...' )
model = Sequential ()
model.add (Dense ( 512 , input_shape = (max_words,)))
model.add (Activation ( 'relu' ))
model.add (Dropout ( 0.5 ))
model.add (Dense (num_classes))
model.add (Activation ( 'softmax' ))

model.compile (loss = 'categorical_crossentropy' ,
              optimizer = 'adam' ,
              metrics = [ 'accuracy' ])

if  not os.path.exists ( '' ):
    os.makedirs (MODEL_DIR)

mcp = ModelCheckpoint (os.path.join (MODEL_DIR, 'reuters_model.hdf5' ), monitor = "val_acc" ,
                      save_best_only = True )

history = model.fit (x_train, y_train,
                    batch_size = batch_size,
                    epochs = epochs,
                    verbose = 1 ,
                    validation_split = 0.1 ,
                    callbacks = [mcp])

score = model.evaluate (x_test, y_test,
                       batch_size = batch_size, verbose = 1 )
 print ( 'Test score:' , score [ 0 ])
 print ( 'Test accuracy:' , score [ 1 ])

To reproduce the settings we are using here, simply run the following commands while training the model. This will allow you to learn without a GPU – on a CPU:

export CUDA_VISIBLE_DEVICES = -1
KERAS_BACKEND = theano python build_classifier.py 

This will create a serialized file   reuters_model.hdf5  from the trained model in the models folder . We are now ready to use the model via Flask on port 4444 . In the code below, we provide a single entry point for the REST request: / predict , which is presented as a GET request , where the text to be classified is passed as a parameter. The returned JSON is of the form {“prediction”: “N”} , where N is an integer representing the predicted class.

from flask import  Flask 
from flask import request
from keras . models import load_model
from keras . datasets import reuters
from keras . preprocessing . text import  Tokenizer , text_to_word_sequence
from flask import jsonify
 import os

MODEL_DIR  = ' ./ models'

max_words =  1000

app =  Flask (__name__)

print ( "Loading model" )
model = load_model (os . path . join ( MODEL_DIR , 'reuters_model . hdf5'))
# we need the word index to map  words to indices
word_index = reuters . get_word_index () 
tokenizer =  Tokenizer (num_words = max_words)


def preprocess_text (text) : 
    word_sequence = text_to_word_sequence (text)
    indices_sequence = [[word_index [word] if word in word_index else  0 for word in word_sequence]]
    x = tokenizer . sequences_to_matrix (indices_sequence, mode = 'binary')
     return x


@app . route (' / predict', methods = [' GET '])
def predict () : 
    try : 
        text = request . args . get ('text')
        x = preprocess_text (text)
        y = model . predict (x)
        predicted_class = y [ 0 ] . argmax (axis = - 1 )
         print (predicted_class)
         return jsonify ({'prediction' : str (predicted_class)})
    except : 
        response = jsonify ({' error ' : 'problem predicting'})
        response . status_code =  400 
        return response


if __name__ ==  "__main__" : 
    app . run (host = ' 0.0 . 0.0 ', port = 4444 )

To start the Flask application server, we run:

python app . py 

You can test it with any REST client ( Postman for example ) or simply by going to this URL in your web browser (replace your_server_url with your server URL):

http: // your_server_url: 4444 / pred? text = this is a news sample text about sports and football in specific 

And you will get the answer

{ 
   "class" : "11" 
} 

Configuring uWSGI Server

We are now ready to scale our application server. uWSGI will be a key player here. It communicates with our Flask application by calling app in the app.py file . uWSGI includes a large number of parallelization functions, which we will use. Its config file looks like this:

[uwsgi] 
# placeholders that you have to change 
my_app_folder = / home / harkous / Development / production_ml
 my_user = harkous

socket =% (my_app_folder) /production_ml.sock
 chdir =% (my_app_folder)
 file = app.py
 callable = app

Variables An environment # 
the env = CUDA_VISIBLE_DEVICES = -1
 the env = KERAS_BACKEND = Theano
 the env = the PYTHONPATH =% (my_app_folder): $ the PYTHONPATH

master = true
 processes = 5
 # allows nginx (and all users) to read and write on this socket 
chmod-socket = 666
 # remove the socket when the process stops 
vacuum = true

# loads your application one time per worker 
# will very probably consume more memory, 
# but will run in a more consistent and clean environment. 
lazy-apps = true

uid =% (my_user)
 gid =% (my_user)

# uWSGI will kill the process instead of reloading it 
die-on-term = true
 # socket file for getting stats about the workers 
stats =% (my_app_folder) /stats.production_ml.sock

# Scaling the server with the Cheaper subsystem

# set cheaper algorithm to use, if not set default will be used 
cheaper-algo = spare
 # minimum number of workers to keep at all times 
cheaper = 5
 # number of workers to spawn at startup 
cheaper-initial = 5
 # maximum number of workers that can be spawned 
workers = 50
 # how many workers should be spawned at a time 
cheaper-step = 3

We need to change the my_app_folder parameter to be your own app directory folder and the my_user parameter to be your own username. Depending on your needs and file locations, you may need to change / add other options .

One of the important sections in uwsgi.ini is the part where we use the Cheaper subsystemin uWSGI, which allows us to run multiple workers in parallel to serve multiple parallel requests. This is one of the interesting features of uWSGI, where dynamic up and down scaling is possible with a few parameters. With the above configuration, we will have at least 5 workers at all times. As the load increases, it will be easier to allocate 3 additional workers at a time until all requests find their worker. The maximum number of workers above is set to 50.

In your case, the best configuration options depend on the number of cores on the server, the total available memory, and the memory consumption of your application. Take a look at the white papers for advanced deployment options.

Binding uWSGI and nginx

If we start working with uWSGI now (we will do that a little later), it will take care of calling the application from the app.py file and we will have all the scaling capabilities it provides. But we want to receive REST requests from the internet and feed them to the Flask app via uWSGI. For this we will be configuring nginx.
Here is a simple config file for nginx. Of course, nginx can be additionally used for SSL configuration or for static files, but that is beyond the scope of this article.

server {
    listen 4444 ; 
    # change this to your server name or IP 
    server_name YOUR_SERVER_NAME_OR_IP ;

    location / {
        include uwsgi_params ; 
        # change this to the location of the uWSGI socket file (set in uwsgi.ini) 
        uwsgi_pass unix: /home/harkous/Development/production_ml/production_ml.sock ;
    }
}

We place this file in / etc / nginx / sites-available / nginx_production_ml (you need sudo access for this). Then, to enable this nginx configuration, we link it to the sites-enabled directory:

sudo ln -s / etc / nginx / sites-available / nginx_production_ml / etc / nginx / sites-enabled

Then we restart nginx:

sudo service nginx restart 

Systemd service setup

Finally, we will start the previously configured uWSGI server. However, to ensure that our server does not die permanently after a system restart or unexpected crashes, we will start it as a systemd service . Here is our service config file, which we place in the / etc / systemd / system directory using:

sudo vi /etc/systemd/system/production_ml.service
[Unit] 
Description = uWSGI instance to serve production_ml service

[Service] 
User = harkous
 Group = harkous
 WorkingDirectory = / home / harkous / Development / production_ml /
 ExecStart = / home / harkous / anaconda3 / envs / production_ml_env / bin / uwsgi --ini / home / harkous / Development / production_ml / uwsgi. ini
 Restart = on-failure

[Install] 
WantedBy = multi-user.target

Then we start the service with:

sudo systemctl start production_ml . service 

To enable this service to start when the device is rebooted:

sudo systemctl enable production_ml.service 

At this stage, our service should start successfully. In case of updating any configs, we must restart it:

sudo systemctl restart production_ml.service 

Service monitoring

To monitor the service and see the load per worker, we can use uwsgitop . In uwsgi.ini, we have already configured the statistics socket in our application folder. To view statistics, run the following command in this folder:

uwsgitop stats.production_ml.sock 

Here is an example of workers in action, with additional workers already created. To simulate such a heavy load on your server, you can add time.sleep (3) to your prediction code.

One way to send parallel requests to your server is to use curl (remember to replace YOUR_SERVER_NAME_OR_IP with your server url or IP address.

#! / usr / bin / env bash 
url = "http: // YOUR_SERVER_NAME_OR_IP: 4444 / predict? text = this% 20is% 20a% 20news% 20sample% 20text% 20about% 20sports,% 20and% 20football% 20in% 20specific"  # add more URLs here

for i in {0..10}
 do 
   # run the curl job in the background so we can start another job 
   # and disable the progress bar (-s) 
   echo  "fetching $ url" 
   curl $ url -s & 
done 
wait  #wait for all background jobs to terminate

To keep track of the log of the application itself, we can use journalctl :

sudo journalctl -u production_ml.service -f 

Your output should look like this:

Final remarks

If you have reached this milestone and your application is running successfully, then this article has achieved its goal. Some additions deserve a mention at this stage:

  • To keep this article general enough, we used lazy-apps mode.in uwsgi, which downloads the application once for each worker. According to the documentation, this will take O (n) time to load (where n is the number of workers). This will also probably require more memory, but results in a clean environment for each worker. By default, uWSGI loads the entire application differently. It starts with one process; then it is deployed several times for additional workers. This results in increased memory savings. However, this doesn’t work very well with all ML structures. For example, the TensorFlow backend in Keras doesn’t work without lazy app mode (check this, this, and this, for example). Your best bet would be to try first without lazy-apps = true and switch to it if you run into similar problems. 
  • Flask application parameters : Since uWSGI calls app as executable, the parameters of the application itself should not be passed through the command line. You are better off using a config file with the likes of configparser to read parameters like this. 
  • Scaling across multiple servers . The above tutorial does not discuss the multi-server case. Fortunately, this can be achieved without significant changes to our setup. Using the load balancing feature in nginx, you can configure multiple machines, each with the uWSGI setup described above. Then you can configure nginx to route requests to different servers. nginx comes with several methods for load balancing, ranging from simple round-robin scaling to connection count or average latency. 
  • Port selection : The above manual uses port 4444 for illustration purposes. You can change this port to suit your conditions. Make sure you open these ports in your firewall or ask your institution’s administrators to do so. 
  • Socket Privileges : We grant access to all users. Feel free to also customize the privileges for your own purposes and run the server with different privilege groups. But make sure your nginx and uWSGI can still communicate successfully with each other after your changes. 

Links to used materials

  • https://hackernoon.com/a-guide-to-scaling-machine-learning-models-in-production-aa8831163846

Write A Comment