Practical Packaging For Machine Learning Solutions

November 13, 2017

About This Post

What/Why

Do you have machine learning models written in Python that you would like to share with other people (e.g. clients, co-workers, etc.)?

It will be easier to share projects by learning how to create our own pip installable packages.

A few things to note

This post attempts to follow generally accepted best practices as much as possible. The challenge is that the application of Python packaging in machine learning projects is not widely discussed, and as a result there are few references explicitly defined for this use case. Therefore, I have done my best to extend the general best practices and to draw on my own professional experience in creating this post.

What our targeted outcome is

In this post we will learn how to make a simple machine learning solution, written in Python, pip installable. Thus any one who knows how to use pip ¹ will be able to make use of our solution.

The example machine learning solution that we will create provides a Python API so that it can be used in other Python programs. It also provides scripts that will be made available when the distribution package ² is installed using pip.

We will cover how to package not only the code but also a fitted model, including common ML dependencies such as scikit-learn ³, numpy ⁴, and scipy ⁵.

Use cases

This approach is best for models that are not frequently updated and that are small enough to run on a single machine. If you are using scikit-learn your project is probably appropriate. Even if your project falls outside of this post’s assumptions, the basic packaging information is still very useful.

Prerequisite Knowledge (What I assume about you)

I assume that you know Python. I also assume that you are familiar with the package manager pip ⁶ or at least conda. ⁷

In addition, I assume that you are some kind of machine learning practitioner. However, if this is not the case, you may still find something useful for packaging other types of projects.

Requirements

You need to have pip and setuptools installed. If they are not installed, checkout the Python Packaging Authorities (PyPA) guide.

If you are on Windows or would just simply prefer to use Conda you can install it instead. Conda comes with pip so you should be able to still follow along, but I will not be taking Conda related issues into account, and I can not guarantee that you will not have issues. Even with Conda, make sure that you have pip and setuptools installed.

Quick Overview Of Our Example Project

The example project (iris_classifier), provides a python API (application programming interface) and scripts that allows users to easily train a classifier to predict the species of an Iris (e.g. Iris setosa, Iris virginica, Iris versicolor) based on measurements of the length and width of the sepals and petals. Once a model has been trained, it can be used to classify new samples of Irises.

The project uses the classic Iris flower data set. ⁸

Some of its dependencies include scikit-learn’s Suport Vector Machine Classifier ⁹ and NumPy.

This is a simple example but it will suffice for demonstrating a method of packaging Python machine learning projects.

The sample project also includes a Jupyter Notebook ¹⁰ which we will use as an example of how to include notebooks with our projects as additional documentation. (The example notebook can be found here)

Requirements To Create A pip Installable Package

To create a pip installable package we must provide various forms of documentation and a setup file that will tell our packaging program how to create the kind of package we want. Some of these things include specifying the data files to include, stating which versions of Python our project can work with, listing the dependencies that our project depends upon, and additional things that we will cover below.

Overview

To help give some perspective, below is a preview of what our project’s file structure will look like.

iris_classifier/
├── setup.py
├── LICENSE
├── README.md
├── iris_classifier/
│   ├── __init__.py
│   ├── iris.py
│   └── data
│       ├── iris_model.pickle
│       ├── iris_x.csv
│       └── iris_y.csv
├── bin
│   ├── classify-iris
│   ├── mk-iris-model
│   └── score-iris-model
└── notebooks
    └── classifier_for_iris_data_set.ipynb

First steps to set up our project directory

First, we must create our project directory and fill it with a few required items. We will follow these steps to do this.

Let’s start by creating a directory for our project:
- Create a directory called iris_classifier, this will serve as the name of our project.
Next, create these files within our project directory (iris_classifier/):
- setup.py
  
  setup.py is an important configuration file and will contain a lot of important information about our project, such as it’s dependencies. We will cover the contents of the setup.py file in more detail below.
- README.md
  
  We can get our README file from here, and place it in our project directory (iris_classifier/).
  
  It is important to include a README file. ¹¹ It serves as the first reference someone should consult when they find our project. It provides instructions for, or at least references to, things such as how to install our project, what is our project about, how to use it, how to contribute, etc.
- LICENSE.txt
  
  We will use the ‘BSD 3-Clause License’ ¹² for this project. It is also the one used by scikit-learn. ¹³
  
  We can get our LICENSE file from here and place it in our project directory (iris_classifier/).
  
  Anytime a project is made available online it is a very good idea to include a license. GitHub provides a guide that is helpful for picking an open-source license. Consulting a lawyer can sometimes be prudent if there are any doubts. When creating a project for an employer or a client it is recommended to consult them and/or a lawyer.

Finally, we must create a directory for our Python Package within our project directory.
- We will give it the same name as the project directory (iris_classifier/). This is common in Python projects.
- The naming convention in Python for packages is to have the name all lower case and we can optionally include underscores for readability. ¹⁴

After completing the above steps, our project directory should look like this:

iris_classifier/
├── setup.py
├── LICENSE
├── README.md
└── iris_classifier/

The modules we create will go in the package-directory. ¹⁵

To complete our project directory, we will create a few additional directories and files.

First, let’s get our Python models __init__.py and iris.py files from here. Then place them in our package directory (iris_classifier/iris_classifier/).
In the package directory (iris_classifier/iris_classifier/), create a directory named data (iris_classifier/iris_classifier/data/).

This is where all of our data files will go.
- We can get our data files from here. Place those data files in the data/ directory we just created.
Next, we create a directory for our scripts that we want to be available to our end users.

In our project directory (iris_classifier/), create a directory named bin. This is where all of our scripts will go.
- In the bin/ directory we will place the scripts that we can get from here.
  
  We will end up with three scripts in the bin/ directory:
  1. bin/classify-iris
  2. bin/mk-iris-model
  3. bin/score-iris-model
Finally, we will create a new directory named notebook in the project directory (iris_classifier/). If we have Jupyter notebooks that we would like to include in the project, this is a good location to place them. The notebooks are an excellent source of additional documentation for our project.

We will not include the notebooks in the distribution package that we will share with our users because they will significantly increase the file size, and are unlikely to be accessed as it is unusual to have notebooks inside of the distribution package. However, we will include them in our version control repository (i.e. git repo) as this is appropriate and recommended.
- Place in the notebook/ directory the notebook found here.

Now our project directory should finally look like the diagram above.

Creating The Setup File

What is the `setup.py` file for?

From the Python Packaging Authority (PyPA):

The most important file is “setup.py” which exists at the root of your project directory. For an example, see the setup.py in the PyPA sample project. “setup.py” serves two primary functions:

It’s the file where various aspects of your project are configured. The primary feature of setup.py is that it contains a global setup() function. The keyword arguments to this function are how specific details of your project are defined. The most relevant arguments are explained in the section below.

It’s the command line interface for running various commands that relate to packaging tasks. To get a listing of available commands, run python setup.py –help-commands.

Contents of `setup.py`

Below is what our setup.py file will look like when we are done creating it. This should help guide and provide us with context for the next few sections about setup.py and setup().

from setuptools import setup, find_packages

setup(
    name="iris_classifier",
    version="1.0.0",
    license='new BSD',

    description="Example for showing how to package python machine learning solutions.",

    author='Steven Cutting',
    author_email='blog@stevencutting.com',

    packages=find_packages(exclude=('bin', 'notebooks')),
    scripts=['bin/classify-iris', 'bin/mk-iris-model', 'bin/score-iris-model'],
    python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, <4",
    install_requires=['scikit-learn>=0.18,<0.19',
                      'numpy>=1.13,<1.14',
                      'scipy>=0.19,<0.20',
                      'click>=6,<7', 
                      'setuptools>=36,<37'],
    package_data={'': ['data/iris_model.pickle']},
    )

Editing `setup.py`

Now open the file we created in the last section named setup.py.

First we need to import the functions setup and find_packages from the setuptools package (find_packages is explained below). So insert the following line toward the top of the file:
```
from setuptools import setup, find_packages
```
Next we need to make use of the setup function, so insert the following expression below our import statement:
```
setup(
)
```
Below we will cover the arguments that need to be supplied to the setup function and what they are used for.

`setup()`

The setup() function allows us to specify various configuration parameters that are used when we create our distribution package, upload it to a repository, and when our users install our package. The values of the arguments we supply will affect how the distribution package is created, what is included in it, and what the metadata says about it.

Below we are going to go over a set of keyword arguments that will allow us to create a distribution package that contains a machine learning solution. ¹⁶

Basic arguments we must provide to `setup()`

First we will set the basic arguments for setup(). We only need to provide a single string as the value for each of these arguments.

name The name of our distribution package, which we will be the same as our packages name (iris_classifier).
```
    name="iris_classifier",
```
Valid names: ¹⁷
- Consist only of ASCII letters, digits, underscores (_), hyphens (-), and/or periods (.), and
- Start & end with an ASCII letter or digit

version

This is the current version of our project. Currently we have it set to 1.0.0 because our project is complete and we are ready to deploy it to production.
```
    version="1.0.0",
```
In this project we are using a semantic versioning ¹⁸ scheme. This is the preferred scheme for Python projects, per the PyPA.

The part of semantic versioning that we will be most concerned with are the three period separated integers (e.g. 1.0.0).
- *.0.0 - The first integer represents the major version number. A change in this number represents a release of the project that has changes to the application interface that are not backwards compatible with previous versions (i.e. will break applications that are using our package).
- 1.*.0 - The second integer represents the minor version number. A minor release represents changes to the project that are backwards compatible (i.e. will not break applications that are using our package).
- 1.0.* - The third integer represents a release with backwards compatible bug fixes.
It is important to include a correct, updated version number for our project because it allows our users to determine whether or not they have the latest version, and to indicate which versions they’ve tested their own software against.

description

A brief description of the project in one or two sentences.

    description="Example showing how to package Python machine learning solutions.",

author and author_email
- author contains the author’s name.
```
   author='Steven Cutting',
```
- author_email obviously contains the author’s email address.
```
   author_email='blog@stevencutting.com',
```
license

Here we will provide the type of license we are using (new BSD). Remember that in the section First steps to set up our project directory, we chose to use the ‘new BSD’ (a.k.a ‘BSD three clause’) license.
```
    license='new BSD',
```

Additional required arguments

packages

The packages argument is used to list the Python packages in our project.

We could list them manually, but we will instead use the find_packages function from setuptools. The find_packages function will automatically find and add our project’s packages to our distribution package that we will create later on. As you can see bellow we can use the exclude keyword argument to find_packages in-order to list directories in our project that we do not want to include as packages in our distribution package. We will be excluding the bin and notebooks directories.
```
packages=find_packages(exclude=('bin', 'notebooks')),
```
Note that the reason bin and notebooks are being excluded here is because we are only interested in keeping directories with Python modules that our users will want to import and use within their own Python code. Remember that bin contains our scripts, but we will specify this in another argument to be provided to setup() that will handle the scripts.
python_requires

This argument states which versions of Python our package will work with. ¹⁹ It is important that we only list versions that we have tested against. Doing this will help avoid compatibility issues that our users will find very annoying.

We will set it to:
```
  python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, <4",
```
This says that our package will work with Python version 2.7 and version 3 starting with 3.3 but will not work with version 4 or greater.

install_requires

Here we provide a list that contains the dependencies that our project requires. ²⁰

The dependencies list is assigned to install_requires:
```
install_requires=['scikit-learn>=0.18,<0.19', 'numpy>=1.13,<1.14', 'scipy>=0.19,<0.20',
                  'click>=6,<7'],
```
Here we are listing version ranges for our dependencies. For example, with scikit-learn we have listed that we want version 0.18 or greater (>=0.18), but less than, and not equal to, 0.19 (0.19). Thus, we define lower and upper bounds instead of pinning the dependency to a specific version.
- Note that we are using abstract requirements, i.e. we do not specify where the requirements come from, only that we need something with that name. For example, we specify that we require numpy, but we don’t specify where it comes from, it’s up to the user to source and install numpy.
- Pinning our dependencies to specific versions is considered to be overly restrictive and will not allow our end users to take full advantage of minor upgrades to packages. If we are too specific we also increase the likelihood of creating conflicts when installing multiple packages with the same dependencies.
- We also want to avoid listing the dependencies of our dependencies. That is not considered best practice as it is too restrictive.
- For an extended discussion of this subject refer to these two pages: caremad: setup.py vs requirements.txt and install_requires vs Requirements files.
  
  These references also help to explain why we are not using a requirements.txt and why it would be inappropriate for use with a distribution package (v.s. an application).
- Despite what would normally be considered best practice in terms of dependencies, we must take into account that we are dealing with a specific case of packaging a machine learning model with its requirements and issues. In this case, we might actually want to be a little more restrictive with dependencies. In our install_requires we have specified a major and minor version (remember the explanation of versions earlier) for scikit-learn, numpy, and scipy. The reason is that changes within these packages that do not affect the API could still have an impact on the output of our models. To be more confident about what kind of range we should set, we should read up on the way that these packages handle their versioning.
  
  We are also including scipy even though it is a dependency of a dependency, for the same reasons listed above.
- Note that if you are creating a package that has the sole purpose of being placed within a single end application (maybe your package will become part of a RESTful micro service), then you may be more specific about the dependencies. Nevertheless, you still want to be as flexible as reasonably possible to allow the end application developer to pin the dependencies (with your input). This is especially applicable in scenarios where you are in direct contact with all of the consumers of you package and are able to discuss what would work best.
  
  Once again, in general you should attempt to not be unnecessarily inflexible with your package dependencies.

package_data

The package_data argument to setup() allows us to specify data files to be included with our package.

In our project we have three files in the directory iris_classifier/iris_classifier/data/, which include a pickled ²¹ fitted model and two .csv data files.

As the package we are creating is intended to provide a pretrained machine learning solution for our users, we only want to include the file for the fitted model. Including the data files is unnecessary and it would greatly increase the size of the package, especially in the case of very large datasets.

The value for this argument is:
```
  package_data={'': ['data/iris_model.pickle']},
```
We are only listing the pickled model, nothing else. When we create the package it will ignore the other data files not listed in this argument. ²²

To allow our code to access the data files we are listing in the value of the package_data argument, we need to use the functions resource_string and resource_filename from the pkg_resources package, which is included with setuptools. For an example of this view the code for the functions load_pkg_data and load_pkg_pre_fit_model from our example project’s module iris_classifier/iris_classifier/iris.py. ²³
scripts

The value for our scripts argument is:
```
  scripts=['bin/classify-iris', 'bin/mk-iris-model', 'bin/score-iris-model'],
```
The scripts argument allows you to list scripts that you would like to include in the distribution package. The provided scripts will be added to their PATH environment variable when they install the distribution package, which basically means that they will be made available to the user just like any other shell command.

Now that we have filled out enough of the arguments to setup(), below we will create a distribution package that we can share with others. ²⁴

Packing It Up

Source Distributions

What is a Source Distribution?

A source distribution is a distribution that has not undergone any build steps. With a pure Python project that means that no installation metadata has been built. The installation metadata will be built when the distribution package is installed.

Creating a Source Distribution

First, let’s change to the project directory.
Then, run python setup.py sdist to create the distribution package.
In the project directory there will now be a directory named dist.

dist/ contains the archive file iris_classifier-1.0.0.tar.gz.

Note that under Windows it will be a .zip archive. ²⁵
We have successfully created a pip installable distribution package (iris_classifier-1.0.0.tar.gz) that we can now share with others!

How to install a Source Distribution

The easiest way to install a source distribution is to use pip.

In order to install the package all we need to do is run the command:

    pip install iris_classifier-1.0.0.tar.gz

This will install the package just like most other packages we might install using pip.

A Note On Python Wheels

In case you plan to share your package on PyPI ²⁶, or if you have extensions that need to be compiled, you should look into using Python Wheels ²⁷ for packaging, in addition to creating source distributions.

Distributing it

Email

The simplest way to share our project with a client is to email them the package we created in the Packing It Up section. If the project will be used by no more than a handful of users and there isn’t distribution infrastructure in place, this will get the job done. Don’t always overthink it.

PyPI

PyPI, short for Python Package Index, is a public central repository of Python packages. It is the one that is used by pip by default. If we place our project on PyPI, other Python programmers will be able to freely download and use it. ²⁸

Private Package Repository

It’s possible that we might find ourselves in a situation where our client or company does not want a particular project to be made available on PyPI, but they still want to have the benefits of having a package repository.

The client may have their own private deployment of PyPI, in which case we would talk with their engineers about the best way to upload our distribution package to their service.

There are also a number of proprietary services that allow us to host distribution packages and control who has access to them. ²⁹

Some Additional Notes On Packaging and Project Structure

In this tutorial we used a simple example in which the data and fitted model files are small enough to fit on a laptop, and all of the code (to train and use the model) is included in one package. This is a simple approach that fits many projects.

In case of a more complex project and bigger data, we might structure our project and package(s) differently. This is beyond the scope of this post and will be left for another time, but it is something that you should be aware of.

Things we haven’t covered but should be aware of

These topics aren’t directly related to packaging but they’re very important because they concern the quality of the code within the package you are creating. No one will trust to use your package if the contents do not meet certain expectations.

Virtualenv

Basically, virtualenv helps us to deal with multiple projects that have incompatible dependencies. For example, if we have two different projects that require different versions of the same package, virtualenv can help us manage this. ³⁰

Automated Testing

Having an automated test suite for our software is important.
PyTest ³¹ is an excellent and approachable testing framework. ³²

Version Control

Version control will help you make modifications to your project in a more organized fashion, by keeping track of changes.
A popular option is git, but there are other options such as mercurial and subversion.

A special thanks to Paula (@LadyData) for all her help editing this post.

About pip. ↩︎
PyPA definition for Distribution Package: A versioned archive file that contains Python packages, modules, and other resource files that are used to distribute a Release. The archive file is what an end-user will download from the internet and install. A distribution package is more commonly referred to with the single words “package” or “distribution”, but this guide may use the expanded term when more clarity is needed to prevent confusion with an Import Package (which is also commonly called a “package”) or another kind of distribution (e.g. a Linux distribution or the Python language distribution), which are often referred to with the single term “distribution”. ↩︎
About scikit-learn. ↩︎
About numpy. ↩︎
About scipy. ↩︎
If you are not familiar with pip (or Conda) checkout the tutorial provided in the pip documentation that covers installing pip, as well as installing packages with pip. ↩︎
About conda. ↩︎
Wikipedia entry on the Iris flower data set. ↩︎
Info on scikit-learn’s Suport Vector Machine Classifier. ↩︎
About Jupyter Notebook. ↩︎
Wikipedia entry on README files. ↩︎
About ‘BSD 3-Clause License`. ↩︎
scikit-learn’s license. ↩︎
For more information on naming conventions for packages and modules read the PEP-8 rules. ↩︎
For more information on the best practices for structuring a Python project and your code checkout The Hitchhiker’s Guide to Python - Structuring Your Project, as a detailed treatment of the topic is well beyond the scope of this post. ↩︎
If you plan on placing your package in a repository such as PyPI, there are additional arguments you should supply. You should go through the PyPA’s official guide. ↩︎
Valid distribution package names as defined by the PyPA. ↩︎
More info about semantic versioning. ↩︎
More info on python_requires. ↩︎
More info on install_requires. ↩︎
Python pickle ↩︎
Note: this is a nice, simple approach, but sometimes it may not be sufficient for your needs. For more information on including files in a package, refer to the “including data files” section of the setuptools documentation ↩︎
For more information on accessing included data files checkout the setuptools guide on accessing data files. ↩︎
For more setup() arguments checkout the PyPA guide. ↩︎
From the documentation on source distributions built on Windows: “The default format is a gzip’ed tar file (.tar.gz) on Unix, and ZIP file on Windows.” ↩︎
PyPI ↩︎
Python Wheels ↩︎
If you want to share your package on PyPI you will need to go through the Python Packaging Authorities guide on packaging. There are additional setup() parameters that you will need to supply. ↩︎
I don’t want to give free advertising to these companies. You can find them by searching for private python package hosting. ↩︎
About virtualenv ↩︎
About PyTest. ↩︎
For information on how to get started using PyTest checkout their good practices guide ↩︎