Practical Packaging For Machine Learning Solutions

About This Post

What/Why

Do you have machine learning models written in Python that you would like to share with other people (e.g. clients, co-workers, etc.)?

It will be easier to share projects by learning how to create our own pip installable packages.

A few things to note

This post attempts to follow generally accepted best practices as much as possible. The challenge is that the application of Python packaging in machine learning projects is not widely discussed, and as a result there are few references explicitly defined for this use case. Therefore, I have done my best to extend the general best practices and to draw on my own professional experience in creating this post.

What our targeted outcome is

In this post we will learn how to make a simple machine learning solution, written in Python, pip installable. Thus any one who knows how to use pip 1 will be able to make use of our solution.

The example machine learning solution that we will create provides a Python API so that it can be used in other Python programs. It also provides scripts that will be made available when the distribution package 2 is installed using pip.

We will cover how to package not only the code but also a fitted model, including common ML dependencies such as scikit-learn 3, numpy 4, and scipy 5.

Use cases

This approach is best for models that are not frequently updated and that are small enough to run on a single machine. If you are using scikit-learn your project is probably appropriate. Even if your project falls outside of this post’s assumptions, the basic packaging information is still very useful.

Prerequisite Knowledge (What I assume about you)

I assume that you know Python. I also assume that you are familiar with the package manager pip 6 or at least conda. 7

In addition, I assume that you are some kind of machine learning practitioner. However, if this is not the case, you may still find something useful for packaging other types of projects.

Requirements

You need to have pip and setuptools installed. If they are not installed, checkout the Python Packaging Authorities (PyPA) guide.

If you are on Windows or would just simply prefer to use Conda you can install it instead. Conda comes with pip so you should be able to still follow along, but I will not be taking Conda related issues into account, and I can not guarantee that you will not have issues. Even with Conda, make sure that you have pip and setuptools installed.

Quick Overview Of Our Example Project

The example project (iris_classifier), provides a python API (application programming interface) and scripts that allows users to easily train a classifier to predict the species of an Iris (e.g. Iris setosa, Iris virginica, Iris versicolor) based on measurements of the length and width of the sepals and petals. Once a model has been trained, it can be used to classify new samples of Irises.

The project uses the classic Iris flower data set. 8

Some of its dependencies include scikit-learn’s Suport Vector Machine Classifier 9 and NumPy.

This is a simple example but it will suffice for demonstrating a method of packaging Python machine learning projects.

The sample project also includes a Jupyter Notebook 10 which we will use as an example of how to include notebooks with our projects as additional documentation. (The example notebook can be found here)


Requirements To Create A pip Installable Package

To create a pip installable package we must provide various forms of documentation and a setup file that will tell our packaging program how to create the kind of package we want. Some of these things include specifying the data files to include, stating which versions of Python our project can work with, listing the dependencies that our project depends upon, and additional things that we will cover below.

Overview

To help give some perspective, below is a preview of what our project’s file structure will look like.

iris_classifier/
├── setup.py
├── LICENSE
├── README.md
├── iris_classifier/
│   ├── __init__.py
│   ├── iris.py
│   └── data
│       ├── iris_model.pickle
│       ├── iris_x.csv
│       └── iris_y.csv
├── bin
│   ├── classify-iris
│   ├── mk-iris-model
│   └── score-iris-model
└── notebooks
    └── classifier_for_iris_data_set.ipynb

First steps to set up our project directory

First, we must create our project directory and fill it with a few required items. We will follow these steps to do this.

  1. Let’s start by creating a directory for our project:

    • Create a directory called iris_classifier, this will serve as the name of our project.
  2. Next, create these files within our project directory (iris_classifier/):

    • setup.py

      setup.py is an important configuration file and will contain a lot of important information about our project, such as it’s dependencies. We will cover the contents of the setup.py file in more detail below.

    • README.md

      We can get our README file from here, and place it in our project directory (iris_classifier/).

      It is important to include a README file. 11 It serves as the first reference someone should consult when they find our project. It provides instructions for, or at least references to, things such as how to install our project, what is our project about, how to use it, how to contribute, etc.

    • LICENSE.txt

      We will use the ‘BSD 3-Clause License’ 12 for this project. It is also the one used by scikit-learn. 13

      We can get our LICENSE file from here and place it in our project directory (iris_classifier/).

      Anytime a project is made available online it is a very good idea to include a license. GitHub provides a guide that is helpful for picking an open-source license. Consulting a lawyer can sometimes be prudent if there are any doubts. When creating a project for an employer or a client it is recommended to consult them and/or a lawyer.

  1. Finally, we must create a directory for our Python Package within our project directory.

    • We will give it the same name as the project directory (iris_classifier/). This is common in Python projects.
    • The naming convention in Python for packages is to have the name all lower case and we can optionally include underscores for readability. 14

After completing the above steps, our project directory should look like this:

iris_classifier/
├── setup.py
├── LICENSE
├── README.md
└── iris_classifier/

The modules we create will go in the package-directory. 15

To complete our project directory, we will create a few additional directories and files.

  1. First, let’s get our Python models __init__.py and iris.py files from here. Then place them in our package directory (iris_classifier/iris_classifier/).

  2. In the package directory (iris_classifier/iris_classifier/), create a directory named data (iris_classifier/iris_classifier/data/).

    This is where all of our data files will go.

    • We can get our data files from here. Place those data files in the data/ directory we just created.
  3. Next, we create a directory for our scripts that we want to be available to our end users.

    In our project directory (iris_classifier/), create a directory named bin. This is where all of our scripts will go.

    • In the bin/ directory we will place the scripts that we can get from here.

      We will end up with three scripts in the bin/ directory:

      1. bin/classify-iris
      2. bin/mk-iris-model
      3. bin/score-iris-model
  4. Finally, we will create a new directory named notebook in the project directory (iris_classifier/). If we have Jupyter notebooks that we would like to include in the project, this is a good location to place them. The notebooks are an excellent source of additional documentation for our project.

    We will not include the notebooks in the distribution package that we will share with our users because they will significantly increase the file size, and are unlikely to be accessed as it is unusual to have notebooks inside of the distribution package. However, we will include them in our version control repository (i.e. git repo) as this is appropriate and recommended.

    • Place in the notebook/ directory the notebook found here.

Now our project directory should finally look like the diagram above.


Creating The Setup File

What is the setup.py file for?

From the Python Packaging Authority (PyPA):

The most important file is “setup.py” which exists at the root of your project directory. For an example, see the setup.py in the PyPA sample project. “setup.py” serves two primary functions:

  1. It’s the file where various aspects of your project are configured. The primary feature of setup.py is that it contains a global setup() function. The keyword arguments to this function are how specific details of your project are defined. The most relevant arguments are explained in the section below.
  2. It’s the command line interface for running various commands that relate to packaging tasks. To get a listing of available commands, run python setup.py –help-commands.
Contents of setup.py

Below is what our setup.py file will look like when we are done creating it. This should help guide and provide us with context for the next few sections about setup.py and setup().


from setuptools import setup, find_packages

setup(
    name="iris_classifier",
    version="1.0.0",
    license='new BSD',

    description="Example for showing how to package python machine learning solutions.",

    author='Steven Cutting',
    author_email='blog@stevencutting.com',

    packages=find_packages(exclude=('bin', 'notebooks')),
    scripts=['bin/classify-iris', 'bin/mk-iris-model', 'bin/score-iris-model'],
    python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, <4",
    install_requires=['scikit-learn>=0.18,<0.19',
                      'numpy>=1.13,<1.14',
                      'scipy>=0.19,<0.20',
                      'click>=6,<7', 
                      'setuptools>=36,<37'],
    package_data={'': ['data/iris_model.pickle']},
    )

Editing setup.py

Now open the file we created in the last section named setup.py.

setup()

The setup() function allows us to specify various configuration parameters that are used when we create our distribution package, upload it to a repository, and when our users install our package. The values of the arguments we supply will affect how the distribution package is created, what is included in it, and what the metadata says about it.

Below we are going to go over a set of keyword arguments that will allow us to create a distribution package that contains a machine learning solution. 16

Basic arguments we must provide to setup()

First we will set the basic arguments for setup(). We only need to provide a single string as the value for each of these arguments.

Additional required arguments

Now that we have filled out enough of the arguments to setup(), below we will create a distribution package that we can share with others. 24


Packing It Up

Source Distributions

What is a Source Distribution?

A source distribution is a distribution that has not undergone any build steps. With a pure Python project that means that no installation metadata has been built. The installation metadata will be built when the distribution package is installed.

Creating a Source Distribution

  1. First, let’s change to the project directory.

  2. Then, run python setup.py sdist to create the distribution package.

  3. In the project directory there will now be a directory named dist.

    dist/ contains the archive file iris_classifier-1.0.0.tar.gz.

    Note that under Windows it will be a .zip archive. 25

  4. We have successfully created a pip installable distribution package (iris_classifier-1.0.0.tar.gz) that we can now share with others!

How to install a Source Distribution

The easiest way to install a source distribution is to use pip.

In order to install the package all we need to do is run the command:

    pip install iris_classifier-1.0.0.tar.gz

This will install the package just like most other packages we might install using pip.

A Note On Python Wheels

In case you plan to share your package on PyPI 26, or if you have extensions that need to be compiled, you should look into using Python Wheels 27 for packaging, in addition to creating source distributions.


Distributing it

Email

The simplest way to share our project with a client is to email them the package we created in the Packing It Up section. If the project will be used by no more than a handful of users and there isn’t distribution infrastructure in place, this will get the job done. Don’t always overthink it.

PyPI

PyPI, short for Python Package Index, is a public central repository of Python packages. It is the one that is used by pip by default. If we place our project on PyPI, other Python programmers will be able to freely download and use it. 28

Private Package Repository

It’s possible that we might find ourselves in a situation where our client or company does not want a particular project to be made available on PyPI, but they still want to have the benefits of having a package repository.

The client may have their own private deployment of PyPI, in which case we would talk with their engineers about the best way to upload our distribution package to their service.

There are also a number of proprietary services that allow us to host distribution packages and control who has access to them. 29

Some Additional Notes On Packaging and Project Structure

In this tutorial we used a simple example in which the data and fitted model files are small enough to fit on a laptop, and all of the code (to train and use the model) is included in one package. This is a simple approach that fits many projects.

In case of a more complex project and bigger data, we might structure our project and package(s) differently. This is beyond the scope of this post and will be left for another time, but it is something that you should be aware of.

Things we haven’t covered but should be aware of

These topics aren’t directly related to packaging but they’re very important because they concern the quality of the code within the package you are creating. No one will trust to use your package if the contents do not meet certain expectations.

Virtualenv

Automated Testing

Version Control


A special thanks to Paula (@LadyData) for all her help editing this post.

  1. About pip↩︎

  2. PyPA definition for Distribution Package: A versioned archive file that contains Python packages, modules, and other resource files that are used to distribute a Release. The archive file is what an end-user will download from the internet and install. A distribution package is more commonly referred to with the single words “package” or “distribution”, but this guide may use the expanded term when more clarity is needed to prevent confusion with an Import Package (which is also commonly called a “package”) or another kind of distribution (e.g. a Linux distribution or the Python language distribution), which are often referred to with the single term “distribution”. ↩︎

  3. About scikit-learn↩︎

  4. About numpy↩︎

  5. About scipy↩︎

  6. If you are not familiar with pip (or Conda) checkout the tutorial provided in the pip documentation that covers installing pip, as well as installing packages with pip. ↩︎

  7. About conda↩︎

  8. Wikipedia entry on the Iris flower data set↩︎

  9. Info on scikit-learn’s Suport Vector Machine Classifier↩︎

  10. About Jupyter Notebook↩︎

  11. Wikipedia entry on README files↩︎

  12. About ‘BSD 3-Clause License`↩︎

  13. scikit-learn’s license↩︎

  14. For more information on naming conventions for packages and modules read the PEP-8 rules↩︎

  15. For more information on the best practices for structuring a Python project and your code checkout The Hitchhiker’s Guide to Python - Structuring Your Project, as a detailed treatment of the topic is well beyond the scope of this post. ↩︎

  16. If you plan on placing your package in a repository such as PyPI, there are additional arguments you should supply. You should go through the PyPA’s official guide↩︎

  17. Valid distribution package names as defined by the PyPA↩︎

  18. More info about semantic versioning↩︎

  19. More info on python_requires↩︎

  20. More info on install_requires↩︎

  21. Python pickle ↩︎

  22. Note: this is a nice, simple approach, but sometimes it may not be sufficient for your needs. For more information on including files in a package, refer to the “including data files” section of the setuptools documentation ↩︎

  23. For more information on accessing included data files checkout the setuptools guide on accessing data files↩︎

  24. For more setup() arguments checkout the PyPA guide↩︎

  25. From the documentation on source distributions built on Windows: “The default format is a gzip’ed tar file (.tar.gz) on Unix, and ZIP file on Windows.” ↩︎

  26. PyPI ↩︎

  27. Python Wheels ↩︎

  28. If you want to share your package on PyPI you will need to go through the Python Packaging Authorities guide on packaging. There are additional setup() parameters that you will need to supply. ↩︎

  29. I don’t want to give free advertising to these companies. You can find them by searching for private python package hosting. ↩︎

  30. About virtualenv ↩︎

  31. About PyTest↩︎

  32. For information on how to get started using PyTest checkout their good practices guide ↩︎