Hi! Hope you're enjoying this blog. I have a new home at www.goldsborough.me. Be sure to also check by there for new posts <3

Tuesday, August 11, 2015

Creating and Distributing a Python Project

I just spent the last week giving birth to ecstasy, my first full-fledged, fully-distributed, fully-fancied-up Python project. It was an amazing adventure and I would like to share all the insights and knowledge I acquired over the last few days. Moreover, I’d like to spare you all the trouble I’ve had and headaches I got and tears I cried.


It’s not that I hadn’t already written tons of stuff in Python and hadn’t already open-sourced multiple projects on Github, but this time I wanted all the fancy bells and whistles. With that I mean, and this is what I’ll show you how to do here:

I will be writing this article as if you hadn’t written anything yet, but had only an idea for your next project, which I’ll refer to as your_app. Nevertheless, it’s quite likely that you’re here right in the heat, looking for a solution to a sub-problem of this whole process and are thus not even reading this sentence but are somewhere below. So, since no-one is reading this sentence, I’ll just take this opportunity to say hi to my mom. Hi.

But on a more serious note, this article can be followed in a linear way, but needn’t be. Also note that I will be sharing valuable resources regarding certain topics that I used to acquire the knowledge I’m sharing with you and that you may also want to laugh at. I mean read. Regarding the "level", I'll try to be more explicit and explain more rather than not. You can always skip what you already know.

We’ll start with a basic directory structure and then add more and more layers of cream and cherries and chocolate, spray some sugar and love on it, spread caramel and marshmallow all over and lastly have it blessed by the almighty Octocat after which we’ll unite in a wild frenzy of excitement and exhilaration and dance around a fire. Sounds good?

Getting Started

A basic directory structure can look like follows:

├── docs
├── (examples)
├── tests
└── your_app

  • your_app/ is the top-level directory, containing your entire project.
  • your_app/your_app/ contains your application’s source code.
  • your_app/docs/ includes everything related to documentation.
  • your_app/examples/ could include small examples of how to use your application. I put this in parentheses because I haven’t seen that many projects on Github with such an examples directory. Personally I quite like the idea of it, but you could also argue that you’d rather put examples into the documentation. Your call.

We’ll want to do three things now:
  1. Make sure you have pip so we can get any and all packages we want.
  2. Initialize a local git repository.
  3. Setup a virtual environment (virtualenv).

The Cheese Shop

pip is a command-line utility, more precisely a “package-manager”, for installing Python packages from PyPI — the Python Package Index (pronounced pie-pee-ie) — also referred to as The Butchery … no wait, that doesn’t seem right. Also referred to as The Cheese Shop. Yes, better. But if you’ve ever used npm for Node.js, RubyGems for Ruby or Bower for Javascript and all things web, then you know what’s up. Either way, we will need it for many, many things so make sure you have pip holstered and ready to fire.

Installing it is easy if your Python distribution’s version ($ python -- version) is 2.7.9 or 3.4.0 (and higher): you don’t have to. For Python 2 < 2.7.9 and Python 3 < 3.4, you can use this command to get it off the command-line:

$ curl --silent --show-error https://bootstrap.pypa.io/get-pip.py | sudo python



At this point, it's a good idea to initialize a local git repository in your application's root directory (your_app/) to record whatever we are doing and save us in case we mess anything up. I won't tell you when to commit anything for the remainder of this article, that's up to you, but remember: commit early and often.

your_app$ git init

At the same time, grab a suitable .gitignore file for Python off Github:

your_app$ curl https://raw.githubusercontent.com/github/gitignore/master/Python.gitignore > .gitignore


We'll want to run everything we do inside a virtual environment. What's that? A virtual environment is an isolated container or environment that we can create with the virtualenv command, that acts just like your normal, global system environment with one important difference: whatever we install with pip will only be installed for the virtual environment (which is really a directory) and not globally -- i.e. what happens in a virtualenv stays in a virtualenv. Thereby, we can have different requirements for different projects, not pollute our global environment and also keep track of exactly what we need to execute and test our application. This last bit is important, because whatever we need to install in our virtual environment, users and contributors will also have to install. Thus, in summary, a virtualenv, in combination with pip, will greatly ease tracking our project's requirements. 

To install virtualenv, use pip:

$ sudo pip install virtualenv 

To then create a virtual environment called env for our project, use this command inside your_app's directory:

your_app$ virtualenv env

This creates a virtualenv directory env (already ignored by .gitignore). To activate the virtualenv, you would:

your_app$ source env/bin/activate

Once you're done, you can deactivate the virtual environment via:
your_app$ deactivate

And at last, a trick, coz ain't nobody got time for typing source env/bin/activate, add this to your .bash_profile or .bash_rc:

venv() { source ${1:-env}/bin/activate;  }

So that, now, you can just type venv to activate your virtualenv called "env", and else call venv <env_name> if you decide to give your virtual environment a different name.



I'll be taking the well-established approach of test-driven-development (TDD) here and first setup tests. In this section, I'll show you how to setup the tests directory, write a simple test module and test it with unittest2 (I'll explain why not the built-in unittest). We'll also use coverage to see how much of our code our tests really cover (execute). Regarding the actual source code: you have yours of course, but for the purpose of this tutorial I'll be writing an application that squares numbers. Yes I'd like 1 Nobel prize please and some fries, thanks.

Unit Tests

First of all, let's turn the tests/ directory into a package by touching an empty __init__.py file (note that we're in the virtual environment now):

(env)your_app$ touch tests/__init__.py

Then, we'll have one test module test_code.py that will test your_app's code, which we'll later on write. For this, we need unittest2. Unittest2 is a testing framework built on Python's own unittest package, but that is backported to Python 2.6. We'll use it because why not have that extra bit of support with no semantic changes (Note: there are many other testing frameworks out there, such as py.test or doctest, which are equally capable but have quite different semantics and flavors, which you can also check out). Get unittest2 with pip:

(env)your_app$ pip install unittest2

Then you can write your unit tests, for example tests/test_code.py (testing methods must start with 'test'):

import os
import sys
import unittest2

sys.path.insert(0, os.path.abspath(".."))

import your_app.code

class TestCode(unittest2.TestCase):

 def test_squares_positive_numbers_correctly(self):

  self.assertEqual(your_app.code.square(2), 4)

 def test_squares_negative_numbers_correctly(self):

  self.assertEqual(your_app.code.square(-3), 9)

Note that we have to make the parent directory along with our source package your_app visible to our test modules' path, which you can see in line 5. Now, to run our tests, we can use the unit2 command that came with unittest2 (which performs test discovery and will find our tests from the root your_app directory). The -v switch causes more verbose output and shows which tests run successfully and which fail. Note that you can also call unittest2 as an executable Python module, like you would unittest (if you've used it before), i.e.:

(env)your_app$ python -m unittest2

In any case, here's what unit2 does:

(env)your_app$ unit2 -v
ERROR: tests.test_code (unittest2.loader._FailedTest)
Traceback (most recent call last):
ImportError: Failed to import test module: tests.test_code
Traceback (most recent call last):
  File "/Users/petergoldsborough/Documents/Projects/your_app/env/lib/python2.7/site-packages/unittest2/loader.py", line 456, in _find_test_path
    module = self._get_module_from_name(name)
  File "/Users/petergoldsborough/Documents/Projects/your_app/env/lib/python2.7/site-packages/unittest2/loader.py", line 395, in _get_module_from_name
  File "/Users/petergoldsborough/Documents/Projects/your_app/tests/test_code.py", line 7, in &ltmodule>
    import your_app.code
ImportError: No module named code

Ran 1 test in 0.001s

FAILED (errors=1)

If all goes well, everything should fail. Of course: we haven't written any code yet (there isn't any module named your_app.code yet). Let's do that. After also creating an __init__.py file in your_app/your_app, the package that will contain all of our application's source code, we can write our main module your_app/code.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

def square(x):
    return x * x

Now running the testing command from above should give us happy messages and make us smile:

(env)your_app$ unit2 -v
test_squares_negative_numbers_correctly (tests.test_code.TestCode) ... ok
test_squares_positive_numbers_correctly (tests.test_code.TestCode) ... ok

Ran 2 tests in 0.000s



It's a good idea to know about our code's test coverage, i.e. how much of our project's code we're testing (and thereby avoiding bugs). For this, we use the coverage utility, which describes itself as "[...] a tool for measuring code coverage of Python programs. It monitors your program, noting which parts of the code have been executed, then analyzes the source to identify code that could have been executed but was not." Grab coverage with pip:

(env)your_app$ pip install coverage

You use coverage by simply running your Python modules like you would with the python command, but substituting python with coverage run. For our tests, that means changing

(env)your_app$ python -m unittest2


(env)your_app$ coverage run -m unittest2

This will run the same code as the previous command, but now we can call coverage report (add the -m switch for uncovered line numbers). We'll want to specify which modules we want to have included in our report, because, at least for me, coverage also reports on coverage of modules in Python's site-packages directory. For this, we add the --include="./*" option. Now, this should happen:

(env)your_app$ coverage report -m --include="./*"
Name                Stmts   Miss  Cover   Missing
tests/__init__          0      0   100%   
tests/test_code        10      0   100%   
your_app/__init__       5      0   100%   
your_app/code           3      0   100%   
TOTAL                  18      0   100%   

Looks like we did a good job writing tests! What an achievement for such a large codebase!


That's it for this section, here's what your current directory should look like if you're following this tutorial in a linear way:

├── docs
├── examples
├── temp
│   └── code.py
├── tests
│   ├── __init__.py
│   └── test_code.py
└── your_app
    ├── __init__.py
    └── code.py


Now that we have finished writing our enchanting code and also have our tests running perfectly, we can work on documentation. This section will cover setting up the docs/ directory, writing our code's docstrings ready to be parsed by the Sphinx documentation generation tool, using a special super-legible syntax enabled by the Napoleon extension to Sphinx, and lastly writing additional documentation in reStructuredText that will later go on ReadTheDocs. You can find the documentation I generated for my project here, to have a look what you'll get: http://ecstasy.readthedocs.org/


What is Sphinx? According to sphinx-doc.org: "Sphinx is a tool that makes it easy to create intelligent and beautiful documentation". Basically, Sphinx is the go-to solution for Python documentation. Anything to be parsed by Sphinx will have to be written in reStructuredText, a markup language like Markdown, but with, of course, an entirely different syntax.

Let's setup Sphinx. Download it with pip:

(env)your_app$ pip install sphinx

Then change into the docs/ directory. Fortunately, Sphinx comes with a command-line utility called 'sphinx-quickstart' which will let you setup your basic Sphinx configuration by asking you a small set of questions about your favorite food and the number of cats you have. From what I remember. In any case, invoking this command (from within your_app/docs/):

docs$ sphinx-quickstart

will trigger a setup wizard. You can configure your project's documentation to your liking or just stick with the defaults (press enter), but do answer yes ('y') when it asks about splitting your build directory into a build and source folder (question 2) and also yes when it asks about autodoc. The latter will ensure that Sphinx reads our source code's documentation (from its docstring). You'll definitely want a Makefile too, so please don't disable it (it's 'yes' by default).  Here's my setup:

docs$ sphinx-quickstart
Welcome to the Sphinx 1.3.1 quickstart utility.

Please enter values for the following settings (just press Enter to
accept a default value, if one is given in brackets).

Enter the root path for documentation.
> Root path for the documentation [.]: 

You have two options for placing the build directory for Sphinx output.
Either, you use a directory "_build" within the root path, or you separate
"source" and "build" directories within the root path.
> Separate source and build directories (y/n) [n]: y

Inside the root directory, two more directories will be created; "_templates"
for custom HTML templates and "_static" for custom stylesheets and other static
files. You can enter another prefix (such as ".") to replace the underscore.
> Name prefix for templates and static dir [_]: 

The project name will occur in several places in the built documentation.
> Project name: your_app
> Author name(s): you

Sphinx has the notion of a "version" and a "release" for the
software. Each version can have multiple releases. For example, for
Python the version is something like 2.5 or 3.0, while the release is
something like 2.5.1 or 3.0a1.  If you don't need this dual structure,
just set both to the same value.
> Project version: 0.1.0
> Project release [0.1.0]: 

If the documents are to be written in a language other than English,
you can select a language here by its language code. Sphinx will then
translate text that it generates into that language.

For a list of supported codes, see
> Project language [en]: 

The file name suffix for source files. Commonly, this is either ".txt"
or ".rst".  Only files with this suffix are considered documents.
> Source file suffix [.rst]: 

One document is special in that it is considered the top node of the
"contents tree", that is, it is the root of the hierarchical structure
of the documents. Normally, this is "index", but if your "index"
document is a custom template, you can also set this to another filename.
> Name of your master document (without suffix) [index]: 

Sphinx can also add configuration for epub output:
> Do you want to use the epub builder (y/n) [n]: 

Please indicate if you want to use one of the following Sphinx extensions:
> autodoc: automatically insert docstrings from modules (y/n) [n]: y
> doctest: automatically test code snippets in doctest blocks (y/n) [n]: n
> intersphinx: link between Sphinx documentation of different projects (y/n) [n]: n
> todo: write "todo" entries that can be shown or hidden on build (y/n) [n]: y
> coverage: checks for documentation coverage (y/n) [n]: y
> pngmath: include math, rendered as PNG images (y/n) [n]: n
> mathjax: include math, rendered in the browser by MathJax (y/n) [n]: y
> ifconfig: conditional inclusion of content based on config values (y/n) [n]: y
> viewcode: include links to the source code of documented Python objects (y/n) [n]: y

A Makefile and a Windows command file can be generated for you so that you
only have to run e.g. `make html' instead of invoking sphinx-build
> Create Makefile? (y/n) [y]: y
> Create Windows command file? (y/n) [y]: n

Now, you'll want to fully enable Sphinx's ability to grab documentation out of your source code by editing the conf.py it created. Open it in your_app/docs/source/conf.py. At the very top, after the import statements, you'll see a commented line saying "# If extensions (or modules to document with autodoc) are in another directory ...". Comment out the single line of Python code below (with sys.path) and add the path to your_app's root directory, which is two relative directories up conf.py (see below).

While we're already in conf.py, let's do two more things. First of all, we'll want a minimum Sphinx version of 1.3 (second configuration option, needs_sphinx). Then, add 'sphinx.ext.napoleon' to the list of extension you see right below. I'll explain what that is in just a bit, but first confirm your conf.py looks like this now (the extensions list may vary according to the settings you selected, but napoleon should be there):

# ...
sys.path.insert(0, os.path.abspath('../../'))

# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
needs_sphinx = '1.0'

# ...
extensions = [

Now, we can add documentation to our source code. Usually, the standard reStructuredText + Sphinx way of adding method, class, module, parameter and return value descriptions, exception specifications and other elements of documentation would look something like this (for your_app/your_app/code.py):

#!/usr/bin/env python
# -*- coding: utf-8 -*-
Your application's badass source code.

def square(x):
 Squares a value.

 Equivalent to calling x**2.

 :param x: The value to square.
 :type x: int|float 
 :returns: The squared value.
 :rtype: int|float

 return x * x

Look at that mess in square()'s docstring. The default reStructuredText syntax for docstrings is ugly and makes me cry. That's where napoleon comes in (the extension we added above). It lets us transform the above into the below:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

Your application's badass source code.

def square(x):
    Squares a value.

    Equivalent to calling x**2.

    x (int|float): The value to square.

    The squared value.

    return x * x

Yes! Beauty! Of course, the output in Sphinx and ReadTheDocs will be the same, but if there's one thing I love then that's cats and source code documentation that looks good in the source code and in the output. You can read all about Napoleon and how to document your source code with it here. Note that this is not specifically "Napoleon Style", but actually the docstring layout recommended by the Google Style Guide for Python.

Next, we can have Sphinx generate documentation for us with the sphinx-apidoc command:

(env)docs$ sphinx-apidoc -o source ../your_app/

To the -o flag we must pass the output directory as its positional argument, in this case the 'source' directory Sphinx generated earlier. The last argument is then the package containing your source code, where Sphinx will pick up the documentation from your modules' docstrings. This will generate  one .rst file per module in source. Then, you can use the Makefile to generate the actual HTML documentation. After building, you'll find an index.html file under your_app/docs/build/html.

(env)docs$ make html
sphinx-build -b html -d build/doctrees   source build/html
Running Sphinx v1.3.1
loading pickled environment... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 0 source files that are out of date
updating environment: 0 added, 0 changed, 0 removed
looking for now-outdated files... none found
no targets are out of date.
build succeeded.

Build finished. The HTML pages are in build/html.
(env)docs$ open build/html/index.html

Yey! Now it's really up to you regarding documentation -- the sky's the limit. I'll add some dummy text from www.lebowskiipsum.com and a picture of my cat to index.rst, just so you have an idea what could follow. In this case, I'll also delete your_app.rst and move its contents into modules.rst, because it's not worth having the extra file. Here are the now two remaining relevant files for your_app's documentation:





* `Introduction`_

* `Unbaking a cake with your_app`_

* `kbyethx`_

* `Source Code`_



Lebowski ipsum stay out of Malibu, Lebowski! Dolor sit amet, consectetur adipiscing elit praesent ac magna justo pellentesque ac lectus. Vee could do things you only dreamed of, Lebowski. Quis elit blandit fringilla a ut. Darkness warshed over the Dude— darker'n a black steer's tookus on a moonless prairie night. There was no bottom. Turpis praesent felis ligula, malesuada suscipit malesuada non, ultrices non. We want that money, Lebowski. Urna sed orci ipsum.

Unbaking a cake with your_app

#. Take cake out of oven
#. Run your_app on it (may require root access rights)
#. Extract eggs, flour, sugar etc. 
#. Cure cancer
#. Profit


.. image:: https://goo.gl/IpUmJn
 :alt: <3
 :align: center

Source Code

.. toctree::
 :maxdepth: 2





.. automodule:: your_app.code

One last thing to mention is that Sphinx supports quite a few themes. The default one, which you should be seeing if you've followed the previous steps, is called "alabaster". Others include the one from ReadTheDocs and also some using Bootstrap. They're quite easy to setup and play around with. Note also that alabaster is very, very customizable (see more on its Github page).

That's it for documentation (we'll send it to ReadTheDocs once we have a GitHub repo). In the next section, we'll setup a few more files before we go live, first on PyPI and then GitHub and everywhere else.


Additional Files

A proper repository needs a bunch of more files to be ready for distribution and publicity and fame.
These are:
  • README.rst
  • HISTORY.rst


The README should be clear. It is the main entry point to your project and should make clear what problem you are solving and how you are bettering the world; a basic usage example if your project happens to be a library; how to install your application; where to find documentation; the license you're using and whatever else you'd like and that makes sense. Have a look at these resources prior to writing a README:

Note that your README must be in reStructuredText and not in Markdown, because PyPI will only parse reStructuredText. 


A project without a license is implicitly closed-source. We don't want that and thus need a license. There are many options, though I always choose the MIT-license. It's permissive and lovely. One thing that's especially wonderful is that you can get your own, personal, free and ever-online MIT-license hosted under mit-license.org. See this repo for more information. Here's my license. Note that the license file doesn't need a file extension (common practice not to add one).


This should be your changelog file (often called CHANGELOG.rst or CHANGE.rst), where you would add a list of your most important commits for every new version release.

Getting Ready for Distribution

We are very close to being able to upload our project to PyPI and then Github. However, now, we'll need a few more more files. These are very important:
  • setup.py
  • setup.cfg
  • requirements.txt
  • Makefile
However, first, we'll have to edit another file: your_app/your_app/__init__.py. As I've come to see, it is a common idiom to store relevant project information such as the project version, the author's name and e-mail address, the project's license and more inside this file. For your_app, it should look like this:

# -*- coding: utf-8 -*-

__title__ = 'your_app'
__version__ = '0.1.0'
__author__ = '<you>'
__license__ = 'MIT'
__copyright__ = 'Copyright 2015 <you>'


setup.py is the Python module used to build and install your project. Here's the file for your_app:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

setup.py script for setuptools.

import re

from setuptools import setup, find_packages

version = ''

with open('your_app/__init__.py') as init:
    version = re.search(r'^__version__\s*=\s*[\'"]([^\'"]*)[\'"]',
       init.read(), re.MULTILINE).group(1)

with open('README.rst') as readme:
    long_description = readme.read()

requirements = [
    # Your project's requirements

test_requirements = [



    description='Your Python application.',



    'Development Status :: 4 - Beta',

    'Intended Audience :: Developers',
    'Topic :: Software Development',

    'License :: OSI Approved :: MIT License',

    'Programming Language :: Python',
    'Programming Language :: Python :: 2',
    'Programming Language :: Python :: 2.6',
    'Programming Language :: Python :: 2.7',
    'Programming Language :: Python :: 3',
    'Programming Language :: Python :: 3.2',
    'Programming Language :: Python :: 3.3',
    'Programming Language :: Python :: 3.4',



    packages=find_packages(exclude=['contrib', 'docs', 'tests*']),




  • Now, first of all, in setup.py, we must import the setup method from the setuptools module, which takes a great number of keyword arguments that enable us to configure our project's metadata. We'll also need setuptools find_packages to make it easy for us to specify which packages setup.py should handle.
  •  Then we grab our project's version string by doing a regular expression search in our __init__.py under the your_app/your_app directory (for this we need the re module). Note that we cannot directly import our module your_app and just retrieve its __version__ directly, as someone who just got your project and has not yet executed the setup procedure may not have all your project's dependencies. Therefore, certain imports may fail.
  • For PyPI, we'll want a long_description, which is simply our README.
  • Requirements/dependencies to run your project's application code successfully should be put into the requirements list. your_app doesn't have any, but I'm leaving it here for you to see. When someone then setup.py installs or pip installs your project, this will also install the dependencies.
  • Test requirements go into the appropriately named list. your_app, for example, requires unittest2, version 1.1.0. Note that each element in these lists of requirements are commands for pip install, i.e. when setting up your project, pip install <element> will be called for each element in these lists.
  • Inside setup(), you'll see a lot of meta-data arguments, which you can fill out for your persona and application
  • Regarding classifiers, you can see the list of available classifiers you can choose from here.
  • By setting include_package_data to True, we can later on specify which additional data files we want to include in our distribution. setup.py will then look for a manifest file (MANIFEST.in) which we'll write in a bit. By default, Python files are included, your README is included and a few other files.
  • For the packages we want to distribute, we let setuptools' find_packages function handle that, as can be seen above.
  • We specify install and test requirements with install_requires and test_requires.
  • By handing over the path to our test directory ("tests") to the test_suite argument, we will later on be able to run our tests with unittest2 directly from the setup file (see below).

Once we're done, you'll be able to build your distribution files with:

(env)your_app$ python setup.py build sdist bdist_wheel

which builds a source and pre-built distribution (using the Python wheel format), and/or install your project with:

(env)your_app$ python setup.py install

and/or test your project with:

(env)your_app$ python setup.py test

and/or do many other things, such as registering your application with PyPI (more on that soon), which you can see by issuing:

(env)your_app$ python setup.py --help



As quickly touched upon above, there are two main variants of distribution files for your project, the first is a source distribution (sdist) -- the standard distribution --- and the second is a wheel distribution (bdist_wheel). According to the documentation for wheels: "A wheel is a ZIP-format archive with a specially formatted filename and the .whl extension. It is designed to contain all the files for a PEP 376 compatible install in a way that is very close to the on-disk format. ". The main advantage of a wheel is that it is quicker to install than a source distribution. Note that you must have the wheel package installed to use wheels, but you should already have it if you're in a virtual environment. In any case, we can provide additional instructions on how to build wheels with a setup.cfg file. Most importantly, we'll want only one single wheel distribution for all Python versions. For this, setup.cfg should hold these two lines:



In our setup.py script, we set include_package_data to True. As a consequence, setup.py will look for a MANIFEST.in file, where we can specify additional data files such as README.rst, LICENSE and others, that we want included in our distribution. It can look like this:

include *.rst
include LICENSE
include Makefile

recursive-include tests *
recursive-exclude * __pycache__
recursive-exclude * *.py[co]

recursive-include docs *.rst conf.py Makefile



This file contains a list of pip packages needed to run your code successfully, explicitly put here alongside having them specified in setup.py. Each line in requirements.txt is a command to pip install.  For example, we'll need unittest2 version 1.1.0 to run our tests, so requirements.txt would contain at least this one line: "unittest2==1.1.0". If you then run

$ pip install -r requirements.txt

pip will execute each line in requirements.txt as if you had called pip install <line>. The best way of getting all requirements for your project is to use:

(env)your_app$ pip freeze

which prints a list of packages installed with pip, either directly by us or as dependencies of those packages. Note that the value of the virtual environment really comes to show here, as pip freeze will only show what we installed for this specific project, whereas it would show an amalgamation of packages for different projects with different requirements if we hadn't used a virtual environment (and also not for other projects). In any case, for your_app, pip freeze shows this:

(env)your_app$ pip freeze

Note that most of these packages were installed as dependencies of the few packages we installed (unittest2, virtualenv and Sphinx -- wheel is installed in a virtualenv by default). We can then generate a requirements.txt file by directing pip freeze's output into it:

(env)your_app$ pip freeze > requirements.txt



We'll want a Makefile to make life easy for us. Mainly taken from here, but modified for our setup:

.PHONY: clean-pyc clean-build docs clean

 @echo "clean - remove all build, test, coverage and Python artifacts"
 @echo "clean-build - remove build artifacts"
 @echo "clean-pyc - remove Python file artifacts"
 @echo "clean-test - remove test and coverage artifacts"
 @echo "lint - check style with flake8"
 @echo "test - run tests quickly with the default Python"
 @echo "coverage - check code coverage quickly with the default Python"
 @echo "docs - generate Sphinx HTML documentation, including API docs"
 @echo "release - package and upload a release"
 @echo "dist - package"
 @echo "install - install the package to the active Python's site-packages"

clean: clean-build clean-pyc clean-test

 rm -rf build/
 rm -rf dist/
 rm -rf .eggs/
 find . -name '*.egg-info' -exec rm -rf {} +
 find . -name '*.egg' -exec rm -f {} +

 find . -name '*.pyc' -exec rm -f {} +
 find . -name '*.pyo' -exec rm -f {} +
 find . -name '*~' -exec rm -f {} +
 find . -name '__pycache__' -exec rm -rf {} +

 rm -rf .tox/
 rm -f .coverage
 rm -rf htmlcov/

 pylint ecstasy

 python setup.py test

 coverage run --include="./*" -m unittest2
 coverage report -m
 coverage html
 open htmlcov/index.html

 $(MAKE) -C docs clean
 $(MAKE) -C docs html
 open docs/build/html/index.html

release: clean
 twine upload sdist
 twine upload bdist_wheel

dist: clean
 python setup.py sdist
 python setup.py bdist_wheel

install: clean
 python setup.py install


I'll just talk about one more thing before we send your_app online: tox. The tox automation project, according to itself, "aims to automate and standardize testing in Python". What's great about it is that it lets us test whether or not our project can be built and distributed correctly across multiple Python versions, without having to setup virtual environments ourselves (done by tox). I've waited till now for this, because we needed a setup.py script. Install tox with pip:

(env)your_app$ pip install tox

You'll then need a tox.ini configuration file, in which we can specify which python versions we'll want to test our distribution for, how to install requirements and what command to use to run tests. Note that what Python versions you can use tox with depends on which Python versions you have installed on your system, though you always grab more Python versions with pyenv or brew. tox.ini could look like this:

envlist = py26, py27, py31, py34

deps = -rrequirements.txt
commands = python setup.py test

Now you run tox to see if all is well for your application in multiple environments:

(env)your_app$ tox
_________________________ summary __________________________

  py26: commands succeeded
  py27: commands succeeded
  py31: commands succeeded
  py34: commands succeeded
  congratulations :)


We are now ready to send our application to the cheese shop (PyPI). First, we'll send it to the test (staging) server and then to the real one. Either way, we'll want to build our project. Run:

(env)your_app$ python setup.py build sdist bdist_wheel

This creates the appropriate distributions in the your_app/dist/ directory. You may want to execute

your_app$ python setup.py check

before doing anything next, to make sure your project is ready (but don't care about the url for now, we'll add it later).


Sign up for the staging server of PyPi here. Note that the staging server is cleaned periodically, so your user profile and repository may vanish from the staging server (not the real one after) at one point in the future. In any case, next, create a .pypirc file in your root directory. ~/pypirc should contain, for now (replace the username and password):

index-servers =

repository = https://testpypi.python.org/pypi
username = <your_username>
password = <your_password>

Then, you can register your application on the test server (note that this will warn about a missing url, we'll add it later):

(env)your_app$ python setup.py register -r test

Go have a look on testpypi.python.org if all looks well for your application on the testing server. If so, we can then upload the distribution files (it's only registered for now). For this, you can either run (-r stands for repository)

(env)your_app$ python setup.py sdist bdist_wheel upload -r test

or use twine. Twine is like the command you just saw, but more secure. You can even use PGP encryption. Get twine with pip

(env)your_app$ pip install twine

Then issue this command to upload your files (pass the -s flag to sign it with PGP):

twine upload -r test dist/*

Either way, you could now create a new virtual environment and test installing your package from the test server:

(env)your_app$ virtualenv test
Using real prefix '/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7'
New python executable in test/bin/python2.7
Also creating executable in test/bin/python
Installing setuptools, pip, wheel...done.
(env)your_app$ venv test
(test)your_app$ pip install -i https://testpypi.python.org/pypi your_app
Collecting your-app
  Using cached https://testpypi.python.org/packages/2.7/y/your_app/your_app-0.1.0-py2.py3-none-any.whl
Installing collected packages: your-app
Successfully installed your-app-0.1.0
(test)your_app$ deactivate 
your_app$ rm -r test
your_app$ venv env


We can now go for the the real thing. Register on PyPI, if you haven't yet (separate registration from the test server). Then add change ~/.pypirc so that it looks like this:

index-servers =

repository = https://pypi.python.org/pypi
username = <your_username>
password = <your_password>

repository = https://testpypi.python.org/pypi
username = <your_username>
password = <your_password>

And repeat the process for the live server:

your_app$ python setup.py register

Registering your_app to https://pypi.python.org/pypi
Server response (200): OK

and then either

your_app$ python setup.py sdist bdist_wheel upload


your_app$ twine upload dist/*

VoilĂ , c'est on PyPI, as my French dog would always say. If I had a French dog. Who could speak.



Register your project on Github now -- I'm sure you know how to do it. Once you have a URL, add it to your_app/your_app/__init__.py and the setup.py script. The url for this tutorial repository is https://github.com/goldsborough/your_app and I'll use it as a placeholder for your project's url from now on.


__title__ = 'your_app'
__url__ = "https://github.com/goldsborough/your_app"
__version__ = '0.1.0'





Now that we have a little bit more meta-data, we can re-register the project on PyPI. This will only re-configure the meta data:

your_app$ python setup.py register

Adding the PyPI badge

First, we can add a "PyPI version" badge to our project. For this, visit this web page. There, you can enter the name of your application and retrieve the reStructuredText code to include in your README.rst file. For your_app, that would be:

.. image:: https://badge.fury.io/py/your_app.svg
    :target: http://badge.fury.io/py/your_app

On Github, your README will then show a badge like this:

Adding your MIT License badge

If you're awesome and have your own mit-license.org subdomain (I talked about it earlier), you can also get a badge for that, by including the following code in your README.rst:

.. image:: https://img.shields.io/github/license/mashape/apistatus.svg
 :target: http://goldsborough.mit-license.org

In the target command, you would replace goldsborough with your name for mit-license.org. That gives you this badge:

You can change the color by making your own badge on http://shields.io.

Adding the Gratipay badge

Gratipay, formerly known under the, IMHO, much nicer name, Gittip, is a way for people to tip you for your great work. Sign up here: https://gratipay.com/

And either go to https://gratipay.com/~<your_user_name>/widgets/ or just change this code for you:

.. image:: http://img.shields.io/gratipay/goldsborough.svg
:target: https://gratipay.com/~goldsborough/

This will give you this badge (feel free to click and donate <3):

Setting up Landscape.io

Landscape.io is a web tool for automated linting, code correctness and style rating. Everytime you push a commit to your repo, landscape.io will run pylint on it. Register for free and add your repository, no additional configuration required. On your repository's overview page on landscape.io, you'll see the badge in the top right corner. Clicking on it redirects you to badge heaven. The badge looks like this (with your code health):

Setting up Code Climate

Code Climate is another web application that performs code rating, or more specifically "static analysis". It differs from Landscape.io in that it really analyzes your code for quality in terms of how well it is structured and how maintainable is, rather than only checking for bugs and style as Landscape.io does. For example, for my latest project, ecstasy, it's giving me a pretty low GPA (2.8) because I have "high cyclomatic complexity" in certain areas. At the same, your_app gets a 4.0 (hehe). Sign up on their website and add your repository. You can get this kind of badge:

by visiting Settings > Badges in your Code Climate dashboard and adding the reStructuredText to your README.rst.

Setting up Coveralls

At the start of this article, I talked about code coverage. We examined our code's coverage with the command-line utility of the same name to understand how much of our code is really being executed when running our tests. Coveralls.io is the web counterpart for this -- sign up there and add your repository. Once you're done, you can see the badge being offered at the top. It looks like this:

However, there's a little extra configuration needed here. First of all, download python-coveralls with pip (and do a pip freeze > requirements.txt again, and also add it to your test_requirements in setup.py):

(env)your_app$ pip install python-coveralls

This utility will let us interact with coveralls from the command-line. Then, add this .coveralls.yml file to your repository (we'll setup Travis-CI in a bit):


service_name: travis-ci

Once you've added your repository on coveralls.io, you'll see setup-instructions for coveralls and Ruby. We don't need the instructions, but you'll see your repo_token in the instructions. Copy it to your clipboard and issue this command, replacing <your_repo_token> with your actual repo token:

(env)your_app$ COVERALLS_REPO_TOKEN=<your_repo_token> coveralls

That'll and you should see results once we've setup Travis CI and pushed a commit afterwards.


Setting up Travis-CI

Travis CI (Continuous Integration) is an awesomely awesome distributed build system for continuous integration in the open-source community, used for running your tests online. It is very, very helpful when working in teams or on open-source projects with multiple contributors working on improvements and features and fixes simultaneously. Sign up on travis-ci.org, the free version for open-source projects, and not on travis-ci.com, their paid counterpart. After adding your project, you'll see the badge in your overview page. Clicking on it and switching to 'rst' yields the reStructuredText code you can add to your README.rst to get this badge:

More configuration. Add this .travis.yml file to your root directory.

language: python
  - "2.6"
  - "2.7"
  - "3.2"
  - "3.3"
  - "3.4"
  - "pip install ."
  - "pip install -r requirements.txt"
script: coverage run --include="./*" -m unittest2
    - petergoldsborough@hotmail.com
  - coveralls

Here, we specify

  • language: The language of our project ("python")
  • python: Against which Python versions to test (e.g. "3.4")
  • install: What shell commands to run to install the project
  • script: What shell commands to run to execute our tests. Notice that we use coverage here like I showed you above. This is important for the after_success step and integration with coveralls.io. If you don't want coverage, you can just execute the "unit2" (make sure unittest2 is in your dependencies).
  • notifications: How and where to notify you of failures and changes (you can change this on travis-ci.org)
  • after_success: What to do when all our tests have passed (there's also after_failure). We want to call coveralls here, which will send the report generated by the coverage report to coveralls.io (python-coveralls interacts with the coveralls.io API).

There's more you can do with this configuration file. The full life cycle of a Travis "job" is described by these YAML statements:
  1. before_install
  2. install
  3. after_install
  4. before_script
  5. script
  6. after_script
  7. after_success or after_failure

See here: http://docs.travis-ci.com/user/customizing-the-build/#The-Build-Lifecycle

Maybe you'd also like a command-line client: https://github.com/travis-ci/travis.rb

In any case, you should now have Travis-CI and coveralls.io ready, so commit and push and check everything is wonderful on travis-ci.org and on coveralls.io.


Setting up Gitter

The last thing we'll want to do for our kick-ass project is to add Gitter. Gitter is a chatroom for Github repositories where you can talk with fellow contributors about your project. Sign up on gitter.im and click on "create a room" in the bottom left corner, then on "repository" and lastly on your Github repo. This will create a room on Gitter. Say hi on your_app's: https://gitter.im/goldsborough/your_app

Now you'll want to setup Gitter and integrate other services. If you're in your room on Gitter, it'll say "Configure your integrations" on the right sidebar. Click that and then integrate GitHub, Coveralls and Travis CI. There are nice instructions for all of that. 

Once you're done you have to patience yourself a little, but Gitter will send you a Pull Request on GitHub to suggest adding its marvelous badge, which looks like so (click on it to shoot me a message in your_app's chatroom):

Setting up ReadTheDocs

The very last thing we'll do, and I almost forgot about it, is setup ReadTheDocs. Sign up on https://readthedocs.org and click on "Add Project" under your user profile. You can then select to "Import from Github" and, after possibly syncing your repos, you can select your Github repo. You'll then have to "Build" it and ReadTheDocs will automagically search your docs/ folder for everything it needs. After, you can click on "View Docs" and voilĂ , your Sphinx docs. Here are those from your_app: http://your-app.readthedocs.org/

For every <thing>.readthedocs.org there's also an alternative domain under <thing>.rtfd.org (read the fabulous docs -- right?), like your_app.rtfd.org


I hope I could share with you what I learnt while setting up my own project. Creating and distributing such a project can be quite tough and there's so, so much to do when setting up a project other than coding (unfortunately). There's most likely many things I did wrong here, too, so please feel free to offer suggestions. 

One last resource I'd like to share with you is cookiecutter, which automates much of the process I just outlined.

But otherwise, thanks for reading and farewell!

Sunday, July 19, 2015

Literals in C++

C++11 introduced the concept of user-defined literals. What does this mean? It means that this snippet of code is valid:

using namespace std::chrono_literals;
std::cout << "One day has "
          << (23h + 59min + 60s).count()
          << " seconds." << std::endl;
// Output: One day has 86400 seconds.

But first, let me outline what your grandmother's rusty C++ compiler can already do in terms of manipulating primitive types by prefixing them in a certain way. Already long before C++11, it was possible to write integers in their octal representation, simply by prefixing them with the digit 0. The compiler would then convert that value from octal to decimal. In the same way, a '0x' prefix stood and stands for a hexadecimal value:

std::cout << 10 << " "
          << 010 << " "
          << 0x10
          << std::endl;

// Output: 10 8 16

Here comes the first change C++11 brought along regarding literals: binary literals. Nowadays, you can not only write integers in their hexadecimal, decimal or octal representation, but also in binary, by prefixing the number with '0b' and writing out the individual bits:

std::cout << 10 << " "
          << 010 << " "
          << 0x10 << " "
          << 0b10
          << std::endl;

// Output: 10 8 16 2

That's cute, but not really impressive. There are, however, three more areas where the standard introduced literals to cast certain expressions to classes depending on what characters they are suffixed by, namely: strings, time durations and complex numbers.

String Literals

For strings, you can cast a character literal (i.e. a char array) to an object of type std::string simply by appending an 's', given that you use a 'using' statement for the 'std::string_literals' (or std::literals::string_literals) namespace:

using namespace std::literals::string_literals;

auto string = "Hello World!"s;

std::cout << string.size() << std::endl;
// Output: 12

While this kind of syntax is certainly interesting, it lacks clarity and I would most likely always prefer to explicitly declare the string type. Imagine someone has no previous knowledge about the new standard literals, does not see the using declaration and comes across this expression. How much confusion would the "magic s" cause? Between 4 and 5 kilograms of confusion I would estimate.

Complex Literals

Next, it is now possible to use literals for complex numbers. Where you previously would have declared a complex number with a real and imaginary part like so:

std::complex<double> z(1, 2);

You can now use the following syntax, given that you make the std::literals, std::complex_literals or the std::literals::complex_literals namespace visible with an appropriate using declaration:

using namespace std::complex_literals;
auto z = 1.0 + 2i;

Duration Literals

The next domain of the standard library where you can find literals is the std::chrono header and its handling of time-spans and durations. I already gave an example of the literals you can specify for durations at the top of this article. At this point, it may be of worth to discuss how the literal syntax is implemented. The answer is surprisingly simple, as to define a certain literal means nothing else as to write a function -- a literal operator -- taking an integer or string literal as its argument (there is a pre-defined set of possible parameter lists you can have) and returning whatever you like. Here, for example, are the definitions of the literal operators that make the "24h" syntax possible:

constexpr std::chrono::hours
operator""_h(unsigned long long hrs)
    return std::chrono::hours(hrs);

constexpr std::chrono::duration<long double, std::ratio<3600>>
operator""_h(long double hrs)
    return std::chrono::duration<long double, std::ratio<3600>>(hrs);

Where you would write operator* to overload the multiplication operator or operator== for comparison, you here write operator"" followed by the suffix you want to define, e.g. h for hours. In this case, there is one overload that takes an unsigned long long and returns a duration represented by an integer (std::chrono::seconds), while the other's parameter and return type both use floating-point types and are called appropriately during overload resolution. This same syntax and pattern is used also for minutes (min), seconds (s), milliseconds (ms), microseconds (us) and lastly nanoseconds(ns).

User-defined Literals

While the standards committee has indeed granted us the right to create and use our own literal operators, there are a few catches to be aware of which I will quickly touch upon before showing you a few examples of user-defined literal operators. The first catch is that user-defined literal suffixes must begin with an underline, as those not starting with an underline are reserved for the standard. The second catch, which I already mentioned briefly further above, is that there is a pre-defined set of parameter lists that you can have for your literal operators. This list includes:

  • ( const char * ) 
  • ( unsigned long long int )
  • ( long double )   
  • ( char )   
  • ( wchar_t )  
  • ( char16_t ) 
  • ( char32_t ) 
  • ( const char * , std::size_t ) 
  • ( const wchar_t * , std::size_t )
  • ( const char16_t * , std::size_t )  
  • ( const char32_t * , std::size_t )

This is why, as you may have noticed, the parameter list for the hour-duration literal operator took a parameter of long double for the one overload and a parameter of type unsigned long long for the other overload. What exactly is the underlying motivation of the standards committee regarding this restriction of our liberty? Well, that's just none of your business. Damn you. But on more serious note, my personal guess is that each of these primitive types (of the non-pointer types here) can hold the largest possible value in their category, meaning all values in the range of their category can be implicitly cast to that type without having to worry about overflow (resulting from the cast -- the integer could still be larger than the range of the largest data type). If there were multiple overloads possible for a value, e.g. one taking an object of type int and another taking an unsigned long long, then given just the literal value 5 the compiler would have a pretty difficult time figuring out which to pick and would likely start to cry. Thus, all values of a certain category will be cast to the type with the largest width and thereby the necessity for overloads is nullified.

Here, now, a quick example of a user-defined type and an appropriate literal operator:

class Foo
    constexpr explicit Foo(std::size_t x)
    : _x(x)
    { }
    void set(std::size_t x)
        _x = x;
    std::size_t get() const
        return _x;
    std::size_t _x;

constexpr Foo operator""_f(unsigned long long x)
    return Foo(x);

As you can see, because the Foo class deals with integers, the parameter for the literal operator is an unsigned long long. Moreover, the literal begins with an underscore to not conflict with standard literals. Now, we can use the following sassy syntax to declare and initialize an object of type Foo:

int main(int argc, char * argv [])
    auto foo = 5_f;
    std::cout << foo.get() << std::endl;
    // Output: 5

That is all neat and nice, but what about a real-world, applicable example of user-defined literals? This would clearly be any form of unit-suffix. For a C++ CSS parser, you could have _em, _px, _pt or _pct suffixes to denote length units. For physics simulations, you could very easily declare things to be of a certain unit such as Newtons (_N) for force or Joules (_J) for energy and then allow implicit conversions between certain units when physically allowed. For the purpose of this article I wrote a neatly-commented little example using literals for units of weight, which you can find as a gist here.

Separator literals

As a very last topic for this article about literals in C++11 and beyond, I just very briefly want to touch upon a little feature that may make your code a lot more readable. Previously, you had no way of separating digits for very long integer literals:

unsigned long x = 204582349058239;

Now, with the arrival of the new literal syntax, it's possible to separate digits by single quotes:

unsigned long x = 204'582'349'058'239;

Much better!

And that'll be it for this post. Hope I could help!

Thursday, May 28, 2015

Hertzsprung-Russell Diagram in LaTeX

Just in case anyone is currently in urgent need of a Hertzsprung-Russell diagram displaying the luminosity of stars in solar units versus their spectral class and temperature:

Here's an implementation in LaTeX / Tikz:

  % Luminosity axis
  \draw (0, 0) -- (0, 11.5);

  % Luminosity axis labels
  \foreach \y in {0, 1, ..., 11}
   \advance \l by -5\relax

   % Shift the ticks down a little
   \draw (0, \y-0.03333) node {---};

   % Display 10^0 as 1
   \ifnum\l = 0
    \draw (-0.6, \y) node {1};
    \draw (-0.6, \y) node {$10^{\the\l}$};

  % Luminosity arrow
  \draw [->] (-1.3, 3) -- +(0, 6);

  % Luminosity label
  \draw (-1.8, 6) node [rotate=90] {Luminosity [$L_{\odot}$]};

  % Temperature axis
  \draw (0, 0) -- (12.5, 0);

  % Spectral class and temperature labels
  \foreach \x/\s/\t in {0/O/47000, 2/B/0, 4/A/10000,
                        6/F/0, 8/G/6000, 10/K/0, 12/M/3000}
   % Tick
   \draw (\x, 0) node {$|$};

   % Draw the spectral class
   \draw (\x, -0.4) node {\s};

   % Only show the temperature if valid (not 0)
   \ifnum\t > 0
    \draw (\x, -0.9) node {\t};

  % Spectral class and temperature axis arrow
  \draw [->] (9, -1.5) -- (3, -1.5);

  % Spectral class and temperature label
  \draw (6, -2) node {Spectral class and temperature [K]};

  % Instability strip
  \draw [fill=gray, gray]
        (2.5, 1) -- (3, 1) -- (8, 9) -- (6, 9) -- (2.5, 1);

  % Instability strip label
  \draw [<-, thick]
        (4, 3.3) -- +(1, 0) node [right] {Instability Strip};

  % Main Sequence strip
  \draw [line width=0.7cm, yellow]
        (0.5, 11) .. controls (2, 4) and (10, 6) .. (12, 0.5);

  % Main sequence label
  \draw [<-, thick]
        (3.5, 6.5) -- +(0, 2) node [above] {Main Sequence};

  % Sun
  \draw [fill=red, red] (6.8, 5)
        circle [radius=3pt] node [right, black] {\, Sun};

  % Supergiants
  \draw [fill=red, red]
        (7, 10) circle [x radius=2.5cm, y radius=0.75cm];

  % Supergiants label
  \draw [<-, thick]
        (8, 10) -- +(2, 0) node [right] {\, Supergiants};

  % Giants
  \draw [fill=orange, orange] (8.5, 7) 
        circle [rotate=10, x radius=2.5cm, y radius=0.75cm];

  % Giants label
  \draw [<-, thick]
        (9.5, 7) -- +(0, -1.5) node [below] {\, Giants};

  % White dwarfs
  \draw [fill=cyan, cyan] (2.5, 2.5) 
        circle [rotate=-45, x radius=2.5cm, y radius=0.75cm];

  % White dwarfs label
  \draw [<-, thick]
        (2, 3) -- +(0, 2) node [above] {\, White Dwarfs};


Right. Back to studying physics then. Super open to any suggestions or tips on LaTeX / Tikz style though, so don't hesitate to leave a comment :)

Tuesday, May 5, 2015

Modeling the Polymerase Chain Reaction in C++

I haven't yet had much chance to wander into the realm of bioinformatics, though I can't deny that it's a pretty fascinating subject. We program computers, robots, drones and nowadays pretty much anything else you can hook up to the internet of things -- washing machines, cars and even homes. It's pretty exciting that we're also adding our own bodies to that list. So you might be fluent in C++, Python or JavaScript, but mate, do you even DNA?

Anyway, what's this post about? Well, while studying for my biology exam I came across the Polymerase Chain Reaction (PCR) and couldn't help my programmer senses from tingling. I'll first explain in a few words what the PCR method is and then how it can be implemented in code.

The Polymerase Chain Reaction, developed by Kary Mullis in the 60's, is a method of replicating a specific DNA sequence many, many (many, many, many) times. The point is, when you want to analyze a certain region of an individual's DNA, say to examine the gene responsible for eye color, you have to isolate that gene from the thousands of other genes found on your typical chromosome. So how do you go about doing that? Polymerase Chain Reaction. Let me throw in a few other necessary definitions and chunks of information about DNA:

A Nucleotide is the basic building block of the double helix that makes up our DNA. It consists of one of four organic bases (as well as a sugar and a phosphate molecule). Those four organic bases are Adenine, Cytosine, Guanine and Thymine (A, C, G, T for short). Because the DNA structure consists of two strands -- left and right -- there are always two complementary nucleotides next two each other. A always goes together with T and C always holds hands with G (aw). Therefore, when we speak of a strand of DNA, e.g. A-C-C-T, there is always a complementary strand, in this case T-G-G-A (always the complementary nucleotides). Moreover, it's important to note that DNA strands have a direction. One end of a DNA strand is called the 5' (five prime) end and the other the 3' (three prime) end (this naming stems from the chemical structure of the molecules). When new nucleotides are added to a DNA strand, this must always happen in the 5' to 3' direction. So if we wanted to add a new Guanine nucleotide to the DNA sequence from above, it would have to be like this: (5') A-C-C-T (3') + G = (5') A-C-C-T-G (3'). Note also that the left and right strand of a DNA structure run antiparallel (opposite directions), meaning the left strand has it's 5' prime end at the opposite end of the right strand's 5' end. Lastly, I need to mention primers and Polymerase. Polymerase is an enzyme that complements a strand, while primers are enzymes that tell the Polymerase at what position of a strand to start complementing.

Let's pull it all together. Have a look at this DNA strand:

(5') T-G-A-A-C-T-G-A-C-A-T-T (3')

Say we're really interested in the region between the 3rd (A) and the 7th (G) nucleotide (including those), because it tells us some information about a hereditary disease. The problem is that these nucleotides are about the size of Kim Kardashian's brain. So just a few molecules wide, around 2 nanometers actually. We can't do any meaningful research with just these four nucleotides. We need more of them. Millions and millions and millions more. This is what the Polymerase Chain Reaction does, by creating complementary strands with the Polymerase enzyme I just introduced. So let's have a further look at this problem. To copy the region between the 3rd and 7th locus (= position in a DNA segment), we first set a primer (P) to the 8th locus:

(5') T-G-A-A-C-T-G-A-C-A-T-T (3')

Then we throw some Polymerase into the equation. It'll look for the primer and synthesize a complementary strand from the primer onwards (that's why it's at locus 8, because 8 is then no longer included in the complementary strand). Note that this synthesis will go into the opposite direction of the old strand, because the two DNA strands must run antiparallel. Moreover, you'll notice that the synthesis goes all the way to the end of the old strand and doesn't just stop at locus 3, even though we'd want it to. Unfortunately, there is no "stop-primer":

(5') T-G-A-A-C-T-G-A-C-A-T-T (3')
(3') A-C-T-T-G-A-C (5')

Okey dokey, almost there. That was iteration one and we almost have our desired region between locus 3 and locus 7 isolated. However, we're still stuck with the old, super-long strand from the beginning and our new, still-too-long complementary strand. Let's ignore the old one for now and just focus on the one that's a bit shorter (the complementary strand). What we can do next is simple, we just set another primer to the new strand, this time on locus 2:

(3') A-C-T-T-G-A-C (5')

And let the Polymerase do its job:

(3') A-C-T-T-G-A-C (5')
          (5') A-A-C-T-G (3')

Yes! We have our desired strand and can now eradicate cancer. Ok maybe not just yet. But anyway, notice that from now on, every single complementary strand we make of this new strand will be exactly perfect. You could almost call it ... a chain reaction.

Additionally, don't forget that we can always re-use the longer strands to make new "perfect" / "correct" ones, which will in turn again produce new correct strands and so on. Exponential!

So how can we implement this in code? Quite simple. We need to model the DNA as a class, consisting of two strands, which in turn consist of nucleotides, which in turn consist of an organic base as well as some information about its locus. This DNA model should give us access to the two strands -- left and right -- as well as make it possible to create a complementary strand for either of the two. Moreover, the DNA class' interface should make it easy to access the 3' and 5' end of either strands and provide a means of iterating over their nucleotides in the correct direction (5' to 3'). Lastly, there should also be an easy way of initializing the DNA class' left and right strands, either from another strand or directly from a sequence of bases. This description paves the way for the following C++ class:


#include <vector>
#include <deque>

class DNA
    enum class Base { A, C, G, T };
    typedef std::size_t locus_t;
    struct Nucleotide { Base base; locus_t locus; };
    typedef std::vector<Base> sequence_t;
    typedef std::deque<Nucleotide> strand_t;
    typedef std::vector<strand_t> container_t;

    typedef strand_t::iterator left_iterator;
    typedef strand_t::const_iterator const_left_iterator;

    typedef strand_t::reverse_iterator right_iterator;
    typedef strand_t::const_reverse_iterator const_right_iterator;
    DNA(const sequence_t& left);
    DNA(const sequence_t& left,
        const sequence_t& right);
    DNA(const strand_t& left);
    DNA(const strand_t& left,
        const strand_t& right);
    void setLeftStrand(const strand_t& strand);
    void setLeftStrand(const sequence_t& bases);
    strand_t getLeftStrand();
    const strand_t getLeftStrand() const;
    left_iterator getLeftFivePrime();
    left_iterator getLeftThreePrime();
    const_left_iterator getLeftFivePrime() const;
    const_left_iterator getLeftThreePrime() const;
    strand_t getLeftComplementary(locus_t primer = 0) const;
    void setRightStrand(const strand_t& strand);
    void setRightStrand(const sequence_t& bases);
    strand_t getRightStrand();
    const strand_t getRightStrand() const;
    right_iterator getRightFivePrime();
    right_iterator getRightThreePrime();
    const_right_iterator getRightFivePrime() const;
    const_right_iterator getRightThreePrime() const;
    strand_t getRightComplementary(locus_t primer = 0) const;
    locus_t rightToLeftLocus(locus_t rightLocus) const;
    locus_t leftToRightLocus(locus_t leftLocus) const;
    strand_t left_;
    strand_t right_;

And definition:

#include "pcr.hpp"

#include <stdexcept>

DNA::DNA() = default;

DNA::DNA(const sequence_t& left)
    setRightStrand(getLeftComplementary(left.size() - 1));

DNA::DNA(const sequence_t& left,
         const sequence_t& right)

DNA::DNA(const strand_t& left)
: left_(left),
  right_(getLeftComplementary(left_.size() - 1))
{ }

DNA::DNA(const strand_t& left,
         const strand_t& right)
: left_(left),
{ }

void DNA::setLeftStrand(const strand_t& strand)
    left_ = strand;

void DNA::setLeftStrand(const std::vector& bases)
    for (locus_t i = 0, end = bases.size(); i < end; ++i)
        left_.push_back({bases[i], i});

DNA::strand_t DNA::getLeftStrand()
    return left_;

const DNA::strand_t DNA::getLeftStrand() const
    return left_;

DNA::left_iterator DNA::getLeftFivePrime()
    return left_.begin();

DNA::const_left_iterator DNA::getLeftFivePrime() const
    return left_.begin();

DNA::left_iterator DNA::getLeftThreePrime()
    return left_.end();

DNA::const_left_iterator DNA::getLeftThreePrime() const
    return left_.end();

DNA::strand_t DNA::getLeftComplementary(locus_t primer) const
    strand_t complementary;
    const_left_iterator itr = std::find_if(getLeftFivePrime(),
                                           [&] (const Nucleotide& n)
                                           { return n.locus == primer; });
    if (itr == getLeftThreePrime())
        throw std::invalid_argument("Could not find primer locus in left strand!");
    for(const_left_iterator begin = getLeftFivePrime();
        itr >= begin;
        Base base;
            case DNA::Base::A: base = DNA::Base::T; break;
            case DNA::Base::C: base = DNA::Base::G; break;
            case DNA::Base::G: base = DNA::Base::C; break;
            case DNA::Base::T: base = DNA::Base::A; break;
        complementary.push_front({base, itr->locus});
    return complementary;

void DNA::setRightStrand(const strand_t &strand)
    right_ = strand;

void DNA::setRightStrand(const std::vector& bases)
    for (locus_t i = 0, end = bases.size(); i < end; ++i)
        right_.push_back({bases[i], i});

DNA::strand_t DNA::getRightStrand()
    return right_;

const DNA::strand_t DNA::getRightStrand() const
    return right_;

DNA::right_iterator DNA::getRightFivePrime()
    return right_.rbegin();

DNA::const_right_iterator DNA::getRightFivePrime() const
    return right_.rbegin();

DNA::right_iterator DNA::getRightThreePrime()
    return right_.rend();

DNA::const_right_iterator DNA::getRightThreePrime() const
    return right_.rend();

DNA::strand_t DNA::getRightComplementary(locus_t primer) const
    strand_t complementary;
    const_right_iterator itr = std::find_if(getRightFivePrime(),
                                            [&] (const Nucleotide& n)
                                            { return n.locus == primer; });
    if (itr == getRightThreePrime())
        throw std::invalid_argument("Could not find primer locus in right strand!");
    for(const_right_iterator begin = getRightFivePrime();
        itr >= begin;
        Base base;
            case DNA::Base::A: base = DNA::Base::T; break;
            case DNA::Base::C: base = DNA::Base::G; break;
            case DNA::Base::G: base = DNA::Base::C; break;
            case DNA::Base::T: base = DNA::Base::A; break;
        complementary.push_back({base, itr->locus});
    return complementary;

DNA::locus_t DNA::leftToRightLocus(locus_t leftLocus) const
    return right_.size() - leftLocus - 1;

DNA::locus_t DNA::rightToLeftLocus(locus_t rightLocus) const
    return left_.size() - rightLocus - 1;

The interface you see here matches the description I gave above. I'll nevertheless talk about two non-obvious things about this class: first, why the Nucleotide locus is hardcoded and, secondly, how the DNA::getLeftComplementary() and DNA::getRightComplementary() methods work.

 Our DNA strands are std::deque<Nucleotide> (typedef strand_t, deque and not vector because we can push_front), where each Nucleotide consists of a member of the DNA::Nucleotide enum class, as well as a locus. The locus is hardcoded for each Nucleotide and not simply contained in the Nucleotide's index in the std::deque, because when DNA strands are complemented, they reduce in size and thus a primer locus wouldn't simultaneously be valid for the original and the complementary strand. As an example, if you set the primer locus to position 10 of a 15-nucleotide-long original strand, this would make the complementary strand 5 nucleotides long (indices 0-4) and thus locus 10 is invalid for the complementary strand. By hardcoding the locus for each nucleotide and not relying on the strand index, locus 10 would always work, because the complementary strand's nucleotides copy the original strand's hardcoded loci (original strand's hardcoded loci go from 0 to 14, complementary strand's loci then go from 10 to 14 → locus 10 valid for both).

Now to DNA::getLeftComplementary(). Its purpose is to take a locus and then return the complementary strand of the DNA object's left strand, starting from that locus (think of this locus as the primer position, just that it is actually included in the complementary strand, unlike the primer's position in real world PCR). At first, we search for the locus in the left strand using std::find_if(), to which we pass a lambda as a predicate which will make std::find_if look at the strand's nucleotide loci and return true if any locus matches the primer locus. We throw an exception if the locus isn't found. Else, we enter a for loop where we get the complementary base for each of the left strand's nucleotide bases (starting at the primer locus) and add this complementary base alongside the current locus (taken from left strand's nucleotide) as a nucleotide to the complementary strand (which is, don't forget, just a Nucleotide container). We iterate backwards from the primer locus because the complementary strand runs antiparallel, so it must move towards the 5' end of the original (left) strand.

Notice also that we call complementary.push_front and not push_back, to simulate this antiparallel behavior. You see, our right and left strand's are actually stored antiparallel in the DNA class, so we must append Nucleotides as well as iterate over them in different directions. Therefore, the reason why DNA::left_iterator is a typedef of strand_t::iterator (strand_t is the std::deque<Nucleotide>) and DNA::right_iterator is a typedef of strand_t::reverse_iterator, is that DNA::getLeftFivePrime() returns left_.begin(), so a normal forward iterator, while DNA::getRightFivePrime() returns right_.rbegin(), which is a reverse_iterator that actually points to the last valid Nucleotide in the strand (would be right_.end() - 1 as a normal iterator). We can then iterate over both strands from their respective 5' ends to their respective 3' ends in the same way.

So much to the DNA class. Let's talk about the algorithm for the actual PCR function. It'll be recursive. It'll take a DNA object and return a container of strands containing (mostly) the target sequence, which we create using the Polymerase Chain Reaction algorithm. It'll also take a "first" and a "last" locus, that tell us where the target DNA sequence starts and ends ("first" and "last" are included in the sequence). The PCR function's last parameter will be the number of iterations that should be performed for the algorithm.

Have a look a this tree diagram. It shows the development of the strands of each DNA strand pair for each iteration of the PCR function. A check mark stands for a strand that is "correct",  which we defined before as any strand that is the target sequence. Conversely, a cross denotes an "incorrect" strand, which is any strand that is not the target sequence (such as the initial DNA pair).

The initial DNA strand pair is iteration 0 of the algorithm, the second row is iteration 1, the third iteration 2, the fourth is iteration 3 and the last is iteration 4. Regarding some notation to address single strands in this tree, we'll have one number stand for the iteration (0 is the top of the tree, 4 is the bottom here), a second number to to stand for the strand pair (left to right) and "l" or "r" is then either the left or the right strand of that pair, respectively. To give some examples:

00l) is in the 0th row of this tree (= 0th iteration), the 0th pair, the left strand (so the left strand of the initial DNA pair at the top of this tree)

37r) is in iteration 3, the 7th pair, the right strand (right most strand of this iteration)

22l) is in iteration 2, the 2nd pair, the left strand

... and so on.

We can make some generalizations for this algorithm. What we can say for sure, with little thought to it, is that after n iterations this algorithm produces a total of 2^(n+1) strands (both correct and incorrect). This is clear since the number of strands doubles for each iteration. The + 1 stems from the fact that we start with 2^1 strands and not 2^0 initially.

Next up, let's find out how many correct strands there are after each iteration. One solution is to find out how many incorrect strands there are after n iterations and to subtract that number from the total.

So how many incorrect strands do we get for n iterations? 2 * (n + 1) is the answer. Why? Have a look at the tree. Notice that during each iteration, two incorrect strands are added from the far left and far right (e.g. 10r creates 21r; 23l creates 37l etc.). The reason why is that the left and right most strands are always still as long as the original pair, so when you perform the PCR on them, you get a complementary strand that goes either from the first target locus to the end of the strand, or from the last target locus to the beginning of the strand. So they're still incorrect (though they yield a correct strand in the next iteration). Just to visualize, left and right most strands look something like this below, where | denotes either the last or first locus of the target sequence we want to copy:

00l, 10l, 20l, ...) 5' <---|-----|---> 3' (left most strand)

00r, 11r, 23r, ...) 3' <---|-----|---> 5' (right most strand)

When we perform the first PCR cycle on them, you'll get complementary strands that are almost correct. So for iteration 1, the two new incorrect strands are (don't forget complementary strands run antiparallel):

10r) 3' <---|-----| 5' (complementary to left most strand)

11l) 5' |-----|---> 3' (complementary to right most strand)

These are the two incorrect strands that are added during each iteration.  In 2 * (n + 1), the 2 therefore stems from the fact that we progress in steps of 2 (since 2 incorrect strands are added for each n). The n + 1 is there because we already have 2 incorrect strands for iteration 0 (at the start).

So we know that the total number of strands, correct and incorrect, for each iteration is 2^(n+1) and that the number of incorrect strands for each iteration is 2(n + 1). Therefore, the first solution to finding the number of correct strand for each iteration is: 2^(n+1) - 2(n+1)

The second approach to finding the number of correct strands is to look directly at the correct strands produced per iteration and how they develop with time. Going back to the visualization from above, we notice that the two incorrect complementary strands will yield correct strands in the next iteration:

21r) 3' <---|-----| 5' (previously 10r)

21l) 5' |-----| 3' (complementary to 21r / 10r)

23l) 5' |-----|---> 3' (previously 11l)

23r) 3' |-----| 5' (complementary to the 23l / 11l)

We can say that starting with iteration 1 (n=1), every incorrect strand yields a correct strand, except for the the incorrect strands at the far left and far right, since those yield the two new incorrect strands we just spoke about (e.g. 10r or 11l), which keep on wandering down the tree producing those correct strands (10r -> 21r -> 32r ...). But just to clarify, those newly created incorrect strands like 10r 11l on the far left and far right mean that the number of correct strands produced by incorrect strands also increments by 2 per iteration (new incorrect strand in n means new correct strand in n+1). Therefore, we can say that for n iterations where n is greater zero, 2(n-1) correct strands are added solely by complementing incorrect ones.

So those are the correct strands produced from incorrect ones. What about correct strands produced from correct ones? Those will be correct again, of course:

21l) 5' |-----| 3'


32l) 3' |-----| 5' (complementary of 21l)
32r) 3' |-----| 5' (previously 21l)

This time, the number of correct strands doubles. For the next iteration, they will double again, thereafter they double again and so on. This is clearly a case for a power of two. So we know two that the number of newly added correct strands per iteration is 2(n-1). We also know that these 2(n-1) strands will subsequently double for each iteration. Therefore, two iterations on, there will be 2 * 2 * 2(n-1) = 2^3(n-1). We can generalize and say that after m iterations, where m equals 1 initially, there will be 2^m * (n-1) strands created from those initial 2(n-1). Notice that this also works at the very beginning when m is 1: 2^1 * (n-1) = 2(n-1).

What we can conclude from this is that to calculate the number of correct strands directly (i.e. without just subtracting the number of incorrect from the total) for a single iteration n, we have to count from n back to the first iteration and perform this operation of doubling the single, newly created correct strands as we just did above, but for each of the previous iterations. Let me put it like this: in iteration n we have 2(n-1) newly added correct strands. Those haven't doubled yet. In the previous iteration, so n-1, we had 2(n-2) newly added correct strands. Those have doubled, so in iteration n those 2(n-2) correct strands have turned into 2*2(n-2) = 2^2(n-2) strands. In iteration n-2 we had 2(n-3) correct strands, those doubled in n-1 and then doubled again for n. So in n, those 2(n-3) have turned into 2*2*2(n-3) = 2^3*(n-3) strands. If we do this for every iteration starting at n, going back to 1, we get the total number of correct strands present in iteration n:

Phew. Hope you survived that. To answer the question you're asking yourself: no, that knowledge won't really be of any use for the C++ PCR algorithm I'm about to show you. But thanks for the fun discussion!

Here's the actual PCR non-member non-friend function:

DNA::container_t pcr(const DNA& dna,
                     DNA::locus_t first,
                     DNA::locus_t last,
                     std::size_t n)
    // If we reached the maximum number of iterations,
    // return the DNA strand pair
    if (! n) return {dna.getLeftStrand(), dna.getRightStrand()};
        // Create left branch of this position in the tree
        DNA left {dna.getLeftStrand(), dna.getLeftComplementary(last)};
        // rvalue reference to capture the return value
        DNA::container_t&& leftReturned = pcr(left, first, last, n - 1);
        // Same thing for the right hand side
        DNA right {dna.getRightComplementary(first), dna.getRightStrand()};
        DNA::container_t&& rightReturned = pcr(right, first, last, n - 1);
        // Tie together right and left returned values
        leftReturned.insert(leftReturned.end(), rightReturned.begin(), rightReturned.end());
        // Then return the two
        return leftReturned;

Ok, maybe I'm actually wrong. Those last few paragraphs are of use. They should make this function really, really easy to understand. What we do is: move left down the tree whenever we can, else back and then right. Essentially, we create complementary DNA strand pairs moving down the tree as long as we can, then return all the DNA pairs from the very bottom (when the parameter n is zero).

Test run:

#include "pcr.hpp"

#include <iostream>

int main(int argc, char * argv [])
    // Our whole DNA. Note that this is
    // the left strand, the constructor
    // initializes the right as the left's
    // complementary
    DNA dna({DNA::Base::A,
    // Do the PCR using our dna, with the first
    // locus of our target sequence at 1 (= C)
    // and the last locus at 3 (= T). Perform
    // 4 iterations.
    DNA::container_t ret = pcr(dna, 1, 3, 4);
    std::size_t correct = 0, incorrect = 0;
    for (auto i : ret)
        // Suffices to do the check
        if (i.size() == 3) ++correct;
        else ++incorrect;
        for (auto j : i)
            switch (j.base)
                case DNA::Base::A: std::cout << "A\t"; break;
                case DNA::Base::C: std::cout << "C\t"; break;
                case DNA::Base::G: std::cout << "G\t"; break;
                case DNA::Base::T: std::cout << "T\t"; break;
        std::cout << std::endl;
    std::cout << std::endl << "Correct: " << correct << std::endl;
    std::cout << "Incorrect: " << incorrect << std::endl;


A C G T C 
T G C A 
C G T 
T G C A 
C G T 
G C A 
C G T 
T G C A 
C G T 
G C A 
C G T 
G C A 
C G T 
G C A 
C G T 
T G C A 
C G T C 
G C A 
C G T 
G C A 
C G T 
G C A 
C G T 
G C A 
C G T C 
G C A 
C G T 
G C A 
C G T C 
G C A 
C G T C 
T G C A G 

Correct: 22
Incorrect: 10

Note that it doesn't matter if a strand is the complementary of the target sequence, it still counts as correct because biologically speaking complementary strand and original strand are the same thing. The bases you see here are the bases of the DNA strands at the bottom of the tree from before, just turned by 270°. As discussed above, the far left and far right (here very top and very bottom) strands are the original strands (ACGTC), the other strands of those two pairs as well as a few others in the tree are the "almost correct ones" (TGCA / CGTC) and all those strands with three nucleotides are our target sequence: CGT / GCA (same thing).

What about the numbers? As we said (n is 4, don't forget):

Number of correct and incorrect strands after n iterations:

2^(n+1) = 2^(4+1) = 32 ... 22 + 10 = 32 ✓

Number of incorrect strands after n iterations:

2(n+1) = 2(4+1) = 2 * 5 = 10 ✓

Number of correct strands after n iterations:

2^(n+1) - 2(n+1) = 32 - 10 = 22 ✓


2^1*(n-1) + 2^2*(n-2) + 2^3*(n-3) = 2 * 3 + 4 * 2 + 8 * 1 = 22 ✓

We can thus conclude:

Total number of blog posts about modeling the Polymerase Chain Reaction in C++: 1

Thanks for reading, hope you learnt something :) kthxbye