Faster Python#

Interfacing Python with other languages#

Numpy arrays provides means to speed up a Python code. However, they only work well when expressions can be “vectorized”, i.e. operations can be performed element-wise. A more flexible and general approach to speed up Python code is to write extensions in a compiled language, which are then interfaced with the Python code.

Fortran: f2py
C, C++: cython (and many more)
CUDA: pycuda, cupy
Julia: PyJulia

Python + Fortran = f2py#

Fortran and Python are a good match for scientific computing:

numpy and Fortran arrays and Fortran play rather well together
writing Fortran kernels give you full control and enable low-level optimization

The standard tool to interface Python and Fortran is f2py, which is installed by default along with numpy. Note that a working Fortran compiler is required for f2py to work.

We start with a piece of Fortran code that performs the same linear combination of arrays we saw above as well as a simple dot product

module kernels
  implicit none
contains
  subroutine daxpy(a, x, y, f)
    real(8), intent(in) :: a, x(:), y(:)
    real(8), intent(inout) :: f(:)
    f = x + a*y
  end subroutine daxpy

  subroutine dot(x, y, f)
    real(8), intent(in) :: x(:), y(:)
    real(8), intent(out) :: f
    f = dot_product(x, y)
  end subroutine dot
end module kernels

We compile and create a Python wrapper module called f90

f2py -c -m f90 kernels.f90

We can now call the routine from Python. Note that the datatypes must match on Python and Fortran sides, according to these rules

Numpy types	Fortran types
float64	real(8) / double precision
float32	real(4) / real
int32	integer(4) / integer
int64	integer(8)
complex64	complex(4)
complex128	complex(8)

Note that Python uses double precision for both float and integer by default. So either on the Fortran side we use double precision integers

a = numpy.ones(10, dtype='int64')

or on the Python side declare integers as 'int32'.

Let’s test the code

import numpy
import f90

N = 10
x = numpy.ones(N, dtype='float64')
y = numpy.ones(N, dtype='float64')
f = numpy.ones(N, dtype='float64')
a = 2.0
f90.kernels.daxpy(a, x, y, f)
print(f)

[3. 3. 3. 3. 3. 3. 3. 3. 3. 3.]

To sum up, f2py works pretty well as long as you adhere to a few simple guidelines.

On the Fortran side:

declare intent’s of all the arguments of your Fortran subroutines, using the intent(inout) attribute for variables you want to modify in-place.
declare your floating point variables as real(8) or double precision, unless you know how to handle types
instead of Fortran functions, use subroutines with intent(out) variables as last arguments - they will be returned as result

f = f90.kernels.dot(x, y)
print(f)

10.0

On the Python side:

use dtype=numpy.int32 for integer arrays, unless you use integer(8) on the Fortran side
in-place modification of scalar variables requires passing a 0-dimensional array (ex. numpy.array(1.0)) - because Python int and float objects are immutable!
shape multi-dimensional numpy arrays using Fortran layout: for instance, numpy.ndarray((3, 10), order=’F’)= will give the right contiguity in memory layout

Just-in-time compilation#

From https://www.freecodecamp.org/news/just-in-time-compilation-explained/:

Static compilation converts the code into a language (typically
machine code) for a specific platform. An interpreter directly
executes the source code, typically one line at a time. Just-in-time
(JIT) compilation attempts to use the benefits of both. While the
interpreted program is being run, the JIT compiler determines the most
frequently used code and compiles it to machine code.

There are several approaches to speed up execution of Python code beyond what numpy can offer. The general idea is to translate parts of your Python code into machine language at run time, compile it and execute it. Some popular packages to achieve this (and more!) are

numba
JAX

Alternate Python distributions like pypy also provide just-in-time compilation.

These approaches give you limited control on how the Python code is translated under the hoods. If you want to have control and are fine with coding some low-level kernels in Fortran yourself, you can use f2py-jit.

The Julia programming language also takes the JIT approach as a starting point. It is a modern programming language, which keeps an excellent balance between ease of use (typical of interpreted languages) and efficiency (typical of compiled languages).

Numba#

Numba is a popular package to compile Python code just-in-time. Install it in your virtual environment with

pip install numba

Python functions decorated with numba.jit will generally run faster than pure Python.

import numpy
from numba import jit

N = 5000
a = 1.0
x = numpy.array([1.0] * N)
y = numpy.array([2.0] * N)
z = numpy.array([0.0] * N)

def daxpy(a, x, y, z):
    for i in range(N):
        z[i] = a * x[i] + y[i]

def daxpy_vector(a, x, y, z):
    z[:] = x[:] + a * y[:]

daxpy_numba = jit(daxpy)
daxpy_numba.__name__ = 'daxpy_numba'

To get the timings, we can use the timeit module, which will repeat the function execution several times. I will actually use this custom decorator, which is a bit more flexibile

import functools
def timeit(func, minimum_time=1.0, size=1, verbose=True, fmt='.1e', calls=0):
    """
    Decorate function to measure its execution time per unit call, or
    normalized by `(call*size)` if the argument `size` is provided
    (the computational size of the problem).

    The `minimum_time` argument is used to ensure the timing takes no
    less than that time. It defaults to 1 second.
    """
    @functools.wraps(func)
    def _func(*args, **kwargs):
        import time

        # Estimate execution time
        ti = time.time()
        res = func(*args, **kwargs)
        tf = time.time()
        # Now measure execution time
        _calls = calls
        if calls == 0:
            _calls = max(1, int(minimum_time / (tf - ti)))
        ti = time.time()
        for _ in range(_calls):
            res = func(*args, **kwargs)
        tf = time.time()
        # Return the normalized time per unit call/size
        dt = (tf-ti) / (_calls * size)
        if verbose:
            unit = 'sec/call/size' if size > 1 else 'sec/call'
            print(f'{dt:{fmt}} {unit} [{func.__name__}]')
        return dt
    return _func

# Standard numpy version
timeit(daxpy)(a, x, y, z)

# Decorated with numba
timeit(daxpy_numba)(a, x, y, z)

# Vectorized numpy
timeit(daxpy_vector)(a, x, y, z)

9e-03 sec/call [daxpy]
2e-06 sec/call [daxpy_numba]
8e-06 sec/call [daxpy_vector]

The numba code is much faster than the original numpy code, and even slightly faster than the vectorized numpy version. This result is a bit surprising and actually measuring timings with timeit or the custom decorator gives appreciably different results for numba.

from timeit import timeit as _timeit

# Standard numpy version
print('Elapsed time (numpy): {:.2f} s'.format(_timeit('daxpy(a, x, y, z)', number=10000, globals=globals())))

# Decorated with numba
daxpy_numba = jit(daxpy)
print('Elapsed time (numba): {:.2f} s'.format(_timeit('daxpy_numba(a, x, y, z)', number=10000, globals=globals())))

# Vectorized numpy
print('Elapsed time (numpy vector): {:.2f} s'.format(_timeit('daxpy_vector(a, x, y, z)', number=10000, globals=globals())))

Elapsed time (numpy): 15.83 s
Elapsed time (numba): 0.09 s
Elapsed time (numpy vector): 0.07 s

The origin of this discrepancy is not clear at the moment…

The jit() function is a decorator, which can be applied directly to the Python function using the decorator syntax

@jit(nopython=True)
def daxpy(a, x, y, z):
    for _ in range(10000):
        for i in range(N):
            z[i] = x[i] + a * y[i]

The jit() decorator offers a few parameters to play with:

nopython instructs numba to raise an exception if the code cannot be compiled in nopython mode, which is faster. Otherwise, numba may fallback to object compilation mode, which is slower. The njit() decorator defaults to nopython=True
cache avoids recompilation every time you execute the Python program
parallel enables automatic parallelization
inline controls function inlining
fast_math provides more aggressive optimizations

Check out which Python features and which numpy features are supported by numba, as well as additional performance tips.

f2py-jit#

f2py-jit builds efficient Fortran extensions for python at execution time. It extends the machinery of f2py to provide

Seamless compilation of source blocks as Python strings
Caching of module builds across executions
Optional inlining of Fortran routines (experimental)

Install it in your virtual environment with

pip install f2py-jit

Since f2py-jit is built on top of f2py, interfacing Python and Fortran 90 is pretty smooth and follows the same rules. See https://coslo.frama.io/f2py-jit/tutorial/ for more details.

There are two equivalent ways to use f2py-jit. You can provide the Fortran code as string and compile it just-in-time

from f2py_jit import jit

source = """
subroutine hello()
  print*, "Hello world!"
end subroutine hello
"""
f90 = jit(source)
f90.hello()

Hello world!

Otherwise, we use the same file we used to test f2py. We compile it, import it and execute it like this

from f2py_jit import jit

f90 = jit('kernels.f90')

Note

When developing a package, the Fortran source file will be stored in the package directory itself. In this case, you must provide the absolute path of the source file to jit(), see the snippet below how to do that. This is necessary also when executing a script from another directory than the one where the script resides.

If the Fortran source file is in the same directory as the Python code, this is the safest way to call jit()

from f2py_jit import jit
import os

pwd = os.path.dirname(__file__)
jit(os.path.join(pwd, 'kernels.f90'))

Module builds are cached by default: if we call jit() and the module associated to the source or source files has been built already, no recompilation is attempted even across sessions. To clear the cache use f2py_jit.clear_modules().

You can specify optimization flags for the compiler when building the Fortran module. This allows you to fine tune the compilation flags at run time and squeeze the most out of your code.

Here is a simple example:

import timeit as _timeit
from f2py_jit import jit

source = """
subroutine series(n, s)
  integer, intent(in) :: n
  real(8), intent(out) :: s
  s = 0.0
  do i = 1,n
    s = s + exp(1.0/(real(i))**2)
  end do
end subroutine
"""

# Try out a few gfortran flags combos
for flags in ['-O0', '-O1', '-O3', '-O3 -ffast-math']:
    f90 = jit(source, flags=flags)
    time = timeit(f90.series, verbose=False)(int(1e8))
    print(f'{flags:16}: {time:.4f}')

-O0             : 0.6995
-O1             : 0.5303
-O3             : 0.5251
-O3 -ffast-math : 0.2427

Notice the gap between -O0 and -O1 flags and the extra boost coming from less accurate exponential evaluation when using -ffast-math.

Performance comparison#

Summing a series#

Here we compare the performance of numba and f2py-jit on a simple example (single loop)

from numba import jit

@jit(nopython=True)
def series_numba(n):
    s = 0.0
    for i in range(1, n+1):
        s = s + 1.0 / float(i)**2
    return s

This is the equivalent code with f2py-jit

source = """
subroutine series(n, s)
  integer, intent(in) :: n
  real(8), intent(out) :: s
  s = 0.0
  do i = 1,n
    s = s + 1.0/(real(i))**2
  end do
end subroutine series
"""
from f2py_jit import jit

f90 = jit(source)
series_f90 = f90.series
series_f90.__name__ = 'series_f90'

The timings are pretty much the same in this simple example (note the first exceution of numba will be slower because of caching).

n = int(1e8)
timeit(series_numba)(n)
timeit(series_f90)(n)

1.15e-01 sec/call [series_numba]
1.18e-01 sec/call [series_f90]

N-body problem#

Now we compare the performance of numba and f2py-jit on a minimal N-body energy calculation (double loop).

This is the numba code

from numba import jit

@jit(nopython=True, fastmath=True)
def potential(r, sigma, epsilon, n):
    return epsilon * (sigma/r)**n

import math
@jit(nopython=True, fastmath=True)
def interaction(pos, sigma, epsilon, n):
    N = pos.shape[0]
    U = 0.0
    for i in range(N):
        for j in range(i+1, N):
            r = pos[i, :] - pos[j, :]
            # rij = numpy.sqrt(numpy.sum(r**2))  # this is slower
            rij = numpy.sum(r**2)**0.5
            if rij > 0.6:
                U += potential(rij, sigma, epsilon, n)
    U /= N
    return U

func_numba = interaction
func_numba.__name__ = 'func_numba'

This is the equivalent Fortran code, called nbody.f90

subroutine potential(r, sigma, epsilon, n, u)
  double precision, intent(in) :: r, sigma, epsilon
  integer, intent(in) :: n
  double precision, intent(out) :: u
  u = epsilon * (sigma/r)**n
end subroutine potential

subroutine interaction(pos, sigma, epsilon, n, U)
  double precision, intent(in) :: pos(:,:), sigma, epsilon
  integer, intent(in) :: n
  double precision, intent(out) :: U
  double precision :: rij, r(size(pos,1)), rcut, uij
  U = 0.0
  do i = 1,size(pos, 2)
        do j = i+1,size(pos, 2)
            r = pos(:, i) - pos(:, j)
            rij = sum(r**2)**0.5
            if (rij > 0.6) then
                call potential(rij, sigma, epsilon, n, uij)
                U = U + uij
            end if
        end do
  end do
  U = U / size(pos,2)
end subroutine interaction

which we now compile

from f2py_jit import jit

f90 = jit('nbody.f90', flags='-O3 -ffast-math')
func_f90 = f90.interaction
func_f90.__name__ = 'func_f90'

Let’s get the timings

import numpy

N = 2000
L = 5.0
numpy.random.seed(1)
pos = (numpy.random.random((N, 3)) - 0.5) * L / 2
timeit(func_numba, fmt='.3f')(pos, sigma=1.0, epsilon=1.0, n=12)
pos = pos.transpose()  # transpose the array for fortran
timeit(func_f90, fmt='.3f')(pos, sigma=1.0, epsilon=1.0, n=12)

0.187 sec/call [func_numba]
0.027 sec/call [func_f90]

In this specific example, f2py-jit is much faster than numba (essentially one order of magnitude) despite the fact that we used the same kind of optimizations. Note that inlining the code does give an appreciable speedup here.

We also check that the result is the same:

N = 2000
L = 5.0
numpy.random.seed(1)
pos = (numpy.random.random((N, 3)) - 0.5) * L / 2
print(func_numba(pos, sigma=1.0, epsilon=1.0, n=12))
pos = pos.transpose()
print(func_f90(pos, sigma=1.0, epsilon=1.0, n=12))

5742.988081452041
5742.988081452041