Faster Python#
Interfacing Python with other languages#
Numpy arrays provides means to speed up a Python code. However, they only work well when expressions can be “vectorized”, i.e. operations can be performed element-wise. A more flexible and general approach to speed up Python code is to write extensions in a compiled language, which are then interfaced with the Python code.
Python + Fortran = f2py#
Fortran and Python are a good match for scientific computing:
numpy and Fortran arrays and Fortran play rather well together
writing Fortran kernels give you full control and enable low-level optimization
The standard tool to interface Python and Fortran is f2py, which is installed by default along with numpy
. Note that a working Fortran compiler is required for f2py
to work.
We start with a piece of Fortran code that performs the same linear combination of arrays we saw above as well as a simple dot product
module kernels
implicit none
contains
subroutine daxpy(a, x, y, f)
real(8), intent(in) :: a, x(:), y(:)
real(8), intent(inout) :: f(:)
f = x + a*y
end subroutine daxpy
subroutine dot(x, y, f)
real(8), intent(in) :: x(:), y(:)
real(8), intent(out) :: f
f = dot_product(x, y)
end subroutine dot
end module kernels
We compile and create a Python wrapper module called f90
f2py -c -m f90 kernels.f90
We can now call the routine from Python. Note that the datatypes must match on Python and Fortran sides, according to these rules
Numpy types |
Fortran types |
---|---|
float64 |
real(8) / double precision |
float32 |
real(4) / real |
int32 |
integer(4) / integer |
int64 |
integer(8) |
complex64 |
complex(4) |
complex128 |
complex(8) |
Note that Python uses double precision for both float and integer by default. So either on the Fortran side we use double precision integers
a = numpy.ones(10, dtype='int64')
or on the Python side declare integers as 'int32'
.
Let’s test the code
import numpy
import f90
N = 10
x = numpy.ones(N, dtype='float64')
y = numpy.ones(N, dtype='float64')
f = numpy.ones(N, dtype='float64')
a = 2.0
f90.kernels.daxpy(a, x, y, f)
print(f)
[3. 3. 3. 3. 3. 3. 3. 3. 3. 3.]
To sum up, f2py
works pretty well as long as you adhere to a few simple guidelines.
On the Fortran side:
declare
intent
’s of all the arguments of your Fortran subroutines, using theintent(inout)
attribute for variables you want to modify in-place.declare your floating point variables as
real(8)
ordouble precision
, unless you know how to handle typesinstead of Fortran functions, use subroutines with
intent(out)
variables as last arguments - they will be returned as result
f = f90.kernels.dot(x, y)
print(f)
10.0
On the Python side:
use
dtype=numpy.int32
for integer arrays, unless you useinteger(8)
on the Fortran sidein-place modification of scalar variables requires passing a 0-dimensional array (ex.
numpy.array(1.0)
) - because Pythonint
andfloat
objects are immutable!shape multi-dimensional numpy arrays using Fortran layout: for instance,
numpy.ndarray((3, 10), order=
’F’)= will give the right contiguity in memory layout
Just-in-time compilation#
From https://www.freecodecamp.org/news/just-in-time-compilation-explained/:
Static compilation converts the code into a language (typically
machine code) for a specific platform. An interpreter directly
executes the source code, typically one line at a time. Just-in-time
(JIT) compilation attempts to use the benefits of both. While the
interpreted program is being run, the JIT compiler determines the most
frequently used code and compiles it to machine code.
There are several approaches to speed up execution of Python code beyond what numpy can offer. The general idea is to translate parts of your Python code into machine language at run time, compile it and execute it. Some popular packages to achieve this (and more!) are
Alternate Python distributions like pypy also provide just-in-time compilation.
These approaches give you limited control on how the Python code is translated under the hoods. If you want to have control and are fine with coding some low-level kernels in Fortran yourself, you can use f2py-jit.
The Julia programming language also takes the JIT approach as a starting point. It is a modern programming language, which keeps an excellent balance between ease of use (typical of interpreted languages) and efficiency (typical of compiled languages).
Numba#
Numba is a popular package to compile Python code just-in-time. Install it in your virtual environment with
pip install numba
Python functions decorated with numba.jit
will generally run faster than pure Python.
import numpy
from numba import jit
N = 5000
a = 1.0
x = numpy.array([1.0] * N)
y = numpy.array([2.0] * N)
z = numpy.array([0.0] * N)
def daxpy(a, x, y, z):
for i in range(N):
z[i] = a * x[i] + y[i]
def daxpy_vector(a, x, y, z):
z[:] = x[:] + a * y[:]
daxpy_numba = jit(daxpy)
daxpy_numba.__name__ = 'daxpy_numba'
To get the timings, we can use the timeit
module, which will repeat the function execution several times. I will actually use this custom decorator, which is a bit more flexibile
import functools
def timeit(func, minimum_time=1.0, size=1, verbose=True, fmt='.1e', calls=0):
"""
Decorate function to measure its execution time per unit call, or
normalized by `(call*size)` if the argument `size` is provided
(the computational size of the problem).
The `minimum_time` argument is used to ensure the timing takes no
less than that time. It defaults to 1 second.
"""
@functools.wraps(func)
def _func(*args, **kwargs):
import time
# Estimate execution time
ti = time.time()
res = func(*args, **kwargs)
tf = time.time()
# Now measure execution time
_calls = calls
if calls == 0:
_calls = max(1, int(minimum_time / (tf - ti)))
ti = time.time()
for _ in range(_calls):
res = func(*args, **kwargs)
tf = time.time()
# Return the normalized time per unit call/size
dt = (tf-ti) / (_calls * size)
if verbose:
unit = 'sec/call/size' if size > 1 else 'sec/call'
print(f'{dt:{fmt}} {unit} [{func.__name__}]')
return dt
return _func
# Standard numpy version
timeit(daxpy)(a, x, y, z)
# Decorated with numba
timeit(daxpy_numba)(a, x, y, z)
# Vectorized numpy
timeit(daxpy_vector)(a, x, y, z)
1.9e-03 sec/call [daxpy]
2.2e-06 sec/call [daxpy_numba]
6.8e-06 sec/call [daxpy_vector]
The numba code is much faster than the original numpy code, and even slightly faster than the vectorized numpy version. This result is a bit surprising and actually measuring timings with timeit
or the custom decorator gives appreciably different results for numba.
from timeit import timeit as _timeit
# Standard numpy version
print('Elapsed time (numpy): {:.2f} s'.format(_timeit('daxpy(a, x, y, z)', number=10000, globals=globals())))
# Decorated with numba
daxpy_numba = jit(daxpy)
print('Elapsed time (numba): {:.2f} s'.format(_timeit('daxpy_numba(a, x, y, z)', number=10000, globals=globals())))
# Vectorized numpy
print('Elapsed time (numpy vector): {:.2f} s'.format(_timeit('daxpy_vector(a, x, y, z)', number=10000, globals=globals())))
Elapsed time (numpy): 15.83 s
Elapsed time (numba): 0.09 s
Elapsed time (numpy vector): 0.07 s
The origin of this discrepancy is not clear at the moment…
The jit()
function is a decorator, which can be applied directly to the Python function using the decorator syntax
@jit(nopython=True)
def daxpy(a, x, y, z):
for _ in range(10000):
for i in range(N):
z[i] = x[i] + a * y[i]
The jit()
decorator offers a few parameters to play with:
nopython
instructs numba to raise an exception if the code cannot be compiled innopython
mode, which is faster. Otherwise, numba may fallback toobject
compilation mode, which is slower. Thenjit()
decorator defaults tonopython=True
cache
avoids recompilation every time you execute the Python programparallel
enables automatic parallelizationinline
controls function inliningfast_math
provides more aggressive optimizations
Check out which Python features and which numpy features are supported by numba, as well as additional performance tips.
f2py-jit#
f2py-jit
builds efficient Fortran extensions for python at execution time. It extends the machinery of f2py to provide
Seamless compilation of source blocks as Python strings
Caching of module builds across executions
Optional inlining of Fortran routines (experimental)
Install it in your virtual environment with
pip install f2py-jit
Since f2py-jit
is built on top of f2py
, interfacing Python and Fortran 90 is pretty smooth and follows the same rules. See https://coslo.frama.io/f2py-jit/tutorial/ for more details.
There are two equivalent ways to use f2py-jit
. You can provide the Fortran code as string and compile it just-in-time
from f2py_jit import jit
source = """
subroutine hello()
print*, "Hello world!"
end subroutine hello
"""
f90 = jit(source)
f90.hello()
Hello world!
Otherwise, we use the same file we used to test f2py
. We compile it, import it and execute it like this
from f2py_jit import jit
f90 = jit('kernels.f90')
Note
When developing a package, the Fortran source file will be stored in the package directory itself. In this case, you must provide the absolute path of the source file to jit()
, see the snippet below how to do that. This is necessary also when executing a script from another directory than the one where the script resides.
If the Fortran source file is in the same directory as the Python code, this is the safest way to call jit()
from f2py_jit import jit
import os
pwd = os.path.dirname(__file__)
jit(os.path.join(pwd, 'kernels.f90'))
Module builds are cached by default: if we call jit()
and the module associated to the source or source files has been built already, no recompilation is attempted even across sessions. To clear the cache use f2py_jit.clear_modules()
.
You can specify optimization flags for the compiler when building the Fortran module. This allows you to fine tune the compilation flags at run time and squeeze the most out of your code.
Here is a simple example:
import timeit as _timeit
from f2py_jit import jit
source = """
subroutine series(n, s)
integer, intent(in) :: n
real(8), intent(out) :: s
s = 0.0
do i = 1,n
s = s + exp(1.0/(real(i))**2)
end do
end subroutine
"""
# Try out a few gfortran flags combos
for flags in ['-O0', '-O1', '-O3', '-O3 -ffast-math']:
f90 = jit(source, flags=flags)
time = timeit(f90.series, verbose=False)(int(1e8))
print(f'{flags:16}: {time:.4f}')
-O0 : 0.6995
-O1 : 0.5303
-O3 : 0.5251
-O3 -ffast-math : 0.2427
Notice the gap between -O0
and -O1
flags and the extra boost coming from less accurate exponential evaluation when using -ffast-math
.
Performance comparison#
Summing a series#
Here we compare the performance of numba
and f2py-jit
on a simple example (single loop)
from numba import jit
@jit(nopython=True)
def series_numba(n):
s = 0.0
for i in range(1, n+1):
s = s + 1.0 / float(i)**2
return s
This is the equivalent code with f2py-jit
source = """
subroutine series(n, s)
integer, intent(in) :: n
real(8), intent(out) :: s
s = 0.0
do i = 1,n
s = s + 1.0/(real(i))**2
end do
end subroutine series
"""
from f2py_jit import jit
f90 = jit(source)
series_f90 = f90.series
series_f90.__name__ = 'series_f90'
The timings are pretty much the same in this simple example (note the first exceution of numba will be slower because of caching).
n = int(1e8)
timeit(series_numba)(n)
timeit(series_f90)(n)
1.15e-01 sec/call [series_numba]
1.18e-01 sec/call [series_f90]
N-body problem#
Now we compare the performance of numba
and f2py-jit
on a minimal N-body energy calculation (double loop).
This is the numba code
from numba import jit
@jit(nopython=True, fastmath=True)
def potential(r, sigma, epsilon, n):
return epsilon * (sigma/r)**n
import math
@jit(nopython=True, fastmath=True)
def interaction(pos, sigma, epsilon, n):
N = pos.shape[0]
U = 0.0
for i in range(N):
for j in range(i+1, N):
r = pos[i, :] - pos[j, :]
# rij = numpy.sqrt(numpy.sum(r**2)) # this is slower
rij = numpy.sum(r**2)**0.5
if rij > 0.6:
U += potential(rij, sigma, epsilon, n)
U /= N
return U
func_numba = interaction
func_numba.__name__ = 'func_numba'
This is the equivalent Fortran code, called nbody.f90
subroutine potential(r, sigma, epsilon, n, u)
double precision, intent(in) :: r, sigma, epsilon
integer, intent(in) :: n
double precision, intent(out) :: u
u = epsilon * (sigma/r)**n
end subroutine potential
subroutine interaction(pos, sigma, epsilon, n, U)
double precision, intent(in) :: pos(:,:), sigma, epsilon
integer, intent(in) :: n
double precision, intent(out) :: U
double precision :: rij, r(size(pos,1)), rcut, uij
U = 0.0
do i = 1,size(pos, 2)
do j = i+1,size(pos, 2)
r = pos(:, i) - pos(:, j)
rij = sum(r**2)**0.5
if (rij > 0.6) then
call potential(rij, sigma, epsilon, n, uij)
U = U + uij
end if
end do
end do
U = U / size(pos,2)
end subroutine interaction
which we now compile
from f2py_jit import jit
f90 = jit('nbody.f90', flags='-O3 -ffast-math')
func_f90 = f90.interaction
func_f90.__name__ = 'func_f90'
Let’s get the timings
import numpy
N = 2000
L = 5.0
numpy.random.seed(1)
pos = (numpy.random.random((N, 3)) - 0.5) * L / 2
timeit(func_numba, fmt='.3f')(pos, sigma=1.0, epsilon=1.0, n=12)
pos = pos.transpose() # transpose the array for fortran
timeit(func_f90, fmt='.3f')(pos, sigma=1.0, epsilon=1.0, n=12)
0.187 sec/call [func_numba]
0.027 sec/call [func_f90]
In this specific example, f2py-jit
is much faster than numba
(essentially one order of magnitude) despite the fact that we used the same kind of optimizations. Note that inlining the code does give an appreciable speedup here.
We also check that the result is the same:
N = 2000
L = 5.0
numpy.random.seed(1)
pos = (numpy.random.random((N, 3)) - 0.5) * L / 2
print(func_numba(pos, sigma=1.0, epsilon=1.0, n=12))
pos = pos.transpose()
print(func_f90(pos, sigma=1.0, epsilon=1.0, n=12))
5742.988081452041
5742.988081452041