Python is a popular and versatile programming language known for its simplicity, readability, and ease of use. However, as with any programming language, Python code can sometimes be inefficient, resulting in slower program execution. I will present how to identify hotspots and techniques to improve their performance.

Optimization should be only done on fully written and tested code using representative input data.

Code that do not need optimization should stay as simple and legible as possible. This will ensure its future maintainability. That’s why only hotspots need to be optimized.

So, a typical optimization workflow is:

Profile
Identify hotspots
Improve top hotspot
Loop to 1 until you are satisfied

You will need to reprofile because the top hotstop could change as you improve it.

Profile

Python comes with 2 modules (cProfile/profile) that will allow to get a good overview of what is happening in your code. You will find how to use them in the python documentation.

$ python -m cProfile example1.py
   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.604    0.604 {built-in method builtins.exec}
        1    0.000    0.000    0.604    0.604 example1.py:1(<module>)
        1    0.344    0.344    0.344    0.344 example1.py:18(f2)
        1    0.260    0.260    0.260    0.260 example1.py:7(f1)
...

$ python -m cProfile -s tottime example1.py
   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.338    0.338    0.338    0.338 example1.py:18(f2)
        1    0.262    0.262    0.262    0.262 example1.py:7(f1)
        2    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        1    0.000    0.000    0.600    0.600 example1.py:1(<module>)
...

If you need a deeper understanding of what is happening on your machine, you could use Intel® VTune™ Profiler which supports Python and gives access to hardware counters.

What to do with the hotspots?

Once you identified where your code is spending time, you could see if it is what is expected. If your algorithm is linear, does the run time increases linearly with the size of the input ?

To improve performance, you could:

Choose/write an algorithm with a lower O complexity.
Use already optimized external libraries.
Avoid conversions. For example, numpy to/from tensorflow.
Use Numba JIT to compile Python functions to optimized machine code at runtime.
Offload some work to the GPU using Numba CUDA
Write some functions in C++ and expose them in Python using pybind11 or Boost.Python.

Numba

Numba is a module that translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library.

Just adding a @jit decorator, the following simple code gets a 400x speedup.

from numba import jit
import numpy as np

def go_slow(a):
    for i in range(a.shape[0]):
        a[i] += 1
    return a

@jit
def go_fast(a):
    for i in range(a.shape[0]):
        a[i] += 1
    return a

You could easily time your function using the timeit module.

from timeit import timeit

a = np.arange(1000000)

# jit warmup
go_fast(a)
go_slow(a)

ts = timeit(lambda: go_slow(a), number=10)
tf = timeit(lambda: go_fast(a), number=10)
print(f'slow: {ts*1000:.2f}ms fast: {tf*1000:.2f}ms speedup: {ts/tf:.1f}x')

# slow: 2405.46ms fast: 5.03ms speedup: 478.3x

Be aware that the function is JIT compiled, i.e. compiled when it is first used based on the input parameters. So, when timing, do a first warm-up run to not time the Numba JIT compilation time.

Beside compiling Python code with @jit, Numba could help parallelize your code:

Seamless integration with Numpy.
Simplified Threading using @jit(parallel=True).
Vectorization: Numba can automatically translate some loops into vector instructions for 2-4x speed improvements.
GPU Acceleration with Numba CUDA.

However adding @jit will work for “simple” code. But, on real code base, you will have to do some refactoring to take full advantage of Numba.

Numba + CUDA

If you want to go one step further and offload some work to your NVIDIA GPU, Numba allows you to access CUDA from Python code. However, CUDA is not as easy to use as the @jit decorator.

Transferring data to and from GPU is a major bottleneck, so you have to be sure to amortize this transfer time by doing enough GPU computation on the uploaded data before downlading it back to continue processing on the CPU. You also need to have enough data to process to get a good occupancy (use all the GPU cores). And lastly, your processing algorithm needs to have a parallelism adapted to the GPU architecture.

A simple example of a CUDA kernel will look like this.

import numpy as np
from numba import cuda

@cuda.jit
def increment_by_one(a):
    pos = cuda.grid(1)
    if pos < a.size:  # Check array boundaries
        a[pos] += 1

a = np.arange(1000000)
d_a = cuda.to_device(a) # upload a to the GPU
threadsperblock = 256
blockspergrid = (d_a.size + (threadsperblock - 1)) // threadsperblock
increment_by_one[blockspergrid, threadsperblock](d_a)

However, you will need to know CUDA’s thread/block/grid architecture to understand how to layout/access your data. Depending on the layout of your data, your access pattern and your use of the global/shared/private memory, you will choose a 1D/2D/3D grid of some size.

Numba gives a cuda.grid shortcut to prevent the verbose usage of cuda.threadIdx.x, cuda.blockIdx.x and cuda.blockDim.x.

Usually, when not using advanced CUDA features (shared memory, Tensor Cores, …) we could expect a 10x speedup compared to equivalent machine code (output of C++ or Numba JIT).

from timeit import timeit

# warmup
go_fast(a)
go_slow(a)
increment_by_one[blockspergrid, threadsperblock](d_a)

tf = timeit(lambda: go_fast(a), number=10)
ts = timeit(lambda: go_slow(a), number=10)
tg = timeit(lambda: increment_by_one[blockspergrid, threadsperblock](d_a),
    number=10)

print(f'slow: {ts*1000:.2f}ms fast: {tf*1000:.2f}ms speedup: {ts/tf:.1f}x')
print(f'fast: {tf*1000:.2f}ms cuda: {tg*1000:.2f}ms speedup: {tf/tg:.1f}x')
print(f'slow: {ts*1000:.2f}ms cuda: {tg*1000:.2f}ms speedup: {ts/tg:.1f}x')

Here are the timing results.

slow: 2405.46ms fast: 5.03ms speedup: 478.3x
fast: 5.03ms cuda: 0.62ms speedup: 8.2x
slow: 2405.46ms cuda: 0.62ms speedup: 3901.0x

Unfortunately, CUDA kernel launch is asynchronous (it returns before the work is completed) so this GPU timing is wrong. The correct way to compute the correct kernel timing is to introduce timing event in the stream of CUDA commands.

start = cuda.event()
end = cuda.event()
start.record()
increment_by_one[blockspergrid, threadsperblock](d_a)
end.record()
end.synchronize()
elapsed = start.elapsed_time(end)
print(f'Event timing: {elapsed:.3f}ms')

# Event timing: 0.024ms (so 0.24ms for number=10)

This is even faster than our wrong timing. It is only measuring the kernel execution time and not all the mechanics necessary to launch the kernel (Numba runtime, CUDA runtime, Python runtime, …).

def inc(blockspergrid, threadsperblock, a):
    increment_by_one[blockspergrid, threadsperblock](a)
    cuda.synchronize()

tg = timeit(lambda: inc(blockspergrid, threadsperblock, d_a), number=10)

# slow: 2600.47ms fast: 6.27ms speedup: 414.4x
# fast: 6.27ms cuda: 1.92ms speedup: 3.3x
# slow: 2600.47ms cuda: 1.92ms speedup: 1353.4x

For more complex computation, we could expect a more significant speedup with CUDA. In the above case, the kernel is memory bound, i.e. limited by the GPU memory read/write speed.

In addition, these timings does not take into account the time to transfer the data to and from the GPU. So, depending on your compute pipeline, it could be more interesting to stay on CPU when we could not chain enough compute on the GPU to amortize this data transfer time.

C++ and pybind11

Another way to get fast code is to write it in C++ and expose it to Python using pybind11. However, the generated module will only work for the Python version it was compiled for and optimized for architectures defined at compile time. Where Numba is more ubiquitous as compilation is done Just In Time.

Here is an example of increment_by_one written in C++ and timed using std::chrono::high_resolution_clock.

#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;

void increment_by_one(int* a, size_t len)
{
    for (size_t i = 0; i < len; i++)
        a[i] += 1;
}

int main()
{
    const size_t kLen = 100000 * 10;
    int* a = new int[kLen];
    for (size_t i = 0; i < kLen; i++) a[i] = i;

    auto start = high_resolution_clock::now();
    for (int i = 0; i < 10; i++) increment_by_one(a, kLen);
    auto end = high_resolution_clock::now();

    auto elapsed_ns = duration_cast<nanoseconds>(end-start);

    std::cout << "Elapsed time: " << elapsed_ns.count()/1000000. << "ms\n";

    delete[] a;

    return 0;
}

Which outputs a timing close to what Numba achieve

Elapsed time: 5.08967ms

With pybind11, you could “easily” expose C++ code to Python and work with numpy arrays.

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
namespace py = pybind11;

void increment_by_one(py::array_t<int64_t,py::array::c_style> input) {
    py::buffer_info buf = input.request();
    int64_t* ptr = static_cast<int64_t*>(buf.ptr);
    
    for (size_t i = 0; i < buf.shape[0]; i++)
        ptr[i] += 1;
}

PYBIND11_MODULE(example, m) {
    m.doc() = "Example module";
    m.def("increment_by_one", &increment_by_one, "Increment array elements");
}

To compile the module, you could use CMake and your CMakeLists.txt would look like:

cmake_minimum_required(VERSION 3.17)
project(example)

find_package(pybind11 CONFIG)

pybind11_add_module(example example.cpp)

Using it in Python will be as simple as using other Python modules

import numpy as np
import example

a = np.arange(1000000)
print(a[:7])
example.increment_by_one(a)
print(a[:7])

tp = timeit(lambda: example.increment_by_one(a), number=10)
print(f'pybind: {tp*1000:.2f}ms')

# [0 1 2 3 4 5 6]
# [1 2 3 4 5 6 7]
# pybind: 9.79ms

Conclusion

Numba is a simple way to get a performance boost on your Python code. However, if the function is called infrequently, the JIT compilation time could kill this performance boost. Numba could cache the compiled function with some limitations.

C++ and pybind11 allows to precompile your code but is way more complex to put in action.

Numba CUDA gives access to all the power of your NVIDIA GPU.

In all cases, don’t forget to first profile your code and don’t hesitate to contact me to discuss your project.

	Timing	Speedup
Python	2600.47ms	1x
Numpy	22.71ms	114x
Numba	6.27ms	414x
Numba CUDA	1.92ms	1353x
C++	5.09ms	511x
C++ pybind11	9.79ms	266x

Benchmarks done in a Docker container (nvcr.io/nvidia/pytorch:22.12-py3) on a PC (Intel® Core™ i9-7900X CPU @ 3.30GHz) with a NVIDIA RTX 3080 Ti GPU:

Python 3.8.10
Numpy 1.22.2
Numba 0.56.4
pybind11 v2.10.3

Optimizing Python code using Numba and CUDA