Arts et Metiers

Making Python Fast
Although flexible and easy to learn, Python suffers from speed issues. Unfortunately there's no magic bullet
which will work in all cases, but hopefully by understanding what makes Python slow, what options we have available
and what their tradeoffs are involved, you'll be in a better position to write high performance Python code.

Vanilla Python
Pros: Object Oriented, highly flexible, can interface with any third party Python module.
Cons: Slow.
If no other tricks will do, regular Python code is a good fallback. Arts et Metiers uses vanilla python
for cases where it needs to interface with third party modules (OpenGL calls, using GLFW etc.), and classes
(the main app, any system which needs to store state). It's also good for code which isn't performance critical,
such as loading files.

Numpy Arrays
Pros: More CPU-friendly data layout, vectorized code can be extremely fast.
Cons: Everything must be numeric, slightly unwieldy syntax, more expressive code (eg. conditionals) still faster but not by as much.

To understand why numpy arrays are faster, it helps to understand why Python is slower. Consider the humble integer,
It'd be reasonable to expect this to an allocation of 4 bytes, allowing direct access. A concession of Python
however, in order to allow for maximum type flexibility, is to implment everything as an object. In Python,
everything is a PyObject, and integers are a subclass of PyObject, PyLongObject. In order to run garbage collection,
all PyObjects need to be reference counted, and although PyLongObject is an opaque datatype, the documentation states
that it has variable size, which would seem to indicate that it has some sort of "value" field, which can be dynamically
reallocated. The point I'm trying to illustrate is that the memory footprint of a simple integer is massive, and working
with one might involve any number of pointer dereferences.

By contrast, numpy arrays just hold values. All elements are of the same datatype, the array elements sit contiguous in memory
so although we're still using the Python interpreter to work with these arrays, performance is still improved.

Numpy has a further benefit: vectorized code. Consider the following:

You can probably imagine what's going on under the hood here, Python accesses the numpy library and tells it to add these arrays
element-wise. The Numpy library then runs its own (compiled, optimized) code to do that, and returns a pointer representing the
result, c. That's an incredible performance improvement, and is incredibly elegant.

As noted in the cons section, though, this introduces some limitations on the code we can write. Arrays need to be homogenous,
wereas structs can mix data types, we can only have a single data type per array. Furthermore, everything must be expressed
numerically. This second point shouldn't be a large constraint, ultimately the computer is dealing with ones and zeros anyway.

It's also slightly awkward to represent arrays of structs with numpy arrays. For more details on this, see the tutorial on packed arrays

The big killer of numpy's performance is that not everything can be expressed as deterministic, vectorised code.
If implementing some game logic, we may need to make decisions and branch, or modify regions of an array in an interleaved manner.
While there are tricks to recast these problems into closed forms, there's another option which gives us performance close to C.

Numba JIT Compilation
Pros: Fast. Gives performance close, if not equal to compiled code.
Cons: Limited functionality supported, functions can have a large number of inputs and outputs.
Numba is a high performance module which offers Just In Time (JIT) compilation of functions. It's incredibly easy to incorporate.
compiled python code

Problem solved! We can now write vanilla Python code, and get compiled performance out of it. We can go home now, right? Well
it's not that simple, only a limited number of Python language features can be compiled in this way. Classes can't be compiled,
and not all numpy functions. It's very common to have to rewrite code to use different functions due to lack of njit support.
Arts et Metiers uses njit compiled functions to do the "heavy lifting" of updating positions, writing arrays of transformation
matrices and so on.

Compute Shaders
Pros: Absolute powerhouse performance, can use structs so syntax gets cleaner, no need to upload data to the GPU every frame.
Cons: Can be wasteful if only a small amount of work needs to be done.
A compute shader is a mini program that runs on the GPU and can perform some arbitrary piece of work. Compute shaders will be discussed
in a future tutorial, but here's a simple example.
vector addition compute shader

Whereas the other programs were looping through the array and adding elements, the compute shader describes a single invocation of the
adding program. When work is dispatched to be computed, the same simple program is run in parallel across the GPU's available compute
units. GPUs are massively parallel machines and so compute shaders are more resistent to large problem sizes. Furthermore, compute
shaders handle linear algebra better due to their SIMD support (when we declare a vec4, under the hood a SIMD chunk of 4 floats is
initialized). And, due to structs, the syntax can be much cleaner than the packed numpy array method.

We also don't need to worry so much about moving memory around. A compute shader writes data back to the same buffer that can be bound
for our rendering shaders to read, without moving anything.

There's one consideration though, compute shaders are fairly heavy duty, and need to have a large enough workload per dispatch
in order to fully warm up. For its numeric methods, Arts et Metiers decides at runtime whether to dispatch a compute shader
or to simply call an njit compiled function. This decision is based on the number of invocations needed.