SIMD Matrix math for Python

Long story short: scroll down for a downloadable DLL and python file that do matrix math using optimized SIMD functions.

Recently I was messing around with some 3D in PyOpenGL and found my most notable slowdowns occuring due to matrix math (multiplications being most common).

So I decided to try and implement some fast matrix functions and call those from python, using C98 limitations and ctypes as explained here by my friend Jan Pijpers.

I won’t go into detail about the math, you can download the project files at the end; any sources used are referenced in there.

I do have some profile results to compare! Doing 100,000 calls for each action listed, time displayed in seconds.

Pure python implementation.

identity: 0.0331956952566
rotateY: 0.0617851720355
transform: 1.70942981948
inverse: 15.095287772
multiply: 0.492130726156
vector: 0.160486968636
perspective: 0.107690428216
transpose: 0.452984656091

Note that the inverse is matrix size agnostic (and not normalized!), therefore no loop unrolling is done by the python compiler. It is not representative of a proper python matrix4x4 inverse.

Using VC++ 14.0 MSBUILD, compiling release with -O2 and running without the debugger.

identity: 0.0333827514946
rotateY: 0.0857555184901
transform: 0.251571437936
inverse: 0.0439880125093
multiply: 0.0420022367291
vector: 0.288415226444
perspective: 0.156626988673
transpose: 0.0889596428649
perspective no SIMD: 0.160488955074
Using LLVM 14.0 from visual studio (not sure which linker is used there), compiling release with -O2 and running without the debugger (-O3 doesnt change the results).

identity: 0.0323709924443
rotateY: 0.0845113462024
transform: 0.23958858222
inverse: 0.0395744785104
multiply: 0.0437013033019
vector: 0.286256299491
perspective: 0.150614703216
transpose: 0.0877707597662
perspective no SIMD: 0.156242612934

Interestingly not all operations are faster using C due to type conversions. For a simple axis aligned rotation all we need is a sin, a cos and a list. The sin/cos of python are not going to be any slower than those in C, so all we did was complicate the program.

But in a practical example represented by the transform function (which is a a separate rotateX, rotateY, rotateZ and translation matrix call, then all four of them multiplied together) we see a very worthwhile performance gain.

The math executes using SIMD instructions, all data is therefore converted to 16-bit memory aligned “__m128” structures (from “xmmintrin.h”). We need the C identity and rotate constructors to get the proper type of data, then when we actually need this data we must call storeMat44() to get an actual c_float[16] for python usage.

From my current usage 1 in 3 matrices requires a conversion back to python floats in order get passed to PyOpenGL, so here is another multiply performance test with every third multiplication stored back into a float[16]…

python multiply: 0.492130726156
MVC raw: 0.0436761417549
MVC multiply convert every third: 0.06491612928
MVC convert all: 0.0925153667527

So while our raw implementation is about 11 times faster, the fully converting implementation is only 5 times faster. 7.5 times for our real world example. That’s more than 30% lost again… still much better than pure python though!

Download the visual studio project with x86 and x64 binaries here! Tested on Win10 x64 with Python 2.7.10 x64.
Math3Dx64.Dll and are the end user files.

One important thing I wish to look into is to pass data by copy instead, currently all functions allocate a new matrix on the heap, and the user has to delete these pointers by hand from python using the deleteMat44() helper function. I do not know enough about DLLs or python’s memory allocation to know whether I can copy data from the stack instead, and if so whether that would be any faster.

I do know that __vectorcall is not compatible with __declspec(dllexport), which kind of makes sense… but more direct data passing could be nice.

Leave a Reply

Your email address will not be published. Required fields are marked *