Long story short: scroll down for a downloadable DLL and python file that do matrix math using optimized SIMD functions.

Recently I was messing around with some 3D in PyOpenGL and found my most notable slowdowns occuring due to matrix math (multiplications being most common).

So I decided to try and implement some fast matrix functions and call those from python, using C98 limitations and ctypes as explained here by my friend Jan Pijpers.

I won’t go into detail about the math, you can download the project files at the end; any sources used are referenced in there.

I do have some profile results to compare! Doing 100,000 calls for each action listed, time displayed in seconds.

identity: 0.0331956952566 rotateY: 0.0617851720355 transform: 1.70942981948 inverse: 15.095287772 multiply: 0.492130726156 vector: 0.160486968636 perspective: 0.107690428216 transpose: 0.452984656091

Note that the inverse is matrix size agnostic (and not normalized!), therefore no loop unrolling is done by the python compiler. It is not representative of a proper python matrix4x4 inverse.

identity: 0.0333827514946 rotateY: 0.0857555184901 transform: 0.251571437936 inverse: 0.0439880125093 multiply: 0.0420022367291 vector: 0.288415226444 perspective: 0.156626988673 transpose: 0.0889596428649 perspective no SIMD: 0.160488955074

identity: 0.0323709924443 rotateY: 0.0845113462024 transform: 0.23958858222 inverse: 0.0395744785104 multiply: 0.0437013033019 vector: 0.286256299491 perspective: 0.150614703216 transpose: 0.0877707597662 perspective no SIMD: 0.156242612934

Interestingly not all operations are faster using C due to type conversions. For a simple axis aligned rotation all we need is a sin, a cos and a list. The sin/cos of python are not going to be any slower than those in C, so all we did was complicate the program.

But in a practical example represented by the transform function (which is a a separate rotateX, rotateY, rotateZ and translation matrix call, then all four of them multiplied together) we see a very worthwhile performance gain.

The math executes using SIMD instructions, all data is therefore converted to 16-bit memory aligned “__m128” structures (from “xmmintrin.h”). We need the C identity and rotate constructors to get the proper type of data, then when we actually need this data we must call storeMat44() to get an actual c_float[16] for python usage.

From my current usage 1 in 3 matrices requires a conversion back to python floats in order get passed to PyOpenGL, so here is another multiply performance test with every third multiplication stored back into a float[16]…

python multiply: 0.492130726156

MVC raw: 0.0436761417549

MVC multiply convert every third: 0.06491612928

MVC convert all: 0.0925153667527

So while our raw implementation is about 11 times faster, the fully converting implementation is only 5 times faster. 7.5 times for our real world example. That’s more than 30% lost again… still much better than pure python though!

**Download the visual studio project with x86 and x64 binaries here! Tested on Win10 x64 with Python 2.7.10 x64.**

*Math3Dx64.Dll and math3d.py are the end user files.*

One important thing I wish to look into is to pass data by copy instead, currently all functions allocate a new matrix on the heap, and the user has to delete these pointers by hand from python using the deleteMat44() helper function. I do not know enough about DLLs or python’s memory allocation to know whether I can copy data from the stack instead, and if so whether that would be any faster.

I do know that __vectorcall is not compatible with __declspec(dllexport), which kind of makes sense… but more direct data passing could be nice.