Long story short: scroll down for a downloadable DLL and python file that do matrix math using optimized SIMD functions.
Recently I was messing around with some 3D in PyOpenGL and found my most notable slowdowns occuring due to matrix math (multiplications being most common).
So I decided to try and implement some fast matrix functions and call those from python, using C98 limitations and ctypes as explained here by my friend Jan Pijpers.
I won’t go into detail about the math, you can download the project files at the end; any sources used are referenced in there.
I do have some profile results to compare! Doing 100,000 calls for each action listed, time displayed in seconds.
identity: 0.0331956952566 rotateY: 0.0617851720355 transform: 1.70942981948 inverse: 15.095287772 multiply: 0.492130726156 vector: 0.160486968636 perspective: 0.107690428216 transpose: 0.452984656091
Note that the inverse is matrix size agnostic (and not normalized!), therefore no loop unrolling is done by the python compiler. It is not representative of a proper python matrix4x4 inverse.
identity: 0.0333827514946 rotateY: 0.0857555184901 transform: 0.251571437936 inverse: 0.0439880125093 multiply: 0.0420022367291 vector: 0.288415226444 perspective: 0.156626988673 transpose: 0.0889596428649 perspective no SIMD: 0.160488955074
identity: 0.0323709924443 rotateY: 0.0845113462024 transform: 0.23958858222 inverse: 0.0395744785104 multiply: 0.0437013033019 vector: 0.286256299491 perspective: 0.150614703216 transpose: 0.0877707597662 perspective no SIMD: 0.156242612934
Interestingly not all operations are faster using C due to type conversions. For a simple axis aligned rotation all we need is a sin, a cos and a list. The sin/cos of python are not going to be any slower than those in C, so all we did was complicate the program.
But in a practical example represented by the transform function (which is a a separate rotateX, rotateY, rotateZ and translation matrix call, then all four of them multiplied together) we see a very worthwhile performance gain.
The math executes using SIMD instructions, all data is therefore converted to 16-bit memory aligned “__m128” structures (from “xmmintrin.h”). We need the C identity and rotate constructors to get the proper type of data, then when we actually need this data we must call storeMat44() to get an actual c_float for python usage.
From my current usage 1 in 3 matrices requires a conversion back to python floats in order get passed to PyOpenGL, so here is another multiply performance test with every third multiplication stored back into a float…
python multiply: 0.492130726156
MVC raw: 0.0436761417549
MVC multiply convert every third: 0.06491612928
MVC convert all: 0.0925153667527
So while our raw implementation is about 11 times faster, the fully converting implementation is only 5 times faster. 7.5 times for our real world example. That’s more than 30% lost again… still much better than pure python though!
Download the visual studio project with x86 and x64 binaries here! Tested on Win10 x64 with Python 2.7.10 x64.
Math3Dx64.Dll and math3d.py are the end user files.
One important thing I wish to look into is to pass data by copy instead, currently all functions allocate a new matrix on the heap, and the user has to delete these pointers by hand from python using the deleteMat44() helper function. I do not know enough about DLLs or python’s memory allocation to know whether I can copy data from the stack instead, and if so whether that would be any faster.
I do know that __vectorcall is not compatible with __declspec(dllexport), which kind of makes sense… but more direct data passing could be nice.