Accelerating Maya -> PyOpenGL mesh IO

A few weeks back I posted about exporting a mesh from Maya and drawing it with Python.

Faster export

I recently got back to improving this a little. First I improved maya exporter performance by porting it to a C++ plugin.
I don’t want to go over all the details, because it is similar to the python version posted before and if I’m going to explain this code properly I’d have to do a tutorial series on the Maya API in the first place! So here’s a little dump of the visual studio project instead: plugin.vcxproj!

It is currently very basic and just exports all data it can find. I’m aware certain maya models can crash certain other functions in Maya’s MFnMesh (and related) class. E.g. empty UV sets, UV sets with UVs for only certain vertices/faces, geometry with holes crashing getTriangles, etc. It may be good to write a python layer that does some validation on the mesh as well as add flags to explicitly export (or ignore) certain attributes and UV/color sets.

Faster import

Next I used the python mmap (memory map) module to upload the mesh directly from disk to openGL without getting (and therefore boxing) the raw data in Python objects first. Previously I was loading binary to python, which requires python to cast the binary to a python object, which I then wrapped into a ctypes object, allocating and copying huge chunks of memory and constructing tons of python objects. With mmap I can just cast the file handle to a void* and hand it to glBufferData.

import os
import mmap
import ctypes
import contextlib


@contextlib.contextmanager
def memoryMap(fileDescriptor, sizeInBytes=0, offsetInBytes=0):
    if isinstance(fileDescriptor, basestring):
        fd = os.open(fileDescriptor, os.O_RDWR | os.O_BINARY)
        ownFd = True
    else:
        fd = fileDescriptor
        ownFd = False
    mfd = None
    try:
        mfd = mmap.mmap(fd, sizeInBytes, offset=offsetInBytes)
        yield MappedReader(mfd)
    finally:
        if mfd is not None:
            mfd.close()
        if ownFd:
            os.close(fd)


class MappedReader(object):
    def __init__(self, memoryMap):
        """Wrap a memory map into a stream that can stream through the file and map sections to ctypes."""
        self.__memoryMap = memoryMap
        self.__offset = 0

    def close(self):
        self.__memoryMap.close()

    def size(self):
        return self.__memoryMap.size()

    def seek(self, offset):
        assert offset >= 0 and offset < self.size(), 'Seek %s beyond file bounds [0, %s)' % (offset, self.size())
        self.__offset = offset

    def tell(self):
        return self.__offset

    def read(self, ctype):
        """
        Map a part of the file memory to a ctypes object (from_buffer, so ctype points directly to file memory).
        Object type is inferred from the given type.
        File cursor is moved to the next unread byte (seek = tell + sizeof(ctype)).
        """
        result = ctype.from_buffer(self.__memoryMap, self.__offset)
        self.__offset += ctypes.sizeof(result)
        return result

    def readValue(self, ctype):
        """
        Utility to read and directly return the data cast as a python value.
        """
        return self.read(ctype).value

The memoryMap context can take a file descriptor (acquired through os.open, different from the regular open) or file path.
It will then open the entire file as read-only binary and map it instead of reading it.
Last it returns a MappedReader object which is a little wrapper around the mmap object that assists in reading chunks as a certain ctype.
This way I can easily read some header data (previously I'd do this by reading n bytes and using struct.unpack) and then read the remainder (or a large chunk) of the file as a ctypes pointer.

This code is a refactor from what I did in the tutorial mentioned at the top, but using mmap instead! It is mostly identical.

def _loadMesh_v0(stream, vao, bufs):
    vertexCount = stream.readValue(ctypes.c_uint32)
    vertexSize = stream.readValue(ctypes.c_ubyte)

    indexCount = stream.readValue(ctypes.c_uint32)
    indexSize = stream.readValue(ctypes.c_ubyte)

    assert indexSize in indexTypeFromSize, 'Unknown element data type, element size must be one of %s' % indexTypeFromSize.keys()
    indexType = indexTypeFromSize[indexSize]

    drawMode = stream.readValue(ctypes.c_uint32)
    assert drawMode in (GL_LINES, GL_TRIANGLES), 'Unknown draw mode.'  # TODO: list all render types

    # gather layout
    numAttributes = stream.readValue(ctypes.c_ubyte)

    offset = 0
    layouts = [None] * numAttributes
    for i in xrange(numAttributes):
        location = stream.readValue(ctypes.c_ubyte)
        dimensions = stream.readValue(ctypes.c_ubyte)
        assert dimensions in (1, 2, 3, 4)
        dataType = stream.readValue(ctypes.c_uint32)
        assert dataType in attributeElementTypes, 'Invalid GLenum value for attribute element type.'
        layouts[i] = AttributeLayout(location, dimensions, dataType, offset)
        offset += dimensions * sizeOfType[dataType]

    assert offset == vertexSize, 'File says each chunk of vertex data is %s bytes, but attribute layout used up %s bytes' % (vertexSize, offset)

    # apply layout
    for layout in layouts:
        glVertexAttribPointer(layout.location, layout.dimensions, layout.dataType, GL_FALSE, vertexSize, ctypes.c_void_p(layout.offset))  # total offset is now stride
        glEnableVertexAttribArray(layout.location)

    raw = stream.read(ctypes.c_ubyte * (vertexSize * vertexCount))
    glBufferData(GL_ARRAY_BUFFER, vertexSize * vertexCount, raw, GL_STATIC_DRAW)

    raw = stream.read(ctypes.c_ubyte * (indexSize * indexCount))
    glBufferData(GL_ELEMENT_ARRAY_BUFFER, indexSize * indexCount, raw, GL_STATIC_DRAW)

    if stream.size() - stream.tell() > 0:
        raise RuntimeError('Error reading mesh file, more data in file after we were done reading.')
    
    return Mesh(vao, bufs, drawMode, indexCount, indexType)


def model(filePath):
    vao = glGenVertexArrays(1)
    glBindVertexArray(vao)
    bufs = glGenBuffers(2)
    glBindBuffer(GL_ARRAY_BUFFER, bufs[0])
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, bufs[1])
    with memoryMap(filePath) as stream:
        fileVersion = stream.readValue(ctypes.c_ubyte)
        if fileVersion == 0:
            return _loadMesh_v0(stream, vao, bufs)
        raise RuntimeError('Unknown mesh file version %s in %s' % (fileVersion, filePath))

SIMD Matrix math for Python

Long story short: scroll down for a downloadable DLL and python file that do matrix math using optimized SIMD functions.

Recently I was messing around with some 3D in PyOpenGL and found my most notable slowdowns occuring due to matrix math (multiplications being most common).

So I decided to try and implement some fast matrix functions and call those from python, using C98 limitations and ctypes as explained here by my friend Jan Pijpers.

I won’t go into detail about the math, you can download the project files at the end; any sources used are referenced in there.

I do have some profile results to compare! Doing 100,000 calls for each action listed, time displayed in seconds.

Pure python implementation.

identity: 0.0331956952566
rotateY: 0.0617851720355
transform: 1.70942981948
inverse: 15.095287772
multiply: 0.492130726156
vector: 0.160486968636
perspective: 0.107690428216
transpose: 0.452984656091

Note that the inverse is matrix size agnostic (and not normalized!), therefore no loop unrolling is done by the python compiler. It is not representative of a proper python matrix4x4 inverse.

Using VC++ 14.0 MSBUILD, compiling release with -O2 and running without the debugger.

identity: 0.0333827514946
rotateY: 0.0857555184901
transform: 0.251571437936
inverse: 0.0439880125093
multiply: 0.0420022367291
vector: 0.288415226444
perspective: 0.156626988673
transpose: 0.0889596428649
perspective no SIMD: 0.160488955074
Using LLVM 14.0 from visual studio (not sure which linker is used there), compiling release with -O2 and running without the debugger (-O3 doesnt change the results).

identity: 0.0323709924443
rotateY: 0.0845113462024
transform: 0.23958858222
inverse: 0.0395744785104
multiply: 0.0437013033019
vector: 0.286256299491
perspective: 0.150614703216
transpose: 0.0877707597662
perspective no SIMD: 0.156242612934

Interestingly not all operations are faster using C due to type conversions. For a simple axis aligned rotation all we need is a sin, a cos and a list. The sin/cos of python are not going to be any slower than those in C, so all we did was complicate the program.

But in a practical example represented by the transform function (which is a a separate rotateX, rotateY, rotateZ and translation matrix call, then all four of them multiplied together) we see a very worthwhile performance gain.

The math executes using SIMD instructions, all data is therefore converted to 16-bit memory aligned “__m128” structures (from “xmmintrin.h”). We need the C identity and rotate constructors to get the proper type of data, then when we actually need this data we must call storeMat44() to get an actual c_float[16] for python usage.

From my current usage 1 in 3 matrices requires a conversion back to python floats in order get passed to PyOpenGL, so here is another multiply performance test with every third multiplication stored back into a float[16]…

python multiply: 0.492130726156
MVC raw: 0.0436761417549
MVC multiply convert every third: 0.06491612928
MVC convert all: 0.0925153667527

So while our raw implementation is about 11 times faster, the fully converting implementation is only 5 times faster. 7.5 times for our real world example. That’s more than 30% lost again… still much better than pure python though!

Download the visual studio project with x86 and x64 binaries here! Tested on Win10 x64 with Python 2.7.10 x64.
Math3Dx64.Dll and math3d.py are the end user files.

One important thing I wish to look into is to pass data by copy instead, currently all functions allocate a new matrix on the heap, and the user has to delete these pointers by hand from python using the deleteMat44() helper function. I do not know enough about DLLs or python’s memory allocation to know whether I can copy data from the stack instead, and if so whether that would be any faster.

I do know that __vectorcall is not compatible with __declspec(dllexport), which kind of makes sense… but more direct data passing could be nice.