Improving a renderer

This feeds into my previous write up on the tools developed for our 64kb endeavours.

After creating Eidolon [Video] we were left with the feeling that the rendering can be a lot better. We had this single pass bloom and simple lambert & phong shading, no anti aliasing and very poor performing depth of field. Last the performance hit for reflections was through the roof as well.

I started almost immediately with a bunch of improvements, most of this work was done within a month after Revision. Which shows in our newest demo Yermom [Video]. I’ll go over the improvements in chronological order and credit any sources used (of which there were a lot), if I managed to document that right…

Something useful to mention, all my buffers are Float32 RGBA.

Low-resolution reflections:

Basically the scene is raymarched, for every pixel there is a TraceAndShade call to render the pixel excluding fog and reflection.
From the result we do another TraceAndShade for the reflection. This makes the entire thing twice as slow when reflections are on.
Instead I early out at this point if:
if(reflectivity == 0 || gl_FragCoord.x % 4 != 0 || gl_FragCoord.y % 4 != 0) return;
That results in only 1 in 16 pixels being reflective. So instead of compositing the reflection directly I write it to a separate buffer.
Then in a future pass I composite the 2 buffers, where I just do a look up in the reflection buffer like so:
texelFetch(uImages[0], ivec2(gl_FragCoord.xy)) + texelFetch(uImages[1], ivec2(gl_FragCoord.xy / 4) * 4)
In my real scenario I removed that * 4 and render to a 4 times smaller buffer instead, so reading it back results in free interpolation.
I still have glitches when blurring the reflections too much & around edges in general. Definitely still room for future improvement.

Oren Nayar diffuse light response

The original paper and this image especially convinced me into liking this shading model for diffuse objects.

So I tried to implement that, failed a few times, got pretty close, found an accurate implementation, realized it was slow, and ended on these 2 websites:

That lists a nifty trick to fake it, I took away some terms as I realized they contributed barely any visible difference, so I got something even less accurate. I already want to revisit this, but it’s one of the improvements I wanted to share nonetheless.

float orenNayarDiffuse(float satNdotV, float satNdotL, float roughness)
    float lambert = satNdotL;
    if(roughness == 0.0)
        return lambert;
    float softRim = saturate(1.0 - satNdotV * 0.5);

    // my magic numbers
    float fakey = pow(lambert * softRim, 0.85);
    return mix(lambert, fakey * 0.85, roughness);

GGX specular

There are various open source implementations of this. I found one here:
It talks about tricks to optimize things by precomputing a lookup texture, I didn’t go that far. There’s not much I can say about this, as I don’t fully understand the math and how it changes from the basic phong dot(N, H).

float G1V(float dotNV, float k){return 1.0 / (dotNV * (1.0 - k)+k);}

float ggxSpecular(float NdotV, float NdotL, vec3 N, vec3 L, vec3 V, float roughness)
    float F0 = 0.5;

    vec3 H = normalize(V + L);
    float NdotH = saturate(dot(N, H));
    float LdotH = saturate(dot(L, H));
    float a2 = roughness * roughness;

    float D = a2 / (PI * sqr(sqr(NdotH) * (a2 - 1.0) + 1.0));
    float F = F0 + (1.0 - F0) * pow(1.0 - LdotH, 5.0);
    float vis = G1V(NdotL, a2 * 0.5) * G1V(NdotV, a2 * 0.5);
    return NdotL * D * F * vis;


FXAA3 to be precise. There whitepaper is quite clear, still why bother writing it if it’s open source. I can’t remember which one I used, but here’s a few links:
Preprocessed and minified for preset 12 made it very small in a compressed executable. Figured I’d just share it.

#version 420
uniform vec3 uTimeResolution;uniform sampler2D uImages[1];out vec4 z;float aa(vec3 a){vec3 b=vec3(.299,.587,.114);return dot(a,b);}
#define bb(a)texture(uImages[0],a)
#define cc(a)aa(texture(uImages[0],a).rgb)
#define dd(a,b)aa(texture(uImages[0],a+(b*c)).rgb)
void main(){vec2 a=gl_FragCoord.xy/uTimeResolution.yz,c=1/uTimeResolution.yz;vec4 b=bb(a);b.y=aa(b.rgb);float d=dd(a,vec2(0,1)),e=dd(a,vec2(1,0)),f=dd(a,vec2(0,-1)),g=dd(a,vec2(-1,0)),h=max(max(f,g),max(e,max(d,b.y))),i=h-min(min(f,g),min(e,min(d,b.y)));if(i<max(.0833,h*.166)){z=bb(a);return;}h=dd(a,vec2(-1,-1));float j=dd(a,vec2( 1,1)),k=dd(a,vec2( 1,-1)),l=dd(a,vec2(-1,1)),m=f+d,n=g+e,o=k+j,p=h+l,q=c.x;
bool r=abs((-2*g)+p)+(abs((-2*b.y)+m)*2)+abs((-2*e)+o)>=abs((-2*d)+l+j)+(abs((-2*b.y)+n)*2)+abs((-2*f)+h+k);if(!r){f=g;d=e;}else q=c.y;h=f-b.y,e=d-b.y,f=f+b.y,d=d+b.y,g=max(abs(h),abs(e));i=clamp((abs((((m+n)*2+p+o)*(1./12))-b.y)/i),0,1);if(abs(e)<abs(h))q=-q;else f=d;vec2 s=a,t=vec2(!r?0:c.x,r?0:c.y);if(!r)s.x+=q*.5;else s.y+=q*.5;
vec2 u=vec2(s.x-t.x,s.y-t.y);s=vec2(s.x+t.x,s.y+t.y);j=((-2)*i)+3;d=cc(u);e=i*i;h=cc(s);g*=.25;i=b.y-f*.5;j=j*e;d-=f*.5;h-=f*.5;bool v,w,x,y=i<0;
#define ee(Q) v=abs(d)>=g;w=abs(h)>=g;if(!v)u.x-=t.x*Q;if(!v)u.y-=t.y*Q;x=(!v)||(!w);if(!w)s.x+=t.x*Q;if(!w)s.y+=t.y*Q;
#define ff if(!v)d=cc(u.xy);if(!w)h=cc(s.xy);if(!v)d=d-f*.5;if(!w)h=h-f*.5;
ee(1.5)if(x){ff ee(2.)if(x){ff ee(4.)if(x){ff ee(12.)}}}e=a.x-u.x;f=s.x-a.x;if(!r){e=a.y-u.y;f=s.y-a.y;}q*=max((e<f?(d<0)!=y:(h<0)!=y)?(min(e,f)*(-1/(f+e)))+.5:0,j*j*.75);if(!r)a.x+=q;else a.y+=q;z=bb(a);}

Multi pass bloom

The idea for this one was heavily inspired by this asset for Unity:!/content/17324

I’m quite sure the technique is not original, but that’s where I got the idea.

The idea is to downsample and blur at many resolutions and them combine the (weighted) results to get a very high quality full screen blur.
So basically downsample to a quarter (factor 2) of the screen using this shader:

#version 420

uniform vec3 uTimeResolution;
#define uTime (uTimeResolution.x)
#define uResolution (uTimeResolution.yz)

uniform sampler2D uImages[1];

out vec4 outColor0;

void main()
    outColor0 = 0.25 * (texture(uImages[0], (gl_FragCoord.xy + vec2(-0.5)) / uResolution)
    + texture(uImages[0], (gl_FragCoord.xy + vec2(0.5, -0.5)) / uResolution)
    + texture(uImages[0], (gl_FragCoord.xy + vec2(0.5, 0.5)) / uResolution)
    + texture(uImages[0], (gl_FragCoord.xy + vec2(-0.5, 0.5)) / uResolution));

Then downsample that, and recurse until we have a factor 64

All the downsamples fit in the backbuffer, so in theory that together with the first blur pass can be done in 1 go using the backbuffer as sampler2D as well. But to avoid the hassle of figuring out the correct (clamped!) uv coordinates I just use a ton of passes.

Then take all these downsampled buffers and ping pong them for blur passes, so for each buffer:
HBLUR taking steps of 2 pixels, into a buffer of the same size
VBLUR, back into the initial downsampled buffer
HBLUR taking steps of 3 pixels, reuse the HBLUR buffer
VBLUR, reuse the initial downsampled buffer

The pixel steps is given to uBlurSize, the direction of blur is given to uDirection.

#version 420

out vec4 color;

uniform vec3 uTimeResolution;
#define uTime (uTimeResolution.x)
#define uResolution (uTimeResolution.yz)

uniform sampler2D uImages[1];
uniform vec2 uDirection;
uniform float uBlurSize;

const float curve[7] = { 0.0205,
    0.0205 };

void main()
    vec2 uv = gl_FragCoord.xy / uResolution;
    vec2 netFilterWidth = uDirection / uResolution * uBlurSize;
    vec2 coords = uv - netFilterWidth * 3.0;

    color = vec4(0);
    for( int l = 0; l < 7; l++ )
        vec4 tap = texture(uImages[0], coords);
        color += tap * curve[l];
        coords += netFilterWidth;

Last we combine passes with lens dirt. uImages[0] is the original backbuffer, 1-6 is all the downsampled and blurred buffers, 7 is a lens dirt image.
My lens dirt texture is pretty poor, its just a precalced texture with randomly scaled and colored circles and hexagons, sometimes filled and sometimes outlines.
I don’t think I actually ever used the lens dirt or bloom intensity as uniforms.

#version 420

out vec4 color;

uniform vec3 uTimeResolution;
#define uTime (uTimeResolution.x)
#define uResolution (uTimeResolution.yz)

uniform sampler2D uImages[8];
uniform float uBloom = 0.04;
uniform float uLensDirtIntensity = 0.3;

void main()
    vec2 coord = gl_FragCoord.xy / uResolution;
    color = texture(uImages[0], coord);

    vec3 b0 = texture(uImages[1], coord).xyz;
    vec3 b1 = texture(uImages[2], coord).xyz * 0.6; // dampen to have less banding in gamma space
    vec3 b2 = texture(uImages[3], coord).xyz * 0.3; // dampen to have less banding in gamma space
    vec3 b3 = texture(uImages[4], coord).xyz;
    vec3 b4 = texture(uImages[5], coord).xyz;
    vec3 b5 = texture(uImages[6], coord).xyz;

    vec3 bloom = b0 * 0.5
        + b1 * 0.6
        + b2 * 0.6
        + b3 * 0.45
        + b4 * 0.35
        + b5 * 0.23;

    bloom /= 2.2; = mix(,, uBloom);

    vec3 lens = texture(uImages[7], coord).xyz;
    vec3 lensBloom = b0 + b1 * 0.8 + b2 * 0.6 + b3 * 0.45 + b4 * 0.35 + b5 * 0.23;
    lensBloom /= 3.2; = mix(, lensBloom, (clamp(lens * uLensDirtIntensity, 0.0, 1.0))); = pow(, vec3(1.0 / 2.2));

White lines on a cube, brightness of 10.

White lines on a cube, brightness of 300.

Sphere tracing algorithm

Instead of a rather naive sphere tracing loop I used in a lot of 4kb productions and can just write by heart I went for this paper:
It is a clever technique that involves overstepping and backgracking only when necessary, as well as keeping track of pixel size in 3D to realize when there is no need to compute more detail. The paper is full of code snippets and clear infographics, I don’t think I’d be capable to explain it any clearer.

Beauty shots

Depth of field

I initially only knew how to do good circular DoF, until this one came along:
Which I used initially, but to get it to look good was really expensive, because it is all single pass. Then I looked into a 3-blur-pass solution, which sorta worked, but when I went looking for more optimized versions I found this 2 pass one: It works extremely well, the only edge cases I found were when unfocusing a regular grid of bright points.

Here’s what I wrote to get it to work with a depth buffer (depth based blur):

const int NUM_SAMPLES = 16;

void main()
    vec2 fragCoord = gl_FragCoord.xy;

    const vec2 blurdir = vec2( 0.0, 1.0 );
    vec2 blurvec = (blurdir) / uResolution;
    vec2 uv = fragCoord / uResolution.xy;

    float z = texture(uImages[0], uv).w;
    fragColor = vec4(depthDirectionalBlur(z, CoC(z), uv, blurvec, NUM_SAMPLES), z);

Second pass:

const int NUM_SAMPLES = 16;

void main()
    vec2 uv = gl_FragCoord.xy / uResolution;

    float z = texture(uImages[0], uv).w;

    vec2 blurdir = vec2(1.0, 0.577350269189626);
    vec2 blurvec = normalize(blurdir) / uResolution;
    vec3 color0 = depthDirectionalBlur(z, CoC(z), uv, blurvec, NUM_SAMPLES);

    blurdir = vec2(-1.0, 0.577350269189626);
    blurvec = normalize(blurdir) / uResolution;
    vec3 color1 = depthDirectionalBlur(z, CoC(z), uv, blurvec, NUM_SAMPLES);

    vec3 color = min(color0, color1);
    fragColor = vec4(color, 1.0);

Shared header:

#version 420

// default uniforms
uniform vec3 uTimeResolution;
#define uTime (uTimeResolution.x)
#define uResolution (uTimeResolution.yz)

uniform sampler2D uImages[1];

uniform float uSharpDist = 15; // distance from camera that is 100% sharp
uniform float uSharpRange = 0; // distance from the sharp center that remains sharp
uniform float uBlurFalloff = 1000; // distance from the edge of the sharp range it takes to become 100% blurry
uniform float uMaxBlur = 16; // radius of the blur in pixels at 100% blur

float CoC(float z)
    return uMaxBlur * min(1, max(0, abs(z - uSharpDist) - uSharpRange) / uBlurFalloff);

out vec4 fragColor;

//note: uniform pdf rand [0;1)
float hash1(vec2 p)
    p = fract(p * vec2(5.3987, 5.4421));
    p += dot(p.yx, p.xy + vec2(21.5351, 14.3137));
    return fract(p.x * p.y * 95.4307);

#define USE_RANDOM

vec3 depthDirectionalBlur(float z, float coc, vec2 uv, vec2 blurvec, int numSamples)
    // z: z at UV
    // coc: blur radius at UV
    // uv: initial coordinate
    // blurvec: smudge direction
    // numSamples: blur taps
    vec3 sumcol = vec3(0.0);

    for (int i = 0; i < numSamples; ++i)
        float r =
            #ifdef USE_RANDOM
            (i + hash1(uv + float(i + uTime)) - 0.5)
            / float(numSamples - 1) - 0.5;
        vec2 p = uv + r * coc * blurvec;
        vec4 smpl = texture(uImages[0], p);
        if(smpl.w < z) // if sample is closer consider it's CoC
            p = uv + r * min(coc, CoC(smpl.w)) * blurvec;
            p = uv + r * CoC(smpl.w) * blurvec;
            smpl = texture(uImages[0], p);
        sumcol +=;

    sumcol /= float(numSamples);
    sumcol = max(sumcol, 0.0);

    return sumcol;

Additional sources used for a longer time

Distance function library
A very cool site explaining all kinds of things you can do with this code. I think many of these functions were invented already, but with some bonusses as ewll as a very clear code style and excellent documentations for full accessibility.
For an introduction to this library:

Noise functions
Hashes optimized to only implement hash4() and the rest is just swizzling and redirecting, so a float based hash is just:

float hash1(float x){return hash4(vec4(x)).x;}
vec2 hash2(float x){return hash4(vec4(x)).xy;}

And so on.

Value noise

Voronoi 2D
Voronoi is great, as using the center distance we get worley noise instead, and we can track cell indices for randomization.
This is fairly fast, but still too slow to do realtime. So I implemented tileable 2D & 3D versions.

Layering the value noise for N iterations, scaling the UV by 2 and weight by 0.5 in every iteration.
These could be controllable parameters for various different looks. A slower weight decrease results in a more wood-grain look for example.

float perlin(vec2 p, int iterations)
    float f = 0.0;
    float amplitude = 1.0;

    for (int i = 0; i < iterations; ++i)
        f += snoise(p) * amplitude;
        amplitude *= 0.5;
        p *= 2.0;

    return f * 0.5;

Now the perlin logic can be applied to worley noise (voronoi center) to get billows. I did the same for the voronoi edges, all tileable in 2D and 3D for texture precalc. Here’s an example. Basically the modulo in the snoise function is the only thing necessary to make things tileable. Perlin then just uses that and keeps track of the scale for that layer.

float snoise_tiled(vec2 p, float scale)
    p *= scale;
    vec2 c = floor(p);
    vec2 f = p - c;
    f = f * f * (3.0 - 2.0 * f);
    return mix(mix(hash1(mod(c + vec2(0.0, 0.0), scale) + 10.0),
    hash1(mod(c + vec2(1.0, 0.0), scale) + 10.0), f.x),
    mix(hash1(mod(c + vec2(0.0, 1.0), scale) + 10.0),
    hash1(mod(c + vec2(1.0, 1.0), scale) + 10.0), f.x), f.y);
float perlin_tiled(vec2 p, float scale, int iterations)
    float f = 0.0;
    p = mod(p, scale);
    float amplitude = 1.0;
    for (int i = 0; i < iterations; ++i)
        f += snoise_tiled(p, scale) * amplitude;
        amplitude *= 0.5;
        scale *= 2.0;

    return f * 0.5;

Creating a tool to make a 64k demo

In the process of picking up this webpage again, I can talk about something we did quite a while ago. I, together with a team, went through the process of making a 64 kilobyte demo. We happened to win at one of the biggest demoscene events in europe. Revision 2017. I still feel the afterglow of happiness from that.

If you’re not sure what that is, read on, else, scroll down! You program a piece of software that is only 64 kb in size, that shows an audio-visual experience generated in realtime. To stay within such size limits you have to generate everything, we chose to go for a rendering technique called ray marching, that allowed us to put all 3D modeling, texture generation, lighting, etc. as ascii (glsl sources) in the executable. On top of that we used a very minimal (yet versatile) modular synthesizer called 64klang2. Internally it stores a kind of minimal midi data and the patches and it can render amazing audio in realtime, so it doesn’t need to pre-render the song or anything. All this elementary and small size data and code compiles to something over 200kb, which is then compressed using an executable packer like UPX or kkrunchy

It was called Eidolon. You can watch a video:
Or stress test your GPU / leave a comment here:

The technologies used were fairly basic, it’s very old school phong & lambert shading, 2 blur passes for bloom, so all in all pretty low tech and not worth discussing. What I would like to discuss is the evolution of the tool. I’ll keep it high level this time though. Maybe in the future I can talk about specific implementations of things, but just seeing the UI will probably explain a lot of the features and the way things work.

Step 1: Don’t make a tool from scratch

Our initial idea was to leverage existing software. One of our team members, who controlled the team besides modelling and eventually directing the whole creative result, had some experience with a real-time node based software called Touch Designer. It is a tool where you can do realtime visuals, and it supports exactly what we need: rendering into a 2D texture with a fragment shader.

We wanted to have the same rendering code for all scenes, and just fill in the modeling and material code that is unique per scene. We figured out how to concatenate separate pieces of text and draw them into a buffer. Multiple buffers even. At some point i packed all code and rendering logic of a pass into 1 grouped node and we could design our render pipeline entirely node based.

Here you see the text snippets (1) merged into some buffers (2) and then post processed for the bloom (3). On the right (4) you see the first problem we hit with Touch Designer. The compiler error log is drawn inside this node. There is basically no easy way to have that error visible in the main application somewhere. So the first iteration of the renderer (and coincidentally the main character of Eidolon) looked something like this:

The renderer didn’t really change after this.

In case I sound too negative about touch designer in the next few paragraphs, our use case was rather special, so take this with a grain of salt!

We have a timeline control, borrowed the UI design from Maya a little, so this became the main preview window. That’s when we hit some problems though. The software has no concept of window focus, so it’d constantly suffer hanging keys or responding to keys while typing in the text editor.

Last issue that really killed it though: everything has to be in 1 binary file. There is no native way to reference external text files for the shader code, or merge node graphs. There is a really weird utility that expands the binary to ascii, but then literally every single node is a text file so it is just unmergeable.

Step 2: Make a tool

So then this happened:

Over a week’s time in the evenings and then 1 long saturday I whipped this up using PyQt and PyOpenGL. This is the first screenshot I made, the curve editor isn’t actually an editor yet and there is no concept of camera shots (we use this to get hard cuts).

It has all the same concepts however, separate text files for the shader code, with an XML file determining what render passes use what files and in what buffer they render / what buffers they reference in turn. With the added advantage of the perfect granularity all stored in ascii files.

Some files are template-level, some were scene-level, so creating a new scene actually only copies the scene-level fies which can them be adjusted in a text editor, with a file watcher updating the picture. The CurveEditor feeds right back into the uniforms of the shader (by name) and the time slider at the bottom is the same idea as Maya / what you saw before.

Step 3: Make it better

Render pipeline
The concept was to set up a master render pipeline into which scenes would inject snippets of code. On disk this became a bunch of snippets, and an XML based template definition. This would be the most basic XML file:

    <pass buffer="0" outputs="1">
        <global path="header.glsl"/>
        <section path="scene.glsl"/>
        <global path="pass.glsl"/>
    <pass input0="0">
        <global path="present.glsl"/>

This will concatenated 3 files to 1 fragment shader, render into full-screen buffer “0” and then use present.glsl as another fragment shader, which in turn has the previous buffer “0” as input (forwarded to a sampler2D uniform).

This branched out into making static bufffers (textures), setting buffer sizes (smaller textures), multiple target buffers (render main and reflection pass at once), set buffer size to a portion of the screen (downsampling for bloom), 3D texture support (volumetric noise textures for cloud).

Creating a new scene will just copy “scene.glsl” from the template to a new folder, there you can then fill out the necessary function(s) to get a unique scene. Here’s an example from our latest Evoke demo. 6 scenes, under which you see the “section” files for each scene.

Camera control
The second important thing I wanted to tackle was camera control. Basically the demo will control the camera based on some animation data, but it is nice to fly around freely and even use the current camera position as animation keyframe. So this was just using Qt’s event system to hook up the mouse and keyboard to the viewport.

I also created a little widget that displays where the camera is, has an “animation input or user input” toggle as well as a “snap to current animation frame” button.

Animation control
So now to animate the camera, without hard coding values! Or even typing numbers, preferably. I know a lot of people use a tracker-like tool called Rocket, I never used it and it looks an odd way to control animation data to me. I come from a 3D background, so I figured I’d just want a curve editor like e.g. Maya has. In Touch Designer we also had a basic curve editor, conveniently you can name a channel the same as a uniform, then just have code evaluate the curve at the current time and send the result to that uniform location.
Some trickery was necessary to pack vec3s, I just look for channels that start with the same name and then end in .x, .y, .z, and possibly .w.

Here’s an excerpt from a long camera shot with lots of movement, showing off our cool hermite splines. At the top right you can see we have several built in tangent modes, we never got around to building custom tangent editing. In the end this is more than enough however. With flat tangents we can create easing/acceleration, with spline tangents we can get continuous paths and with linear tangents we get continuous speed. Next to that are 2 cool buttons that allow us to feed the camera position to another uniform, so you can literally fly to a place where you want to put an object. It’s not as good as actual move/rotate widgets but for the limited times we need to place 3D objects it’s great.

Hard cuts
Apart from being impossible to represent in this interface, we don’t support 2 keys at identical times. This means that we can’t really have the camera “jump” to a new position instantly. With a tiny amount of curve inbetween the previous and the next shot position, the time cursor can actually render 1 frame of a random camera position. So we had to solve this. I think it is one of the only big features that you won’t see in the initial screenshot above actually.

Introducing camera shots. A shot has its own “scene it should display” and its own set of animation data. So selecting a different shot yields different curve editor content. Shots are placed on a shared timeline, so scrolling through time will automatically show the right shot and setting a keyframe will automatically figure out the “shot local time” to put the key based on the global demo time. The curve editor has it’s own playhead that is directly linked to the global timeline as well so we can adjust the time in multiple places.

When working with lots of people we had issues with people touching other people’s (work in progress) shots. Therefore we introduced “disabling” of shots. This way anyone could just prefix their shots and disable them before submitting, and we could mix and match shots from several people to get a final camera flow we all liked.

Shots are also rendered on the timeline as colored blocks. The grey block underneath those is our “range slider”. It makes the top part apply on only a subsection of the demo, so it is easy to loop a specific time range, or just zoom in far enough to use the mouse to change the time granularly enough.

The devil is in the details
Some things I overlooked in the first implementation, and some useful things I added only recently.
1. Undo/Redo of animation changes. Not unimportant, and luckily not hard to add with Qt.
2. Ctrl click timeline to immediately start animating that shot
3. Right click a shot to find the scene
4. Right click a scene to create a shot for that scene in particular
5. Current time display in minutes:seconds instead of just beats
6. BPM stored per-project instead of globally
7. Lots of hotkeys!

These things make the tool just that much faster to use.

Finally, here’s our tool today. There’s still plenty to be done, but we made 2 demos with it so far and it gets better every time!