Improving a renderer

This feeds into my previous write up on the tools developed for our 64kb endeavours.

After creating Eidolon [Video] we were left with the feeling that the rendering can be a lot better. We had this single pass bloom and simple lambert & phong shading, no anti aliasing and very poor performing depth of field. Last the performance hit for reflections was through the roof as well.

I started almost immediately with a bunch of improvements, most of this work was done within a month after Revision. Which shows in our newest demo Yermom [Video]. I’ll go over the improvements in chronological order and credit any sources used (of which there were a lot), if I managed to document that right…

Something useful to mention, all my buffers are Float32 RGBA.

Low-resolution reflections:

Basically the scene is raymarched, for every pixel there is a TraceAndShade call to render the pixel excluding fog and reflection.
From the result we do another TraceAndShade for the reflection. This makes the entire thing twice as slow when reflections are on.
Instead I early out at this point if:
if(reflectivity == 0 || gl_FragCoord.x % 4 != 0 || gl_FragCoord.y % 4 != 0) return;
That results in only 1 in 16 pixels being reflective. So instead of compositing the reflection directly I write it to a separate buffer.
Then in a future pass I composite the 2 buffers, where I just do a look up in the reflection buffer like so:
texelFetch(uImages[0], ivec2(gl_FragCoord.xy)) + texelFetch(uImages[1], ivec2(gl_FragCoord.xy / 4) * 4)
In my real scenario I removed that * 4 and render to a 4 times smaller buffer instead, so reading it back results in free interpolation.
I still have glitches when blurring the reflections too much & around edges in general. Definitely still room for future improvement.

Oren Nayar diffuse light response

The original paper and this image especially convinced me into liking this shading model for diffuse objects.

So I tried to implement that, failed a few times, got pretty close, found an accurate implementation, realized it was slow, and ended on these 2 websites:

That lists a nifty trick to fake it, I took away some terms as I realized they contributed barely any visible difference, so I got something even less accurate. I already want to revisit this, but it’s one of the improvements I wanted to share nonetheless.

float orenNayarDiffuse(float satNdotV, float satNdotL, float roughness)
    float lambert = satNdotL;
    if(roughness == 0.0)
        return lambert;
    float softRim = saturate(1.0 - satNdotV * 0.5);

    // my magic numbers
    float fakey = pow(lambert * softRim, 0.85);
    return mix(lambert, fakey * 0.85, roughness);

GGX specular

There are various open source implementations of this. I found one here:
It talks about tricks to optimize things by precomputing a lookup texture, I didn’t go that far. There’s not much I can say about this, as I don’t fully understand the math and how it changes from the basic phong dot(N, H).

float G1V(float dotNV, float k){return 1.0 / (dotNV * (1.0 - k)+k);}

float ggxSpecular(float NdotV, float NdotL, vec3 N, vec3 L, vec3 V, float roughness)
    float F0 = 0.5;

    vec3 H = normalize(V + L);
    float NdotH = saturate(dot(N, H));
    float LdotH = saturate(dot(L, H));
    float a2 = roughness * roughness;

    float D = a2 / (PI * sqr(sqr(NdotH) * (a2 - 1.0) + 1.0));
    float F = F0 + (1.0 - F0) * pow(1.0 - LdotH, 5.0);
    float vis = G1V(NdotL, a2 * 0.5) * G1V(NdotV, a2 * 0.5);
    return NdotL * D * F * vis;


FXAA3 to be precise. There whitepaper is quite clear, still why bother writing it if it’s open source. I can’t remember which one I used, but here’s a few links:
Preprocessed and minified for preset 12 made it very small in a compressed executable. Figured I’d just share it.

#version 420
uniform vec3 uTimeResolution;uniform sampler2D uImages[1];out vec4 z;float aa(vec3 a){vec3 b=vec3(.299,.587,.114);return dot(a,b);}
#define bb(a)texture(uImages[0],a)
#define cc(a)aa(texture(uImages[0],a).rgb)
#define dd(a,b)aa(texture(uImages[0],a+(b*c)).rgb)
void main(){vec2 a=gl_FragCoord.xy/uTimeResolution.yz,c=1/uTimeResolution.yz;vec4 b=bb(a);b.y=aa(b.rgb);float d=dd(a,vec2(0,1)),e=dd(a,vec2(1,0)),f=dd(a,vec2(0,-1)),g=dd(a,vec2(-1,0)),h=max(max(f,g),max(e,max(d,b.y))),i=h-min(min(f,g),min(e,min(d,b.y)));if(i<max(.0833,h*.166)){z=bb(a);return;}h=dd(a,vec2(-1,-1));float j=dd(a,vec2( 1,1)),k=dd(a,vec2( 1,-1)),l=dd(a,vec2(-1,1)),m=f+d,n=g+e,o=k+j,p=h+l,q=c.x;
bool r=abs((-2*g)+p)+(abs((-2*b.y)+m)*2)+abs((-2*e)+o)>=abs((-2*d)+l+j)+(abs((-2*b.y)+n)*2)+abs((-2*f)+h+k);if(!r){f=g;d=e;}else q=c.y;h=f-b.y,e=d-b.y,f=f+b.y,d=d+b.y,g=max(abs(h),abs(e));i=clamp((abs((((m+n)*2+p+o)*(1./12))-b.y)/i),0,1);if(abs(e)<abs(h))q=-q;else f=d;vec2 s=a,t=vec2(!r?0:c.x,r?0:c.y);if(!r)s.x+=q*.5;else s.y+=q*.5;
vec2 u=vec2(s.x-t.x,s.y-t.y);s=vec2(s.x+t.x,s.y+t.y);j=((-2)*i)+3;d=cc(u);e=i*i;h=cc(s);g*=.25;i=b.y-f*.5;j=j*e;d-=f*.5;h-=f*.5;bool v,w,x,y=i<0;
#define ee(Q) v=abs(d)>=g;w=abs(h)>=g;if(!v)u.x-=t.x*Q;if(!v)u.y-=t.y*Q;x=(!v)||(!w);if(!w)s.x+=t.x*Q;if(!w)s.y+=t.y*Q;
#define ff if(!v)d=cc(u.xy);if(!w)h=cc(s.xy);if(!v)d=d-f*.5;if(!w)h=h-f*.5;
ee(1.5)if(x){ff ee(2.)if(x){ff ee(4.)if(x){ff ee(12.)}}}e=a.x-u.x;f=s.x-a.x;if(!r){e=a.y-u.y;f=s.y-a.y;}q*=max((e<f?(d<0)!=y:(h<0)!=y)?(min(e,f)*(-1/(f+e)))+.5:0,j*j*.75);if(!r)a.x+=q;else a.y+=q;z=bb(a);}

Multi pass bloom

The idea for this one was heavily inspired by this asset for Unity:!/content/17324

I’m quite sure the technique is not original, but that’s where I got the idea.

The idea is to downsample and blur at many resolutions and them combine the (weighted) results to get a very high quality full screen blur.
So basically downsample to a quarter (factor 2) of the screen using this shader:

#version 420

uniform vec3 uTimeResolution;
#define uTime (uTimeResolution.x)
#define uResolution (uTimeResolution.yz)

uniform sampler2D uImages[1];

out vec4 outColor0;

void main()
    outColor0 = 0.25 * (texture(uImages[0], (gl_FragCoord.xy + vec2(-0.5)) / uResolution)
    + texture(uImages[0], (gl_FragCoord.xy + vec2(0.5, -0.5)) / uResolution)
    + texture(uImages[0], (gl_FragCoord.xy + vec2(0.5, 0.5)) / uResolution)
    + texture(uImages[0], (gl_FragCoord.xy + vec2(-0.5, 0.5)) / uResolution));

Then downsample that, and recurse until we have a factor 64

All the downsamples fit in the backbuffer, so in theory that together with the first blur pass can be done in 1 go using the backbuffer as sampler2D as well. But to avoid the hassle of figuring out the correct (clamped!) uv coordinates I just use a ton of passes.

Then take all these downsampled buffers and ping pong them for blur passes, so for each buffer:
HBLUR taking steps of 2 pixels, into a buffer of the same size
VBLUR, back into the initial downsampled buffer
HBLUR taking steps of 3 pixels, reuse the HBLUR buffer
VBLUR, reuse the initial downsampled buffer

The pixel steps is given to uBlurSize, the direction of blur is given to uDirection.

#version 420

out vec4 color;

uniform vec3 uTimeResolution;
#define uTime (uTimeResolution.x)
#define uResolution (uTimeResolution.yz)

uniform sampler2D uImages[1];
uniform vec2 uDirection;
uniform float uBlurSize;

const float curve[7] = { 0.0205,
    0.0205 };

void main()
    vec2 uv = gl_FragCoord.xy / uResolution;
    vec2 netFilterWidth = uDirection / uResolution * uBlurSize;
    vec2 coords = uv - netFilterWidth * 3.0;

    color = vec4(0);
    for( int l = 0; l < 7; l++ )
        vec4 tap = texture(uImages[0], coords);
        color += tap * curve[l];
        coords += netFilterWidth;

Last we combine passes with lens dirt. uImages[0] is the original backbuffer, 1-6 is all the downsampled and blurred buffers, 7 is a lens dirt image.
My lens dirt texture is pretty poor, its just a precalced texture with randomly scaled and colored circles and hexagons, sometimes filled and sometimes outlines.
I don’t think I actually ever used the lens dirt or bloom intensity as uniforms.

#version 420

out vec4 color;

uniform vec3 uTimeResolution;
#define uTime (uTimeResolution.x)
#define uResolution (uTimeResolution.yz)

uniform sampler2D uImages[8];
uniform float uBloom = 0.04;
uniform float uLensDirtIntensity = 0.3;

void main()
    vec2 coord = gl_FragCoord.xy / uResolution;
    color = texture(uImages[0], coord);

    vec3 b0 = texture(uImages[1], coord).xyz;
    vec3 b1 = texture(uImages[2], coord).xyz * 0.6; // dampen to have less banding in gamma space
    vec3 b2 = texture(uImages[3], coord).xyz * 0.3; // dampen to have less banding in gamma space
    vec3 b3 = texture(uImages[4], coord).xyz;
    vec3 b4 = texture(uImages[5], coord).xyz;
    vec3 b5 = texture(uImages[6], coord).xyz;

    vec3 bloom = b0 * 0.5
        + b1 * 0.6
        + b2 * 0.6
        + b3 * 0.45
        + b4 * 0.35
        + b5 * 0.23;

    bloom /= 2.2; = mix(,, uBloom);

    vec3 lens = texture(uImages[7], coord).xyz;
    vec3 lensBloom = b0 + b1 * 0.8 + b2 * 0.6 + b3 * 0.45 + b4 * 0.35 + b5 * 0.23;
    lensBloom /= 3.2; = mix(, lensBloom, (clamp(lens * uLensDirtIntensity, 0.0, 1.0))); = pow(, vec3(1.0 / 2.2));

White lines on a cube, brightness of 10.

White lines on a cube, brightness of 300.

Sphere tracing algorithm

Instead of a rather naive sphere tracing loop I used in a lot of 4kb productions and can just write by heart I went for this paper:
It is a clever technique that involves overstepping and backgracking only when necessary, as well as keeping track of pixel size in 3D to realize when there is no need to compute more detail. The paper is full of code snippets and clear infographics, I don’t think I’d be capable to explain it any clearer.

Beauty shots

Depth of field

I initially only knew how to do good circular DoF, until this one came along:
Which I used initially, but to get it to look good was really expensive, because it is all single pass. Then I looked into a 3-blur-pass solution, which sorta worked, but when I went looking for more optimized versions I found this 2 pass one: It works extremely well, the only edge cases I found were when unfocusing a regular grid of bright points.

Here’s what I wrote to get it to work with a depth buffer (depth based blur):

const int NUM_SAMPLES = 16;

void main()
    vec2 fragCoord = gl_FragCoord.xy;

    const vec2 blurdir = vec2( 0.0, 1.0 );
    vec2 blurvec = (blurdir) / uResolution;
    vec2 uv = fragCoord / uResolution.xy;

    float z = texture(uImages[0], uv).w;
    fragColor = vec4(depthDirectionalBlur(z, CoC(z), uv, blurvec, NUM_SAMPLES), z);

Second pass:

const int NUM_SAMPLES = 16;

void main()
    vec2 uv = gl_FragCoord.xy / uResolution;

    float z = texture(uImages[0], uv).w;

    vec2 blurdir = vec2(1.0, 0.577350269189626);
    vec2 blurvec = normalize(blurdir) / uResolution;
    vec3 color0 = depthDirectionalBlur(z, CoC(z), uv, blurvec, NUM_SAMPLES);

    blurdir = vec2(-1.0, 0.577350269189626);
    blurvec = normalize(blurdir) / uResolution;
    vec3 color1 = depthDirectionalBlur(z, CoC(z), uv, blurvec, NUM_SAMPLES);

    vec3 color = min(color0, color1);
    fragColor = vec4(color, 1.0);

Shared header:

#version 420

// default uniforms
uniform vec3 uTimeResolution;
#define uTime (uTimeResolution.x)
#define uResolution (uTimeResolution.yz)

uniform sampler2D uImages[1];

uniform float uSharpDist = 15; // distance from camera that is 100% sharp
uniform float uSharpRange = 0; // distance from the sharp center that remains sharp
uniform float uBlurFalloff = 1000; // distance from the edge of the sharp range it takes to become 100% blurry
uniform float uMaxBlur = 16; // radius of the blur in pixels at 100% blur

float CoC(float z)
    return uMaxBlur * min(1, max(0, abs(z - uSharpDist) - uSharpRange) / uBlurFalloff);

out vec4 fragColor;

//note: uniform pdf rand [0;1)
float hash1(vec2 p)
    p = fract(p * vec2(5.3987, 5.4421));
    p += dot(p.yx, p.xy + vec2(21.5351, 14.3137));
    return fract(p.x * p.y * 95.4307);

#define USE_RANDOM

vec3 depthDirectionalBlur(float z, float coc, vec2 uv, vec2 blurvec, int numSamples)
    // z: z at UV
    // coc: blur radius at UV
    // uv: initial coordinate
    // blurvec: smudge direction
    // numSamples: blur taps
    vec3 sumcol = vec3(0.0);

    for (int i = 0; i < numSamples; ++i)
        float r =
            #ifdef USE_RANDOM
            (i + hash1(uv + float(i + uTime)) - 0.5)
            / float(numSamples - 1) - 0.5;
        vec2 p = uv + r * coc * blurvec;
        vec4 smpl = texture(uImages[0], p);
        if(smpl.w < z) // if sample is closer consider it's CoC
            p = uv + r * min(coc, CoC(smpl.w)) * blurvec;
            p = uv + r * CoC(smpl.w) * blurvec;
            smpl = texture(uImages[0], p);
        sumcol +=;

    sumcol /= float(numSamples);
    sumcol = max(sumcol, 0.0);

    return sumcol;

Additional sources used for a longer time

Distance function library
A very cool site explaining all kinds of things you can do with this code. I think many of these functions were invented already, but with some bonusses as ewll as a very clear code style and excellent documentations for full accessibility.
For an introduction to this library:

Noise functions
Hashes optimized to only implement hash4() and the rest is just swizzling and redirecting, so a float based hash is just:

float hash1(float x){return hash4(vec4(x)).x;}
vec2 hash2(float x){return hash4(vec4(x)).xy;}

And so on.

Value noise

Voronoi 2D
Voronoi is great, as using the center distance we get worley noise instead, and we can track cell indices for randomization.
This is fairly fast, but still too slow to do realtime. So I implemented tileable 2D & 3D versions.

Layering the value noise for N iterations, scaling the UV by 2 and weight by 0.5 in every iteration.
These could be controllable parameters for various different looks. A slower weight decrease results in a more wood-grain look for example.

float perlin(vec2 p, int iterations)
    float f = 0.0;
    float amplitude = 1.0;

    for (int i = 0; i < iterations; ++i)
        f += snoise(p) * amplitude;
        amplitude *= 0.5;
        p *= 2.0;

    return f * 0.5;

Now the perlin logic can be applied to worley noise (voronoi center) to get billows. I did the same for the voronoi edges, all tileable in 2D and 3D for texture precalc. Here’s an example. Basically the modulo in the snoise function is the only thing necessary to make things tileable. Perlin then just uses that and keeps track of the scale for that layer.

float snoise_tiled(vec2 p, float scale)
    p *= scale;
    vec2 c = floor(p);
    vec2 f = p - c;
    f = f * f * (3.0 - 2.0 * f);
    return mix(mix(hash1(mod(c + vec2(0.0, 0.0), scale) + 10.0),
    hash1(mod(c + vec2(1.0, 0.0), scale) + 10.0), f.x),
    mix(hash1(mod(c + vec2(0.0, 1.0), scale) + 10.0),
    hash1(mod(c + vec2(1.0, 1.0), scale) + 10.0), f.x), f.y);
float perlin_tiled(vec2 p, float scale, int iterations)
    float f = 0.0;
    p = mod(p, scale);
    float amplitude = 1.0;
    for (int i = 0; i < iterations; ++i)
        f += snoise_tiled(p, scale) * amplitude;
        amplitude *= 0.5;
        scale *= 2.0;

    return f * 0.5;

Leave a Reply

Your email address will not be published. Required fields are marked *