This method needs documenting, as it is so awesome...

It takes advantage of:

1) Separability of the Filter

2) Hardware Bi-Linear Interpolation

3) OpenGL 3.2 -> GLSL 1.40 new texture fetching functions

Suppose we want an NxN pixel tap blur (i.e. blurs 9x9 pixels together)

Due to amazing maths which I will not explain here, Convolving the image in N-orthogonal directions with identical distribution Gaussian functions has exact same result as convolving the image with an N-dimensional Gaussian function

This means we first blur in the X direction and then we blur the result of that we blur in the Y direction

This way we can get exact same result as NxN pixel blur with only performing 2xN reads

Bilinear interpolation is free on the GPU, everytime you read a pixel the GPU can interpolate between the nearest 4 to get an approximate value of the inbetween areas.

If we read along a line between 2 pixels (not diagonal) then only the values of these two will be interpolated

The interpolation is:

A*(1-c)+B*c=out

where c is fraction of the way from A to B which must be between 0.0 and 1.0

now if we look how the gaussian blur is performed

A*gaussWeight(positionA)+B*gaussWeight(positionB)

if positionB = positionA+1 (next pixel from a)

Then c can be set so c = gaussWeight(positionA+1)/(gaussWeight(positionA)+gaussWeight(positionA+1))

and then if we multiply "out" then we get essentially the same result

A*gaussWeight(positionA)+B*gaussWeight(positionB) = out*(gaussWeight(positionA)+gaussWeight(positionB))

Assuming Bilinear is free (or almost free) this cuts down our texture reads to N+2 for a NxN filter

with a radius of 8 pixels, 17x17 filter, 18 texture reads instead of 289

3) GLSL 1.40 fetches

We can make an observation that the c=gaussWeight(positionA+1)/(gaussWeight(positionA)+gaussWeight(positionA+1))

is usually such that 0.45<c<0.5 for most sensible filters

Because we can count on not that much precision being involved in bilinear interpolation

We can actually always read at ~0.47 away from the first pixel and not introduce too much error

That means we will read in intervals of exactly 2 pixels

This brings textureOffset() reading function into play which enables the GPU to fetch texture samples faster (optimizing cache usage and the bilinear filter)

It takes advantage of:

1) Separability of the Filter

2) Hardware Bi-Linear Interpolation

3) OpenGL 3.2 -> GLSL 1.40 new texture fetching functions

Suppose we want an NxN pixel tap blur (i.e. blurs 9x9 pixels together)

**1: Separability of the filter**Due to amazing maths which I will not explain here, Convolving the image in N-orthogonal directions with identical distribution Gaussian functions has exact same result as convolving the image with an N-dimensional Gaussian function

This means we first blur in the X direction and then we blur the result of that we blur in the Y direction

This way we can get exact same result as NxN pixel blur with only performing 2xN reads

2: Hardware Bi-Linear Interpolation2: Hardware Bi-Linear Interpolation

Bilinear interpolation is free on the GPU, everytime you read a pixel the GPU can interpolate between the nearest 4 to get an approximate value of the inbetween areas.

If we read along a line between 2 pixels (not diagonal) then only the values of these two will be interpolated

The interpolation is:

A*(1-c)+B*c=out

where c is fraction of the way from A to B which must be between 0.0 and 1.0

now if we look how the gaussian blur is performed

A*gaussWeight(positionA)+B*gaussWeight(positionB)

if positionB = positionA+1 (next pixel from a)

Then c can be set so c = gaussWeight(positionA+1)/(gaussWeight(positionA)+gaussWeight(positionA+1))

and then if we multiply "out" then we get essentially the same result

A*gaussWeight(positionA)+B*gaussWeight(positionB) = out*(gaussWeight(positionA)+gaussWeight(positionB))

Assuming Bilinear is free (or almost free) this cuts down our texture reads to N+2 for a NxN filter

with a radius of 8 pixels, 17x17 filter, 18 texture reads instead of 289

3) GLSL 1.40 fetches

We can make an observation that the c=gaussWeight(positionA+1)/(gaussWeight(positionA)+gaussWeight(positionA+1))

is usually such that 0.45<c<0.5 for most sensible filters

Because we can count on not that much precision being involved in bilinear interpolation

We can actually always read at ~0.47 away from the first pixel and not introduce too much error

That means we will read in intervals of exactly 2 pixels

This brings textureOffset() reading function into play which enables the GPU to fetch texture samples faster (optimizing cache usage and the bilinear filter)