This method needs documenting, as it is so awesome...
It takes advantage of:
1) Separability of the Filter
2) Hardware Bi-Linear Interpolation
3) OpenGL 3.2 -> GLSL 1.40 new texture fetching functions
Suppose we want an NxN pixel tap blur (i.e. blurs 9x9 pixels together)
1: Separability of the filter
Due to amazing maths which I will not explain here, Convolving the image in N-orthogonal directions with identical distribution Gaussian functions has exact same result as convolving the image with an N-dimensional Gaussian function
This means we first blur in the X direction and then we blur the result of that we blur in the Y direction
This way we can get exact same result as NxN pixel blur with only performing 2xN reads
2: Hardware Bi-Linear Interpolation
Bilinear interpolation is free on the GPU, everytime you read a pixel the GPU can interpolate between the nearest 4 to get an approximate value of the inbetween areas.
If we read along a line between 2 pixels (not diagonal) then only the values of these two will be interpolated
The interpolation is:
A*(1-c)+B*c=out
where c is fraction of the way from A to B which must be between 0.0 and 1.0
now if we look how the gaussian blur is performed
A*gaussWeight(positionA)+B*gaussWeight(positionB)
if positionB = positionA+1 (next pixel from a)
Then c can be set so c = gaussWeight(positionA+1)/(gaussWeight(positionA)+gaussWeight(positionA+1))
and then if we multiply "out" then we get essentially the same result
A*gaussWeight(positionA)+B*gaussWeight(positionB) = out*(gaussWeight(positionA)+gaussWeight(positionB))
Assuming Bilinear is free (or almost free) this cuts down our texture reads to N+2 for a NxN filter
with a radius of 8 pixels, 17x17 filter, 18 texture reads instead of 289
3) GLSL 1.40 fetches
We can make an observation that the c=gaussWeight(positionA+1)/(gaussWeight(positionA)+gaussWeight(positionA+1))
is usually such that 0.45<c<0.5 for most sensible filters
Because we can count on not that much precision being involved in bilinear interpolation
We can actually always read at ~0.47 away from the first pixel and not introduce too much error
That means we will read in intervals of exactly 2 pixels
This brings textureOffset() reading function into play which enables the GPU to fetch texture samples faster (optimizing cache usage and the bilinear filter)
It takes advantage of:
1) Separability of the Filter
2) Hardware Bi-Linear Interpolation
3) OpenGL 3.2 -> GLSL 1.40 new texture fetching functions
Suppose we want an NxN pixel tap blur (i.e. blurs 9x9 pixels together)
1: Separability of the filter
Due to amazing maths which I will not explain here, Convolving the image in N-orthogonal directions with identical distribution Gaussian functions has exact same result as convolving the image with an N-dimensional Gaussian function
This means we first blur in the X direction and then we blur the result of that we blur in the Y direction
This way we can get exact same result as NxN pixel blur with only performing 2xN reads
2: Hardware Bi-Linear Interpolation
Bilinear interpolation is free on the GPU, everytime you read a pixel the GPU can interpolate between the nearest 4 to get an approximate value of the inbetween areas.
If we read along a line between 2 pixels (not diagonal) then only the values of these two will be interpolated
The interpolation is:
A*(1-c)+B*c=out
where c is fraction of the way from A to B which must be between 0.0 and 1.0
now if we look how the gaussian blur is performed
A*gaussWeight(positionA)+B*gaussWeight(positionB)
if positionB = positionA+1 (next pixel from a)
Then c can be set so c = gaussWeight(positionA+1)/(gaussWeight(positionA)+gaussWeight(positionA+1))
and then if we multiply "out" then we get essentially the same result
A*gaussWeight(positionA)+B*gaussWeight(positionB) = out*(gaussWeight(positionA)+gaussWeight(positionB))
Assuming Bilinear is free (or almost free) this cuts down our texture reads to N+2 for a NxN filter
with a radius of 8 pixels, 17x17 filter, 18 texture reads instead of 289
3) GLSL 1.40 fetches
We can make an observation that the c=gaussWeight(positionA+1)/(gaussWeight(positionA)+gaussWeight(positionA+1))
is usually such that 0.45<c<0.5 for most sensible filters
Because we can count on not that much precision being involved in bilinear interpolation
We can actually always read at ~0.47 away from the first pixel and not introduce too much error
That means we will read in intervals of exactly 2 pixels
This brings textureOffset() reading function into play which enables the GPU to fetch texture samples faster (optimizing cache usage and the bilinear filter)