Miss Cache: Lighting transparent surfaces with deferred shading. Part II

As promised last time, today I'm going to describe a technique for rendering transparent objects in deferred shading pipeline. I hope it will provide an interesing read. If you don't care about my motivations and how this little project started, feel free to skip this section.

Roughly two months ago I decided to add some particle effects like smoke to my engine ("engine" is a slight exaggeration, but let's stick with it). Engine currently supports deferred scheme, so I had to make up my mind whether I want particles to be lit (Yes! was an immediate answer. Unlit particles are more like 1912 than 2012), and if so, how do I do that. I described few possibilites in my last post, but all of them seemed just wrong. They were either problematic performance-wise (multipass), messy to handle and interoperate with rest of the engine (forward), or rather special-cased (deep G-buffer). Also, it's not immediately obvious how to apply e.g. multipass or deep G-buffer to particles. In case of smoke you could probably do something like this (accumulating all effect's particles' normals and alphas into offscreen buffer and illuminate the result as a single object).

What I wanted was a versatile method, even if it would mean sacrificing some quality. I also wanted to avoid keeping all shadow maps in memory.

(Actually, Creative Assembly method that I also mentioned last time fullfils my requirements pretty well. I've been lucky that they hadn't publicly described their method when I was starting this, because I would probably stick with it if I knew it. :) )

There is another, maybe even more important reason for this endeavour. Currently, I am probably somewhere in the middle between novice and intermediate graphics programmer. Most of the work that I've done before was rather implementing well-known and well-described algorithms and only sometimes adding small things here and there from myself. I wanted to make a step into the "intermediate" direction and use the basic skills that I've been acquiring during past ~two years to "invent" something on my own. I like to think of this work as my first "grown-up" project as a serious graphics coder.

Assumptions

low-frequency lighting will suffice for smoke
small number of particle systems to render at one instant of time (~10)
I am aiming at DX 11 level hardware, so modern GPU features can be used

Idea

Since I am still a baby in graphics world, lots of stuff that's conceptually easy is still unobvious to me. I'm getting better with this but as with everything it takes a lot of time to get used to some patterns and ways of thinking. After that time, suddenly all dots start to connect and ideas relate to each other very naturally. Before I reach that state, apart from doing project and implementation work, I need to read a lot to absorb correct ways of thinking. Idea for the method came when I was reading Aras' 2012 Theory for Forward Rendering. From this article:

All the deferred lighting/shading approaches are essentially caching schemes. We cache some amount of surface information, in screen space, in order to avoid fetching or computing the same information over and over again, while applying lights one by one in traditional forward rendering.
Now, the “cache in screenspace” leads to disadvantages like “it’s really hard to do transparencies” - since with transparencies you do not have one point in space mapping to one pixel on screen anymore. There’s no reason why caching should be done in screen space however; lighting could also just as well be computed in texture space (like some skin rendering techniques, but they do it for a different reason), world space (voxels?), etc.

Idea of storing light in world space is not difficult in itself and it was not the first time I read about it, but due to yet imperfect thinking patterns I needed such stimulus to acknowledge that it is one possible solution to the problem and may be worth exploring.

(Side note: I just realized that if I were smarter and from the two examples he mentioned (texture space and world space) followed the former instead of the latter, maybe I would come up with Creative Assembly method, since that's exactly what they're doing - storing lighting in texture space.)

(Side side note: I was obviously kidding. I have a long way to reach their level.)

I decided to try sampling and storing lighting in uniform 3D grid. Of course grid covering entire scene was out of the question. But given an assumption of small number of objects that require special treatment, maybe computing small grid for every such object (i.e. covering it's bounding box) would not be unfeasible?

Even with small number of small grids, they still need to be rather sparse. That, and a simple fact that grid points don't lay on object surface, but rather represent lighting environments in their small neighbourhoods, means that light needs to be captured into some kind of light probes instead of surface-orientation-dependent values. First I thought of spherical harmonics, but eventually pursued a simpler approach.

I started with projecting light onto six world-space directions corresponding to coordinate axes: X+, X-, Y+, Y-, Z+, Z-. If you do a dot product of light direction vector with (1, 0, 0) vector, the result will tell you how much light has in common with positive X axis (very informally speaking). Do that for all axes, and...

For example, let's take L = (1, 2, 3) light direction vector. (What? It's not normalized? Oh, well. Just assume that it's length represents light intensity. It needs to be incorporated anyway). For six axes we obtain:

dot((1, 2, 3), X+) = 1
dot((1, 2, 3), X-) = -1
dot((1, 2, 3), Y+) = 2
dot((1, 2, 3), Y-) = -2
dot((1, 2, 3), Z+) = 3
dot((1, 2, 3), Z-) = -3

These might be light contributions for six axes. But some of them are negative and it's hard to imagine that light might be "decontributed". Everything less than zero is just a no-light, so we need to clamp negative values to zero.

So, how could we use it? Well, if we compute those six values for all grid points, then while rendering an object, for every pixel we could just pick the closest grid point and use the values to evaluate lighting. How? Simply by taking dot product of all axes with pixel's normal vector, multypling results by values corresponding to each axis and summing them. Let's take it one step at a time:

Assume normal N = (0.8, 0.535, 0.267) (roughly normalized)

dot((0.8, 0.535, 0.267), X+) = 0.8
dot((0.8, 0.535, 0.267), X-) = -0.8
dot((0.8, 0.535, 0.267), Y+) = 0.535
dot((0.8, 0.535, 0.267), Y-) = -0.535
dot((0.8, 0.535, 0.267), Z+) = 0.267
dot((0.8, 0.535, 0.267), Z-) = -0.267

Once again, this tells us how much normal vector has in common with each of the axes. Again, as you are going to see in a second, negative values don't make sense in here neither, so just clamp them to zero.

Now, as said before, multiply the results with corresponding light-strength-per-axis values and sum them (I only write out non-zero terms):

dot(N, X+) * 1 + dot(N, Y+) * 2 + dot(N, Z+) * 3 = 0.8 * 1 + 0.535 * 2 + 0.267 * 3 = 2.671

That's final light intensity on the pixel. Let's do a sanity check. Is it a reasonable value? Well, initial intensity was sqrt(3*3 + 2*2 + 1*1) = 3.74. Since both N and L point into the same octant, but they also differ quite substantially, the value might be OK. I think you can now observe how this procedure behaves for other values. If you take a normal that points away from light direction, only non-zero terms of dot(N, AXIS) sequence will happen in axes, for which dot(L, AXIS) was negative (i.e. zero after clamping). So final lighting, as expected, will be zero. It is also easy to observe that if you take N = L, you will get full light intensity.

However, if you take normal perpendicular to light vector, final light intensity might not necessarily be zero. L had a non-zero dot product with an X+ axis. It is easy to find a vector perpendicular to L which at the same time also has a non-zero dot product with X+ axis, which means pixel with such normal vector will also receive some contribution from the light. This is were we see it is only an approximation. It's not that bad though, since accidentally we obtained something similar to "half-Lambert" lighting model.

There are few more things we need to consider to make the idea usable:

Colors - up until know we only had light intensity. We want to store colors, which means that for every axis we will store R, G, B components of light color. To obtain these three values, we will simply multiple light color with dot product of axis and light direction. We also need to incorporate intensity (affected by power of the light, attenuation etc.). As mentioned before we actually did it already - exemplary light direction was not normalized. So we can just premultiply light intensity into light direction or we can use normalized light direction for dot product and then multiply it with intensity and color
Interpolation - for simplification I described pixel lighting process using closest set of axes (closest sample in 3D grid). Fortunately, process of evaluating light from axes is linear, which means we can safely interpolate between values across neighbouring samples and have smooth results
Storing multiple lights - as with usual light accumulation, it is correct to simply add values obtained for each light. So light evaluation will give six values and each will be added to corresponding axis accumulator value

Since we already know how to store multiple lights, I'd like to go back for a second to accuracy of this solution.
Representation of lighting using finite set of axes will be more accurate when light direction will be closely aligned with one of the axes. The further light direction is from any of axes, the less accurate it will be. Consider two lights with exactly opposite directions (1, 0, 0) and (-1, 0, 0), both with intensity equal to 1. Six axes values will then be 1, -1, 0, 0, 0, 0. So we had lights from two directions along X axis and now we have non-zero light values only on X+ and X- axes. Reasonable.
Now, let's rotate those lights a bit. Their directions are now: (0.577, 0.577, 0.577) and (-0.577, -0.577, -0.577). After storing them into six axes, all axis values are 0.577. We have lost all directional information about lighting. That's how it (not) works.

Implementation

The most natural storage method for uniform 3D grid is 3D texture and that's what I am using. Since we have six axes, each with R, G, B color components, I started with six R8G8B8A8 textures with alpha component unused (I don't care about HDR for now, so 8 bits per component is enough). After light textures have been filled with values, it is straightforward to write a pixel shader that samples them with interpolation, extracts six axes values, dot's them with pixel normal vector and sums to obtain final lighting. It is the light texture creation process ("baking" lights) where problems arise.

Obviously, I wanted to harness GPU power for this process, but how? It would be great, if I could write to entire 3D texture in one go. Straightforward pixel shader would only allow me to write to one texture slice at a time (or maybe a few: as many as there are MRTs - though I'm not sure if it could work this way - but even with a single slice I still need to access six render targets (one for each axis) at the same time anyway). Or maybe I could map slices onto a 2D texture, one next to another, and write to that in pixel shader, but I think I'd need to copy slices into a 3D texture again. Or I could use geometry shader to choose a slice to write to. All these looked overcomplicated.

If there wasn't a better way I would surely get more accurate information about above options, but fortunately there was (or at least I think it is better!) - compute shader (CS), which can be easily made to run in a 3D grid which makes it straightfoward to map threads to points in 3D space and 3D texture cells.

Fresh meat!

So what I was going to do was to get a subset of space - a subset that will accurately cover the transparent object I want to illuminate - and calculate lighting in a uniform 3D grid of positions inside this subset of space. Size of the subset along each of axes helps determine how many threads along each axis need to be dispatched.

Since I hadn't used a CS before, to be safe and not lose to much time on debugging in environment I'm not familiar with, I started with a CPU implementation. Once I had that working, I moved on to CS. Then, some issues came out. But let's start with looking at an example spot light shader. It is a simplified version of what I currently use.

Let's take a look at more important parts:

There are five texture variables declared that constitute what I called before a "light texture". Why not six? I said before that alpha channel of texture is not used so first obvious optimization was to squeeze sixth axis into free alpha channels.
Shader calculates light influence from a single spot light without shadows at some point in space. Since I want to have light texture for some part of space only, I need to tell the shader what that part of space is. I do it by providing it with front-bottom-left corner of the cuboid and it's scale along each of axes. It is enough data to transform thread index into sample position. That's what gLightVolPos and gLightVolScale are for. Actually, single scale value should be enough, since I said it is uniform grid, but I lied a little. It's not exactly uniform. The reason is this: compute shaders are called in groups, and groups have a specific size marked in shader code (in above example: 8x8x8). In single CS dispatch I can only run groups of one given size, so when I pick a group size, I need to decide how many groups to squeeze into considered area so there is enough density of sample points. What I do, is I simply assume that my chosen group size is enough to cover some experimentally found space region (let's say: 50x50x50 in world space units). If area that I want to cover is of size 150x100x50, then it is clear how many groups in each axis to use: (3, 2, 1) = (150, 100, 50) / (50, 50, 50). However, if it's for example 170x130x50, numbers of groups along X and Y axes won't be whole. I round them to the nearest bigger whole number, but then parts of the additional groups will be wasted (they will cover space outside considered region). So I squeeze them along problematic axes to fit the region and effectively there may be different density of threads along each axis - thus, three individual scale values are needed
Code that calculates lighting in sample point is straightforward. However, there is some magic going on when accessing textures. Compute shaders can read from and write to (among others) so called Unordered Access Views (UAVs), which can be used as views ("aliases") for ordinary textures. As said before, I need to accumulate lighting into textures. To perform a "+=" operation on an UAV entry, I need both read and write access at the same time. You can have read-only textures pretty much without restricions in CS, but it turns out there is a special requirement for textures that need write access. They have to be single-component. So while I create textures as R8G8B8A8, in CS I use UAV with R32_UINT format. It means I can only access texture value as 32-bit integer, so I have to convert it to float4 on reading, add light influence, and convert the sum back to int before storing. More info on lunaproject source code comments.

Addition of vector of 0.5f in:

float3 samplePos = gLightVolPos + (float3(DispatchThreadID) +
     float3(0.5f, 0.5f, 0.5f)) * gLightVolScale;

is needed to obtain sample positions in grid cell centers instead of corners

I have similar shaders for directional light and point lights.

As for scheduling light volume baking in rendering process: since I currently store all my shadow maps in memory anyway, I bake light textures just before using them (drawing transparent object). But if I wanted to reuse shadow map memory, light shader for a given light could be applied immediately after rendering light's shadow map - the same way it can be done in regular deferred rendering in order to avoid storing shadow maps. In this setting, however, you need to store multiple light textures for all active transparent objects. Light textures are quite low resolution and hopefully you have less transparent objects than lights, so it may be a win anyway.

Useful optimization described below that is based on storing all light textures in one big texture can be used with this setting and has similar storage requirements.

Optimizations

Smaller set of axes

Six axes seems to be pretty good tradeoff between quality and performance, but there's nothing stoping you from using different set of axes. I tried using four axes:

 const float3 axis0 = float3(0, 1, 0);
 const float3 axis1 = float3(0, -0.341987, 0.939705);
 const float3 axis2 = float3( 0.813808, -0.341987, -0.469852);
 const float3 axis3 = float3(-0.813808, -0.341987, -0.469852);

One is straight up. Second is obtained by rotating previous one around X axis by ~100 degrees. Last two are created from the second one by rotating it 120 and 240 degrees around Y axis. Initially I wanted to have a symmetrical set of axes (it would be similar to set of vectors pointing to orthocenters of tetrahedron faces) but I decided that I don't care that much about lighting coming from below and adjusted second, third and fourth axes to be more horizontal.

Difference in quality was clearly visible, though I think it can still be useful for low-quality settings. As for performance, I had varying results, differing between applications runs - sometimes four axes method was ~25% faster, and sometimes only marginally faster.

I also thoroughly explored using sets of axes that don't cover entire sphere, but only some part of it pointing toward the viewer. I tried both:

using small set of axes (three) that rotate with the camera movement and always face the viewer
using bigger set (six or eight) stationary axes and choosing on a per-frame basis only three to five that were facing the camera. Light was stored only in this reduced set of axes.

However, both approaches gave poor results, because with both rotating and switching axes, ugly flickering of lighting arose.

A slight performance disadvantage of using any axes other than coordinate axes is that you actually have to perform the dot products described in the beginning. With X, Y, Z axes dot products amount to simply taking x, y or z component of considered vector, possibly with minus sign.

Batching lights

Baking multiple lights can obviously be packed into single shader dispatch. Currently I bake all point lights in single call, but use separate calls for spot lights, since they cast shadows. The problem with batching lights that cast shadows is that shader needs to access array of texture variables in loop that processes all lights. Such accesses unfortunately can not be done with dynamic array index - it has to be known at compile time. Because of that, loop that processes lights has to be unrolled, which is very inconvenient when you process less lights than a predefined maximum. I'm not sure if using actual texture arrays (not just arrays of texture variables) wouldn't solve this problem.

And again, when batching lights that cast shadows you can't reuse shadow maps.

Packing light textures

If you have a lot of transparent objects, having to bake all lights separately to all objects' textures can become prohibitive in terms of performance. One method to deal with it, is to pack all light textures into one big texture and do single light baking pass on this. It is in some ways problematic but definitely doable.

Now, instead of determining how many thread groups to run along each axis to cover object's bounding box, you have to pack those bounding boxes one next to other and find out how many threads to run to cover all consecutive regions. For simplicity, I use 3D texture extended along X axis and allocate space for objects along this axis only. Now, the number of thread groups to run along X is sum of widths of all boxes. The number of thread groups to run along Y and Z is maximums of boxes' heights and depths, which is not optimal when regions have substantially varying sizes. Better solution would be to pack those boxes into big box of smallest possible volume, but this would probably mean some NP-complete packing along all three axes (I'm just guessing, I'm not really algorithm guy. Maybe it's only N^3 or something like that :)). Approximate solutions would surely be possible, but for now I assume transparent objects of similar sizes so there is no big waste in simply putting them one next to another along single axis.

There are few more little obstacles:

As seen before, shader needs region position and scale to determine position of sampling point. Now, there are no single values of position and scale, but instead they differ, depending on which region (which object) given thread group deals with. Array of positions and scales needs to be supplied, and it needs to be indexed by SV_GroupIndex.x. I use dynamic 1D textures for this.
In single object method, thread indices started from (0, 0, 0). Now, only the part of them that covers first packed region start with (0, 0, 0). Indices for subsequent regions start from (NumberOfThreadsForObjectsThusFar, 0, 0). But thread indices are needed to calculate sample positions across the region and they are assumed to start from (0, 0, 0) for a bottom-left-front region corner:
```
float3 samplePos = gLightVolPos + (float3(DispatchThreadID) +
     float3(0.5f, 0.5f, 0.5f)) * gLightVolScale;
```
Correction is needed for all thread groups but first, so another 1D texture is used that stores offsets along X axis needed to obtain correct thread indices
Slight correction has to be done to object rendering shader to account for that now it needs to access only part of light texture

Disadvantage of this optimization is that CPU per-object light culling is now impossible (though you could still do a "per-all-transparent-objects" light culling and only send to shader those that affect any of objects. Since we assumed there is a moderate amount of transparent objects in a scene, there should still be a lot of lights that don't affect any of them).

Varying light baking resolution

Quite obvious idea to control both performance and quality is to make baking resolution depend on how far an object is to the viewer. I haven't done it yet though. I suspect it could suffer from light shimmering though.

Quality

Below you can see comparison between forward per-pixel method and 3D light texture method. Forward per-vertex method is much, much worse so I don't include it in comparison (even though instead of direct per-vertex lighting I use method of storing lighting in HL2 basis per-vertex and sample it per-pixel - as decribed in Practical particle lighting - so I have a rough approximation of per-pixel lighting).

In presented scenes light texture of size 16x24x16 was used.

First setting shows smoke lit by two lights - green point light very close to particles and red spot light in front of smoke casting shadow of object.

Forward rendering method

3D light texture method

Next screenshots show smoke partially lit by single directional light.

Forward rendering method

3D light texture method

Performance

Since this method has been developed as an experiment and not for particular game, it's hard to test it in a real, "game-like" scenario. I have left some modifications and optimizations for later, when I actually want to use it, so I have specific needs and frame time limit to fit in. Also, I spent much more time on optimizing light texture method than forward rendering method. Thus, following comparison should only be considered as an approximation of how methods could perform in game. Below I describe exact circumstances and simplifications concerning both algorithms, that may be very different in actual game.

Both methods:

The same set of lights - one directional light with CSM, multiple point lights (without shadows), multiple spot lights (with and without shadows)
No CPU light culling at all

Forward rendering method:

Everything is done in single pass with heavy pixel shader.
There are bounds on maximum number of point lights and spot lights without shadows (128 for both), since their data is stored in arrays in constant buffers. They are processed in loops
There is harder restriction on number of spot lights with shadows - 16. That's because loop that processes them needs to be unrolled for the reason mentioned in Batching lights section

Light texture method:

Here multiple passes are not so expensive, so I have one pass over light texture for directional light and one for every spot light. All point lights however are accumulated in a single pass.
Six coordinate axes are used
Method for packing regions into one big light texture described in Packing light textures section is used
Compute shader is dispatched with 6x3x2 = 36 groups of size 8x8x8. Each particle system gets 16x24x16 part of the light texture

Test scene contains three particle systems moderately close to the viewer. All are lit be a directional light. Additionaly, each one is lit by a couple of point lights and one spot light casting shadows.

Performance test scene with light texture method

Results on Radeon HD 5670:

Forward rendering method: 9.47ms
Light texture method: 1.347ms (light baking: 0.311ms, rendering: 1.016ms)

As you can see, performance difference is huge. What is promising in light texture method, is that with increasing amount of lights, only light baking time will go up. Rendering time in both methods mostly depends on how close an object is to the viewer, which determines how many pixels need to be processed.

Other considerations

Something I haven't done and what could really make the above comparison favour forward rendering method is to calculate forward lighting "per-domain" - use tesselation on particles with density dependent on particle's distance from camera and do a per-vertex lighting on those tesselated vertices - described in Practical particle lighting with great quality and performance results. Such method would still have the same disadvantages like having to store all shadow maps in memory (I feel like I'm writing about this issue for the hundredth time since previous post).

I have yet to find out how to tackle the problem of amount of lighting that object receives. I haven't noticed it before I started messing around with smaller sets of axes. Suddenly, objects became darker. After a while I realized it is reasonable - you store lighting by projecting it on a set of axes and when rendering you just sum influence from all of them. If you use less axes, there will on average be less lighting in a single light probe. With six axes, amount of lighting is comparable to that with forward rendering method, but it's very possible that it's not exact too. I haven't yet wrapped my head around it completely, but I think maybe simple normalization factors dependent on the number of axes and their "density" might solve the problem. I have to do more testing and thinking. I welcome any insights from you.

Final thoughts

Due to my lack of experience in big game projects, I can't tell if this or similar method could possibly be used in some real-world setting. Nevertheless, I am satisfied with results - method meets the requirements. While sacrificing some quality, it allows having particle systems' lighting consistent with opaque objects and it does it fast enough.

Also, it made me learn a lot and gave a topic to start my blog with. :)

Since it is incorporated into engine, I don't have any small sample with described method in action to post so you could easily compile and run it yourself, however I'd be happy to answer questions and post more code if someone's interested.

Miss Cache

Sunday, August 26, 2012

Lighting transparent surfaces with deferred shading. Part II