Joost's Dev Blog: performance

Showing posts with label performance. Show all posts

Sunday, 1 March 2020

How we made particles twice as fast through cache optimisation

Cache optimisation is one of the most hardcore, low-level topics in game programming. It's also a topic that I haven't been very successful with myself: I tried optimising for cache efficiency a few times but never managed to achieve a measurable improvement in those specific cases. That's why I was quite intrigued when our colleague Jeroen D. Stout (famous as the developer of the experimental indie game Dinner Date) managed to make our particles twice as fast. What's the trick? Jeroen was kind enough to tell me and today I'll share his approach. Since it's such a hardcore topic, I'll start by giving a general overview of what cache optimisation is in the first place.

Let's have a look at computer memory, or RAM. Computers these days have gigabytes of very fast memory that's used to store whatever the game or program needs to use on the fly. But is that memory truly so fast? Physically, it's a separate part of the computer. Whenever the processor needs to get something from memory, it needs to be transported to the CPU. That only takes a tiny bit of time, but a modern processor can perform billions of operations per second. At those kinds of speeds, a tiny bit of time is actually a lot of operations that the processor could have done during that time. Instead, it's waiting for something to come out of memory.

This is where cache comes in. Cache is a small amount of even faster memory that's directly on the processor and is thus a lot faster to access than normal memory. So whenever something is already in cache, the processor doesn't have to wait for it (or at least not as long) and can keep working. According to this article cache can be up to 27 times faster than main memory.

As programmers we generally don't have direct control over what's in cache: the processor tries to utilise the cache as efficiently as possible by it's own volition. However, depending on how we access memory, we can make it a lot easier for the processor to use cache efficiently. This is where cache optimisation comes in.

The goal of cache optimisation is to structure our code in such a way that we use cache as much as possible and need to wait for memory as little as possible. Whenever the processor needs to get something from memory and it's not already available in cache, that's called a cache miss. Our goal is to avoid cache misses as much as possible.

To be able to avoid cache misses, we need to know a little bit about how cache works. I'm no specialist in this field, but we only need a small bit of knowledge to already understand how a lot of cache optimisations work. These are the basic ideas:

Data is transferred from memory to cache in blocks. So whenever you're using something, the memory directly around that is likely also already available in cache.
Cache is often not cleared immediately. Things you've just used have a good chance of still being available in cache.
The processor tries to predict what you'll use next and get that into cache ahead of time. The more predictable your memory access, the better cache will work.

These three concepts lead to a general rule for writing code that's efficient for cache: memory coherence. If you write your code in such a way that it doesn't bounce all over memory all the time, then it will likely have better performance.

Now that the basics are clear, let's have a look at the specific optimisation that Jeroen did that halved the CPU usage of our particles.

To avoid dumping huge amounts of code here, I'm going to work with a highly simplified particle system. Please ignore any other inefficiencies: I'm focusing purely on the cache misses here. Let's have a look at a very naive approach to implementing a particle system:

struct Particle
{
   Vector3D position;
};

class ParticleSystem
{
   std::vector<Particle*> particles;

   void update()
   {
      for (int i = 0; i < particles.size(); ++i)
      {
         particles[i]->position += speed;
         if (shouldDestroy(particles[i]))
         {
            delete particles[i];
            particles.erase(particles.begin() + i);
            --i;
         }
      }
      while (shouldAddParticle())
      {
         particles.push_back(new Particle{});
      }
   }
};

This code is simple enough, but if we look at it from a cache perspective, it's highly inefficient. The problem is that every time we create a new particle with new, it's placed in a random free spot somewhere in memory. That means that all the particles will be spread out over memory. The line particles[i]->position += speed; is now very slow, because it will result in a cache miss most of the time. The time the processor spends waiting for the particle's position to be read from memory is much greater than the time that simple addition takes.

I knew that this would give performance problems, so I immediately built the particles in a more efficient way. If we know the maximum number of particles a specific particle system can contain, then we can reserve a block of memory for that on startup and use that instead of calling new all the time.

This results in a bit more complex code, since we now need to manage that block of memory and create objects inside it using placement new. In C++, placement new allows us to call a constructor the normal way, but use a block of memory that we already have. This is what the resulting code can look like:

struct Particle
{
   Vector3D position;
};

class ParticleSystem
{
   unsigned char* particleMemory;
   std::vector<Particle*> particles;
   int numUsedParticles;

   ParticleSystem():
      numUsedParticles(0)
   {
      particleMemory = new unsigned char[maxCount * sizeof(Particle)];
      for (unsigned int i = 0; i < maxCount; ++i)
      {
         particles.push_back(reinterpret_cast<Particle*>(
            particleMemory + i * sizeof(Particle)));
      }
   }

   ~ParticleSystem()
   {
      delete [] particleMemory;
   }

   void update()
   {
      for (int i = 0; i < numUsedParticles; ++i)
      {
         particles[i]->position += speed;
         if (shouldDestroy(particles[i]))
         {
            //Call the destructor without releasing memory
            particles[i]->~Particle();

            // Swap to place the dead particle in the last position
            if (i < numUsedParticles - 1)
               swap(particles[i], particles[numUsedParticles - 1]);

            --numUsedParticles;
            --i;
         }
      }
      while (shouldAddParticle() && bufferNotFull())
      {
         //Placement new: constructor without requesting memory
         new(particles[numUsedParticles]) Particle{};
         ++numUsedParticles;
      }
   }
};

Now all the particles are close to each other in memory and there should be much fewer cache misses when iterating over them. Since I built it this way right away, I'm not sure how much performance I actually won, but I assumed it would be pretty efficient this way.

Nope.

Still lots of cache misses.

In comes Jeroen D. Stout, cache optimiser extraordinaire. He wasn't impressed by my attempts at cache optimisation, since he saw in the profiler that the line particles[i]->position += speed; was still taking a disproportionate amount of time, indicating cache misses there.

It took me a while to realise why this is, but the problem is in the swap-line. Whenever a particle is removed, it's swapped with the last one to avoid moving all particles after it one forward. The result however is that as particles are removed, the order in which we go through the particles becomes very random very quickly.

The particles in our example here are really small: just a single Vector3D, so 12 bytes per particle. This means that even if we go through the particles in random order, we might occasionally still be staying in the same cache line. But a real particle in the Secret New Game (tm) that we're developing at Ronimo has much more data, like speed, scale, orientation, rotation speed, vibration, colour, and more. A real particle in our engine is around 130 bytes. Now imagine a particle system with 100 particles in it. That's a block of memory of 13,000 bytes. Hopping through that in random order is much more problematic for cache!

Thus, Jeroen set out to make it so that updating particles goes through memory linearly, not jumping around in memory at all anymore.

Jeroen's idea was to not have any pointers to the particles anymore. Instead we have that big block of memory with all the particles in it, and we store the indices of the first and last currently active particles in it. If a particle dies, we update the range of living particles and mark the particle as dead. If the particle happens to be somewhere in the middle of the list of living particles then we simply leave it there and skip it while updating.

Whenever we create a new particle, we just use the next bit of memory and increment the index of the last living particle. A tricky part here is what to do if we've reached the end of the buffer. For this we consider the memory to be a ring buffer: once we reach the end, we continue at the start, where there's room since those particles will have died by now. (Or, if there's no room there, then we don't create a particle since we don't have room for it.)

The code for this isn't that complicated, but it's a bit too much to post here. Instead, I'm just going to post this scheme:

Compared to my previous version, Jeroen managed to make our particles twice as fast with this approach. That's a pretty spectacular improvement and shows just how important cache optimisation can be to performance!

Interestingly, my first hunch here is that this optimisation is a bad idea since we need to touch all the dead particles in between the living ones to know that they're dead. Worst case, that's a lot of dead particles. In fact, when looking at computational complexity, the worst case has gone from being linear in the number of currently existing particles, to being linear in the number of particles that has ever existed. When I studied computer science at university there was a lot of emphasis on Big O Notation and complexity analysis, so to me this is a big deal.

However, as was mentioned above, accessing cache can be as much as 27 times faster than accessing main memory. Depending on how many dead particles are in the middle, touching all those dead particles can take a lot less time than getting all those cache misses.

In practice, the lifetimes of particles don't vary all that much. For example, realistic particles might randomly live between 1.2 seconds and 1.6 seconds. That's a relatively small variation. In other words: if there are dead particles in between, then the ones before them will also die soon. Thus the number of dead particles that we need to process doesn't become all that big.

The big lesson this has taught me personally, is that cache optimisation can be more important for performance than Big O Notation, despite that my teachers at university talked about Big O all the time and rarely about cache optimisation.

EDIT: Several commenters on Twitter and Reddit suggested an even simpler and faster solution: we can swap the dead particles towards the end of the list. This removes the overhead of needing to skip the dead particles and also removes the memory that's temporarily lost because dead particles are in the middle. It also simplifies the code as we don't need to handle that ring buffer anymore.

For further reading, I would like to again recommend this article, as it explains the principles behind cache lines very well and shows some clever tricks for optimising to reduce cache misses.

As a final note, I would like to also stress the unimportance of cache optimisations. Computers these days are super fast, so for the vast majority of code, it's fine if it has very poor cache behaviour. In fact, if you're making a small indie game, then it's probably fine to ignore cache misses altogether: the computer is likely fast enough to run your game anyway, or with only simpler optimisations. As your game becomes bigger and more complex, optimisations become more important. Cache optimisations generally make code more complex and less readable, which is something we should avoid. So, use cache optimisations only where they're actually needed.

Have you ever successfully pulled off any cache optimisations? Are there any clever tricks that you used? Please share in the comments!

Sunday, 29 January 2017

The many meanings of the confusing word 'lag'

'Lag' is one of the most confusing words used in gaming: it's used to describe wildly different problems. This makes communication difficult: when a player is complaining about 'lag', what's his problem exactly? Is he talking about his internet connection, framerate or input delay? So today I'd like to list the different meanings of the word 'lag', both to reduce confusion and to explain why all of them are important. I've grouped them into three main types.

It can be difficult to know exactly what's going on, but it would already help a lot to always at least specify which of the rough groups you mean, instead of just saying 'lag'.

Note that this post is not specifically about our own games. These problems can happen in most games, although the exact impact depends on how the game was been programmed.

Internet and connectivity issues

High ping: ping is the time it takes for an internet packet to travel from your computer to another. In most games the gameplay still feels smooth with high ping, but you'll often start seeing oddities like getting shot by someone who isn't in view anymore, or your shots missing despite perfect aim. The faster the game, the more problematic high ping is.
High jitter: jitter is an often overlooked aspect of ping. Jitter is when some packets take longer than others. Jitter often causes stuttery movement in games. You might have low average ping but still experience stutters because of high jitter. (Packet loss often looks the same as an extreme type of jitter so I'm not listing it separately here.)
Pingspikes: this is when occasionally the connection becomes extremely slow or even drops altogether for a few seconds. In most games ping spikes result in serious problems, like your controls not working anymore for a while, getting a lot of damage at once at the end of the spike, or even being disconnected.

All of these problems can have many causes, like a slow internet connection or your roommate downloading a torrent over the same connection. There's one common cause that's relatively easily solved though: WiFi. WiFi increases ping, causes jitter and packet loss and, worst of all, pingspikes. If you can, never play online games over WiFi. Connect to your router with an ethernet cable instead. Check this blogpost from the League Of Legends devs to see how bad WiFi really is.

Delay between input and seeing the result

Input delay: the time between you pressing a button and the game actually processing it. Games usually process input only once per frame, so the higher the framerate, the lower the input delay. Drivers and the operating system will also cause a little bit of input delay, or a lot if it's a complex controller like the Kinect camera.
Input jitter: this is when the input delay varies. This can happen if you press the button just before the frame update and then just after. If the game runs at 30 frames per second, this can create a 33 milliseconds difference in input delay. The higher the framerate, the lower this kind of jitter wil be.
Rendering delay: once the gameplay has processed your input, it needs to be rendered. The GPU is usually at least one frame behind, and in some cases the game itself can be as well. For example, if you set the Multithreading Mode in Awesomenauts to Higher Framerate, then the graphics are always one additional frame behind on the gameplay, causing additional input delay. Again, the higher the framerate, the smaller this effect.
Monitor delay: this is usually the most noticeable cause of delay between pressing a button and seeing the result. Many televisions have all kinds of clever processing to make the screen sharper and smoother, but this can add over 100ms of lag. That's a lot and can be very noticeable. When using a television you should always switch it to Game Mode to solve this. Computer monitors generally don't have this problem.

Framerate problems

Low framerate is also often called "lag". Players sometimes don't realise that there are different types of low framerate. Being precise when describing your problem helps a lot if you ever ask anyone for help with how to solve your framerate problems.

Constant low framerate: this usually means the videocard or processor is not fast enough to run the game.
Occasional long stutters: the game sometimes freezes for half a second or more. This might be caused by problems reading from the harddisk, so switching to an SSD might help. In general though, a game should be programmed in such a way as to do disk reads asynchronously and thus not freeze like this, but sometimes this is really difficult to achieve.
Regular short stutters: every second or so the game skips a few frames. Such stutters are really short but can be very noticeable, especially during smooth camera movements. This might be caused by running something else on your computer (like a virus scan or a lot of tabs in your internet browser). It can also be caused by the game not balancing the work well enough between frames, causing some frames to take longer.

Framerate problems and input delay are both often decreased by turning off VSync. This does come at the cost of getting screen tearing though. Which you prefer is a matter of personal taste.

Friday, 29 August 2014

A simple trick for fast high quality bokeh on lights in Cello Fortress

Bokeh is an effect that pops up in a lot of new games. It is one of those effects that makes a game instantly feel a lot more "next-gen". It is also an effect that usually eats up a lot of performance. For Cello Fortress I came up with a simplified version of bokeh that looks really high quality, is very fast to render and even very easy to implement. My implementation also has some severe limitations, but I think it can work really well for many games.

Bokeh is a part of focal blur, which is also known as depth of field blur or DoF. DoF is the effect that only objects at a certain distance are sharp, while everything closer and further away is blurred. This is an effect that every lens has, and the larger the lens is, the stronger the blur is. Focal blur has been done in many games for quite a while now, but with the normal DoF rendering techniques the bokeh effect is usually lost. Bokeh means that extremely bright spots grow with the blur and appear as clearly recognisable circles (or hexagons or other shapes, depending on the shape of the lens).

Image from HDWallSource.com.

The problem with rendering bokeh in games is that a naive implementation requires high dynamic range (HDR) rendering and an extremely large amount of blur samples. HDR has become quite common in games by now, but taking enough samples to get good sharp bokeh is impractical. For bokeh as large as in the image above, you would probably need many hundreds of samples to get it look smooth. I actually tried that exact thing in Proun two years ago, as a fun little experiment.

Proun already had really high quality DoF (as I described in this blogpost). I added bokeh by simply making some pixels extremely bright. This is a quick dirty trick that was needed because Proun only has limited fake HDR, but the effect looks pretty convincing, as you can see below on the left image. However, if the blur is increased the bokeh becomes extremely noisy, as you can see on the right. It looks noisy despite that it already uses a quite insane 128 samples per pixel! You can find more images of this approach in this blogpost.

This makes this naive approach unusable if strong blur is wanted. We need something smarter. The most common approach in current games seems to be to render actual polygons with a bokeh circle texture on them. To do so we need to find all the bright pixels in the image and then generate a quad for each one.

According to this article The Witcher 2 was ahead of its time by having bokeh all the way back in 2011. The Witcher 2 did this by generating a quad for every pixel and then discarding the ones that are not blurred enough. That's a whole lot of sprites to render! Needless to see this only worked in real-time on the very fastest PC videocards.

Most games that have bokeh these days seem to use a smarter approach, using newer shader models: first they look for all the pixels that are so bright that they would get a clear bokeh circle, and then they generate quads for this. An example of this approach can be found here. Unlike The Witcher 2's technique this creates only the bokeh itself, not the general focal blur, so with this technique the depth of field blur needs to be done separately.

Even with this clever technique the performance hit is still heavy. It requires analysing all the pixels on the screen to find the ones that are much brighter than the ones next to it. It also produces temporal artefacts: if the source of the bokeh is really small it will flicker because it is smaller than a pixel. Normally this is not that much of a problem because it is only a pixel, but if a big bokeh sprite flickers on and off this becomes much more noticeable. These temporal artefacts might already show with slight camera movements. I don't know what technique Star Citizen uses, but the kind of flickering this would result in can clearly be seen in this video in the bright spots in the background.

Now that we roughly know how bokeh is usually rendered, let's look at that simple and fast trick that I use in Cello Fortress. My idea comes from that the most pronounced bokeh is often from actual lights, not just random bright pixels. If you are looking directly at a light, but the light is in the blurred background, then it will generate a very clear bokeh circle. In general a game engine already knows where all the lights are, so this means that we can skip the searching for bright pixels entirely and directly create one screen-space quad for every light. To do so I first render the scene, then apply normal depth of field blur, and finally render the bokeh sprites on top of that.

Not only does this method skip the expensive step of finding the bright pixels that need bokeh, it also fixes all the issues with temporal artefacts. We always know exactly where the lights are so no matter how small they are, the bokeh will never flicker when the camera moves. It just moves along with the light correctly.

To get a good-looking effect I scale the bokeh with how blurred the light should be. This can simply be calculated based on the focal settings of the camera and the distance to the camera. I also fade out the bokeh sprite a bit as it gets larger, since the larger the blur, the less bright the bokeh circle should be (unless the light is infinitely bright). Here is a video that shows the bokeh in action in Cello Fortress. The bokeh is mostly visible at the bottom of the screen.

An important part of bokeh is the actual shape of the bokeh effect. This shape is created by the shape of the lens. I often like hexagonal bokeh best, but I recently discovered that in photography this is generally considered ugly. When reading reviews of a new camera lens I wanted to buy I learned that the more expensive lenses have circular bokeh while cheaper lenses have hexagons. Still, in the end the only thing that matters is what looks good aesthetically in your game. Since most bokeh rendering techniques use those textured sprites the bokeh shape can be modified really simply by using a different texture.

Note that this all does not look as good as it can yet because I have so far spent too little time on the actual visual design of Cello Fortress. I mostly focussed on the gameplay and some cool shaders. Once I start working on proper graphics I should also tweak the brightness, size and colour of the bokeh to make it look better. I should probably also try adding a bit of chromatic aberration to the bokeh texture then.

My bokeh technique does not solve occlusion at the moment. If a light disappears behind an object then it should not still get a bokeh sprite. I didn't solve this yet because it does not occur in the current version of Cello Fortress. However, several solutions can be implemented easily. For example, I already have a depth texture for the depth of field blur, so I can do a look-up in that on the screen position of the light to see whether there is an object in front of it.

The big downside of my method for rendering bokeh is that you need to know where the bokeh will appear beforehand. This means that more subtle bokeh sources like reflections and strong speculars are not practically doable. I can imagine some tricks to get for example bokeh in planar reflections working (if you know where the reflecting planes are), but beyond that it quickly becomes infeasible. The standard technique of searching for the bokeh pixels of the image handles this much better, so if you really need bokeh on speculars, then you will probably need to resort to such techniques.

Image from this article by MJP

I looked up a bunch of articles on bokeh implementations and couldn't find any mention of something similar to what I am doing. This surprises me because the idea is so simple and obvious. If anyone knows of any articles or games that already do this, then please let me know so that I can add a link.

That's it! My bokeh technique is simpler and much faster than most commonly used methods of rendering bokeh and it even fixes problems with temporal artefacts. It is also a limited technique that does not handle all bokeh effects, but in many cases it will probably be good enough. It definitely is for Cello Fortress!

Saturday, 13 April 2013

Making everything animatable

Everything in our tools is animatable. Not just position, rotation and scale, but also things like colour, skew, taper, texture scroll and the number of particles emitted. This is a feature that I know from 3D Studio MAX and that I had been admiring for ages, so as a programmer I really wanted to come up with a neat system to build this. Being able to animate everything also happens to be really useful for our artists... ;) Today, I would like to explain how we built this.

You might already have seen our animation tools in the Ronitech tools trailer I made a while ago, but that was only a very short fragment, so here is a more complete demo of the animation capabilities of the Ronitech animation editor:

Before I continue, I should note that although I designed most of this system, the real credit should go to Thijs (at the time a coding intern at Ronimo, now a gameplay programmer): he implemented it all, figured out the nitty-gritty details, and built the editor's animation UI.

With so many variables that can be animated, we need a good way on the coding side to hide the actual animation code from everything else. In fact, ideally I would like to be able to just mark a variable as being animatable, and that's it. Relatively simple object oriented design allowed us to do just that.

To make a value animatable, we write a class around it called AnimatedFloat, which handles the animation and contains the data needed for that. This class is updated every frame, and whenever the value is used, we just ask for it in that class. Using our AnimatedFloat in a gameplay object looks something like this:

class WorldObject
{
    AnimatedFloat scale;
    AnimatedFloat rotation;
    TexturedRectangle* visualObject;

    void update(float time)
    {
        position.updateAnimation(time);
        rotation.updateAnimation(time);
        visualObject->setScale(scale.getCurrentValue());
        visualObject->setRotation(rotation.getCurrentValue());
    }
};

This works simple enough, and AnimatedFloat itself is quite straightforward to implement like this. There are a couple of things in this bit of code that we could try to improve, though. One is those calls to getCurrentValue(). This function returns the value that the scale has at the current moment in time. If we use the scale variable a lot, it might be cumbersome to call getCurrentValue() all the time. Luckily, there is a nice (and rather obscure) solution for this in C++: we can skip calling that function by providing a conversion operator in the AnimatedFloat class. Like this:

class AnimatedFloat
{
    float currentValue;

    void updateAnimation(float time)
    {
        //perform actual animation here and update currentValue
    }

    operator float() const
    {
        return currentValue;
    }
};

The operator float() there is a conversion operator and tells C++ what to do if we use an AnimatedFloat in a spot where a float was expected. It removes the need for calling the function getCurrentValue().

This is a simple and nice solution. However, in practice we ended up not using it in our code, because our AnimatedFloat also has a function getBaseValue() (for animations that perform a periodical animation around a base value), and we felt it was not clear enough which value the conversion operator would return.

A bigger problem in our current design is that it is easy to forget to update one of the AnimatedFloats in a class. If we forget to update it, everything still works, but the object won't animate anymore. With the large number of AnimatedFloats some of our classes have, this is an easy oversight, and forgettig it might not become apparent until an artist actually tries to animate something and it doesn't work. I would prefer it if we could make it so that it is not possible for the programmer to forget updating a single variable.

I imagine there is a solution for this using macros, but I in general I think macros are not very readable, so we avoid them whenever possible.

The solution we came up with, is that values are updated by another class: the ObjectAnimator. The AnimatedFloat itself can only be created by providing an ObjectAnimator. Now the programmer only needs to update the ObjectAnimator and cannot forget about individual variables anymore. It is still possible to forget to update the ObjectAnimator, but then nothing at all will be animatable, so this is much less likely to be overlooked. Our design now looks like this:

class ObjectAnimator
{
    std::vector animatedValues;

    void add(AnimatedFloat* value)
    {
        animatedValues.push_back(value);
    }

    void updateAnimations(float time)
    {
        for (std::size_t i = 0; i < animatedValues.size(); ++i)
        {
            animatedValues[i]->updateAnimation(time);
        }
    }
};

class AnimatedFloat
{
    //The constructor requires a pointer to an ObjectAnimator
    AnimatedFloat(ObjectAnimator* owner)
    {
        //The AnimatedFloat registers itself for updating
        owner->add(this);
    }
};

class WorldObject: ObjectAnimator
{
    AnimatedFloat scale;
    AnimatedFloat rotation;
    TexturedRectangle* visualObject;

    WorldObject():
        scale(this),
        rotation(this)
    {
    }

    void update(float time)
    {
        updateAnimations(time);
        visualObject->setScale(scale.getCurrentValue());
        visualObject->setRotation(rotation.getCurrentValue());
    }
};

The clue is that AnimatedFloat's constructor requires an ObjectAnimator. Without an ObjectAnimator, it cannot be created at all, so it is impossible to forget about it in WorldObject's initialiser list.

This is a nice example of a coding principle that is extremely important to me: code needs to be designed in such a way that common mistakes are simply not possible. Of course, this principle cannot be achieved in everything, but at least it is a good goal to strive for. In this case, by changing our class design, we have avoided a future bug that was very likely to happen quite often.

A big flaw in this design is that it currently only animates floats. In reality, a lot of our values are other things, like Colour, Radian and Vector2. The solution to this is to make AnimatedFloat templatised and to rename it to AnimatedValue. This is quite straightforward if you are experienced with templatised programming, but for any programmers less familiar with this weird part of C++, I'll provide a small example of roughly how that looks. Just think of a template as a type that will be filled in depending on what it really is. So T here can be a float, or a double, or a Colour, or whatever you want. In this specific piece of code, substitute the T with something that one can do math on, and it will still be good code.

template ‹typename T›
class AnimatedValue
{
    T currentValue;

    T getCurrentValue() const
    {
        return currentValue;
    }
};

class WorldObject: ObjectAnimator
{
    AnimatedValue‹float› scale;
    AnimatedValue‹Radian› rotation;
};

There are some details to this that make the real code a bit more complex, but the core idea remains the same. By using templates, we now have a single AnimatedValue class that can animate any kind of value that we can do math on.

There are several topics that I have completely ignored so far. The first is actual animation. However, I don't want to make this post too long, so I won't go into detail on that. Animations are calculated in the updateAnimation() function of AnimatedValue.

Another topic I have ignored is how to generate an interface for editing these values, how to save them, and how to even have complete undo/redo for editing animation details. That is an interesting topic by itself, so I think I might get to that in a future blogpost at some point.

Finally, there is the topic of performance. Updating all these animation objects uses actual performance, especially if our artists animate tons of values. I had to optimise our animation code quite a bit to still get a good framerate on consoles. In the end, this is a trade-off I seem to encounter all the time: in most cases, the more flexible a system is, the more performance it uses. Luckily, computers and consoles are getting faster and faster, so as this is only a small problem for us now, I have no doubt a couple of years from now the performance of systems like this will have become completely irrelevant in all but the biggest triple-A games.

The fun part of our animation system, is that it is extremely flexible, but at the same time rather simple to build and use. In that sense it is rather similar to our upgrade system for gameplay, which I explained in a blogpost last year. Our artists have made tons of animations with our animation system for Awesomenauts, and it contributed greatly to the lively feel that the game has!

Saturday, 23 February 2013

Optimising special effects in Awesomenauts

The latest Awesomenauts patch increased the framerate a lot for players with older videocards, especially during fierce battles. We managed to optimise our special effects without making them look noticeably different. Today I would like to explain how we did that!

Before I dive into the details, let me first give some background. In a 2D game like Awesomenauts, most objects are just a square with a texture on it. The texture contains an image, and you only see that image, not the entire square. However, the videocard renders the entire square. So from a performance perspective, it doesn't matter how much of the texture is actually visible. The entire square is rendered and the every pixel uses performance!

Our artists know this, so they try to crop the image to have as few transparent pixels as possible (without changing the actual looks of the end result, of course).

Since objects in Awesomenauts contain so many (partially) transparent pixels, we cannot easily detect whether an object is hidden entirely behind another object or not. So we always render every object that is on the screen, from back to front, regardless of whether an object is actually visible or not. All objects are rendered on top of each other until the image is complete.

This introduces a big performance issue: massive overdraw. Overdraw is when a single pixel needs to be drawn several times each frame. On 1920x1080, the screen contains over 2 million pixels. If we draw every pixel 10 times, that is already 20 million pixels per frame. Now do that 60 times per second, and we are rendering a whopping 1.2 billion pixels per second. That's a mindbogglingly large number of pixels per frame!

The fun thing, however, is that modern videocards are totally okay with that. However, for slightly older ones, it might become a problem to pull this off fluently...

In a previous blogpost I explained how I abused my depth of field blur as an excuse to render the backgrounds in Awesomenauts on a much lower resolution. This greatly reduces the number of pixels that need to be rendered every frame, but it only helps for the background, since the rest of the scene needs to be sharp.

Since the background are optimised this way already, the biggest remaining overdraw in Awesomenauts happens during special effects, mostly those of nauts' special skills. This is because they have a lot of overlapping particles for smoke, fire, dust and debris. Having several of those special skills on screen at the same time decreased the framerate on older and mobile videocards too much. This is extra problematic because during fierce battles, high framerate is needed most!

Once I figured out that overdraw was the cause of the framedrops we were seeing on older videocards, the solution seemed simple: decrease the number of particles on screen. The difficulty, however, is that it is difficult for artists to optimise things without being able to see what they need to optimise. Having a ton of large particles for thin smoke looks very subtle, but eats performance like crazy. And if an effect needs to be optimised, what part of that effect is the problem exactly?

So I made a small addition to our engine to visualise overdraw. Internally we have a shortcut now to enable this, and when this is pressed, the screen shows how often each pixel is rendered. White means a pixel is rendered 85 times (!), while black means it is rendered 0 times. Technically, this is a very simple thing to make: all objects are simply rendered on a very dark additive blend mode. So each rendered object adds a little bit of brightness.

This is not just informative, but also looks pretty awesome! Have a look at this video of going through the menus and then some gameplay with Skølldir to see how weird and cool this looks:

As you can see, the character itself is not every expensive to render, since it is only one square. The problem comes from the tons of particles.

Using the overdraw visualisation, our artists (in this case mostly Koen, Ralph and Gijs) were able to quickly identify which parts of which special effect needed to be changed. These changes are usually pretty simple: 40 smoke particles that are 10% opaque, look nearly exactly the same as 20 smoke particles that are 20% opaque. So in most cases the trick was to drastically decrease the number of particles and then just modify some colours to compensate for that.

The video above shows what the overdraw looks like after these optimisations were already done, so here are some comparison shots that show how massively the overdraw was decreased while hardly changing the looks of these effects:

We tested the results on our Mac (even fast, expensive Macs usually have mobile videocards, which are not very fast and far from ideal for gaming) and indeed the game runs a lot smoother now. Several users reported that while they could previously only run Awesomenauts smoothly on Normal graphics quality, they can now run on High!

These improvements were released on Steam in patch 1.14 two weeks ago, so I hope players with older videocards are enjoying the smoother framerate they are getting!

Friday, 21 December 2012

Snowball Earth's melting effects

The part that I find most interesting and am most proud of in Snowball Earth (our cancelled game from 2008), is the melting effects. Every piece of the world is frozen and covered in snow, and by walking around with the heating turned on, the player can melt it all. In the prototype this had very few gameplay mechanics attached to it, but just seeing and hearing everything come to life is already so much fun that it could carry half the game on its own. Today, I would like to dissect this effect and give an overview of what went into it. (Spoiler: a lot of custom art by our artists Gijs, Ralph, Olivier and Martijn, and a lot of coding by me).

The melting effects in Snowball Earth

For those who haven't tried it yet: a couple of months ago (in this blogpost) we released the entire prototype we made at the time, so you can download and play it yourself. Note that this is a Torrent file, so you will need something like BitTorrent to download the game. Here is the download link:

Download Snowball Earth prototype

To manage the melting, the game has a big 2D map of the level (a grid, really) and for each point on the map, it stores how much it has melted yet. By walking around, the player turns the parts of the map from frozen to green. This map is then used by all elements in the level to determine to what extend they should show up melted.

A rather simple effect is the smaller plants: they are just scaled to size 0 until they are unfrozen, at which point they grow to full scale with a nice popping animation. For variation, our artists made a couple of different popping animations, but that's about it.

Key to the joyful feeling while melting things, is that every time such a plant pops up, a little bell is heard. These bells have a number of different tones, which form a happy harmony together. When several plants are unfrozen shortly after each other, their popping creates little melodies. Thanks to Aline Bruijns for the sound design on this.

Larger plants, with which the player can have collision, are already there in frozen state. As the player melts them, their texture goes from a frozen version to an unfrozen version. Leafs and such can be made to appear during the melting by setting their alpha to black in the frozen texture, and to white in the melted texture.

To make these plants look more lively, a vertex shader animates them with a simple waving motion. We have a couple of different shaders for different types of plants. Of course, this waving should not happen in frozen form, so this effect fades in smoothly as the plant is unfrozen.

If you have played the prototype, then you might have noticed that it takes a long time to load, and performance isn't quite that good. Current PCs can handle it fine, but in 2008 only the fastest PCs could run the prototype at a high framerate. The main reason for this is that there are a lot of plants in the level, and they are all separate objects that need to be loaded and updated. If the game had not been cancelled, getting this to work smoothly on the Wii would have been a huge challenge...

Performance wasn't only a problem in the game itself: the version of 3D Studio Max we used at the time wasn't up for the challenge either. To make sure our artists could re-use the same plant in several parts of the level without needing to copy-paste the entire plant, I had made a little plug-in script object for 3D Studio. This object would automatically handle an XRef object: a reference to a plant model in another file.

However, the combination of lots of scripted plugins and lots of XRefs totally broke 3D Studio. Near the end of the project, our artists had to restart 3D Studio every hour or so, and had to constantly monitor its memory usage. At some point the memory usage would suddenly go crazy, leaving about 20 seconds to save the work... This made me a little bit afraid to use existing tools in the future again, and I am much happier with our own in-game tools for Awesomenauts now. In our own code we can actually solve brutal bugs like this...

We also wanted grass, and because of the performance trouble, I immediately built that using something similar to a particle system. All the grass in an area is rendered as a single object, despite that pieces of grass can grow individually based on what part of the grass area had already been melted. I even made a nice little brush plug-in for 3D Studio to allow our artists to quickly draw the grass where they wanted it. Programming 3D brushes is immensely fun, it is a pity I haven't needed custom 3D brushes for something since!

Ice walls in the Snowball Earth prototype may look like they were made using physics, but to reduce coding time, their destruction animation was animated by hand by an artist. As a cute little extra, we also have smaller blocks of ice with frozen animals in them. When they are broken, the little rabbit inside starts walking around and makes rubber ducky noises when the player stands on it.

A more subtle element is the sound effects. The game has two tracks of ambience: one with sounds of birds and jungle, the other with chilly winds. These are subtly faded in and out depending on how much of the area around the player has already been melted. Few players will have actively noticed this, but this is incredibly important for the mood of the game. As the player walks from the happy jungle in an area he just finished, to the chilly winds of a new area, he cannot help but feel even more like it is time to melt it all and bring back the happiness!

So far I haven't mentioned the thing that is most noticeable in the melting effects of Snowball Earth: the ground. This is because the ground is my pride and joy in this topic and I find it interesting enough that it deserves its own blogpost! See you next week!

Saturday, 21 July 2012

Optimisation lessons learned (part 3)

To finish my little series on CPU optimisation (here were part 1 and part 2, today I bring out the big guns: threading and timing! This is where optimisation really shines as a form of pure entertainment and delight. I can honestly hardly imagine why anyone ever does Sudokus or crosswords. Puzzling to find the best optimisations is so much more fun!

Hiccups are horrible without the right tools

Framerate hiccups are a special case. This is when the game is running at a high enough framerate, but once in a while a single frame takes a lot longer. Hiccups cause that a game running at 55fps might look a lot less fluent than one running at 30fps, if every second the 55fps consists of 54 smooth frames and one really long one.

The difficulty here is that when you gather performance data, the time spent in each function will usually be averaged. This means that if a function runs really fast most of the time and is really slow only once per second, then it will look like a normal function in a profiler. So to fix hiccups, you will need an overview of all the frames captured, and the option to analyse them individually. Not all profilers are capable of doing this, but if you have one that does (like the excellent Playstation 3 profiler), then finding the cause of hiccups isn't all that difficult.

The important thing here is the realisation that preventing hiccups is at least as important for having a smooth feel to the game as having a high framerate, so you really need to make sure hiccups are rare enough. In the Awesomenauts Beta that is currently running, the most important hiccups are when new players join (both when they enter character select and when they enter the actual game). We are working on decreasing those hiccups, but the good thing here is that they only happen once every few minutes. Beware of hiccups that happen once every second!

CPU time versus real time

This is an oddity in some profilers that caused me to make some seriously wrong judgements on where performance was going. An extreme example is the following. I was calling SDL_GL_SwapBuffers, which finishes rendering a frame and shows it on the screen. According to my profiler, 0% of the time was spent there, so I concluded I didn't have to worry about it. This was totally wrong. It turned out that the profiler measured actual time spent using the CPU, so when a function simply waits or sleeps, then this profiler would not count that. When I discovered this, I quickly learned that SDL_GL_SwapBuffers was actually using 50% of the time in my game. This function was waiting for the videocard to finish rendering, and the solution was to massively decrease the overdraw. (Note that the videocard usually does not wait for the last frame to finish, but an earlier one, as I explained in this previous blogpost.)

So be aware that sometimes profilers show the time spent using the CPU, and sometimes they show the actual time on the clock that it took for a function to finish. Keep this in mind!

Multi-threading makes everything very complex

This also brings us to the one thing that makes optimisation truly complex: multi-threading. Again, this is best explained through an example. The Ronitech (our 2D multi-platform game engine) has two main threads running: the rendering thread and the game thread. At the end of every frame, these threads wait for each other to finish. The threads never take exactly the same amount of time, so in practice one thread waits a bit every frame. This means that on a multi-core processor, optimising the fastest thread does not increase the framerate at all.

So before optimising a multi-threaded game, you should always get a clear view of who is waiting where and for what, and then you should only optimise the things that cause the waiting.

Luckily, modern consoles are so fast that for smaller games, you often won't need a lot of multi-threading. Everything relevant in Swords & Soldiers was done in just one thread. Awesomenauts is a lot larger and more complex, though, so we now have three threads running at all times. In general, unless you are making something relatively big and are in a larger coding team (Ronimo currently has five programmers), you won't need a lot of threading and can thus skip this kind of complexity.

Which brings us to the end of this series of the most interesting lessons I learned while doing optimisation! For posts somewhere in the future I still have two more optimisation topics that I would like to discuss: one with practical examples of (simple) optimisations that worked well for me, and the other with a detailed look at our memory manager, which turned out to improve the framerate a lot without being very complex to make. For now, I hope this was an interesting wall of text about the sensual joys of optimisation!

Saturday, 14 July 2012

Optimisation lessons learned (part 2)

Last week I talked about some rather general things that I learned about CPU optimisation, when spending a lot of time improving the framerates of Awesomenauts, Swords & Soldiers, and even Proun. Today I would like to discuss some more practical examples of what kind of optimisations are to be expected.

Somehow I like talking about optimisation so much that I couldn't fit it all in today's blogpost, so topics around threading and timing will follow next week. Anyway, forward with today's "lessons learned"!

Giant improvements are possible (at first)

The key point of my previous blogpost was that you should not worry too much about optimisation early in the project. A fun side-effect of not caring about performance during 'normal' development, is that you are bound to waste a lot of framerate in really obvious ways. So once you fire up the profiler for the first time on a big project, there will always be a couple of huge issues that give giant framerate improvement with very little work.

For example, adding a memory manager to Swords & Soldiers instantly brought the framerate from 10fps to 60fps. That is an extreme example of course, and it only worked this strongly on the Wii (apparently the Wii's default memory manager is really slow compared to other platforms). Still, during the first optimisation rounds, there is bound to always be some low-hanging fruit, ready to be plucked.

The real challenge starts when all the easy optimisations have been done and you still need to find serious framerate improvements. The more you have already done, the more difficult it becomes to do more.

Big improvements are in structure, not in details

Before I actually optimised anything, I thought optimisation would be about the details. Faster math using SIMD instructions, preventing L2 cache misses, reducing instruction counts by doing things slightly smarter, those kinds of things. In practice, it turns out that there is much more to win by simply restructuring code.

A nice example of this, is looking for all the turrets in a list of 1,000 level objects. Originally it might make most sense to just iterate over the entire list and check which objects are turrets. However, when this turns out to be done so often that it reduces framerate, it is easy enough to make an extra list with only the turrets. Sometimes I need to check all level objects, so turrets are now in both lists, and when just the turrets are needed, the longer list doesn't need to be traversed any more. Optimisations like this are really simple to do and can have a massive impact on performance.

This is also a nice example of last week's rule that "Premature optimisation is the root of all evil": having the same object in two lists is more easy to break, for example by forgetting to remove the turret from the other list when it is destroyed. In fact, the rare bug with purple screens that sometimes happens in Awesomenauts on console recently turned out to be caused by exactly this! (Note that the situation was extremely timing specific: this only happened when host migration had happened just before a match was won.)

In my experience, it is quite rare to find optimisations that don't make code at least a little bit more complex and more difficult to maintain.

Platforms have wildly different performance characteristics

This is quite a funny one. I thought running the same game on different platforms would have roughly the same performance characteristics. However, this turned out to not be the case. I already mentioned that the default memory manager is way slower on the Wii than on any of the other platforms I worked with, making implementing my own memory manager more useful there than elsewhere. Similarly, the copying phase of my multi-threading structure (which I previously discussed here) takes a significant amount of time on the Playstation 3, but is hardly measurable when I run the exact same code on a PC.

So far I have seen that all the optimisations I have done have improved the framerate on different platforms with wildly differing amounts. They did always improve the performance on all platforms at least a bit, just not with the same amounts. So I think it is really important to try to always profile on the platform that actually has the worst performance problems, so that you can focus on the most important issues.

Truly low-level optimisations are horribly difficult

The final lesson that I would like to share today is actually a negative one, and cause for a little bit of shame on my side. I have tried at several occasions, but I have hardly ever been able to achieve measurable framerate improvements with low-level optimisations.

I have read a lot of articles and tutorials about this for various platforms. I tried all kinds of things. To avoid cache misses, I have tried using intrinsics to tell the CPU which memory I would need a little bit later. I have tried avoiding virtual function calls. I have tried several other similar low-level optimisations that are supposedly really useful, but somehow I have never been able to improve the framerate this way. The only measurable result I ever got this way was a 1% improvement by making a set of functions available for inlining (the Playstation 3 compiler does not have Whole Program Optimisation to do this automatically in more complex cases).

Of course, this definitely does not mean that low-level optimisations are impossible, it just means that I consider them a lot more complex to get results with. This also means that it is possible to make a larger project like Awesomenauts run well enough without any low-level optimisations.

We've got a big announcement coming up next week, and next weekend I will be back with the last part of my mini-series on optimisation. Stay tuned!

(Muhaha, are you curious what we are going to announce? Feel free to speculate in the comments!)

Friday, 6 July 2012

Optimisation lessons learned (part 1)

Optimisation is a very special art form, with lots of tricks and realisations that give it a place of its own in the programmers toolbox. Until I had to do it for the first time on Swords & Soldiers for the Wii, I had never done any optimisation. At university I had already learned a lot about time complexity (big O notation, like O(n)), but optimising an actual game with a big codebase is a different beast entirely. I was like a virgin that had never been seduced by a handsome profiler yet.

I just started fooling around with optimisations, reading tutorials and articles on Nintendo's dev website and the internet in general, and learning as I went along. Since no pro ever taught me how to optimise, I guess I might still be unaware of some really important tricks. Yet somehow I managed to get Swords & Soldiers from an initial 10fps to 100fps on the Wii (if VSync is turned off), and Awesomenauts from 10fps to 70fps on the Playstation 3.

In past five years I spent some six months in total on several kinds of optimisation. Quite a lot of that was on bandwidth optimisations, but also a lot on framerate and loading-times optimisations, so I have quite a bit of experience with it by now. There is a lot of interesting stuff to say about it, so in the coming weeks I would like to share the most important lessons that I learned. Today's concepts are rather general, next week I will share some more detailed pieces of (hopefully) 'wisdom'. ^_^

Don't make any assumptions
This is by far the most important one. When I want to optimise something, I always have a some ideas in my head about which parts of the game must be causing the low framerates. In practice, however, it turns out that I am hardly ever right. When I fire up a profiler and analyse where time is really spent, the big optimisations almost always turn out to be somewhere else than expected. This is still the case, even though in the past five years I have spent so much time full-time on optimisation (especially Awesomenauts was a really complex beast to tame), optimising code for Wii, PS3, Xbox360 and PC. So much experience, yet my expectations are still usually wrong.

So the rule I derive from this, is to never take any action based on assumptions. Always let go of your assumptions. Take real measurements and work from there. And once something has been changed, measure again to see whether it really works. As Dutch physics teachers often say: "Meten is weten" ("Measuring is knowing").

Profilers are awesome
Which brings me to the best part: profilers (measuring tools) are awesome! They give insane amounts of detail on all kinds of things. The core for me is the hierarchical view. This starts in your main() function and tells you it is using 100% of the time (oh really...). From there you can keep opening function calls in a tree-like form, to see in ever more detail where the time is spent. Beyond this, profilers have all kinds of nifty tools. If you ever want to do any optimisation, be sure to start by getting a proper profiler!

I have to say, though, that I have not been able to find a really good, affordable profiler for PC. Wii, PS3 and Xbox360 all have excellent profilers that only work on that platform, but on PC, most profilers I have seen are either incredibly expensive (like the one in Visual Studio Team Edition), or lack the concept of a "frame" and thus don't allow me to find framerate hickups. The best one for PC I have seen so far is Very Sleepy, even though it completely destroys the framerate while running and doesn't know what a frame is either. But Very Sleepy is easy to use and the hierarchical tree-view is really clear. So if anyone has any recommendations for a better profiler for optimising games that are written in C++, then please let me know!

Premature optimisation is the root of all evil
I already mentioned that you should not make any assumptions, and this also leads to another important point. "Premature optimisation is the root of all evil" is a quote from the famous computer scientist Donald Knuth, and it is all too true. Most optimisations make code more complex, more difficult to maintain and evolve, and more sensitive to bugs. Also, in practice, 99% of the code is completely irrelevant to performance. A couple of hotspots take up most of the CPU's time, and the rest just doesn't occur often enough to be relevant to optimise. So my conclusion is to not really care about performance and just write the cleanest, most readable code I can. Until the profiler says otherwise.

To give a simple example from C++: sometimes using a const char* is a lot faster than an std::string. However, the latter is so much safer to use, that I never use const char* until the profiler tells me I have a problem, and then I change it in that specific spot.

Modern consoles are ridiculously fast
When you look online for optimisation tips, or when you talk to hardened industry veterans, you will learn a lot of things that I dare claim are just not relevant any more. Consoles and PCs these days are so fast, that you can just waste performance on all kinds of things and still get a good framerate. Of course, I am not talking about the big games here: if you are working on the next Uncharted and need to squeeze every last bit of power out of the Playstation 3, then it probably becomes really important to look after every little performance detail, but in general: modern consoles are so fast that you can waste most of your processor time on inefficient code and still easily get a good framerate.

To give an example: a rule that some studios have is that dynamic allocations are forbidden. So calling new or malloc during gameplay is not allowed. Everything that is going to be used, must be allocated during loading, and then kept in pools for quick usage. This helps against memory fragmentation, but is also supposedly necessary to achieve a good framerate. I personally think this limits the code design way too much. It makes code more difficult to manage and extend during development, so I never do this. It turns out that even the Wii, supposedly the slowest of the current generation, can easily run 60fps with 100 characters walking around in Swords & Soldiers.

(Note that I did end up making my own memory manager to speed up allocations on the Wii, which I will discuss in a future blog post. But calling new tons of times each frame was not a problem in the end.)

My point here is not that everything is always possible, but I do think it is important to realise that unless you are making a triple-A console game, you should not worry about performance too much, definitely not early in development.

I personally find optimisation a lot of fun to do, so there is a lot more I would like to say about it. Next week I will be a bit more specific and follow this post up with some lessons that I learned about timing, threading, hiccups, and where the biggest performance improvements can be made.

Friday, 23 March 2012

Question: how to synchronise between the game and render thread?

In today's blogpost, I'd like to do something radically different: I'm going to ask you a question. I know that quite a few of the visitors of my blog are hardcore programmers, and since I have a complex situation for which I would like to improve the performance, I figured I could ask for advice here! :)

The problem is pretty complex, though, so I figured a short blogpost would be needed to really explain it. In short, this is the question:

What is the most efficient way to synchronise renderable objects between the game thread and the render thread?

There are a couple of subtleties you need to know to answer this for our situation, though. Let me start with a scheme that shows the approach to threading in the Ronitech (our in-house multiplatform 2D engine), as we use it in Awesomenauts on the Xbox 360 and Playstation 3:

This works fine and we managed to get 60fps on console this way. However, the copying phase takes relatively long, so I am still wasting a lot of performance there. Since we will also use the Ronitech in all our future games, improving here will pay off greatly in our future projects.

There are a couple of specific requirements that I have to make this work with the rest of our engine:

The gameplay thread must be able to modify (and even delete!) all renderable objects. At the same time, the renderer must still be able to render each object how it looked during the previous frame.
Within a single frame, the same object might be rendered to different viewports in different ways. For example, a character might be partially transparent in one viewport and fully visible in another. The game can register itself as a PreViewportRenderListener and thus change objects in between the rendering of two viewports.

The copying phase as I have it right now, works as follows: for every renderable object I have a clone for every viewport it might be shown on. If there are lots of viewports, this can mean lots of clones. I don't have any memory problems, so I don't mind the clones. Now every frame during the copying phase, for all the objects that are actually visible in a viewport, all the properties of the original object are copied to the clone. This is the step that costs the performance, since in three player splitscreen this can mean copying all the properties of 2000 objects every frame, and each object has dozens of properties! I currently just call the PreViewportRenderListeners during the copying phase, in between copying data for one viewport and copying data for the next viewport.

So, what alternatives are there to just copying all the relevant data for all visible objects every frame? Or should rendering in a separate thread be set up differently all together? Please comment below if you have interesting ideas for this!

Sunday, 23 October 2011

What no one told you about the videocard: lagging 1 frame behind

Our new game Awesomenauts is a lot more complex than Swords & Soldiers in every possible way, so we suddenly need to actually have a serious look at performance to get a good framerate on all platforms. Now while trying to optimise things, I found out that I had totally misunderstood an important part of how the videocard works.

I have read several books on programming real-time graphics (including the entire OpenGL Red Book), yet somehow I have never read about this. So when I found out, I spoke to a couple of other experienced graphics programmers, and it turned out most didn't know this either. So I guess quite a few of my readers will find this interesting. For those who already knew about this: why didn't anyone tell me?!?! ;)

So what am I talking about? I always thought that the timing of a frame works like this:

So as soon as you start sending render calls to the videocard, the videocard starts processing them. However, the videocard usually cannot process these calls as fast as it receives them, so when all the calls for a frame have been received, the CPU waits for the videocard to finish, and then proceeds with the next frame.

This scheme is quite nasty, since it contains two periods of waiting: both the GPU and the CPU wait for each other at some point, simply doing nothing in the meanwhile. This wastes performance.

So I have build some timers in Awesomenauts and I saw that indeed the CPU was spending a lot of time waiting for the GPU in calls to things like SDL_GL_SwapBuffers. I tested this in all versions of our engine, so on the Playstation 3, the Xbox 360 and the PC, and this happened on each platform.

So I implemented a multi-threading scheme to do the waiting in a separate thread, so that the next game frame can already be processed while we are still waiting for the GPU to finish the previous frame (this is actually a lot more complex than it sounds, but I will leave out the details for now).

And what happened? NOTHING! PC, PS3, Xbox360: none showed any framerate improvement! Argh! So I asked around to find out what I was doing wrong, and it turned out that the above scheme is entirely wrong. It is simply not how it works. This is how the scheme really works:

So when you call D3DPresent or SDL_GL_SwapBuffers, the time spent there is not spent waiting for the current frame, but waiting for the previous frame. This is actually a really simple and smart solution to the waiting problem I mentioned above. As long as the GPU has more work to do than the CPU, it will never have to wait!

This explains why my optimisation didn't help: as this image shows, the GPU is constantly busy, so improving the framerate of the CPU using multi-threading is totally useless here.

An important thing to mention here is that I am not talking about triple buffering here. Triple buffering is a different subject and this scheme happens regardless of whether triple buffering is turned on or not (although for a good understanding of triple buffering, you would need to take this scheme into account as well).

Note that a side-effect of this scheme is that it introduces some extra input lag: user input (pressing a button to jump, for example) happens in the game state update, and the time between the game state update and the moment when it's results are shown on the screen increases because of this scheme. However, the framerate also increases a lot, so this is definitely a worthwhile trade-off.

Of course, if your game takes more time on the CPU than on the GPU, the GPU will still have to wait:

Now my next question was: does it always work like this on all platforms? It turns out this varies. I asked around, and this is what I learned:

While asking around, I also learned that in some cases on PC, the driver might decide not to wait at all. Someone on the Ogre forums posted here that he had really long input long in his application. It turned out this was because the GPU was a lot more than 1 frame behind, because his CPU had so little work to do. So in his case, the scheme worked like this:

However, I have never seen this happen myself, so I am not sure when this problem would occur. I have heard from a user that Proun sometimes has serious input lag when being run in Wine under Linux. I have not been able to test this myself, but I suspect this is the same problem.

However, this problem is limited by the size of the GPU command buffer: when the GPU is lagging too far behind, the entire buffer will be full of commands, so the CPU will not be able to push in any new ones, forcing the CPU to wait.

This can be solved using fences (an advanced feature of OpenGL and DirectX). Fences allow you to wait until the GPU has reached a certain point. You have to implement this yourself, but it makes absolutely certain that the GPU is never lagging more than 1 frame behind.

To conclude: keep this scheme in mind whenever you try to optimise your game, and be sure to first make sure whether the game is GPU or CPU bound. My fault while optimising Awesomenauts was that I was trying to optimise the CPU, while the bad framerate was being caused by the GPU. Always check your bottlenecks!

PS. My previous blogpost about the sales numbers of Proun resulted in some really painful comments online. I had tried to write a really positive blogpost about how happy I was with Proun's results, but Gamasutra and several other large game sites summarised it simply as "Proun Creator Disappointed With 'Pay What You Want' Results". Some sites and commenters also did some nasty misquotations on how I interpreted the sales data. Lots of people concluded that I am a whiner, and some even did some serious flaming about Proun and me. This is really painful for a game I made for the fun of making it, especially since I am really happy with how Proun did and tried to write a very positive blog post. Interesting how being misquoted can make people online hate me... Anyway, I cannot reach all those people and explain to them how happy I am with the reviews and income I got for Proun, so I guess I can only answer by trying to make more cool games! ^_^

Wednesday, 20 October 2010

Dirty collision detection trickery in Proun

Some people seem to read Pr0n instead of Proun. I think this may be due to the dirtiness of some of the coding tricks I used in Proun.

Or maybe it is because these words are written so similarly. No, really, I don't think so.

Anyway, programmers who worked with me might agree that I have pretty radical ideas on what good programming is, especially in comparison to how large console games are usually made. I almost always choose simplicity and understandability over performance, over memory use and often even over amount of work.

Most coding interns need to get used to the idea that sometimes when they consider something done, but there is still a way of structuring it in a better way, I actually make them go back and change their already working code, just to make the code itself better. Since the games at Ronimo are around 100.000 lines of code, keeping the code clean is very important. No way will we find our bugs if such an amount of code is also a mess!

An extreme example of choosing simplicity over performance can be found in the collision detection in Proun.

Collision detection in games is usually done per frame. So 60 times per second, all objects are moved a bit and then the game checks whether they collide. This means that the time between two frames is completely ignored!

This becomes a problem when fast moving objects (e.g. bullets) pass through thin obstacles: in one frame they are on one side of the obstacle, and in the next frame they already passed it.

Common solutions to this problem are to cast a ray between the position at the previous frame and the current position. An alternative is to do sub-steps just for the collision detection and only when moving really fast.

Both of these solutions share that they would need additional work in the collision detection of vehicles in Proun. When doing the solution with subframes, the collision detection would run at a different framerate from the rest of the game, which increases the complexity of the code. Not dramatically, but it is more complex nevertheless.

For Proun, I implemented a simpler solution: all gameplay in Proun runs at a fixed 3000 frames per second. Three thousand. Three thousand! The graphics still runs at 60fps, of course (or less if your videocard is too slow).

The result is that even when you race extremely fast (grab a turbo on Impossible difficulty to do that), you still don't move too fast to pass through objects without collision detection detecting this.

Any reasonable programmer would say this is idiotic. So much performance wasted! I even update all obstacle animations at this framerate. Unnecessary! However, Proun's performance is almost entirely in the graphics. The gameplay is very simple and can be processed really quickly, so why would I bother implementing something more efficient and more complex? Doing a lot of gameplay updates per frame is really simple to implement and everything just works. I don't need to add any math for the time-in-between or anything like that, I don't need to think about having different framerates for collision detection and for other gameplay aspects. Any other solution would have made the collision detection code more complex than it is now.

I tested the actual performance impact of this and it turns out that this only impacts the framerate in a relevant way when a really slow processor is combined with a very fast videocard. No one actually has such a combination, so I don't consider that relevant.

So. Simple! Clear!

(And in fact not dirty at all, so maybe the words Pr0n and Proun do look pretty similar...)