Our new game Awesomenauts is a lot more complex than Swords & Soldiers in every possible way, so we suddenly need to actually have a serious look at performance to get a good framerate on all platforms. Now while trying to optimise things, I found out that I had totally misunderstood an important part of how the videocard works.
I have read several books on programming real-time graphics (including the entire OpenGL Red Book), yet somehow I have never read about this. So when I found out, I spoke to a couple of other experienced graphics programmers, and it turned out most didn't know this either. So I guess quite a few of my readers will find this interesting. For those who already knew about this: why didn't anyone tell me?!?! ;)
So what am I talking about? I always thought that the timing of a frame works like this:
So as soon as you start sending render calls to the videocard, the videocard starts processing them. However, the videocard usually cannot process these calls as fast as it receives them, so when all the calls for a frame have been received, the CPU waits for the videocard to finish, and then proceeds with the next frame.
This scheme is quite nasty, since it contains two periods of waiting: both the GPU and the CPU wait for each other at some point, simply doing nothing in the meanwhile. This wastes performance.
So I have build some timers in Awesomenauts and I saw that indeed the CPU was spending a lot of time waiting for the GPU in calls to things like SDL_GL_SwapBuffers. I tested this in all versions of our engine, so on the Playstation 3, the Xbox 360 and the PC, and this happened on each platform.
So I implemented a multi-threading scheme to do the waiting in a separate thread, so that the next game frame can already be processed while we are still waiting for the GPU to finish the previous frame (this is actually a lot more complex than it sounds, but I will leave out the details for now).
And what happened? NOTHING! PC, PS3, Xbox360: none showed any framerate improvement! Argh! So I asked around to find out what I was doing wrong, and it turned out that the above scheme is entirely wrong. It is simply not how it works. This is how the scheme really works:
So when you call D3DPresent or SDL_GL_SwapBuffers, the time spent there is not spent waiting for the current frame, but waiting for the previous frame. This is actually a really simple and smart solution to the waiting problem I mentioned above. As long as the GPU has more work to do than the CPU, it will never have to wait!
This explains why my optimisation didn't help: as this image shows, the GPU is constantly busy, so improving the framerate of the CPU using multi-threading is totally useless here.
An important thing to mention here is that I am not talking about triple buffering here. Triple buffering is a different subject and this scheme happens regardless of whether triple buffering is turned on or not (although for a good understanding of triple buffering, you would need to take this scheme into account as well).
Note that a side-effect of this scheme is that it introduces some extra input lag: user input (pressing a button to jump, for example) happens in the game state update, and the time between the game state update and the moment when it's results are shown on the screen increases because of this scheme. However, the framerate also increases a lot, so this is definitely a worthwhile trade-off.
Of course, if your game takes more time on the CPU than on the GPU, the GPU will still have to wait:
Now my next question was: does it always work like this on all platforms? It turns out this varies. I asked around, and this is what I learned:
While asking around, I also learned that in some cases on PC, the driver might decide not to wait at all. Someone on the Ogre forums posted here that he had really long input long in his application. It turned out this was because the GPU was a lot more than 1 frame behind, because his CPU had so little work to do. So in his case, the scheme worked like this:
However, I have never seen this happen myself, so I am not sure when this problem would occur. I have heard from a user that Proun sometimes has serious input lag when being run in Wine under Linux. I have not been able to test this myself, but I suspect this is the same problem.
However, this problem is limited by the size of the GPU command buffer: when the GPU is lagging too far behind, the entire buffer will be full of commands, so the CPU will not be able to push in any new ones, forcing the CPU to wait.
This can be solved using fences (an advanced feature of OpenGL and DirectX). Fences allow you to wait until the GPU has reached a certain point. You have to implement this yourself, but it makes absolutely certain that the GPU is never lagging more than 1 frame behind.
To conclude: keep this scheme in mind whenever you try to optimise your game, and be sure to first make sure whether the game is GPU or CPU bound. My fault while optimising Awesomenauts was that I was trying to optimise the CPU, while the bad framerate was being caused by the GPU. Always check your bottlenecks!
PS. My previous blogpost about the sales numbers of Proun resulted in some really painful comments online. I had tried to write a really positive blogpost about how happy I was with Proun's results, but Gamasutra and several other large game sites summarised it simply as "Proun Creator Disappointed With 'Pay What You Want' Results". Some sites and commenters also did some nasty misquotations on how I interpreted the sales data. Lots of people concluded that I am a whiner, and some even did some serious flaming about Proun and me. This is really painful for a game I made for the fun of making it, especially since I am really happy with how Proun did and tried to write a very positive blog post. Interesting how being misquoted can make people online hate me... Anyway, I cannot reach all those people and explain to them how happy I am with the reviews and income I got for Proun, so I guess I can only answer by trying to make more cool games! ^_^