Saturday, 21 July 2012

Optimisation lessons learned (part 3)

To finish my little series on CPU optimisation (here were part 1 and part 2, today I bring out the big guns: threading and timing! This is where optimisation really shines as a form of pure entertainment and delight. I can honestly hardly imagine why anyone ever does Sudokus or crosswords. Puzzling to find the best optimisations is so much more fun!



Hiccups are horrible without the right tools

Framerate hiccups are a special case. This is when the game is running at a high enough framerate, but once in a while a single frame takes a lot longer. Hiccups cause that a game running at 55fps might look a lot less fluent than one running at 30fps, if every second the 55fps consists of 54 smooth frames and one really long one.

The difficulty here is that when you gather performance data, the time spent in each function will usually be averaged. This means that if a function runs really fast most of the time and is really slow only once per second, then it will look like a normal function in a profiler. So to fix hiccups, you will need an overview of all the frames captured, and the option to analyse them individually. Not all profilers are capable of doing this, but if you have one that does (like the excellent Playstation 3 profiler), then finding the cause of hiccups isn't all that difficult.

The important thing here is the realisation that preventing hiccups is at least as important for having a smooth feel to the game as having a high framerate, so you really need to make sure hiccups are rare enough. In the Awesomenauts Beta that is currently running, the most important hiccups are when new players join (both when they enter character select and when they enter the actual game). We are working on decreasing those hiccups, but the good thing here is that they only happen once every few minutes. Beware of hiccups that happen once every second!

CPU time versus real time

This is an oddity in some profilers that caused me to make some seriously wrong judgements on where performance was going. An extreme example is the following. I was calling SDL_GL_SwapBuffers, which finishes rendering a frame and shows it on the screen. According to my profiler, 0% of the time was spent there, so I concluded I didn't have to worry about it. This was totally wrong. It turned out that the profiler measured actual time spent using the CPU, so when a function simply waits or sleeps, then this profiler would not count that. When I discovered this, I quickly learned that SDL_GL_SwapBuffers was actually using 50% of the time in my game. This function was waiting for the videocard to finish rendering, and the solution was to massively decrease the overdraw. (Note that the videocard usually does not wait for the last frame to finish, but an earlier one, as I explained in this previous blogpost.)

So be aware that sometimes profilers show the time spent using the CPU, and sometimes they show the actual time on the clock that it took for a function to finish. Keep this in mind!

Multi-threading makes everything very complex

This also brings us to the one thing that makes optimisation truly complex: multi-threading. Again, this is best explained through an example. The Ronitech (our 2D multi-platform game engine) has two main threads running: the rendering thread and the game thread. At the end of every frame, these threads wait for each other to finish. The threads never take exactly the same amount of time, so in practice one thread waits a bit every frame. This means that on a multi-core processor, optimising the fastest thread does not increase the framerate at all.

So before optimising a multi-threaded game, you should always get a clear view of who is waiting where and for what, and then you should only optimise the things that cause the waiting.



Luckily, modern consoles are so fast that for smaller games, you often won't need a lot of multi-threading. Everything relevant in Swords & Soldiers was done in just one thread. Awesomenauts is a lot larger and more complex, though, so we now have three threads running at all times. In general, unless you are making something relatively big and are in a larger coding team (Ronimo currently has five programmers), you won't need a lot of threading and can thus skip this kind of complexity.

Which brings us to the end of this series of the most interesting lessons I learned while doing optimisation! For posts somewhere in the future I still have two more optimisation topics that I would like to discuss: one with practical examples of (simple) optimisations that worked well for me, and the other with a detailed look at our memory manager, which turned out to improve the framerate a lot without being very complex to make. For now, I hope this was an interesting wall of text about the sensual joys of optimisation!

Tuesday, 17 July 2012

Awesomenauts is coming to Steam!

Yesterday we have announced that Awesomenauts is coming to Steam! Woohoo! We have been working on it for quite a while already, so it will release really soon. There will even be a closed beta, which will start within a week. Exciting times! I am kind of surprised that I managed to write those optimisation blogposts in the past weeks at all, because we have been crunching like crazy to iron out all the bugs. In particular we spent a lot of time on getting the controls just right for PC, and personally I feel it plays great, so I hope the players in the beta will agree! Here's a trailer showing Coco and Derpl on PC:

Saturday, 14 July 2012

Optimisation lessons learned (part 2)

Last week I talked about some rather general things that I learned about CPU optimisation, when spending a lot of time improving the framerates of Awesomenauts, Swords & Soldiers, and even Proun. Today I would like to discuss some more practical examples of what kind of optimisations are to be expected.

Somehow I like talking about optimisation so much that I couldn't fit it all in today's blogpost, so topics around threading and timing will follow next week. Anyway, forward with today's "lessons learned"!

Giant improvements are possible (at first)

The key point of my previous blogpost was that you should not worry too much about optimisation early in the project. A fun side-effect of not caring about performance during 'normal' development, is that you are bound to waste a lot of framerate in really obvious ways. So once you fire up the profiler for the first time on a big project, there will always be a couple of huge issues that give giant framerate improvement with very little work.

For example, adding a memory manager to Swords & Soldiers instantly brought the framerate from 10fps to 60fps. That is an extreme example of course, and it only worked this strongly on the Wii (apparently the Wii's default memory manager is really slow compared to other platforms). Still, during the first optimisation rounds, there is bound to always be some low-hanging fruit, ready to be plucked.

The real challenge starts when all the easy optimisations have been done and you still need to find serious framerate improvements. The more you have already done, the more difficult it becomes to do more.

Big improvements are in structure, not in details

Before I actually optimised anything, I thought optimisation would be about the details. Faster math using SIMD instructions, preventing L2 cache misses, reducing instruction counts by doing things slightly smarter, those kinds of things. In practice, it turns out that there is much more to win by simply restructuring code.

A nice example of this, is looking for all the turrets in a list of 1,000 level objects. Originally it might make most sense to just iterate over the entire list and check which objects are turrets. However, when this turns out to be done so often that it reduces framerate, it is easy enough to make an extra list with only the turrets. Sometimes I need to check all level objects, so turrets are now in both lists, and when just the turrets are needed, the longer list doesn't need to be traversed any more. Optimisations like this are really simple to do and can have a massive impact on performance.

This is also a nice example of last week's rule that "Premature optimisation is the root of all evil": having the same object in two lists is more easy to break, for example by forgetting to remove the turret from the other list when it is destroyed. In fact, the rare bug with purple screens that sometimes happens in Awesomenauts on console recently turned out to be caused by exactly this! (Note that the situation was extremely timing specific: this only happened when host migration had happened just before a match was won.)

In my experience, it is quite rare to find optimisations that don't make code at least a little bit more complex and more difficult to maintain.

Platforms have wildly different performance characteristics

This is quite a funny one. I thought running the same game on different platforms would have roughly the same performance characteristics. However, this turned out to not be the case. I already mentioned that the default memory manager is way slower on the Wii than on any of the other platforms I worked with, making implementing my own memory manager more useful there than elsewhere. Similarly, the copying phase of my multi-threading structure (which I previously discussed here) takes a significant amount of time on the Playstation 3, but is hardly measurable when I run the exact same code on a PC.

So far I have seen that all the optimisations I have done have improved the framerate on different platforms with wildly differing amounts. They did always improve the performance on all platforms at least a bit, just not with the same amounts. So I think it is really important to try to always profile on the platform that actually has the worst performance problems, so that you can focus on the most important issues.

Truly low-level optimisations are horribly difficult

The final lesson that I would like to share today is actually a negative one, and cause for a little bit of shame on my side. I have tried at several occasions, but I have hardly ever been able to achieve measurable framerate improvements with low-level optimisations.

I have read a lot of articles and tutorials about this for various platforms. I tried all kinds of things. To avoid cache misses, I have tried using intrinsics to tell the CPU which memory I would need a little bit later. I have tried avoiding virtual function calls. I have tried several other similar low-level optimisations that are supposedly really useful, but somehow I have never been able to improve the framerate this way. The only measurable result I ever got this way was a 1% improvement by making a set of functions available for inlining (the Playstation 3 compiler does not have Whole Program Optimisation to do this automatically in more complex cases).

Of course, this definitely does not mean that low-level optimisations are impossible, it just means that I consider them a lot more complex to get results with. This also means that it is possible to make a larger project like Awesomenauts run well enough without any low-level optimisations.

We've got a big announcement coming up next week, and next weekend I will be back with the last part of my mini-series on optimisation. Stay tuned!

(Muhaha, are you curious what we are going to announce? Feel free to speculate in the comments!)

Friday, 6 July 2012

Optimisation lessons learned (part 1)

Optimisation is a very special art form, with lots of tricks and realisations that give it a place of its own in the programmers toolbox. Until I had to do it for the first time on Swords & Soldiers for the Wii, I had never done any optimisation. At university I had already learned a lot about time complexity (big O notation, like O(n)), but optimising an actual game with a big codebase is a different beast entirely. I was like a virgin that had never been seduced by a handsome profiler yet.

I just started fooling around with optimisations, reading tutorials and articles on Nintendo's dev website and the internet in general, and learning as I went along. Since no pro ever taught me how to optimise, I guess I might still be unaware of some really important tricks. Yet somehow I managed to get Swords & Soldiers from an initial 10fps to 100fps on the Wii (if VSync is turned off), and Awesomenauts from 10fps to 70fps on the Playstation 3.

In past five years I spent some six months in total on several kinds of optimisation. Quite a lot of that was on bandwidth optimisations, but also a lot on framerate and loading-times optimisations, so I have quite a bit of experience with it by now. There is a lot of interesting stuff to say about it, so in the coming weeks I would like to share the most important lessons that I learned. Today's concepts are rather general, next week I will share some more detailed pieces of (hopefully) 'wisdom'. ^_^

Don't make any assumptions
This is by far the most important one. When I want to optimise something, I always have a some ideas in my head about which parts of the game must be causing the low framerates. In practice, however, it turns out that I am hardly ever right. When I fire up a profiler and analyse where time is really spent, the big optimisations almost always turn out to be somewhere else than expected. This is still the case, even though in the past five years I have spent so much time full-time on optimisation (especially Awesomenauts was a really complex beast to tame), optimising code for Wii, PS3, Xbox360 and PC. So much experience, yet my expectations are still usually wrong.

So the rule I derive from this, is to never take any action based on assumptions. Always let go of your assumptions. Take real measurements and work from there. And once something has been changed, measure again to see whether it really works. As Dutch physics teachers often say: "Meten is weten" ("Measuring is knowing").

Profilers are awesome
Which brings me to the best part: profilers (measuring tools) are awesome! They give insane amounts of detail on all kinds of things. The core for me is the hierarchical view. This starts in your main() function and tells you it is using 100% of the time (oh really...). From there you can keep opening function calls in a tree-like form, to see in ever more detail where the time is spent. Beyond this, profilers have all kinds of nifty tools. If you ever want to do any optimisation, be sure to start by getting a proper profiler!



I have to say, though, that I have not been able to find a really good, affordable profiler for PC. Wii, PS3 and Xbox360 all have excellent profilers that only work on that platform, but on PC, most profilers I have seen are either incredibly expensive (like the one in Visual Studio Team Edition), or lack the concept of a "frame" and thus don't allow me to find framerate hickups. The best one for PC I have seen so far is Very Sleepy, even though it completely destroys the framerate while running and doesn't know what a frame is either. But Very Sleepy is easy to use and the hierarchical tree-view is really clear. So if anyone has any recommendations for a better profiler for optimising games that are written in C++, then please let me know!

Premature optimisation is the root of all evil
I already mentioned that you should not make any assumptions, and this also leads to another important point. "Premature optimisation is the root of all evil" is a quote from the famous computer scientist Donald Knuth, and it is all too true. Most optimisations make code more complex, more difficult to maintain and evolve, and more sensitive to bugs. Also, in practice, 99% of the code is completely irrelevant to performance. A couple of hotspots take up most of the CPU's time, and the rest just doesn't occur often enough to be relevant to optimise. So my conclusion is to not really care about performance and just write the cleanest, most readable code I can. Until the profiler says otherwise.



To give a simple example from C++: sometimes using a const char* is a lot faster than an std::string. However, the latter is so much safer to use, that I never use const char* until the profiler tells me I have a problem, and then I change it in that specific spot.

Modern consoles are ridiculously fast
When you look online for optimisation tips, or when you talk to hardened industry veterans, you will learn a lot of things that I dare claim are just not relevant any more. Consoles and PCs these days are so fast, that you can just waste performance on all kinds of things and still get a good framerate. Of course, I am not talking about the big games here: if you are working on the next Uncharted and need to squeeze every last bit of power out of the Playstation 3, then it probably becomes really important to look after every little performance detail, but in general: modern consoles are so fast that you can waste most of your processor time on inefficient code and still easily get a good framerate.

To give an example: a rule that some studios have is that dynamic allocations are forbidden. So calling new or malloc during gameplay is not allowed. Everything that is going to be used, must be allocated during loading, and then kept in pools for quick usage. This helps against memory fragmentation, but is also supposedly necessary to achieve a good framerate. I personally think this limits the code design way too much. It makes code more difficult to manage and extend during development, so I never do this. It turns out that even the Wii, supposedly the slowest of the current generation, can easily run 60fps with 100 characters walking around in Swords & Soldiers.

(Note that I did end up making my own memory manager to speed up allocations on the Wii, which I will discuss in a future blog post. But calling new tons of times each frame was not a problem in the end.)

My point here is not that everything is always possible, but I do think it is important to realise that unless you are making a triple-A console game, you should not worry about performance too much, definitely not early in development.



I personally find optimisation a lot of fun to do, so there is a lot more I would like to say about it. Next week I will be a bit more specific and follow this post up with some lessons that I learned about timing, threading, hiccups, and where the biggest performance improvements can be made.