Joost's Dev Blog: Optimisation lessons learned (part 2)

Saturday, 14 July 2012

Optimisation lessons learned (part 2)

Last week I talked about some rather general things that I learned about CPU optimisation, when spending a lot of time improving the framerates of Awesomenauts, Swords & Soldiers, and even Proun. Today I would like to discuss some more practical examples of what kind of optimisations are to be expected.

Somehow I like talking about optimisation so much that I couldn't fit it all in today's blogpost, so topics around threading and timing will follow next week. Anyway, forward with today's "lessons learned"!

Giant improvements are possible (at first)

The key point of my previous blogpost was that you should not worry too much about optimisation early in the project. A fun side-effect of not caring about performance during 'normal' development, is that you are bound to waste a lot of framerate in really obvious ways. So once you fire up the profiler for the first time on a big project, there will always be a couple of huge issues that give giant framerate improvement with very little work.

For example, adding a memory manager to Swords & Soldiers instantly brought the framerate from 10fps to 60fps. That is an extreme example of course, and it only worked this strongly on the Wii (apparently the Wii's default memory manager is really slow compared to other platforms). Still, during the first optimisation rounds, there is bound to always be some low-hanging fruit, ready to be plucked.

The real challenge starts when all the easy optimisations have been done and you still need to find serious framerate improvements. The more you have already done, the more difficult it becomes to do more.

Big improvements are in structure, not in details

Before I actually optimised anything, I thought optimisation would be about the details. Faster math using SIMD instructions, preventing L2 cache misses, reducing instruction counts by doing things slightly smarter, those kinds of things. In practice, it turns out that there is much more to win by simply restructuring code.

A nice example of this, is looking for all the turrets in a list of 1,000 level objects. Originally it might make most sense to just iterate over the entire list and check which objects are turrets. However, when this turns out to be done so often that it reduces framerate, it is easy enough to make an extra list with only the turrets. Sometimes I need to check all level objects, so turrets are now in both lists, and when just the turrets are needed, the longer list doesn't need to be traversed any more. Optimisations like this are really simple to do and can have a massive impact on performance.

This is also a nice example of last week's rule that "Premature optimisation is the root of all evil": having the same object in two lists is more easy to break, for example by forgetting to remove the turret from the other list when it is destroyed. In fact, the rare bug with purple screens that sometimes happens in Awesomenauts on console recently turned out to be caused by exactly this! (Note that the situation was extremely timing specific: this only happened when host migration had happened just before a match was won.)

In my experience, it is quite rare to find optimisations that don't make code at least a little bit more complex and more difficult to maintain.

Platforms have wildly different performance characteristics

This is quite a funny one. I thought running the same game on different platforms would have roughly the same performance characteristics. However, this turned out to not be the case. I already mentioned that the default memory manager is way slower on the Wii than on any of the other platforms I worked with, making implementing my own memory manager more useful there than elsewhere. Similarly, the copying phase of my multi-threading structure (which I previously discussed here) takes a significant amount of time on the Playstation 3, but is hardly measurable when I run the exact same code on a PC.

So far I have seen that all the optimisations I have done have improved the framerate on different platforms with wildly differing amounts. They did always improve the performance on all platforms at least a bit, just not with the same amounts. So I think it is really important to try to always profile on the platform that actually has the worst performance problems, so that you can focus on the most important issues.

Truly low-level optimisations are horribly difficult

The final lesson that I would like to share today is actually a negative one, and cause for a little bit of shame on my side. I have tried at several occasions, but I have hardly ever been able to achieve measurable framerate improvements with low-level optimisations.

I have read a lot of articles and tutorials about this for various platforms. I tried all kinds of things. To avoid cache misses, I have tried using intrinsics to tell the CPU which memory I would need a little bit later. I have tried avoiding virtual function calls. I have tried several other similar low-level optimisations that are supposedly really useful, but somehow I have never been able to improve the framerate this way. The only measurable result I ever got this way was a 1% improvement by making a set of functions available for inlining (the Playstation 3 compiler does not have Whole Program Optimisation to do this automatically in more complex cases).

Of course, this definitely does not mean that low-level optimisations are impossible, it just means that I consider them a lot more complex to get results with. This also means that it is possible to make a larger project like Awesomenauts run well enough without any low-level optimisations.

We've got a big announcement coming up next week, and next weekend I will be back with the last part of my mini-series on optimisation. Stay tuned!

(Muhaha, are you curious what we are going to announce? Feel free to speculate in the comments!)

14 comments:

Thamas14 July 2012 at 00:41
"A nice example of this, is looking for all the turrets in a list of 1,000 level objects. Originally it might make most sense to just iterate over the entire list and check which objects are turrets"

Ouch, that just made my brain hurt :(. Well, I guess that just goes to show it's a good thing I'm not in industry. Joking aside, what can we learn from this one?

There are, what, 10 turrets out of 1000 objects? That means checking the entire list has serious overhead versus being able to iterate over just the turrets. But if you do it only once per frame, then the overhead _per frame_ is maybe negligible.

You --I'm going to say 'you', but I guess that means 'somebody at Ronimo'-- ended up changing it. So at some point this got on your radar? Did you end up getting more performance? (You did get more bugs, assuming your initial implementation had no bugs.)

Was it worth it? Could you have known in advance? Should you have done the more complicated solution to begin with? Are you glad that you didn't? I ask this non-rethorically.
ReplyDelete
Replies
Thamas14 July 2012 at 00:53
Could you say more about not getting performance from avoiding virtual function calls? Is there something you tried there? It seems to me like this is mostly an architectural aspect that can't easily be hacked into something.

That is, this seems like something where some initial 'optimisation' could actually be important; setting up your architecture so that it uses virtuals all over the place even though it doesn't have to would be (premature?) pessimisation :P.

This assumes virtual function calls actually impact performance. I have heard it suggested that it doesn't actually matter much in most reasonable cases on modern hardware. On the other hand, if I recall correctly, virtual function calls ARE super terrible on PowerPC, that is, Xbox. I think Olaf told me something to that effect at some point.
ReplyDelete
Replies
Unknown14 July 2012 at 19:29
Joost how do you look at this from a design point of view?

For example you mention you increased fps from 10 to 60 at a certain stage. But if it was at 10 before that would seriously hinder proper playtesting and design. When designing levels or gameplay I continiously play the game. If that doesn't run properly it is almost impossible to do. I'd even say that if the game doesn't run at the desired final fps it limits a designer. Sounds like whining maybe, but I truely believe that.

Also during development of ibb and obb we have showcased the game at many different events. Often last minute requests came in and this sometimes let to not optimal builds at these events. (I still need to figure out a way to deal with this for later projects.)

How do you feel about this? Do you guys build seperate playtest or event builds sometimes?
ReplyDelete
Replies
pachanga20 July 2012 at 14:41
It's really interesting to know how you optimized your game for PS3. Did you offload PPU by running some logic on SPU? And if yes, what logic exactly? (e.g physics, animations, etc)
ReplyDelete
Replies

Add comment