In the past half year we have seen a really rare bug occur about once per month in Awesomenauts on the Playstation 3: suddenly the game would freeze for anywhere between 10 to 100 seconds, and then continue normally as if nothing weird had happened. We always have a PC connected to collect logs from the game. However, the log printed nothing interesting and showed simply the framerate, which is printed once per second:
From this we concluded that the issue must be in some part of code that does not log anything. That leaves 99% and with a codebase of at least 150,000 lines, that is a lot!
Since this issue was so rare and we hadn't seen it in months, we had closed it. We thought it was an anomaly that we couldn't find or fix and that had somehow disappeared.
Until we were testing our latest build last Thursday. We were playing on seven Playstation 3 devkits, when suddenly 4 of them froze for about 90 seconds. Argh! We thought this really difficult bug had magically disappeared, and now it was back, with a vengeance! We cannot leave a bug in that happens so massively on all these consoles at once!
Again, no relevant logging. Yet something really interesting nevertheless: these Playstations were in different online matches, so they weren't even talking to each other! So how could they all freeze at the exact same time? We concluded that the only thing they had in common, was that they all talk to the same Sony matchmaking servers, so we started investigating all our code related to that. However, we couldn't find anything relevant.
So we added a lot more logging, and did some really advanced stuff to get more info on the stack during the freeze (which is difficult to get from an executable that has been stripped of all debugging info), and started playing again. I let the Playstations perform our automatic testing all night, and the bug didn't happen. Then we played the game for five more hours with the entire team and BAM!, it finally happened again on two consoles!
This time, we had more info, and it turned out that the spot where the game froze was different on both freezing consoles, and both did not contain any calls to Sony's matchmaking servers. In fact, it was in between two logging calls, in a spot where nothing relevant was happening. So we concluded there were only two possible causes: either other threads were hogging the entire CPU (due to how the scheduling system on the Playstation 3 works, high priority threads can do this permanently), or the logging itself was broken.
So we started experimenting around that, and now we quickly found the cause of this 'bug': when the PC that is tracking logs goes into sleep mode, the connected consoles freeze a little while later; once the PC is awoken once more, the consoles continue as well a little later. The PC that was tracking the logs automatically went into sleep mode after not touching it for 30 minutes. This only happened during extensive play-testing, because people normally actually use that PC. So it wasn't even a bug in our code! ARGH! Especially since logging is turned off in the release build...
This may all seem really obvious in hindsight, but in general when we have a bug/freeze/crash it is in our own code, not in one of the tools we use. With such a big codebase, it is easy to not even think about something else. Also, in the chaos of 14 people playing the game on 7 consoles, it is easy to overlook one specific PC going into Sleep mode a little bit before the consoles freeze... Also, I still don't know why not all consoles connected to that PC froze.
So, this hard to find and very persistent bug turned out not to even be a bug! And what do we learn from this? Two things:
1. Really nasty bugs are often in a totally different spot than where you are looking.
2. Sometimes you encounter the lamest and most frustrating stuff while programming games!