Sunday, 24 May 2020

Then the Halls Were Empty... and I Turned It On!

I've finished a new song! It's an instrumental that tells a little story. This composition is the centrepiece for my album: here the gate that it's all about is opened for the first time, into who-knows-where. After this the real drama starts: the next song is about how the entire facility needs to be destroyed in order to stop creatures from flooding in!



The first 2 minutes of this song actually started as the soundtrack for a cancelled Ronimo game, all the way back in 2012. I wrote a few short pieces for that and this one made it in. I started by building a rhythm from workshop sound effects, like a tape-measure sliding back in, and then composed the rest from there. While the atmosphere worked really well with our game prototype, the rhythmic sound effects turned out to be problematic: the game also had real sound effects and that didn't combine well. So I simply removed the rhythmic part from my song and then it sounded good in the game. I think this is a beautiful example of how creativity and iteration work: the things you make serve a purpose in the process, even those that get scrapped in the end.

When that game was cancelled, this track went into the fridge. Nevertheless it remained one of my favourite tracks and I came back to listen to it quite often. I wanted to do something with it, but didn't know what. Back then it was only 1:50 minutes long and that felt too short to be 'finished'.

Then I heard Home By The Sea by Genesis. I'm a huge fan of Genesis so it's not rare for me to listen to their music, but this particular time I suddenly realised that I could try doing the same structure for my own song. Home By The Sea is basically two songs in one: it starts as a pop song but halfway it quiets down and then at 5:07 a big beat sets in and it becomes an amazing instrumental. I wanted a similar transformation in my own song! So I tried emulating that vibe, but I failed and it doesn't sound anything like the original. In this case however it did capture something else: I really liked the resulting sound! To me that's another great example of how creativity can work: take an inspiration and then morph it into something completely different to make it your own.

I continued from there, turning it into a 6:25 minutes long piece. That actually makes it my longest composition since I was in a prog-rock band over 15 years ago.

For the title I drew some more inspiration from Genesis. The original title of my song was "And Then The Halls Were Empty", but the more active second part didn't fit that quiet title anymore, so I needed to add something. Genesis has a wonderful pair of songs called "Unquiet Slumbers for the Sleepers..." and "...In That Quiet Earth". I figured I could do something similar with the title of my song, hence the ellipsis (...) in my title. (Note that the second half of the title also sounds a whole lot like another Genesis song, but that's actually a coincidence that I didn't notice until I started writing this blogpost.)

The inspiration for the final bit (starting at 3:44) comes from the song Prelude 1 by Floex and Tom Hodge. Here the result is closer to the inspiration, but I feel my composition is different enough that it's become its own thing. I actually met Floex (Tomáš Dvořák) at a conference a few years ago and was fan-boying to the max, since his soundtrack for Machinarium is one of my favourite game soundtracks. Especially the second half of The Glasshouse With Butterfly is mind-blowingly beautiful. Turns out Tomáš Dvořák is also a very friendly guy to talk to. :)

Since I've started the tradition of creating cello sheet music for all of my songs, I've done so for this one as well. From the perspective of the cello this isn't my most interesting composition, but if you'd like to play the cello parts, you can find sheet music and an MP3 to play along to (without the cello) on music.joostvandongen.com. This website also has sheet music for all my other compositions.

This song actually completes the line-up for my album, which will be called "The Ageless Gate - A Cello Tale". All 13 songs are now complete, making for a grand total of 47:03 minutes of music. The only steps left now are mixing/mastering and making things like a cover and booklet, since I want to make a proper CD out of this one (as well as put it on streaming services like Spotify, of course).

Sunday, 1 March 2020

How we made particles twice as fast through cache optimisation

Cache optimisation is one of the most hardcore, low-level topics in game programming. It's also a topic that I haven't been very successful with myself: I tried optimising for cache efficiency a few times but never managed to achieve a measurable improvement in those specific cases. That's why I was quite intrigued when our colleague Jeroen D. Stout (famous as the developer of the experimental indie game Dinner Date) managed to make our particles twice as fast. What's the trick? Jeroen was kind enough to tell me and today I'll share his approach. Since it's such a hardcore topic, I'll start by giving a general overview of what cache optimisation is in the first place.



Let's have a look at computer memory, or RAM. Computers these days have gigabytes of very fast memory that's used to store whatever the game or program needs to use on the fly. But is that memory truly so fast? Physically, it's a separate part of the computer. Whenever the processor needs to get something from memory, it needs to be transported to the CPU. That only takes a tiny bit of time, but a modern processor can perform billions of operations per second. At those kinds of speeds, a tiny bit of time is actually a lot of operations that the processor could have done during that time. Instead, it's waiting for something to come out of memory.

This is where cache comes in. Cache is a small amount of even faster memory that's directly on the processor and is thus a lot faster to access than normal memory. So whenever something is already in cache, the processor doesn't have to wait for it (or at least not as long) and can keep working. According to this article cache can be up to 27 times faster than main memory.



As programmers we generally don't have direct control over what's in cache: the processor tries to utilise the cache as efficiently as possible by it's own volition. However, depending on how we access memory, we can make it a lot easier for the processor to use cache efficiently. This is where cache optimisation comes in.

The goal of cache optimisation is to structure our code in such a way that we use cache as much as possible and need to wait for memory as little as possible. Whenever the processor needs to get something from memory and it's not already available in cache, that's called a cache miss. Our goal is to avoid cache misses as much as possible.

To be able to avoid cache misses, we need to know a little bit about how cache works. I'm no specialist in this field, but we only need a small bit of knowledge to already understand how a lot of cache optimisations work. These are the basic ideas:
  • Data is transferred from memory to cache in blocks. So whenever you're using something, the memory directly around that is likely also already available in cache.
  • Cache is often not cleared immediately. Things you've just used have a good chance of still being available in cache.
  • The processor tries to predict what you'll use next and get that into cache ahead of time. The more predictable your memory access, the better cache will work.

These three concepts lead to a general rule for writing code that's efficient for cache: memory coherence. If you write your code in such a way that it doesn't bounce all over memory all the time, then it will likely have better performance.

Now that the basics are clear, let's have a look at the specific optimisation that Jeroen did that halved the CPU usage of our particles.



To avoid dumping huge amounts of code here, I'm going to work with a highly simplified particle system. Please ignore any other inefficiencies: I'm focusing purely on the cache misses here. Let's have a look at a very naive approach to implementing a particle system:

struct Particle
{
   Vector3D position;
};

class ParticleSystem
{
   std::vector<Particle*> particles;

   void update()
   {
      for (int i = 0; i < particles.size(); ++i)
      {
         particles[i]->position += speed;
         if (shouldDestroy(particles[i]))
         {
            delete particles[i];
            particles.erase(particles.begin() + i);
            --i;
         }
      }
      while (shouldAddParticle())
      {
         particles.push_back(new Particle{});
      }
   }
};

This code is simple enough, but if we look at it from a cache perspective, it's highly inefficient. The problem is that every time we create a new particle with new, it's placed in a random free spot somewhere in memory. That means that all the particles will be spread out over memory. The line particles[i]->position += speed; is now very slow, because it will result in a cache miss most of the time. The time the processor spends waiting for the particle's position to be read from memory is much greater than the time that simple addition takes.



I knew that this would give performance problems, so I immediately built the particles in a more efficient way. If we know the maximum number of particles a specific particle system can contain, then we can reserve a block of memory for that on startup and use that instead of calling new all the time.

This results in a bit more complex code, since we now need to manage that block of memory and create objects inside it using placement new. In C++, placement new allows us to call a constructor the normal way, but use a block of memory that we already have. This is what the resulting code can look like:

struct Particle
{
   Vector3D position;
};

class ParticleSystem
{
   unsigned char* particleMemory;
   std::vector<Particle*> particles;
   int numUsedParticles;

   ParticleSystem():
      numUsedParticles(0)
   {
      particleMemory = new unsigned char[maxCount * sizeof(Particle)];
      for (unsigned int i = 0; i < maxCount; ++i)
      {
         particles.push_back(reinterpret_cast<Particle*>(
            particleMemory + i * sizeof(Particle)));
      }
   }

   ~ParticleSystem()
   {
      delete [] particleMemory;
   }

   void update()
   {
      for (int i = 0; i < numUsedParticles; ++i)
      {
         particles[i]->position += speed;
         if (shouldDestroy(particles[i]))
         {
            //Call the destructor without releasing memory
            particles[i]->~Particle();

            // Swap to place the dead particle in the last position
            if (i < numUsedParticles - 1)
               swap(particles[i], particles[numUsedParticles - 1]);

            --numUsedParticles;
            --i;
         }
      }
      while (shouldAddParticle() && bufferNotFull())
      {
         //Placement new: constructor without requesting memory
         new(particles[numUsedParticles]) Particle{};
         ++numUsedParticles;
      }
   }
};

Now all the particles are close to each other in memory and there should be much fewer cache misses when iterating over them. Since I built it this way right away, I'm not sure how much performance I actually won, but I assumed it would be pretty efficient this way.

Nope.

Still lots of cache misses.

In comes Jeroen D. Stout, cache optimiser extraordinaire. He wasn't impressed by my attempts at cache optimisation, since he saw in the profiler that the line particles[i]->position += speed; was still taking a disproportionate amount of time, indicating cache misses there.

It took me a while to realise why this is, but the problem is in the swap-line. Whenever a particle is removed, it's swapped with the last one to avoid moving all particles after it one forward. The result however is that as particles are removed, the order in which we go through the particles becomes very random very quickly.



The particles in our example here are really small: just a single Vector3D, so 12 bytes per particle. This means that even if we go through the particles in random order, we might occasionally still be staying in the same cache line. But a real particle in the Secret New Game (tm) that we're developing at Ronimo has much more data, like speed, scale, orientation, rotation speed, vibration, colour, and more. A real particle in our engine is around 130 bytes. Now imagine a particle system with 100 particles in it. That's a block of memory of 13,000 bytes. Hopping through that in random order is much more problematic for cache!

Thus, Jeroen set out to make it so that updating particles goes through memory linearly, not jumping around in memory at all anymore.

Jeroen's idea was to not have any pointers to the particles anymore. Instead we have that big block of memory with all the particles in it, and we store the indices of the first and last currently active particles in it. If a particle dies, we update the range of living particles and mark the particle as dead. If the particle happens to be somewhere in the middle of the list of living particles then we simply leave it there and skip it while updating.

Whenever we create a new particle, we just use the next bit of memory and increment the index of the last living particle. A tricky part here is what to do if we've reached the end of the buffer. For this we consider the memory to be a ring buffer: once we reach the end, we continue at the start, where there's room since those particles will have died by now. (Or, if there's no room there, then we don't create a particle since we don't have room for it.)

The code for this isn't that complicated, but it's a bit too much to post here. Instead, I'm just going to post this scheme:



Compared to my previous version, Jeroen managed to make our particles twice as fast with this approach. That's a pretty spectacular improvement and shows just how important cache optimisation can be to performance!

Interestingly, my first hunch here is that this optimisation is a bad idea since we need to touch all the dead particles in between the living ones to know that they're dead. Worst case, that's a lot of dead particles. In fact, when looking at computational complexity, the worst case has gone from being linear in the number of currently existing particles, to being linear in the number of particles that has ever existed. When I studied computer science at university there was a lot of emphasis on Big O Notation and complexity analysis, so to me this is a big deal.

However, as was mentioned above, accessing cache can be as much as 27 times faster than accessing main memory. Depending on how many dead particles are in the middle, touching all those dead particles can take a lot less time than getting all those cache misses.

In practice, the lifetimes of particles don't vary all that much. For example, realistic particles might randomly live between 1.2 seconds and 1.6 seconds. That's a relatively small variation. In other words: if there are dead particles in between, then the ones before them will also die soon. Thus the number of dead particles that we need to process doesn't become all that big.

The big lesson this has taught me personally, is that cache optimisation can be more important for performance than Big O Notation, despite that my teachers at university talked about Big O all the time and rarely about cache optimisation.

EDIT: Several commenters on Twitter and Reddit suggested an even simpler and faster solution: we can swap the dead particles towards the end of the list. This removes the overhead of needing to skip the dead particles and also removes the memory that's temporarily lost because dead particles are in the middle. It also simplifies the code as we don't need to handle that ring buffer anymore.

For further reading, I would like to again recommend this article, as it explains the principles behind cache lines very well and shows some clever tricks for optimising to reduce cache misses.

As a final note, I would like to also stress the unimportance of cache optimisations. Computers these days are super fast, so for the vast majority of code, it's fine if it has very poor cache behaviour. In fact, if you're making a small indie game, then it's probably fine to ignore cache misses altogether: the computer is likely fast enough to run your game anyway, or with only simpler optimisations. As your game becomes bigger and more complex, optimisations become more important. Cache optimisations generally make code more complex and less readable, which is something we should avoid. So, use cache optimisations only where they're actually needed.

Have you ever successfully pulled off any cache optimisations? Are there any clever tricks that you used? Please share in the comments!

Sunday, 16 February 2020

New song: Approach of the Derelict Research Station

Here's my newest song: Approach Of The Derelict Research Station! This one is not just my newest song, but also one of the oldest: it's an evolution of a song that I wrote some tens years ago, then called Space War X9.

Two years ago, when I decided that the songs I was writing at that time should form an album together, I wondered what else should be on it. I've always been quite fond of Space War X9 and wanted to include it, but it was lacking an important ingredient: cello! On my album I want all the songs to feature my cello in a prominent role. However, back when I made Space War X9 I had no idea how to record my cello in a way that made me happy with the results, so I mostly composed purely electronic music.

Adding cello was an interesting challenge: the tonality of Space War X9 turned out to be rather odd and none of it had been written with an additional instrument in mind. Still, somehow the cello ended up being the lead instrument through most of it. Listening to the new version, it now feels weird that it once lacked cello, since the cello is now the lead instrument during most of the song.

This song is also the first one to be released with narration. On my album all the songs will tell a story together. Voice actor Chris Einspahr has graciously lent his voice and I love how he acts. His voice also blends very well with the music. Check it out:



At the end of the song there's a bit of crazy cello improvisation. I often do these kinds of intense gliding notes during jam sessions but this is the first time I dared include one in a real song. I'm quite happy with how that turned out, and it nicely captures the craziness of the fight portrayed at the end of the song.

If you'd like to play this song yourself and give your own twist to the crazy ending, you can find cello sheet music for it on music.joostvandongen.com, as well as a version of the song without the cello, to play along it.

The art in the video is by Karen Papazian, who allowed me to use her existing piece "Not much left". While it wasn't originally drawn for this song, it perfectly fits what I imagined the setting to be.

Finally, I should mentioned the samples I used here. I found these samples on FreeSound.org and it's great that most audio on this website requires nothing more than attributing the original artists and then you can use it. The samples used in Approach Of The Derelict Research Station were made by 211redman112, foxraid, lezaarth, TMPZ_1, Ideacraft, Deathscyp, rene___, steshystesh, jonccox and lollosound. (Gotta love the kinds of named people give themselves on the internets!)

This leaves only one song to finish for my album, almost there!

Sunday, 2 February 2020

Five important realisations about game balance

The games we've so far made at Ronimo have all featured a heavy emphasis on competitive multiplayer. Designing, testing and iterating these games, especially our biggest hit Awesomenauts, has taught us many things about balancing. Today I'd like to share some of the most important lessons we've learned along the way.

1. Overpowered is much worse than underpowered


At first glance one might think that in game balance, underpowered and overpowered are equally bad: they both mean something is badly balanced and needs to be improved. This is true, but in practice overpowered things turn out to have way more impact than underpowered ones.

The reason for this is that players tend to flock to whatever is strongest and use only that. For example, Awesomenauts has 34 characters. If 3 of those would be underpowered, then most players wouldn't play those, leaving 31 valid characters. That's still plenty of choice and variation. On the other hand, if 3 characters were overpowered, then players would play only those 3 and would ignore the rest. That would make the game very repetitive and turn stale quickly.

This knowledge can be used as a crude tool in cases where no better solution is available. For example, if something is overpowered but only under certain circumstances, then you might choose to nerf it until it's okay under those circumstances only and is underpowered in all other situations. This way at least it isn't dominating the game anymore.

2. Variety always adds imbalance


A game with just one weapon on just one symmetrical map will pretty much automatically be balanced. Even if only because all players are in exactly the same situation. Such a game would however probably not just be balanced, but also be boring. So we need to add variety: more weapons, more maps, more items, more builds, more everything. Maybe even asymmetrical maps. The key thing to realise when doing this, is that as the game becomes more complex, 'perfect' balance becomes ever more difficult to achieve. This quickly gets to the point were 'perfect' balance is impossible, and every bit of variation you add makes the game a bit more unbalanced.

Let's look at a really simple example: walking speed. Let's say some characters are fast and others are slow. This gives the slow characters a disadvantage that can be balanced by giving them more health and damage. However, now we also add maps of different sizes. On a bigger map, the disadvantage for the slow characters will be bigger than on a small map. This is because if the arena is small, slow characters can get to the other side fast enough anyway. No amount of health or damage tweaks will fix this, since it differs per map.



One solution to this can be found in our game Swords & Soldiers 2. There in matchmade online matches, we only use certain maps: the ones that we feel are most balanced. On the other hand, when you invite a friend to play, you can choose from all maps, including a bunch of pretty weird ones. Those might be less balanced, but they add a lot of spice and fun. Depending on how much you want to appeal to players with a competitive mindset, you can choose to include those varied but imbalanced maps, or not.

3. Competitive players often dislike randomness and luck


When Awesomenauts launched, the game had random crits, which dealt a lot of extra damage. This added surprise and suspense: every hit might be extra strong! It also means that even if the opponent is better than you, you might occasionally win because you got lucky. This makes a game a lot more friendly for beginners. An extreme example of this can be found in Mario Kart: this game includes a lot of randomness. Combined with a bunch of catch-up mechanics, Mario Kart is a game where occasionally a n00b can beat a pr0.

However, randomess also adds bad luck: sometimes you clearly outplay an opponent and still lose, because the enemy got lucky and landed several crits in a row. In a sense this might feel nice: since the opponent clearly just got lucky, you don't need to blame yourself for losing. Many competitive players however don't want this to factor into the equation. They want a very simple thing: the best player should win. "If I practice more and get better, then I should always win." In the years after release many players in the Awesomenauts community got better and more competitive, to the point where a lot of players really wanted the random crits to be removed from the game. For this reason we ended up changing crits into a predictable system where simply every third hit deals more damage.



4. Balance automatically becomes worse over time


Even if you think the balance in your game is in a good place, by simply leaving it as it is for a while, it will deteriorate. The reason for this is that as time passes, players get better at the game, learn new tricks and talk to each other. This changes how the game is played, thus also changing how the balance is experienced. Usually not for the better: as time progresses and no balance tweaks are made, balance usually becomes worse.

One example we've seen with Awesomenauts was that at some point after a few months of stable balance, one specific team discovered a new tactic that was super strong. This tactic had been possible for months, but somehow no one had found it yet. This tactic was then first used in a tournament, where that team gloriously beat everyone else and won. After the tournament, news spread like wildfire and suddenly this tactic was used in almost every match. We had no choice but to quickly do a balance patch specifically to nerf this one particular tactic. (The fact that our game is deep enough that players can discover new tactics this way is one of the things I'm most proud of in all of my career as a gamedev.)

Another example of balance getting worse over time might not even be caused by something actually being way too strong. Maybe there's something that's just slightly overpowered, so mildly that it really doesn't matter. As time passes, players write guides and talk about the best tactics. They will point each other at this subtle advantage, causing more and more people to use it. Even if the advantage is really small, or, even more extreme, even if the advantage doesn't exist and players are just imagining it, this still ruins the game for the simple reason that everyone starts doing the same thing. This makes the game predictable and boring.

Sometimes you get lucky and players start responding to that imbalance. Maybe an underpowered character happens to be really strong in that particular situation. Since that situation now happens so often, that underpowered character is suddenly super strong in most matches, causing lots of players to choose him. This makes the dominant strategy shift naturally from one thing to another, adding variation and fun. I've been told some games in the Super Smash Brothers series have balanced excellently for this: as soon as one character becomes dominant, its counter becomes extra interesting and lots of players start playing the counter, at which point the counter of the counter becomes useful, etc. This causes the balance to slowly but constantly shift.

5. 'Perfect' balance is impossible


The final point I'd like to share today is both soothing and intensely frustrating: for any game that has significant complexity and variation, perfect balance is impossible. This is a soothing thought in the sense that it makes you realise that even if you were the best game designer in the world, your game would still not have perfect balance. It's also frustrating, because of course the goal of the game designer is to make the balance really good. Knowing that the balance will never be truly fantastic makes balancing a frustrating experience.

So, why is 'perfect' balance not possible? I've already mentioned above that as more variation is added, it becomes impossible to make all options equally strong under all circumstances. But there's more. What about players of different skills? Some characters/weapons/maps are bound to be more difficult to play than others. The result is that for beginners, the balance will be different than for pros. And what about simply different tastes? Some players prefer fun and variation, while other players prefer predictability and skill. The balance can't make both groups perfectly happy. Combined, all of these elements make it impossible to achieve 'perfect' balance.

This post has shared some of the things we've learned through the years about balance. What are your most valuable or most surprising insights regarding balance?

PS. Some of the topics in this blogpost have been discussed in more detail in previous blogposts. If you'd like to read more, have a look at the following posts: