Game Performance: It’s the Software’s Fault
August 21, 2015
When I last reported on the progress of the upcoming Sleuthhounds: The Cursed Cannon computer game, I was just embarking on improving the performance of the game for running on tablets. My own tablet was having hardware difficulties that made it hard to judge said performance. With those difficulties resolved, I’ve turned my attention to improving the game and have made some pretty decent gains in that area.
Strap in, ’cause this is gonna be another long blog post.
After addressing the hardware issues on the tablet, my game was stable but running less than optimally. The largest room in the game is the base of a giant cannon. The background for this room is so large that it doesn’t all fit on screen at the same time and can scroll both horizontally and vertically. The size of the image used for the room is 1624x1154 pixels. Additionally, the room can display up to six characters at any given time.
To put these figures into perspective, the largest room in the first game (Sleuthhounds: The Unlocked Room) was 779x626 pixels and only showed one character. A smaller room than that could show two characters.
When tackling the performance of the game on the tablet, there were two issues I was concerned with.
- Loading Time – The cannon base room with its multiple characters took about 6 seconds to load every time you entered the room. That’s long enough to make a person think the game has stopped responding. Never a good thing.
- Rendering Speed – The cannon base room was rendering at a measly 12 frames per second (FPS). For smooth animation it’s generally considered that 24 FPS is needed and at least 30 FPS is even better.
Games are very complex pieces of software. There’s a lot of potential places where inefficiencies can be coded into the game and a lot of time can be spent blindly jumping about in the code hoping to stumble across where the big bottlenecks are. To help my analysis, rather than using the random shotgun approach, I employed a profiler.
A profiler is an application (sometimes an integrated tool) that can be run alongside the main application to determine what part of code is being executed the most. There are two types of profilers:
- Instrumenting Profilers – These types of profilers require changes in the source code that indicate the entrance and exit points of each subroutine. When the application is run with these profilers they know exactly when a given piece of code executes and how long it takes. Every time a subroutine is called that call is logged. This type of profiler gives a very detailed breakdown of the execution of an application but at the cost of slowing performance down from all the logging that goes on.
- Sampling Profilers – These types of profilers don’t require changes to the code. Instead, as they run they periodically check what instruction from the code is being executed. Through memory mapping magic this can be translated back into the actual subroutines being called. The nature of these kinds of profilers is such that they don’t incur the performance penalties of instrumenting profilers. However, because they’re only sampling where the current code execution is, they don’t have the same detailed execution breakdown.
In my case, since I wasn’t sure where the performance hit or hits were in my code I decided to go with a sampling profiler. I wanted to keep my game running as close to its normal speed as possible and hoped to get a general idea of where the bottlenecks were, which is what sampling profilers are intended for. I developed my game using Delphi 5 (still working after all these years). I used the Delphi Sampling Profiler from DelphiTools. It takes only a couple quick easy steps to configure Delphi and the Sampling Profiler to produce usable profiling results.
Very quickly the profiler highlighted a problem routine in my game’s rendering code. Wholly 17% of total execution time was falling into the evil ModelGet routine!
I first started seriously working on game programming way, way back in 2006. At that time I started work on a basic game framework to handle all the common game chores of displaying images on screen, playing sounds, managing game states, dealing with different input devices, and so on.
I wasn’t too sure how to go about developing such a framework but my general experience as a programmer told me that whatever solution I came up with, it would grow and change and evolve with time.
In the case of the graphics module of the software I chose to use OpenGL over DirectGraphics as it fit my programming style more closely. I implemented a basic model structure in the framework that is used to represent and render both 3D objects and 2D sprite images. Each of these objects was assigned a unique model ID that the rest of the game framework could reference whenever the model needed to be manipulated or rendered.
This method of doing things meant that other modules in the framework only needed to know the unique model IDs and not care at all about the internals of how the models were handled. Theoretically this was a great idea. Theoretically.
Once the profiler had pinned down ModelGet as the single worst performance offender I went in to take a closer look at it. Within the graphics module of the game all models are stored in a dynamic array. Whenever a model is needed, to be manipulated or to be rendered, the game does a simple loop from the beginning of the array to the end looking for the particular model based on its model ID.
In the cannon base screen the background is a model. And every frame of each character animation is a model. And all the on screen user interface components are models. And all the dialog bubbles that appear are models. And on and on and on. When I added them all up, the cannon base screen had over 800 models. That was 800 different models that had to be searched every time a model was referenced. Just rendering the scene meant, on average, 800 x 400 = 32,000 searches. That’s not even counting moving models about as characters walk. Or changing model colors to account for different lighting effects. Or changing the size of models to make them scale properly in a screen. And all of those searches were happening for every single frame rendered by the game.
The kicker is, I’d even left myself a comment in the code:
//RH - I know this isn’t a good way to do this, but I’m doing it for now because my current project doesn’t have a lot of models.
Between 2006 and 2015 I’d forgotten about that comment.
After reviewing ModelGet I took a couple of hours to go through my game code and convert it from using model IDs to using pointers. The benefit of passing a pointer for a model around is that no searching for that model whatsoever is needed. The drawback is that if the pointer ever points to garbage…well, garbage in, garbage out. However, I wasn’t too concerned about that since the framework has become very stable over time and the pointers were quite easy to manage.
Using pointers meant that I could eliminate ModelGet entirely. Doing so improved the tablet performance from 12 FPS to about 25 FPS. Fast enough to give smooth animation but slow enough that using a mouse on the tablet showed that the mouse was still a little sluggish.
So back to the profiler I went. Now the great thing about a sampling profiler is not only does it show how much time is being spent in your code but how much time is being spent in any libraries that you may have linked to. After eliminating the ModelGet problem there weren’t any other large bottlenecks being identified in my own code. However, the profiler identified that a huge amount of time was being spent in the OpenGL DLL of my tablet. Most of that time was the result of being called from my framework’s main rendering routine.
When I took a look at my rendering routine I found that it was already written really tightly. There wasn’t a lot in the code that could be tweaked, but one thing did nag at me.
My framework is based on the older OpenGL 1.2 specification. That was the current spec when I first started learning OpenGL. By the time I came to start work on the game framework a couple newer versions of the spec were out but hadn’t received large adoption in the hardware market. OpenGL 1.2 still seemed the way to go.
OpenGL is concerned with the rendering of 3D objects. 2D images can also be rendered by basically drawing a rectangle oriented parallel to the screen. Even though it’s represented as 3D data it looks like a 2D image. That’s the approach the Sleuthhounds games take.
Now, when it comes time to render objects to the screen with OpenGL those objects can have various attributes or states applied to them. For example, maybe one object is tinted green and one is tinted blue. Maybe one object needs to show a texture (an image of some sort) and one object needs to just be a flat color with no texture. Changing these attributes between objects causes state changes within OpenGL.
The number one way to improve OpenGL performance is to decrease the number of state changes made every frame that’s rendered.
Almost all of the graphical components of the Sleuthhounds games are built from static images (kind of like the images you get when taking pictures with digital cameras or phones except the game images are drawn by hand). These images are rectangles of different sizes and orientations.
OpenGL, being computer based, likes to deal with images (textures) in powers of 2. So you can set up textures that are 32x32 pixels, or 64x64 pixels, or 512x128 pixels. But you can’t set up textures like 17x39 pixels.
Within my game framework, when an image is loaded up the framework goes through and breaks up the single “any shape” image into several images that fit the sizes that OpenGL mandates. This is all done behind the scenes so that when I’m developing an actual game with the framework I can just say “open image” and not care about what its actual dimensions are.
When I was first developing the game framework (again back in 2006) the OpenGL 1.2 specification mandated that all implementations of OpenGL had to support texture sizes of at at least 64x64 pixels. So the simple minded solution for dealing with oddly shaped images was to just go through and break them up into multiple textures each 64x64 pixels in size.
Suppose an image being used was 116x72 pixels in size. To represent this in 64x64 pixel chunks requires four textures to be created as shown above. When the image gets rendered, that requires four OpenGL state changes to occur. However, if the OpenGL implementation supports larger textures sizes, 128x128 for example, then the entire image can be represented by a single texture and so requires only one state change.
The worst case room in the first Sleuthhounds game is 779x626 pixels. That requires 130 64x64 pixel textures to be created to represent it (read 130 OpenGL state changes).
The worst case room the The Cursed Cannon is 1624x1154 pixels. That requires 494 textures. Nearly four times the number of state changes!
The 64x64 texture size was the size that OpenGL implementations had to support…In 1999. Nowadays texture sizes of 512, 1024, 2048, or even higher are not uncommon. The solution was simple, reduce the number of needed textures by increasing the texture size.
Within the framework, I changed my algorithm for breaking up images so that instead of using the simple but small 64x64 textures the framework would work out what size of textures a specific OpenGL implementation supported. It would then use the largest textures possible without exceeding the size of the image. Smaller textures would then be created to get any “spillover” that didn’t fit in the larger textures.
The result on my tablet, which is several years old at this point, is that the 494 textures have now dropped to 12 textures, with the largest texture size being 512x512. With this change in place, the 25 FPS on my tablet increased to about 32 FPS. My goal had been to hit 30 FPS, so I was quite happy that with only a few changes to the framework I was able to get the rate up above my goal.
With the framerate in reasonably good shape, I turned my attention to the loading time. That cannon base room was still taking about 6 seconds to load on the tablet. Far too long.
Here’s where one of the difficulties of a sampling profiler comes into play. The profiler reports on performance based on how long the application spends in a section of code compared to the overall execution time. While 6 seconds is a long load time, running the game for even as little as 30 seconds meant that the 6 seconds was already in the minority of the game’s overall execution time. The results were skewed such that the load performance didn’t look as bad as it actually was.
In this instance, because I knew that some part of the load code was a problem I was able to help the profiler out a bit. I basically cheated. I changed the game so that when a new room loaded it would actually run the load code 100 times in succession. This caught the profiler’s attention and it very quickly narrowed down the next culprit.
All images that the game uses, whether they’re backgrounds or sprites or UI elements, are saved as TARGA images (like JPEGs but, um, different). These images, of course, reside on the hard drive until they need to be loaded. The problem is, loading things from the hard drive is very slow.
The example, my university prof once gave still sticks with me to this day. Generalizing a bit, there are three physical places that data can come from on a computer. The fastest is the cache. Recently accessed data is temporarily stored here so that if it needs to be accessed again it can be retrieved as quickly as possible. You can think about this like having a book on the desk you’re working at. You can grab that book very quickly but your desk isn’t going to be big enough to hold very many books.
The next place that data can come from is random access memory. Accessing data here isn’t as fast as going to the cache, but it’s still pretty quick and it can hold a lot more than the cache. This is like having a wall sized bookshelf on the other side of the room from you. You have to get up and walk over to the bookshelf, grab a book, and bring it back to the desk.
The third place data can come from is the hard drive. This can hold a HUGE amount of data, but accessing it is very slow. This would be like getting up from your desk, hopping into a car, driving to the library in a neighboring city, checking out a book, driving back home, and then returning to your desk.
Reading data from the hard drive is a costly operation. So when you do have to read from the hard drive, you want to try to grab as much as you can so that you take as small a hit as possible.
Unfortunately, I hadn’t optimized my TARGA loading routine. The routine would read three or four bytes, process them, then read another three or four bytes until it had handled an entire image. Since my worst case image, the cannon base, is 5.36 MEGAbytes reading only a handful at a time was painfully, cumbersomely slow. Following the book analogy from above, it was like going to that other city’s library and coming back with one book when I really needed a U-Haul to come back with a couple of MILLION books.
Changing up the TARGA loading to bring the whole image from the hard drive into random access memory before processing it had the effect of dropping the load time down from 6 seconds to under 1 second on the tablet. Fast enough so that the game no longer seems like it’s going unresponsive.
There are a number of places, both in the rendering and in the loading, that can still be optimized to squeeze out more performance from the game. I may take a look at some of those before the release of the game, but for now at least the game is running comfortably fast enough for my needs.
Getting performance to acceptable levels has removed one of the two last major tasks from the To Do list for the game (the other major task being incorporating all the audio for sound effects, music, and dialog lines). The project has very much entered into the polishing phase of production. Although I don’t have a specific date just yet, it’s looking like the game is heading towards an early September release. Stay tuned to the blog for more details as the release nears.