Analytics: A Scientific Approach to Game Development

The Game Designer has the most important role on the team. A game can be ugly, broken, and slow to update — but if it is fun it can succeed in spite of those things. This isn’t a brag, this doesn’t mean other roles are worthless, this isn’t saying designers are smarter —in fact, this is a callout.

Despite the massive amount of responsibility, time and time again I see well-crafted games faceplant because the designer just didn’t check their work. I’m here today to warn you about a type of Game Designer I’m labeling as fortune-tellers.

I’ve met many fortune-tellers. They make assumptions about the most important parts of the game, and never bother to verify them, instead waiting for the initial release to confirm their ideas. They act like they were handed a blueprint from the gods and that any fact-checking is sacrilege.

I’m calling it, this is professional negligence. It’s a strong statement, but at this point, I can see it no other way. I see it so clearly because it’s where my career as a professional developer almost ended.

This is a story about almost losing everything to overconfidence, and the tedious road every game developer needs to travel if they want to still have a job in 5 years.

The Cost of Designer Arrogance

The first step is recognizing you have a problem. Do you relate to this story?

I was giving a presentation alongside another student as a UX Design major at Purdue University. We were tasked with finding a way to improve the design of Fitbits. We gave the presentation, and we were the only design that people couldn’t seem to pick apart during the Q&A portion. I thought we had done really well. We got a D.

Our professor pointed out that while we did “test” our concept, the actual reasoning we used for defending our design was all theoretical. We had essentially done the product testing as lip-service to the rubric, rather than as a defensible way to prove its viability.

At the time I vehemently disagreed. I hated testing designs. It was slow. It involved too many social interactions for my introverted soul. In a way I’d tested it by showing it to my peers (they all said they liked it after all). On top of that, I already thought I was right — so what was the point?

I thought that if I spent enough time in the planning process, in predicting interactions through studying successful designs, I could make up for not testing. I was so sure that all the testing was holding me back that I ended up changing majors. I was arrogant, preferring pride over proof.

This attitude towards design reached its peak during the summer of 2019. I was working hard as a Roblox Accelerator Intern on my game, Mortal Metal.

Spoiler alert: the game failed massively. And even worse, it failed because my old professor was right and testing is needed for more than just debugging code.

The failure was quite public, everyone in the office knew. All the tens of thousands of my community members knew. My team especially knew. Despite that, they never really bashed me over it. They knew the blame was mine alone, but they also knew nobody had given more to the project than I had.

I had spent all my savings, dozens of all-nighters, and working for at least 11 hours every weekday and weekend for 3 months. I wrote up +100 pages of design documentation in hopes of shoring up my chances. I was the designer, the producer, the funding, and the programmer. I gave the game my all.

I did everything I could think to do, and it all fell apart. It left me beyond despair, and with a valuable lesson. I realized you aren’t going to succeed in life simply because you work hard and really believe in yourself. This isn’t a children’s cartoon. Success is about strategy and luck. When you inevitably run out of luck, you better hope you’ve built up your strategy.

How much luck do you have left?

Becoming a Game Doctor

I think one thing most people can agree upon is that they would not like to be sick prior to modern medicine. There was a time when supposed experts prescribed unproven fixes to maladies, taking recoveries as confirmation of validity, and deaths as simply beyond their power. So what changed? Why don’t we prescribe leeches for a cold anymore?

The scientific revolution happened. As math and science evolved, the medical field was reshaped to be more and more reliant on theories emergent from data-driven analysis and testing. In many ways, doctors had to become scientists. As scientists, they were called to be more careful, they were called to operate within the realm of proven truths, rather than speculation.

As a game developer, I want you to view yourself as a game doctor.

In medicine, there are various metrics you can use to determine a person’s health. This includes things like heart rate, oxygen levels, temperature, cholesterol levels, blood sugar, sodium levels, blood pressure, etc. While these numbers are simplifications and snapshots of a complicated dynamic system, in many cases they’re good enough to make informed decisions.

For example, if a patient’s oxygen is low, you can attach them to a ventilator. If their blood pressure is low, you can give them a blood transfusion. Vitals aren’t just useful for emergencies either, at general check-ups doctors can provide you with actionable goals, techniques, and if necessary prescribe certain treatments.

Wouldn’t it be great if we could do this for games? Well, the good news is we can! You may have heard the term before, but these vitals are often called Key Performance Indicators, often abbreviated as KPIs. So what types of KPIs are do you need to worry about?

  • Session Duration: This is all about how players are enjoying the game. Problems with it can indicate issues with onboarding and the core loop. As you may have guessed this is typically measured as an average session duration across your audience.
  • Retention: This is all about getting players to return to your game. It’s arguably the most important KPI, having connections to almost every aspect of game performance. This is typically measured as a percentage of players who return a certain period of time after their first visit.
  • Spending: This is where game development becomes a business. Honestly, this KPI gets too much attention, often to the detriment of games. It doesn’t matter what your revenue per user is if you have 0 users. That being said though it’s half of the equation to determine if you’re ready to advertise your game. Spending is frequently measured by dividing the amount of revenue by the number of users.
  • Discovery: how many new players try your game. This was one I underappreciated for a while, but it is as if not more important than monetization. This is often described as how much it costs to bring in a new player, as well as how many players naturally discover the game a day.

With these KPIs, you can predict the lifecycle of a game. I’ve created a spreadsheet to help you with that. All you need to do is plug in the vitals of your game, and it will tell you if your game is going to die.

But knowing vitals isn’t enough —nobody wants to be told by a dumb spreadsheet “you’re going to fail and there’s nothing you can do about it”. Taking these vitals and transforming them into a realistic path to success is the most important part.

From Game Doctor to Game Scientist

I’m not here to teach you how to be a plague doctor, though if you wish to wear one of those cool beak hats while coding that is perfectly acceptable. My goal is to teach you to be a post-scientific revolution modern medicine doctor. To do this you need to ditch the pseudo-science and go with the real thing.

The premise of science is simple, often outlined in a philosophy referred to as “The Scientific Method”. In this method, you use data in combination with changing one variable at a time to get reproducible results. The more times you can repeat the experiment and get the same result, the more reliable your findings.

That’s a nice theory and all, but most of us likely knew that. Let’s look at the scientific method when applied to game development.

In the software developer world, this is often referred to as“AB testing”, though it also goes by the names “Bucket testing” and “Split testing”. The reason this is important is so that you can draw your conclusion directly to a single action. This is to help overcome the risk of spurious correlation skewing your understanding of your game.

You will want to know for certain that a change in A directly impacts B. This means if your game is struggling, don’t publish 10 fixes at once. Even if it does fix the game you won’t know why, and that means you could accidentally unfix it later and be back at square one.

If you flip a coin twice, half of the time you’ll see the expected 50:50 heads vs tails outcome. However 25% of the time you’ll only see heads, and 25% of the time you’ll only see tails. In this specific instance, without prior information or further testing, half of the people could hypothetically walk away thinking when you flip a coin it always lands on the same face.

This is the risk you take when working with smaller sample sizes. As a game developer, this means making sure your AB test is also interacted with by hundreds of people in the right play conditions (aka with a full non-broken server). If you don’t already have an audience you should advertise your AB tests to verify them.

Having a large enough sample size isn’t just important for your KPI measurements, but also your timeframe. For example say you compare the results of a change with how your game performed yesterday — seems fine right? Except what I’ve withheld is that yesterday was Christmas, skewing player behavior dramatically. Now obviously not every day will be Christmas unless you’re like me and loop Mariah Carey’s hit single all year round.

The more common scenario is that the more regular shifts in weekday audiences will cause a change in behavior that you misinterpret. Just because your numbers drop on a Monday doesn’t make you a bad dev, it just means not all your players are bots. What I recommend is to increase your timeframe from a day to at least a week while at the same time keeping an eye on the reported platform MAU to know how comparable results from a year ago are today.

Imagine two types of player groups enjoy your game, one group enjoys hanging with friends, the other group enjoys killing strangers. After getting a composite readout on the behaviors of the average player, you come to the horrifying conclusion that your average player enjoys killing their friends. You then spend the next week building up a data-driven update focusing on killing your friends.

This is obviously a bit of a stretch, but not as much as you may think. When you combine a diverse player base into a single player, you lose a ton of nuance and can get conflicting insights. To avoid this it’s necessary to sort your audience into relevant testing groups to monitor how their unique circumstances impact behavior.

New: Orange — Returning: Blue

For instance, a grouping I’ve found particularly useful is isolating returning players from first-time players. As you can see in the chart above created using data from a game of mine, almost a third of new players exit within the first 90 seconds of the game.

When something is less easily categorized such as say, FPS, you can achieve two distinct sections by sorting the data, then splitting the sorted data into two bins — one representing the lower half, the other representing the higher half. There’s no law against going above 2 bins, I’ve just found it’s a lot more helpful for detecting impactful audience sorts.

Here are some good areas to split and check for behavioral changes:

  • Experience Quality
  • Device
  • Retention
  • Server Fill
  • Age
  • Language
  • Player Motivation

I would argue player motivation is the most important, and unfortunately often the least well-defined. In order to properly sort your players into what they care about, you may benefit from the Triumvirate Model of Player Motivation.

I personally use a philosophy I call the Triumvirate-Model. The idea behind it is that all player motivations — like the actual emotional reasons people are engaged with the game — can all be plotted somewhere on these 3 overlapping circles.

The three sections were inspired by viewing human emotion through the lens of how their motivated behaviors provided evolutionary advantages.

For example, Survival-motivated games tap into the fight/flight instincts that benefit from quick reactions, aggression, and running. Success-motivated games tap into the emotions that reward setting, working towards, and achieving goals. Social-motivated games tap into the emotions released by being part of a community — even if the community members are fictional NPCs.

Games can of course mix categories, and often benefit from doing so. It’s important though to recognize that you can’t hit them all at once any more than you can feel every emotion at once. The only games which can efficiently cater to all three categories are open-world sandbox type games (Minecraft is a good example). These games succeed not because players feel every emotion at once, but because there are infinite ways to play, allowing the player to self optimize to what they’re interested in.

Now that you have a way to define motivation, you can begin to measure it through analytics. Defining these abstract concepts in math is hard, so we use secondary metrics relating to how frequently players interact with certain related mechanics. Here are some examples you may consider:

  • Survival: attack rate, proximity to threat, health
  • Social: speaking rate, average proximity to players, average server fill, average number of friends in-game
  • Success: win/lose rate, attempt rate, XP rate

This is a tricky area though that must be implemented very carefully. For example, proximity to threat can infer a higher stress situation, but it assumes the player is aware they’re near a threat. A mobile player might speak in smaller sentences for speed, thus chat more often. A person earning XP at lvl 1 might be more motivated by it than lvl 10. The best way in my experience to minimize the chances of a faulty conclusion is to only compare player motivation against those in very similar groups, a concept I mentioned on a broader sense above.

For example, the in determining how much a map engages survival motivated players, you could filter your audience to show only players of the same platform in similarly filled servers within that map in the same version of the game with similar experience levels. Now this can drastically shrink a userbase, so if you find yourself fighting the law of large numbers it may be best to focus on a broader problem.

It can take a few hours to fully examine the results of the testing data. Once you’re done you may have had to filter your audience dozens of different ways. Some of the differences will be clear, some will be a bit vague but make sense once you stand back, and some might prove to be dead ends.

You can’t possibly store all that information in memory. Later if you’re curious about how a certain update impacted a certain aspect of your audience you’ll be forced to do all of it again, that is assuming you can even remember how you originally isolated the data.

You need to record your results, along with the procedure in which you procured them. It’s not going to be very fun, and maybe one day you can automate most of the actual data entering. No matter what though you need to record your results. Try to keep your results standardized to easily compare how your game is changing over time. A small shift in a certain audience motivation may not be noticeable between two tests, but it may be much more detectable when shown across all tests.

Conclusion

If you take a look at the cognitive bias codex, you’ll see a list of all the ways we trick ourselves. Human judgment just isn’t reliable enough for important decisions. With game development, you will be sinking hundreds of hours and sometimes thousands of dollars into it. The stakes are quite high.

I’ve come to realize that all the procedures, all the paperwork, and all of the other things that made me so miserable as a student were in fact the most important part. These are tools we have to overcome our biases, our miscalculations, and our bad ideas. We need to use them.

In the past year, I’ve returned to work on a few of my old games, and already they’re doing better. In Super Hero Life III, and after a single analytics-driven update I’ve been able to increase D1 retention by 20%. After being brought into a second project, Slider Infinity, I’ve managed to boost play session duration by over 100%, as well as bring a 72% like ratio up to 80% in only a week. These techniques get results — that’s their entire point.

Testing your design might not be the most glamorous area of game development, but it is worth it. If you can master this technique, you won’t just end up with a successful game.

You’ll end up with a successful career.

Founder of Nightcycle Studios. He/Him. Slight Workaholic.