May 17, 2024

14 Min Read

Minimax Exploiter: A Data Efficient Approach for Competitive Self Play

Background & Preliminaries

Reinforcement Learning (RL) has demonstrated its utility in reliably solving static environments like the Mountain Car & Cart Pole where the problem and environment are fixed; however, in competitive environments such as TicTacToe or Chess, a new problem arises, which is choosing an adequate opponent. An opponent that is too challenging may significantly limit the ability of an agent to learn, while too easy of an opponent risks leading to no (significant) learning whatsoever.

While early prototypes of RL in competitive environments attempted to hand-craft opponents that would suit the learning agent -- a valid approach, albeit one that is both inefficient and time consuming -- recent advances have developed upon an elegant solution to this dilemma. 

Indeed, reflection on potential solutions is an interesting thought experiment for those of you reading this blog. Imagine, for example, wanting to learn how to play chess: can you find an opponent that is guaranteed to have the same skill level as yourself, without knowing precisely what your skill level is? Additionally, you would want such an opponent to improve alongside you, otherwise you would eventually find yourself playing against too easy a challenger.

While solving this question may initially seem out of reach, a very simple and elegant answer is staring right at us: your ideal opponent is, in fact, yourself! You are mathematically guaranteed to have the same strength as yourself, without having to know precisely (or at all!) what strength levels that is! While this sounds like a theoretically foolproof idea, realizing this strategy in practice requires a concept referred to as League Training. With League Training, you not only play exclusively against yourself, but also against past versions of yourself, using a ranking system similar to what sports leagues use to rank various teams throughout a season. You could imagine ranking previous versions of yourself based on how they perform against your current self, and choosing the most suitable opponent for you from this league to help accelerate your learning.

This more specific training paradigm is referred to as Competitive Self Play (CSP), where a complete description can be  found in the foundational Alphastar work that applies CSP to Starcraft, ultimately resulting in super-human performance. It is important to understand this paradigm as the concepts of Main Agents and Exploiters lie at the heart of our work. Put briefly, our main agent tries to learn strategies, while Exploiters are special agents that try to learn counterstrategies.

Motivation

With the setting laid out above, our goal is to improve the data efficiency of CSP. Since we are training against ourselves, we have access to our own neural network, i.e., the brain of the agent. In traditional CSP, the opponent is treated as a black box, where we only observe their actions in response to our own; we, instead, propose to open that black box up, digging actively into the logic that led to taking such actions, so as to better determine whether our own decisions were useful or not. In RL terms, traditional CSP only focuses on the opponent's policy, whereas we propose to analyze both their policy and value networks.

Implementing our idea is straightforward, since we can access our opponent's Value or Q-Networks, and so we can evaluate our action from their perspective. Return to our simple chess scenario. Let's say you made a move, and now it's your opponent's turn to play: since we have access to their value networks, we can see how they evaluate their current position before making their move, and since that current position is a direct result of our move in the previous turn, we obtain a proxy with which we can evaluate the efficacy of our action from their perspective. Concretely, imagine a chess board that is currently in a tied position, but you made a crucial mistake that isn't obvious to you but that will result in a loss five moves in the future. Your opponent will capitalize on this mistake the moment you make it, and will most likely evaluate their position as "winning", which will be reflected in their value networks. With traditional RL, your crucial mistake would only result in a negative reward when the game ends, and it is then up to the RL algorithm to back and propagate the information from the end of the game to that one crucial point in time. Alternatively, rather than waiting for the game to end before the agent can improve, we simply peek at our opponent's valuation of the board and directly infer that it was not a good move.

In essence, we provide a reward to our agent, based on the evaluation of the opponent - which we call the Minimax Reward, as it is inspired by the minimax algorithm. The minimax algorithm is one of the most popular techniques in early artificial intelligence for solving two player zero-sum games. As a refresher, a two player zero-sum game is any form of two player game where the outcome will always sum to zero, meaning the game either ends with one player losing (-1) and another one winning (+1), or the game ends in a draw(+/-0), such as Tic Tac Toe, Chess, Go, and so on.

The minimax algorithm analyzes all possible actions we can take at a given state, followed by all possible actions our opponent can respond with at each resulting state, repeating this pattern until the resulting states are terminal, i.e., when we reach the end of the game. At each step, when it is our turn, the algorithm tries to maximize the outcome (i.e., choosing actions that will result in a win), and when it is the opponent's turn, the algorithm tries to minimize the outcome (i.e., prevent losing actions): hence the term minimax. 

[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - img

The main issue with the minimax algorithm is that the time complexity scales exponentially with the number of actions to consider at every state, formally known as the branching factor. While it can work well for relatively simple environments like Tic Tac Toe, the algorithm quickly becomes intractable when tackling more complex problems like chess, where there are about 30 to 40 legal moves to consider per turn on average. 

The Minimax Reward

Since the minimax algorithm cannot be applied to extremely high dimensional environments, and given that in competitive environments trained using CSP, we have access to the networks controlling our opponents (since it is our self), we propose the minimax reward, which has this formulation

Riminimax (st, at) = Ri(st, at) - (1-d)aQj(st+1, a)

Here, Ri is the environment reward observed by our agent, Qj is the opponent's Q-function, [0, 1] is a coefficient modulating the opponent's signal, and d{0, 1} is the done signal. Simply put, this reward mixes between the environment reward, and the opponent's Q-function. The idea here is to penalize our agent at every step, by the evaluation of our opponent at every subsequent step. What we're essentially telling our agent is to find the action, at step t, that would minimize the evaluation at the next step t+1, from the perspective of our opponent, while maximizing the environment reward. 

A Note on Reward Shaping

The idea of giving our agent an extra source of reward to speed up training is not a new concept, in fact this is usually referred to as reward shaping or reward engineering, where you, as the designer, would alter the reward function of the agent in order to make the learning process more manageable. Going back to the chess example, the simplest form of reward function you can give your agent is a +1 for winning, -1 for losing, and 0 for all other actions. This is referred to as a sparse reward function, since as the name states, the agent only observes a signal when the game ends, and all intermediary actions have no reward, and it is up to the learning algorithm to correctly value them. While this works in theory, sparse reward functions require lots of experience to be able to converge to the correct solution. An alternative is to instead break the problem up, leaving breadcrumbs to the agent to learn the task, known as a dense reward function. For example, every time your chess agent captures a pawn, he gains +1, if his pawn gets captured he gets a -1, if he captures a knight he gains +3, and so on. This breaks up the daunting chess problem into much smaller and easily attainable goals. The only downside here is that you, as the designer, are now injecting your own bias into the agent's evaluation. Why is a pawn worth +1? Why is the knight worth +3? Is this the correct way of learning chess by thinking about valuing each piece individually rather than the entire board as a collective?

The key difference between traditional reward shaping, and the Minimax reward, is that while both end up densifying the reward signal, we do so without injecting any human bias in the reward function. While the bias injected in reward shaping can be quite beneficial for simpler tasks, for more complicated tasks, coming up with the optimal dense reward is far from trivial. Furthermore, some environments are almost impossible to come up with an easy dense reward. If you're training an agent to play Tic Tac Toe or Connect4 for example, how would you hand craft a dense reward function?

Experiments & Results

We run the minimax exploiter on four different environments, including Tic Tac Toe, Connect4, Atari Boxing, and finally For Honor! 

Tic Tac Toe & Connect4\

[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - img 2 [LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - img 3

The first two environments we run are simple turn based board games, Tic Tac Toe and Connect4. We train our agents against the Minimax algorithm, and we use the evaluation of the Minimax algorithm as a proxy for a Q-function. For Tic Tac Toe, we run the minimax algorithm to completion, while for Connect4, we limit the search depth of the minimax algorithm to only 3 moves. This setup is done purposefully to simulate two scenarios, one with a perfect opponent policy, and one with an imperfect opponent policy. We compare the training of traditional DQN agents with and without the minimax reward. We also train an additional agent called the gamma-0 agent, in which we set the gamma (of the TD learning update rule) to 0, effectively cloning the inverse Minimax of the opponent as its Q-function. Reminder that the TD Update rule is: Q(st, at)  Q(st, at) +  [R(st, at) + maxaQ(st+1, a)]

By setting gamma to 0, the update rule then is solely based on the immediate reward, which, in the case of the gamma-0 agent, is the opponent's Q-function. This gamma-0 agent allows us to estimate how much information there is in the opponent's Q-function alone, and if this Q-function is inaccurate, can we still extract information from it to speed up training?

[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - img 4[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - img 5

We can see that the Minimax exploiter trains faster than the standard DQN agent, even when the Q-function is not expressive enough, like in the case of Connect4. For the next two experiments, since we can see the minimax exploiter can be effective even when the Q-function isn't expressive on its own, we explore how the Minimax reward fares against standard reward shaping.

Atari Boxing

[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - img 6

For Atari Boxing, we train three agents like before. A regular DQN agent that observes a sparse reward of +1 for winning, -1 for losing and 0 everywhere else, which we call DQN-Sparse. A DQN that observes a dense reward of +1 every time he successfully hits his opponent, which we call DQN-Dense. And Finally, the Minimax Exploiter.

[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - img 7

From the training graphs, we can see that in this case, simple reward engineering is more effective than the Minimax reward at training against the same fixed opponent. However, this is expected, as the game Atari Boxing has no downside of getting hit. This is a special case of simple reward engineering leading to an optimal solution. 

However, as we will see in the final For Honor experiment, simple reward engineering only takes you so far. 

For Honor

[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - img 8

The final experiment on For Honor is our most extensive one, where we train an entire CSP League training framework for 24 hours, and compare the training on two metrics. The first one is how many converged Exploiters are generated at the end of training. And secondly, we take all of the final Main Agents from each training, and pair them against each other so see which one is the most robust. We train 4 different types of exploiters here. The Vanilla Exploiter which observes a sparse reward function of +10 for winning, -10 for losing, The Minimax Exploiter, which observes the Minimax Reward, The Defensive Exploiter, which observes +10 for winning, -10 for losing, and a penalty for every hit taken, and finally the Aggressive Exploiter, which observes +10 for winning, -10 for losing, and a reward for every hit successful attack on the opponent.

[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - Figure 5 1[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - Figure 5 2

From the graph above, we can see that the Minimax Exploiter is consistently generating the most amount of fully converged exploiters after 24 hours of training over three seeds. We then paired the Main Agents from the best seed for each case against each other, and compared their win-rates over 1000 duels. 

[LA FORGE Studio] News - Minimax Exploiter: A Data Efficient Approach for Competitive Self Play - img 10

We can see that the Minimax agent is the most robust out of all agents, with a  win-rate above 66% against all of the different agents. 

Conclusion

The Minimax Exploiter introduces a data-efficient approach to Competitive Self Play by leveraging the opponent's Q-function. This approach provides a novel way to assess your actions in real-time, allowing you to learn from your opponent's perspective. The experiments and results presented here demonstrate the effectiveness of the Minimax Exploiter in various gaming environments, highlighting its potential to accelerate training and improve agent performance. This approach brings us one step closer to creating intelligent and adaptive learning agents in AAA video games.