Simulating Pente in Python (Data Generation and Labeling)

The goal here is not to try to solve a traditional supervised training neural network task, like image classification or house price regression, so the data collection methods must be different. Instead of "collecting" data, a game simulation must be used to simulate the game logic behind Pente, and to enable running thousands of games of Pente in only a few minutes. This has several implications for data cleaning and preparation... mainly, these steps are integrated into the data generation step so they don't need to be done separately later.

For those interested, the entire code for the simulator can be interacted with here on Google Colab.

Simulating the Game

Pente is a fairly easy game to simulate because the win conditions and rules are both simple. Legal moves are simply determined by whether a board position is occupied or not, and finding unoccupied positions on the board is not a computationally or conceptually difficult process. Then, out of the available legal moves, one move can be chosen per game played simultaneously, either at random or using upper confidence bound for trees (UCB) scores (see Figure 1) from Monte Carlo tree search (MCTS).

When using UCB scores on a graph of Pente game states, the maximum score (U) over all legal moves from the current node is chosen, given W (the win count for the current node), Ni (the number of visits to the current node), Np (the number of visits to the parent node), and c (some constant to help balance between exploration and exploitation). This post that discusses UCB and how it is used to explore game states in AlphaZero is particularly insightful on this front.

Figure 1: Upper Confidence Bound for Trees

Because of the massive number of legal game boards that exist in Pente, the entire graph cannot feasibly be stored in any computer. This is where UCB score comes in. The UCB score is used to build out a subset of the total graph by simulating games. The first part of the UCB score equation (W/N) represents the average outcome experienced by visiting a particular node. The second part of the UCB score equation is very large if a node hasn't been well explored, and very small if a node is very well explored. These two parts of the equation help balance between exploiting moves that are known to be good, and exploring new moves to ensure that other better unexplored moves do not remain unexplored.

Then, after a move is chosen, the simulator checks to see if any of the other player's pairs were captured by taking that move, or if any win condition is met (player achieved five in a row, or captured five pairs of the opponent's pieces). If a win condition is not met, this process continues until a win condition is met.

Decreasing Space and Time Complexity of the Simulations

All boards in the current simulation run simultaneously (e.g. there are 10000 game states for 10000 games stored simultaneously in memory as a single NumPy array). This allows the code to run around 100X faster than if each new game were instantiated in a loop after completing the previous game. This is a practical example of how code vectorization can provide tremendous performance improvements.

Additionally, when storing the generated games, storing 1000 games of Pente in a networkx DiGraph took over 1 GB of storage. After using bz2 compression, though, that same graph could be stored in 90 MB! This discovery was exciting, because one of the biggest challenges in replicating a version of AlphaZero is having limited compute power, so much of the work put into data generation revolves around optimizing simulation code and data storage.

Storing the Simulated Data

Aside from the general game logic and simulation optimization, the main thing left to note is that every board game state (including current piece positions, captured pair count by each player, and current player number) is stored in a graph with edges connecting adjacent game states. See figure 2 for an illustration of the data stored in one node of our graph.

Figure 2: Summary of Data Stored in Each Graph Node

This approach can easily update the visit count (Ni and Np) and outcome sum (W) (+1 if player 1 wins, +0 if player 0 wins, and +0.5 if there is a draw) for each game state while exploring, which then allows the calculation of the UCB score discussed earlier. Again, the UCB score will then help balance between choosing high performing game states (that are more likely to lead to game wins) and unexplored (or poorly explored) game states.

Just for the sake of completeness, here is a zoomed out plot of what a complete graph created from playing several random games might look like (figure 3).

Figure 3: Directed Graph that Contains Data from 10 Random Games

Because the graph will be different every time it is generated due to the randomness inherent in MCTS, it doesn't make much sense to link directly to the dataset, but a sample of the kind of graph generated by the game simulator can be found here. Instead, if you're interested in playing with the data, simply run the simulator code linked to in Google Colab above. This will allow you to generate your own data using the same methodology used here.

The visit count and outcome sum can also be used to train a neural network to approximate the value for every board state (value = 0 implies player 0 always won, value = 0.5 implies it was split 50/50, etc.).
So, as things stand currently, there exists a data generation method (a Pente game simulator coded in Python), and a data labeling method (UCB score and average outcome W/N).

Access code here

Search This Blog

Pente AI