Training a Neural Network to Estimate Win Probability

 Data Prep Visualization

Visualizing the entire dataset used for a training iteration is particularly difficult because it is a very large graph.  So instead of visualizing what an actual train/test split looks like, here's an example on a small example graph in figure 1.

Figure 1: Example Illustration of Train/Test Split

The goal of this figure is to demonstrate that the test dataset is built by randomly selecting about 10% of the nodes from the graph created using MCTS, and then the remaining data is the train dataset.

Again, as a reminder, the data contained in each node is the following:

Figure 2: Reminder of how our data is stored in nodes

An example graph with 10,000+ nodes for reference can be found here, or you can generate your own dataset using the code here.

The large number of features in the dataset (upwards of 700) and the number of nodes in even just a small graph (10,000+) make it particularly difficult to visualize the dataset well.  Figure 1 and Figure 2 are an attempt at visualizing the data in spite of these difficulties.

Neural Network Architecture

The original AlphaZero implementation from DeepMind used ResNets to allow for the effective use of very deep convolutional networks.  Additionally, the original implementation of AlphaZero included a policy head (that estimated probability of winning for each legal move following the current state) and a value head (that estimated the probability of winning from the current game state).

The model architecture used in this project was a network with two input heads - one for the board part of the game state, and a second for the number of pairs captured by each player at that state.  Both of these inputs are necessary, because one of the win conditions involves capturing five pairs of the opponents pieces so the number of pairs currently captured at any given state may dictate how aggressively a player tries to capture more pairs.

The input for each board state was fed into a five layer deep convolution network, regularized by adding gaussian noise to the weights of each layer.  This representation was then flattened, concatenated with information about the number of captured pairs for each player, then passed into a dense network with three layers, and finally reshaped and passed to a set of deconvolution layers that force the output shape to be the same as the board shape (19x19, in the case of Pente).  There is also one skip (or residual) connection between an early convolutional layer in the model and a late deconvolutional layer.

The output head of this network is the "policy head" from the original implementation.  For our model, a sigmoid activation was applied to all 361 output neurons to force them each to between 0 and 1, because all estimated outcomes pulled from the graph data will be between 0 and 1.  Every other layer used ReLU because it is non-saturating and can help mitigate the vanishing gradient problem.  The final neural network in this project did not use a value head because it did not seem to improve overall model performance at this scale.

Neural Network Training and Results

At training time, all data is augmented using rotations, mirroring, and swapping player 1 and player 2.  This turns every 1000 datapoints generated into 16000 augmented datapoints, which makes RAM limitations during training a very real problem, so batch size needs to managed carefully.  This data is then fed into a network that is using mean squared error as the loss function and Adam as the optimizer.  For every training iteration, 256 games are played out (i.e. added to the graph) using MCTS, then points from the graph are sampled from the graph and used to train the network.  Then, more games are simulated and the network is trained again, and so on.  The original AlphaZero trained for 1 or 2 GPU-decades.  In other words, this loop of generating data, then training the model could likely be run for 10 or 20 years on the single laptop available for the project, and at the end of the time we could still be seeing improvements.

To evaluate the network, traditional methods like training history plots aren't very helpful.  Instead, as the model is trained, many checkpoints are saved and played against each other.   The average win rates of each model relative to each other model are then used to calculate Elo ratings for each model.  The results of that testing process are shown in Figure 3.



Figure 3: Elo Rating vs Training Iteration

Figure 3 clearly shows a promising increase in Elo of over 500 points from the untrained model to the first model checkpoint.  To put that in perspective, an Elo difference of 400 points indicates 10 to 1 odds of beating the agent with the lower Elo rating.  The Elo rating then slowly begins decreasing.  This is likely due to the fact that the model is only trained on a randomly selected subset of the graph at each iteration, so every iteration the model may forget or replace some of the strategies learned in the previous iteration with a different, potentially less helpful strategy.  With more computing power, this theory could be tested by actually training on the entire graph for every iteration.

Comments

Popular posts from this blog

Simulating Pente in Python (Data Generation and Labeling)

Introduction to Pente AI