Building a Decision Tree to Play Tic Tac Toe

Oct 8, 2019 19:09 · 1646 words · 8 minutes read

A little while ago I started working on the fast.ai introductory machine learning course. The first lesson is a general overview of the tools and a discussion of random forests. With a little research I learned that random forests are basically just groups of averaged decision trees. My next question was: “What is a decision tree?”

A decision tree boils down to a chain of

if (some metric):
    # one output
else:
    # some thing else

The trick is taking a bunch of columnar data and generating this branching structure automatically. I was able to find a couple of good blog posts that describe decision trees but this one by Dr. Brownlee was by far the most helpful. In it, Dr. Brownlee guides the reader through the construction of a decision tree from scratch which was exactly what I needed to both: 1) Wrap my head around what a decision tree is and how one is made; and 2) solve a practical problem with decision trees.

The problem

In order to better understand decision trees I wanted to see how you might apply one to play tictactoe as a fixed player (always playing O for instance). There are a few problems to solve here:

1) Generating a large enough data set to create a tree 2) Getting the data in the right format to produce a useful decision tree 3) Hooking up the decision tree to be able to play the game

Generating a large enough data set

I initially considered that I could “play” tictactoe against myself and record the results so that I could have a mix of games played perfectly and games played poorly. However this would take way too much time and would also be incredibly boring. Luckily, there are no shortage of AI tictactoe projects on github.

This one by pybites is very well constructed and was easy to modify to get the computer to play itself on a loop. Since I wanted to train the decision tree to always play O, I modified the AI so that when it played for the X player it would throw in a small number of random moves instead of always playing the best. This way the data set would contain some perfect games, mostly games that were played almost perfectly by the X player, and almost no games where X just completely threw the game.

I generated 5000 games for training data.

Getting the games in the right format

The games in the data set had 11 columns each:

Turn|s0|s1|s2|s3|s4|s5|s6|s7|s8|s9|move played

And the board is structured like so:

s7 | s8 | s9
------------
s4 | s5 | s6
------------
s1 | s2 | s3

For each one of the “squares” (columns that begin with s), the data is either a O, X, or an underscore (indicating a blank square).

The turn column varies between O and X depending on whose turn it is in the game (O always starts). Finally, the move played is a little interesting. We need to pack two different facts into this column: 1) what square was played on and 2) what piece was played on that square?

We can represent both pieces of data if we think a little creatively. Suppose I say that we will have a two decimal number in this column. The most significant digit will always be the square played. The least significant digit will always be the piece played. Therefore if I want to translate “play an X on square 9”, I would write 91 to that column (9 for the 9th square and 1 for X).

Using this strategy, we can easily translate the other 10 columns into numerics as well. Each “square” column will contain either a 0, 1, or 2. If the column contains a 0 it means no piece has been placed there yet. If a 1, it means an X has been placed. Finally if a 2 it means an O has been placed. The turn column will either contain a 1 or a 2 depending on if X or O is to play respectively.

Once the game state is in purely numeric data, we can generate our decision tree.

Generating the tree

My first crack at generating the tree yielded this result:

[X6 < 2.000]
 [52]
 [X11 < 21.000]
  [X11 < 12.000]
   [11]
   [12]
  [X11 < 91.000]
   [X11 < 22.000]
    [21]
    [X11 < 41.000]
     [X11 < 32.000]
      [X4 < 1.000]
       [31]
       [22]
      [32]
     [X11 < 42.000]
      [41]
      [X11 < 72.000]
       [X11 < 71.000]
        [X11 < 62.000]
         [X11 < 61.000]
          [42]
          [61]
         [62]
        [71]
       [X11 < 81.000]
        [72]
        [X11 < 82.000]
         [81]
         [82]
   [X4 < 2.000]
    [91]
    [92]

This graphical representation of the tree shows the condition at each node. If the condition is true, you take the left (first) path and continue until you hit a terminal node. If the condition is false, you take the right (second) path and continue in the same manner.

Thus for evaluating the initial board state we look at column X6 (square 5 in the tictactoe board) and see if it contains a value less than 2. In an empty board, square 5 would contain the value 0 (no piece yet) and so you would play 52 (put an O in square 5).

You can quickly see that the first run is a bad tree because it always tests on column 11 which is actually the column we are trying to predict!

Generating the tree: Round two

After modifying the tree building code, I generated the following tree:

[X6 < 2.000]
 [52]
 [X1 < 2.000]
  [X10 < 1.000]
   [X2 < 2.000]
    [X5 < 2.000]
     [X2 < 1.000]
      [X1 < 1.000]
       [81]
       [81]
      [71]
     [61]
    [91]
   [X3 < 1.000]
    [X4 < 2.000]
     [X9 < 2.000]
      [X5 < 1.000]
       [X8 < 2.000]
        [X4 < 1.000]
         [X1 < 1.000]
          [21]
          [21]
         [41]
        [41]
       [X1 < 1.000]
        [81]
        [81]
      [21]
     [X2 < 2.000]
      [X1 < 1.000]
       [21]
       [21]
      [21]
    [X8 < 1.000]
     [71]
     [X1 < 1.000]
      [41]
      [41]
  [X2 < 1.000]
   [12]
   [X4 < 1.000]
    [X9 < 1.000]
     [X8 < 1.000]
      [32]
      [82]
     [X8 < 1.000]
      [72]
      [32]
    [X8 < 1.000]
     [X5 < 1.000]
      [X3 < 1.000]
       [62]
       [72]
      [72]
     [X5 < 1.000]
      [X7 < 1.000]
       [X2 < 2.000]
        [42]
        [82]
       [42]
      [X7 < 1.000]
       [62]
       [X9 < 1.000]
        [X3 < 1.000]
         [92]
         [82]
        [22]

This tree looks much better, notably:

  • It starts with the optimal move, you always want to play square 5 if you have the opportunity to
  • It then examines whose turn it is (X1) and plays accordingly

Unfortunately this tree is also incorrect. If you take some time to trace through it and play a few games based on its recommendations you will find not only incorrect suggestions resulting in a loss for O, but also illegal suggestions.

What I’ve learned

So in the end this experiment was a failure, but it was not a waste of time. I’ve learned at least a few valuable things about decision trees and I’d wager to say ML in general.

First, the quality of your data matters. In this example if I run across a game I have never seen before, the decision tree will have no possible response since it can’t predict values for games it has never been trained on. Furthermore if some board positions were only a small number of the total board positions, they may not make it into the tree at all.

This is also a general property of decision trees. Because decision trees are essentially just building a logical path through the data they are provided, they will by nature overfit to the data. Random forests improve on this by averaging the results of several decision trees created from different combinations of different metrics. This is why random forests may still show over fitting behavior. If your data is homogeneous to the point that splits will always occur using the same set of columns and resulting in the same conditions, you will have an overfit tree. Diverse data will be less prone to over fitting.

Second, training these simple trees using just Python with no C optimizations and using no hardware accelerations from graphics cards or parallelization took a long time. Even generating these simple trees took upwards of 10 hours. There are simply a lot of computations to do and the tool I selected was poorly suited to it.

Third, feature engineering is required for even simple data sets. The requirement to make all columns numeric in this example required engineering the last prediction column. I get the impression that it’s important to be thoughtful about how you engineer features since any translation of data needs to have a perfect isomorphism to the original data. Otherwise you would end up with incorrect results.

Finally, some AI/ML techniques are not suited to all problems. As the first video in the fast.ai lecture series shows, you can get great results on Kaggle for lots of competitions by using the standard Random Forest algorithm built into scikit learn. This is a good fit because if you’re predicting numeric data you will be some % off which is maybe not a big deal. For example if you predict the real value of a bulldozer was 2500 and it was actually 2475 then your algorithm is probably good enough.

When you’re predicting something like game state for the purposes of making a move on the other hand, “good enough” is a lot more difficult. You need to predict a legal move, a non-losing move, and preferably
a winning move. Deep learning is a great approach as evidenced by AlphaGo, etc. Using a random forest is a bad approach as evidenced by this experiment.

Even within ML there are lots of tools and techniques to reach for. Using the right ones on the right problems is important to get the best result.