← Back to Main

Monte Carlo Tree Search

Complete Step-by-Step Visual Guide

Adapted for the Simple Pair Game

The Simple Pair Game

Game Rules

  • Board: 4 positions [0, 1, 2, 3]
  • Players: Player 1 (Red), Player 2 (Blue)
  • Goal: Mark 2 adjacent positions first
  • Winning pairs: [0,1], [1,2], or [2,3]
  • Draw: Board full, no winner

Win/Draw Examples

P1 wins with pair [0,1]:
1
1
2
Draw (board full, no pair):
2
1
1
If P2 plays pos 2 → draw

MCTS Algorithm - The Four Phases

1️⃣ Selection
Start at root, use UCB1 to select best child repeatedly until reaching a node that's not fully expanded
2️⃣ Expansion
Add one new child node for an untried action from the selected node
3️⃣ Simulation
From the new node, play randomly until the game ends. Record the result
4️⃣ Backpropagation
Update visit counts (v) and win counts (w) from the new node back up to the root

🔑 Critical Concept: Alternating Perspectives

Each node's win count (w) represents wins for the player who made the move to reach that node.

  • • P1 node: Created by P1's move. If P1 wins simulation, w increases.
  • • P2 node: Created by P2's move. If P2 wins simulation, w increases.
  • • During backprop: We flip the result (1-result) at each level going up.

📊 Complete MCTS Walkthrough - Iteration by Iteration

Initial State

Current Board (Root):
0
1
2
3

Player 1's turn. 4 possible moves: positions 0, 1, 2, or 3

Root
N = 0
0
1
2
3
P1's turn
4 unvisited children (all UCB1 = ∞)
P1→0
1
0
0
0
v=0, w=0
P1→1
0
1
0
0
v=0, w=0
P1→2
0
0
1
0
v=0, w=0
P1→3
0
0
0
1
v=0, w=0

🔄 Iteration 1

1️⃣ Selection: All children have UCB1 = ∞. Select first unvisited: position 0
2️⃣ Expansion: Add child node for "P1 plays position 0"
Resulting board state:
1
0
0
0
Now P2's turn at this node (3 positions available)
3️⃣ Simulation: From [1,0,0,0], play randomly:
  • • P2 plays position 2 → [1,0,2,0]
  • • P1 plays position 1 → [1,1,2,0] → P1 WINS! (pair [0,1])
4️⃣ Backpropagation: P1 won (result = 1.0)
  • • Node "P1→0": v=1, w=1 (P1 node, P1 won → w increases)
  • • Root: N=1
Tree after Iteration 1:
Root (N=1)
P1→0
v=1, w=1
100%
P1→1
v=0, w=0
P1→2
v=0, w=0
P1→3
v=0, w=0

🔄 Iteration 2

1️⃣ Selection: Nodes 1, 2, 3 still have UCB1 = ∞. Select position 1
2️⃣ Expansion: Add child node for "P1 plays position 1"
0
1
0
0
3️⃣ Simulation: From [0,1,0,0]:
  • • P2 plays position 0 → [2,1,0,0]
  • • P1 plays position 3 → [2,1,0,1]
  • • P2 plays position 2 → [2,1,2,1] → Draw
4️⃣ Backpropagation: Draw (result = 0.5 for P1)
  • • Node "P1→1": v=1, w=0.5 (P1 node, Draw → w increase by 0.5)
  • • Root: N=2
Tree after Iteration 2:
Root (N=2)
P1→0
v=1, w=1
100%
P1→1
v=1, w=0.5
50%
P1→2
v=0, w=0
P1→3
v=0, w=0

🔄 Iterations 3-4 (Summary)

Iteration 3: Expand P1→2, simulate → P1 wins → v=1, w=1
Iteration 4: Expand P1→3, simulate → P1 loses → v=1, w=0
Tree after Iteration 4 (all root children visited once):
Root (N=4)
P1→0
v=1, w=1
100%
P1→1
v=1, w=0.5
50%
P1→2
v=1, w=1
100%
P1→3
v=1, w=0
0%

🔄 Iteration 5 - First Real UCB1 Selection

Calculate UCB1 for each root child (c = √2 ≈ 1.414):
UCB1(P1→0) = 1/1 + √2 × √(ln 4/1) = 1.000 + 1.664 = 2.664
UCB1(P1→1) = 0.5/1 + √2 × √(ln 4/1) = 0.500 + 1.664 = 2.164
UCB1(P1→2) = 1/1 + √2 × √(ln 4/1) = 1.000 + 1.664 = 2.664
UCB1(P1→3) = 0/1 + √2 × √(ln 4/1) = 0.000 + 1.664 = 1.664
Tie between P1→0 and P1→2. Randomly select P1→0
1️⃣ Selection: Select P1→0 (UCB1 = 2.664). This node is NOT fully expanded yet (has unvisited children).
2️⃣ Expansion: Expand one of P2's possible moves from [1,0,0,0].
P2 can play positions 1, 2, or 3. Select position 1.
New board state:
1
2
0
0
Now P1's turn (2 positions available: 2, 3)
3️⃣ Simulation: From [1,2,0,0]:
  • • P1 plays position 3 → [1,2,0,1]
  • • P2 plays position 3 → [1,2,2,1] → P2 WINS
4️⃣ Backpropagation: P2 won (result = 1.0 for P2)
  • • New node "P2→1" (from [1,0,0,0]): v=1, w=1 (P2 node, P2 won → w=1)
  • • Parent "P1→0": v=2, w=1 (flip result: 1-1=0, so w is unchanged)
  • • Root: N=5
Tree after Iteration 5 (now with depth 2):
Root (N=5)
P1→0
v=2, w=1
50%
P2→1
v=1, w=1
P1→1
v=1, w=0.5
50%
P1→2
v=1, w=1
100%
P1→3
v=1, w=0
0%
Tree is now growing deeper on promising branches!

🔄 Iterations 6-50 (Continued Growth)

As iterations continue, MCTS will:

  • • Explore promising branches
  • • Occasionally revisit bad moves due to exploration term
  • • Refine statistics as visit counts increase
Final Tree after 50 iterations (example):
Root (N=50)
P1→0
v=10, w=5.5
55%
P1→1
v=17, w=15
88%
P1→2
v=16, w=13.5
84%
P1→3
v=7, w=2.5
36%
🏆 Best Move (most visits): Position 1
17 visits, 88% win rate
Player 1 move:
0
1
2
3

The UCB1 Formula

UCB1 ( i ) = wi ni + c ln n pi n i

wi: wins for node i (player's perspective)
ni: visits to node i
n pi : total visits to parent of node i
c: exploration constant (typically √2)

Exploitation Term

wi ni

Win rate. Favors moves that have performed well in past simulations.

Exploration Term

c ln n pi n i

Exploration bonus. Grows as parent n pi increases but child ni stays low, encouraging exploration of rarely-visited nodes.

Final Move Selection Strategies

✅ Most Visits (Recommended)

Best = arg max i v i

Select the child with the highest visit count.

Why it works: Visit counts are stable, and UCB1 naturally directs visits to promising moves. A highly-visited node has been thoroughly explored.

⚠️ Highest Win Rate

Best = argmax i ( wi vi )

Select the child with the highest win rate.

Caution: Can be unreliable for low-visit nodes. A node with v=1, w=1 has 100% win rate but very limited information.

🎯 Key Insights

Why MCTS is Powerful:
  • No evaluation function needed: Random playouts estimate position value
  • Adaptive: Focuses computation on promising branches automatically
  • Anytime: Can return best move found so far at any point
  • Asymmetric growth: Good moves get deep trees, bad moves stay shallow
Understanding the Tree:
  • Early iterations (1-20): Exploring all options, unstable statistics
  • Middle iterations (20-60): Clear favorites emerge, deeper exploration
  • Late iterations (60+): Focus on 1-2 best moves, very deep search
Perspective Flipping is Critical:

Each node tracks wins from its player's perspective. During backpropagation, we flip the result (1-result) at each level. This implements minimax thinking: what's good for P1 is bad for P2!