80 lines
3.8 KiB
Markdown
80 lines
3.8 KiB
Markdown
---
|
|
license: mit
|
|
library_name: pytorch
|
|
tags:
|
|
- reinforcement-learning
|
|
- game-ai
|
|
- tic-tac-toe
|
|
- pytorch
|
|
---
|
|
|
|
# Ultimate Tic Tac Toe Deep Learning Bot
|
|
|
|
Model for playing Ultimate Tic Tac Toe
|
|
|
|
## Available checkpoints
|
|
|
|
- `checkpoints/medium.pth`: medium-difficulty checkpoint.
|
|
- `checkpoints/hard.pth`: hard-difficulty checkpoint.
|
|
|
|
## Usage
|
|
|
|
**Usage**
|
|
Run `python run.py --help` for help.
|
|
|
|
**Shared flags**
|
|
- --device: Torch device string. If omitted, it auto-picks "cuda" when available, otherwise "cpu".
|
|
- --checkpoint: Path to the model checkpoint file. Default is latest.pth. It is loaded for eval, play, and checkpoint-based arena,
|
|
and used as the save path for accepted training checkpoints.
|
|
|
|
**Training parameters**
|
|
- --resume: Loads model and optimizer state from --checkpoint before continuing training.
|
|
- --num-simulations default 100: MCTS rollouts per move during self-play. Higher is stronger/slower.
|
|
- --num-iters default 50: Number of outer training iterations. Each iteration generates new self-play games, trains, then arena-tests
|
|
the new model.
|
|
- --num-eps default 20: Self-play games per iteration.
|
|
- --epochs default 5: Passes over the current replay-buffer training set per iteration.
|
|
- --batch-size default 64: Mini-batch size for gradient updates.
|
|
- --lr default 5e-4: Adam learning rate.
|
|
- --weight-decay default 1e-4: Adam weight decay (L2-style regularization).
|
|
- --replay-buffer-size default 50000: Maximum number of training examples retained across iterations. Older examples are dropped.
|
|
- --value-loss-weight default 1.0: Multiplier on the value-head loss in total training loss. Total loss is policy_KL +
|
|
value_loss_weight * value_loss.
|
|
- --grad-clip-norm default 5.0: Global gradient norm clipping threshold before optimizer step.
|
|
- --temperature-threshold default 10: In self-play, moves before this step use stochastic sampling from MCTS visit counts; later
|
|
moves use greedy selection.
|
|
- --root-dirichlet-alpha default 0.3: Dirichlet noise alpha added to root priors during self-play MCTS to force exploration.
|
|
- --root-exploration-fraction default 0.25: How much of that root prior is replaced by Dirichlet noise.
|
|
- --arena-compare-games default 6: Number of head-to-head games between candidate and previous model after each iteration. If <= 0,
|
|
every candidate is accepted.
|
|
- --arena-accept-threshold default 0.55: Minimum average points needed in arena to keep the new model. Win = 1, draw = 0.5.
|
|
- --arena-compare-simulations default 8: MCTS simulations per move during those arena comparison games. Separate from self-play
|
|
--num-simulations.
|
|
|
|
**Evaluation parameters**
|
|
|
|
- --moves default "": Comma-separated move list to reach a position from the starting board, e.g. 0,10,4.
|
|
- --top-k default 10: How many highest-probability legal moves to print from the model policy.
|
|
- --with-mcts: Also run MCTS on that position and print the best move, instead of only raw network policy/value.
|
|
- --num-simulations default 100: Only matters with --with-mcts; controls MCTS search depth for that evaluation.
|
|
|
|
**Play parameters**
|
|
|
|
- --human-player default 1: Which side you control. 1 means X, -1 means O.
|
|
- --num-simulations default 100: MCTS simulations the AI uses for each move.
|
|
|
|
**Arena parameters**
|
|
- --games default 20: Number of matches to run.
|
|
- --num-simulations default 100: MCTS simulations per move for checkpoint-based players.
|
|
- --x-player / --o-player: Either checkpoint or random. Chooses the agent type for each side.
|
|
- --x-checkpoint / --o-checkpoint: Checkpoint path for that side when its player type is checkpoint. Ignored for random.
|
|
|
|
A few practical examples:
|
|
|
|
```bash
|
|
python run.py train --num-iters 100 --num-eps 50 --resume
|
|
python run.py eval --checkpoint latest.pth --moves 0,10,4 --with-mcts --num-simulations 200
|
|
python run.py play --human-player -1 --num-simulations 300
|
|
python run.py arena --games 50 --x-player checkpoint --o-player random
|
|
```
|