An investigation into how a honey bee-sized neural network learns to play chess through next-token prediction


TL;DR

I trained a 900K parameter transformer to play chess using only next-token prediction - no explicit rules, no search algorithms, just pattern recognition. The model achieves an 85.9% legal move rate on first try, rising to 99.0% with retries. Through deep analysis, I discovered:

  • The model composes moves from 5 tokens (separator + color + piece type + from_square + to_square)
  • Token 3 (piece type) is where real chess decisions happen (~57% confidence)
  • The model uses shortcut learning for simple patterns (e.g., alternating colors)
  • It excels at pattern matching but struggles with deep tactical calculations

This post documents the nitty-gritty details of what the model learned and where it falls short.


1. The Challenge: Chess as a Language Problem

Context: This project was part of the 2026 LLM Course taught by Nathanaël Fijalkow and Marc Lelarge. Thank you for this fun project!

Why <1M Parameters?

The challenge was simple: train the best chess-playing language model possible with fewer than 1,000,000 parameters - roughly the neuron count of a honey bee 🐝. This constraint forces you to make every parameter count.

The Approach: Chess is Just Text

Instead of treating chess as a game with explicit rules and search trees (like AlphaZero), I treated it as a language modeling problem:

Input:  [BOS] WPe2e4 BPe7e5 WNg1f3 ...
Output: BNb8c6  ← model predicts next move

Each move is encoded as character-level tokens: WPe2e4 = White Pawn from e2 to e4. The model learns by predicting the next token, just like GPT learns to predict the next word.


2. Model Architecture: A Minimal GPT

The Final Configuration

After experimenting with different architectures, here’s what the configuration I used to stay under the 1M parameter budget:

ChessForCausalLM(
    vocab_size=113,          # Custom tokenizer vocabulary
    n_embd=128,              # Embedding dimension
    n_layer=6,               # Transformer layers
    n_head=4,                # Attention heads (32 dims each)
    n_ctx=256,               # Context length (tokens)
    n_inner=384,             # FFN inner dim (3× instead of 4×)
    tie_weights=True,        # Weight tying saves ~150K params
)

ChessForCausalLM(
  (wte): Embedding(113, 128)
  (wpe): Embedding(256, 128)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-4): 5 x TransformerBlock(
      (ln_1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (attn): MultiHeadAttention(
        (c_attn): Linear(in_features=128, out_features=384, bias=True)
        (c_proj): Linear(in_features=128, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (mlp): FeedForward(
        (c_fc): Linear(in_features=128, out_features=384, bias=True)
        (c_proj): Linear(in_features=384, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
  (lm_head): Linear(in_features=128, out_features=113, bias=False)
)

Total Parameters: ~900,000

Custom Tokenizer Design

I designed a custom tokenizer specifically for chess moves with a vocabulary of 113 tokens:

  • Special tokens: [PAD], [BOS], [EOS], [UNK], [SEP]
  • Colors: W (White), B (Black)
  • Pieces: P, N, B, R, Q, K
  • Squares: All 64 squares (a1-h8) as single tokens
  • Modifiers: x (capture), + (check), * (checkmate), o/O (castling)

This tokenization makes each move exactly 5 tokens: separator → color → piece → from_square → to_square. The compact vocabulary keeps the embedding layer small while making moves semantically structured.


3. Training Setup

Dataset: 1M Lichess Games by David Louapre

Dataset on Huggingface

dataset: dlouapre/lichess_2025-01_1M
format:  WPe2e4 BPe7e5 WNg1f3 BNb8c6 WFc4 ...
split:   1,000,000 games

Each game is one long sequence of space-separated moves. The model learns by predicting each token given all previous tokens (causal language modeling).

Training Configuration

python -m src.train \
    --output_dir ./my_model \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-4 \
    --warmup_steps 500 \
    --max_length 256

Hardware: MacBook Pro M1 (32GB RAM, MPS acceleration)
Training Time: ~2 hours for 3 epochs


4. Evaluation: How Well Does It Play?

The primary test: Can the model generate legal chess moves?

# Evaluate on 500 random positions
evaluator.evaluate_legal_moves(n_positions=500, temperature=0.8, seed=42)

Results:

Positions tested:     500
Legal (1st try):      429 (85.9%)
Legal (with retry):   495 (99.0%)
Always illegal:       5 (1.0%)

Key Finding: The model is remarkably good at generating legal moves! With a simple retry mechanism (sample 3 times), it achieves 99% success.

Consistency Across Game Phases

I analyzed performance by game length:

Game Phase Moves Success Rate Positions
Early game ≤15 85.2% 189
Mid game 16-30 86.8% 201
Late game >30 85.7% 110

Insight: No significant degradation in late game! The model maintains consistency even with longer contexts.

Distribution of Retries

Figure: Distribution of retries needed for legal moves across 500 test positions


5. Deep Dive: Multi-Token Move Architecture

Moves Are Composed of 5 Tokens

This is where things get interesting. Each move isn’t a single token - it’s built from 5 tokens:

Example: Black knight from b8 to c6

Token 1: " " (separator)        → ~100% confidence
Token 2: "B" (Black to play)    → ~100% confidence
Token 3: "N" (knight piece)     → ~57% confidence
Token 4: "b8" (from square)     → varies
Token 5: "c6" (to square)       → varies

Note: The vocabulary includes all 64 squares (a1-h8) as single tokens, so squares aren’t split into separate characters.

Token 3 is Where the Piece Selection Happens

I analyzed the confidence at each token position:

def get_token_probs_at_positions(board, moves_str, num_tokens=5):
    """Get probabilities for all 5 tokens of next move."""
    # ... generate tokens one by one, measure confidence

Findings:

  • Token 1 (separator): ~100% - trivially predictable
  • Token 2 (color W/B): ~100% - deterministic (whose turn)
  • Token 3 (piece type): ~57% confidence - THIS is the chess decision!
  • Token 4 (from square): Varies based on piece chosen
  • Token 5 (to square): Varies based on position and tactics

Model Confidence vs Context Length

Figure: Model’s confidence on piece type prediction (Token 3) remains stable across different game lengths

Why ~57% is Good News

The ~57% confidence on piece selection shows the model is genuinely uncertain about which piece to move. This is desirable! It means:

  • The model considers multiple valid options (pawn vs knight vs bishop)
  • It’s not overconfident
  • The probability distribution reflects real strategic choices

Example: What the Model Considers

Position after e2e4 e7e5 g1f3 b8c6 (White to play):

Model's piece type probabilities (Token 3):

  1.  P        41.2%  ✓  ← Pawn move (most common)
  2.  N        32.1%  ✓  ← Knight move  
  3.  B        18.5%  ✓  ← Bishop move
  4.  Q         4.8%  ✓  ← Queen move (rare at move 4)
  5.  K         2.1%  ✓  ← King move (castling)
  6.  R         1.3%  ✓  ← Rook move (hard to develop)

Interpretation: Moderate uncertainty (good!)

The model has learned that in the opening, pawns and knights are most likely to move, while queens and rooks are rarer. This matches chess principles!


6. The Color Token Mystery: Shortcut Learning

100% Accuracy on Predicting Color

Token 2 (W or B) is predicted with near-perfect accuracy. But how does the model do this?

I have run some tests to probe the behavior:

Investigation: 5 Tests

I designed tests to probe how the model predicts color:

Test 1: Normal Case (Move 4)

board = chess.Board()
board.push_uci("e2e4")  # White
board.push_uci("e7e5")  # Black
board.push_uci("g1f3")  # White
# Model should predict: Black

Result: Predicted 'B' with 99.8% confidence 

Test 2: Edge Case (Move 1)

board = chess.Board()  # Empty board
# Model should predict: White

Result: Predicted 'W' with 99.9% confidence 

Insight: Model learned that [BOS] → separator → 'W' is the starting pattern!

Test 3: Adversarial (Corrupt Last Color)

# Change last move from "WN..." to "BN..." (wrong color)

# Original:  ...WPe2e4 BPe7e5 WNg1f3 BNb8c6
# Corrupted: ...WPe2e4 BPe7e5 WNg1f3 WNb8c6
# (Changed last move's color to confuse the model)

# Position: 4 moves
# Actual turn: White
# Expected token: W

# Top 2 Token 2 predictions:
#    1. 'W' 100.00% ✓✓✓
#    2. 'B'  0.00% 

Result: Predicted 'W' - Model seems unbothered by the corruption

Test 4: Adversarial (Corrupt Last Color)


# Original:  ...WPe2e4 BPe7e5 WPd2d4
# Corrupted: ...WPe2e4 BPe7e5 BPd2d4
# (Changed last move's color to confuse the model)

# Position: 3 moves
# Actual turn: Black
# Expected token: B

# Top 2 Token 2 predictions:
#    1. 'B' 100.00% ✓✓✓
#    2. 'W'  0.00% 

Result: Predicted 'B' - Model flipped the color this time

Conclusion: This would require a deeper dive into the model to really understand what is happening.


7. Visualization: Seeing What the Model Sees

Example 1: Successful First-Try Move

Position: After WNg1h3 BPd7d6 WPf2f3 BBc8g4 WRh1g1 BPh7h5 WRg1h1 BPa7a6 WPd2d4 BNg8h6 WPd4d5 BBg4f5, White to play

Successful Bishop Development

Figure: Model successfully predicts Bishop development (green arrow)

Model's prediction: c2c4 (Pawn move)
Legal: ✓ Yes
Retries needed: 0

Analysis: The model correctly chose to develop the bishop, a classical opening principle!

Example 2: Failed First Attempt (Red + Green Arrows)

Position: Complex middlegame after 18 moves

Failed First Attempt

Figure: Red arrow shows illegal first attempt, green arrow shows successful retry

First attempt: Ne8f6 (White Bishop checks so Knight move is illegal)
Final move:    Kd6c7 (Black king avoids the check)
Retries needed: 1

Analysis: Model missed the check position

8. Tactical Awareness: Mate-in-1 Tests

Can the model spot checkmate in one move?

I tested the model on classic mate-in-1 positions:

Missed Mate Position

Missed WQf7+*

Missed Mate Position

Missed BQhj4+*

Missed Mate Position

Missed WRd8+*

Missed Mate Position

Missed BQh1+*

Insight: The model missed these mate-in-1 situations. Training was not goal oriented so this was expected.


9. Opening Repertoire Analysis

White’s First Move (100 trials)

first_moves = {}
for trial in range(100):
    board = chess.Board()
    move = model.generate_move(board, temperature=0.8)
    first_moves[move] += 1

Results:

e2e4  46%  ← King's Pawn Opening (most popular)
d2d4  28%  ← Queen's Pawn Opening
g1f3  12%  ← Réti Opening
c2c4   8%  ← English Opening
e2e3   6%  ← Other

White's Opening Repertoire

Figure: Top 5 first moves for White, shown as colored arrows on the starting position

Analysis: The model learned the frequency distribution of openings in the training data! It plays e2e4 most often because that’s what humans play most often.

Black’s Response to 1.e4 (100 trials)

e7e5  52%  ← King's Pawn Game (symmetrical)
c7c5  31%  ← Sicilian Defense (sharp!)
e7e6   9%  ← French Defense
c7c6   5%  ← Caro-Kann Defense
d7d5   3%  ← Scandinavian Defense

Black's Response to e4

Figure: Black’s most common responses to 1.e4, with White’s first move shown in gray

Insight: The model has a reasonable opening repertoire dominated by popular defenses. The Sicilian Defense (c7c5) at 31% shows it learned aggressive play!


10. Where the Model Fails: Limitations

Common Failure Modes

Before discussing the theoretical limitations, let’s see some actual examples of where the model makes mistakes:

Example 1: Trying to Jump Over Pawns

Illegal knight move

The model sometimes forgets knights are the only pieces that can jump over others

Example 2: Moving Pawns Backwards

Illegal pawn move

Pawns can only move forward - but the model occasionally tries backward moves

Example 3: Missing Opponent Checks

Missed check

The model doesn’t always track which squares are under attack

These real examples show the model’s pattern-matching approach breaks down when positions deviate from common patterns.


1. No Deep Calculation

The model can’t think ahead. It predicts the next token based on patterns, not by searching future positions. This means:

  • ❌ Can’t reliably find forced mates beyond simple patterns
  • ❌ Misses long tactical sequences
  • ❌ No understanding of threats or defensive moves

2. Limited Board State Tracking

With 256 token context and ~900K parameters, the model has limited “memory”:

  • ❌ May forget piece positions in complex games
  • ❌ Struggles with rare piece configurations
  • ❌ Better at openings (memorized) than endgames (novel)

3. Shortcut Learning

The model finds simple heuristics when possible:

  • ✓ Color prediction: Just flip the last color (100% accuracy, 0% understanding)
  • ✓ Piece frequency: Pawns and knights move more than queens
  • ✓ Opening patterns: Memorize popular sequences

This is efficient but limits generalization to novel positions.

4. No Strategic Understanding

The model generates legal moves but doesn’t understand:

  • Position evaluation (is this move good or bad?)
  • Plans (pawn storm, minority attack, etc.)
  • Compensation (sacrificing material for tempo/position)

It’s pattern matching, not “thinking” about chess.


11. Key Takeaways: What Did We Learn?

1. Language Modeling Principles Work for Games

Next-token prediction naturally captures sequential game logic:

  • ✓ The model learned chess rules implicitly from data
  • ✓ No explicit rule engine needed
  • ✓ Attention mechanism handles move dependencies

2. Small Models Can Be Surprisingly Capable

With <1M parameters (honey bee-sized brain):

  • ✓ 85.9% legal move accuracy
  • ✓ Reasonable opening repertoire

This suggests efficient compression of chess knowledge is possible.

3. Multi-Token Architecture Provides Interpretability

Breaking moves into 4-5 tokens reveals:

  • Where decisions happen: Token 3 (piece type)
  • What the model considers: Probability distributions over pieces
  • Confidence levels: ~57% means genuine uncertainty

We can see the model’s thought process at each token!

4. Pattern Recognition ≠ Understanding

High legal move rate ≠ good chess play:

  • Model excels at learned patterns
  • Struggles with novel positions requiring calculation
  • Uses shortcuts (color flipping) instead of deep reasoning

This is the difference between System 1 (fast, intuitive) and System 2 (slow, deliberate) thinking.


12. Future Directions

Immediate Improvements

1. Scale Up

2. Curriculum Learning

  • Start with simple puzzles (mate-in-1)
  • Gradually increase difficulty
  • Fine-tune on tactical pattern datasets

3. Hybrid Approaches

  • Combine LM generation with shallow search (depth 1-2)
  • Use model probabilities to guide MCTS
  • Best of both worlds: pattern recognition + calculation

Research Questions

1. Scaling Laws for Chess

  • 10M parameters → 2000 ELO?
  • What’s the minimal architecture for 90%+ legal rate?

2. Comparison to Value-Function Approaches

  • AlphaZero style: value network + search
  • Pure LM approach: next-token prediction
  • Which is more parameter-efficient?

3. Transfer Learning

  • Pre-train on chess puzzles
  • Fine-tune on full games
  • Does tactical training improve strategic play?

Explore the Jupyter Notebook

jupyter notebook play.ipynb

The notebook contains all the analysis from this post: token probability analysis, attention visualization, mate-in-1 tests, and more!


14. Conclusion: Can a Bee Play Chess?

Yes - if by “play chess” you mean “generate legal moves 86% of the time.”

No - if you expect strategic understanding or deep tactical calculation.

The sub-1M parameter transformer learned impressive chess patterns through pure next-token prediction:

  • ✅ Implicit rule learning from data
  • ✅ Compositional move generation
  • ✅ Reasonable opening knowledge
  • ✅ Simple tactical recognition

But it lacks:

  • ❌ Deep calculation abilities
  • ❌ Strategic planning
  • ❌ True understanding of positions

For a model with fewer parameters than a honey bee’s brain, playing legal chess 86% of the time is remarkable. The question is: can we teach this bee to sting? 🐝♟️

Published: February 2026