Algorithms

TwisteRL currently supports two reinforcement learning algorithms.

PPO (Proximal Policy Optimization)

PPO is a policy gradient method that strikes a balance between ease of implementation, sample complexity, and wall-clock time.

Key Features

Stable Training: Uses a clipped objective to prevent destructively large policy updates
Sample Efficient: Reuses experience data multiple times through multiple epochs
General Purpose: Works well across a wide variety of environments

Configuration

PPO is configured through a JSON config file. Here’s an example with the actual parameter names:

{
    "algorithm_cls": "twisterl.rl.PPO",
    "algorithm": {
        "collecting": {
            "num_cores": 32,
            "num_episodes": 1024,
            "lambda": 0.995,
            "gamma": 0.995
        },
        "training": {
            "num_epochs": 10,
            "vf_coef": 0.8,
            "ent_coef": 0.01,
            "clip_ratio": 0.1,
            "normalize_advantage": true
        },
        "optimizer": {
            "lr": 0.00015
        },
        "learning": {
            "diff_threshold": 0.85,
            "diff_max": 32,
            "diff_metric": "ppo_1"
        }
    }
}

Parameters

Collecting Parameters:

num_cores: Number of parallel workers for data collection
num_episodes: Number of episodes to collect per iteration
lambda: GAE lambda parameter for advantage estimation
gamma: Discount factor

Training Parameters:

num_epochs: Number of training epochs per update
vf_coef: Coefficient for value function loss
ent_coef: Coefficient for entropy bonus
clip_ratio: PPO clipping parameter (epsilon)
normalize_advantage: Whether to normalize advantages

Optimizer Parameters:

lr: Learning rate for Adam optimizer

Learning Parameters:

diff_threshold: Success rate threshold for increasing difficulty
diff_max: Maximum difficulty level
diff_metric: Which evaluation metric to use for difficulty progression

Example

Train PPO on the 8-puzzle:

python -m twisterl.train --config examples/ppo_puzzle8_v1.json

AlphaZero (AZ)

AlphaZero combines Monte Carlo Tree Search (MCTS) with deep neural networks for planning-based learning.

Key Features

Tree Search: Employs MCTS for look-ahead planning
Self-Play: Learns through self-play without human knowledge
Value + Policy Learning: Jointly learns value and policy functions

Configuration

AlphaZero is configured similarly to PPO:

{
    "algorithm_cls": "twisterl.rl.AZ",
    "algorithm": {
        "collecting": {
            "num_cores": 32,
            "num_episodes": 512,
            "num_mcts_searches": 1000,
            "C": 1.41,
            "max_expand_depth": 1,
            "seed": 123
        },
        "training": {
            "num_epochs": 10
        },
        "optimizer": {
            "lr": 0.0003
        }
    }
}

Parameters

Collecting Parameters:

num_mcts_searches: Number of MCTS simulations per move
C: Exploration constant (UCB formula)
max_expand_depth: Maximum tree expansion depth
seed: Random seed for reproducibility

Algorithm Comparison

Algorithm	Type	Sample Eff.	Compute Cost	Use Case
PPO	On-Policy	Medium	Low	General RL
AlphaZero	Planning	High	High	Perfect Info

When to Use Each Algorithm

Use PPO when:

You want a general-purpose, fast-to-train algorithm
Computational resources are limited
You need stable, reliable training

Use AlphaZero when:

The environment has perfect information (you know the transition model)
You can afford higher computational cost for MCTS
Look-ahead planning is beneficial

Hyperparameter Tuning

General Guidelines

Start with defaults: See src/twisterl/defaults.py for sensible default parameters
Adjust learning rate first: This usually has the biggest impact
Monitor training curves: Use TensorBoard to track progress (logs saved to runs/ by default)

PPO Tuning Tips

Increase num_epochs if training is stable but slow
Decrease clip_ratio if policy updates are too aggressive
Increase ent_coef if the policy becomes too deterministic too quickly
Adjust gamma and lambda based on episode length

AlphaZero Tuning Tips

Increase num_mcts_searches for better play quality (but slower training)
Adjust C to balance exploration vs exploitation in MCTS