Algorithms
TwisteRL currently supports two reinforcement learning algorithms.
PPO (Proximal Policy Optimization)
PPO is a policy gradient method that strikes a balance between ease of implementation, sample complexity, and wall-clock time.
Key Features
Stable Training: Uses a clipped objective to prevent destructively large policy updates
Sample Efficient: Reuses experience data multiple times through multiple epochs
General Purpose: Works well across a wide variety of environments
Configuration
PPO is configured through a JSON config file. Here’s an example with the actual parameter names:
{
"algorithm_cls": "twisterl.rl.PPO",
"algorithm": {
"collecting": {
"num_cores": 32,
"num_episodes": 1024,
"lambda": 0.995,
"gamma": 0.995
},
"training": {
"num_epochs": 10,
"vf_coef": 0.8,
"ent_coef": 0.01,
"clip_ratio": 0.1,
"normalize_advantage": true
},
"optimizer": {
"lr": 0.00015
},
"learning": {
"diff_threshold": 0.85,
"diff_max": 32,
"diff_metric": "ppo_1"
}
}
}
Parameters
Collecting Parameters:
num_cores: Number of parallel workers for data collection
num_episodes: Number of episodes to collect per iteration
lambda: GAE lambda parameter for advantage estimation
gamma: Discount factor
Training Parameters:
num_epochs: Number of training epochs per update
vf_coef: Coefficient for value function loss
ent_coef: Coefficient for entropy bonus
clip_ratio: PPO clipping parameter (epsilon)
normalize_advantage: Whether to normalize advantages
Optimizer Parameters:
lr: Learning rate for Adam optimizer
Learning Parameters:
diff_threshold: Success rate threshold for increasing difficulty
diff_max: Maximum difficulty level
diff_metric: Which evaluation metric to use for difficulty progression
Example
Train PPO on the 8-puzzle:
python -m twisterl.train --config examples/ppo_puzzle8_v1.json
AlphaZero (AZ)
AlphaZero combines Monte Carlo Tree Search (MCTS) with deep neural networks for planning-based learning.
Key Features
Tree Search: Employs MCTS for look-ahead planning
Self-Play: Learns through self-play without human knowledge
Value + Policy Learning: Jointly learns value and policy functions
Configuration
AlphaZero is configured similarly to PPO:
{
"algorithm_cls": "twisterl.rl.AZ",
"algorithm": {
"collecting": {
"num_cores": 32,
"num_episodes": 512,
"num_mcts_searches": 1000,
"C": 1.41,
"max_expand_depth": 1,
"seed": 123
},
"training": {
"num_epochs": 10
},
"optimizer": {
"lr": 0.0003
}
}
}
Parameters
Collecting Parameters:
num_mcts_searches: Number of MCTS simulations per move
C: Exploration constant (UCB formula)
max_expand_depth: Maximum tree expansion depth
seed: Random seed for reproducibility
Algorithm Comparison
Algorithm |
Type |
Sample Eff. |
Compute Cost |
Use Case |
|---|---|---|---|---|
PPO |
On-Policy |
Medium |
Low |
General RL |
AlphaZero |
Planning |
High |
High |
Perfect Info |
When to Use Each Algorithm
Use PPO when:
You want a general-purpose, fast-to-train algorithm
Computational resources are limited
You need stable, reliable training
Use AlphaZero when:
The environment has perfect information (you know the transition model)
You can afford higher computational cost for MCTS
Look-ahead planning is beneficial
Hyperparameter Tuning
General Guidelines
Start with defaults: See
src/twisterl/defaults.pyfor sensible default parametersAdjust learning rate first: This usually has the biggest impact
Monitor training curves: Use TensorBoard to track progress (logs saved to
runs/by default)
PPO Tuning Tips
Increase
num_epochsif training is stable but slowDecrease
clip_ratioif policy updates are too aggressiveIncrease
ent_coefif the policy becomes too deterministic too quicklyAdjust
gammaandlambdabased on episode length
AlphaZero Tuning Tips
Increase
num_mcts_searchesfor better play quality (but slower training)Adjust
Cto balance exploration vs exploitation in MCTS