Algorithms ========== TwisteRL currently supports two reinforcement learning algorithms. PPO (Proximal Policy Optimization) ----------------------------------- PPO is a policy gradient method that strikes a balance between ease of implementation, sample complexity, and wall-clock time. Key Features ~~~~~~~~~~~~ - **Stable Training**: Uses a clipped objective to prevent destructively large policy updates - **Sample Efficient**: Reuses experience data multiple times through multiple epochs - **General Purpose**: Works well across a wide variety of environments Configuration ~~~~~~~~~~~~~ PPO is configured through a JSON config file. Here's an example with the actual parameter names: .. code-block:: json { "algorithm_cls": "twisterl.rl.PPO", "algorithm": { "collecting": { "num_cores": 32, "num_episodes": 1024, "lambda": 0.995, "gamma": 0.995 }, "training": { "num_epochs": 10, "vf_coef": 0.8, "ent_coef": 0.01, "clip_ratio": 0.1, "normalize_advantage": true }, "optimizer": { "lr": 0.00015 }, "learning": { "diff_threshold": 0.85, "diff_max": 32, "diff_metric": "ppo_1" } } } Parameters ~~~~~~~~~~ **Collecting Parameters:** - **num_cores**: Number of parallel workers for data collection - **num_episodes**: Number of episodes to collect per iteration - **lambda**: GAE lambda parameter for advantage estimation - **gamma**: Discount factor **Training Parameters:** - **num_epochs**: Number of training epochs per update - **vf_coef**: Coefficient for value function loss - **ent_coef**: Coefficient for entropy bonus - **clip_ratio**: PPO clipping parameter (epsilon) - **normalize_advantage**: Whether to normalize advantages **Optimizer Parameters:** - **lr**: Learning rate for Adam optimizer **Learning Parameters:** - **diff_threshold**: Success rate threshold for increasing difficulty - **diff_max**: Maximum difficulty level - **diff_metric**: Which evaluation metric to use for difficulty progression Example ~~~~~~~ Train PPO on the 8-puzzle: .. code-block:: bash python -m twisterl.train --config examples/ppo_puzzle8_v1.json AlphaZero (AZ) -------------- AlphaZero combines Monte Carlo Tree Search (MCTS) with deep neural networks for planning-based learning. Key Features ~~~~~~~~~~~~ - **Tree Search**: Employs MCTS for look-ahead planning - **Self-Play**: Learns through self-play without human knowledge - **Value + Policy Learning**: Jointly learns value and policy functions Configuration ~~~~~~~~~~~~~ AlphaZero is configured similarly to PPO: .. code-block:: json { "algorithm_cls": "twisterl.rl.AZ", "algorithm": { "collecting": { "num_cores": 32, "num_episodes": 512, "num_mcts_searches": 1000, "C": 1.41, "max_expand_depth": 1, "seed": 123 }, "training": { "num_epochs": 10 }, "optimizer": { "lr": 0.0003 } } } Parameters ~~~~~~~~~~ **Collecting Parameters:** - **num_mcts_searches**: Number of MCTS simulations per move - **C**: Exploration constant (UCB formula) - **max_expand_depth**: Maximum tree expansion depth - **seed**: Random seed for reproducibility Algorithm Comparison -------------------- +----------------+----------+---------------+----------------+-----------------+ | Algorithm | Type | Sample Eff. | Compute Cost | Use Case | +================+==========+===============+================+=================+ | PPO | On-Policy| Medium | Low | General RL | +----------------+----------+---------------+----------------+-----------------+ | AlphaZero | Planning | High | High | Perfect Info | +----------------+----------+---------------+----------------+-----------------+ When to Use Each Algorithm -------------------------- **Use PPO when:** - You want a general-purpose, fast-to-train algorithm - Computational resources are limited - You need stable, reliable training **Use AlphaZero when:** - The environment has perfect information (you know the transition model) - You can afford higher computational cost for MCTS - Look-ahead planning is beneficial Hyperparameter Tuning ---------------------- General Guidelines ~~~~~~~~~~~~~~~~~~ 1. **Start with defaults**: See ``src/twisterl/defaults.py`` for sensible default parameters 2. **Adjust learning rate first**: This usually has the biggest impact 3. **Monitor training curves**: Use TensorBoard to track progress (logs saved to ``runs/`` by default) PPO Tuning Tips ~~~~~~~~~~~~~~~ - Increase ``num_epochs`` if training is stable but slow - Decrease ``clip_ratio`` if policy updates are too aggressive - Increase ``ent_coef`` if the policy becomes too deterministic too quickly - Adjust ``gamma`` and ``lambda`` based on episode length AlphaZero Tuning Tips ~~~~~~~~~~~~~~~~~~~~~ - Increase ``num_mcts_searches`` for better play quality (but slower training) - Adjust ``C`` to balance exploration vs exploitation in MCTS