Neural Networks

TwisteRL provides neural network architectures for reinforcement learning policies.

NN Module

Policy Networks

BasicPolicy

The BasicPolicy is the main policy network used for both PPO and AlphaZero. It has an actor-critic architecture with shared embedding layers.

Configuration:

{
    "policy_cls": "twisterl.nn.BasicPolicy",
    "policy": {
        "embedding_size": 512,
        "common_layers": [256],
        "policy_layers": [],
        "value_layers": []
    }
}

Parameters:

  • embedding_size: Size of the embedding layer

  • common_layers: Hidden layer sizes for shared network

  • policy_layers: Additional layers for policy head (after common layers)

  • value_layers: Additional layers for value head (after common layers)

Architecture:

  1. Embedding layer: obs_size -> embedding_size (Linear + ReLU)

  2. Common layers: Shared MLP

  3. Policy head: Outputs action logits

  4. Value head: Outputs state value

Usage:

from twisterl.nn import BasicPolicy

policy = BasicPolicy(
    obs_shape=[9],           # 3x3 puzzle = 9 observations
    num_actions=4,           # 4 possible moves
    embedding_size=512,
    common_layers=(256,),
    policy_layers=(),
    value_layers=(),
    obs_perms=(),            # Observation permutations (twists)
    act_perms=()             # Action permutations (twists)
)

# Forward pass (returns logits, not probabilities)
import torch
obs = torch.randn(32, 9)
logits, values = policy(obs)

# Predict with numpy input (returns action probabilities and value)
action_probs, value = policy.predict(obs_numpy)

Conv1dPolicy

A variant of BasicPolicy that uses 1D convolutions for the embedding layer. Useful for environments with structured 2D observations.

Parameters:

  • conv_dim: Which dimension to convolve over (0 or 1)

Permutation Support (Twists)

Both policy classes support permutation symmetries (“twists”) for symmetry-aware training:

# Get twists from environment
obs_perms, act_perms = env.twists()

# Create policy with permutation support
policy = BasicPolicy(
    obs_shape=env.obs_shape(),
    num_actions=env.num_actions(),
    obs_perms=obs_perms,
    act_perms=act_perms,
    ...
)

When permutations are provided, the policy can:

  • Apply random permutations during training for data augmentation

  • Handle permutation indices passed during forward pass

Rust Conversion

Policies can be converted to Rust for fast inference:

# Convert PyTorch policy to Rust
rust_policy = policy.to_rust()

This is used internally during training for fast data collection.

Network Utilities

Key utility functions:

  • make_sequential(in_size, layer_sizes, final_relu=True): Create a sequential MLP

  • sequential_to_rust(module): Convert PyTorch Sequential to Rust

  • embeddingbag_to_rust(module, shape, dim): Convert embedding layer to Rust

Device Management

Policies automatically handle device placement:

# Move to GPU if available
policy = policy.to("cuda")
policy.device = "cuda"

# Or use config-based device selection
# (handled automatically by Algorithm class)