Neural Networks

TwisteRL provides neural network architectures for reinforcement learning policies.

NN Module

Policy Networks

BasicPolicy

The BasicPolicy is the main policy network used for both PPO and AlphaZero. It has an actor-critic architecture with shared embedding layers.

Configuration:

{
    "policy_cls": "twisterl.nn.BasicPolicy",
    "policy": {
        "embedding_size": 512,
        "common_layers": [256],
        "policy_layers": [],
        "value_layers": []
    }
}

Parameters:

embedding_size: Size of the embedding layer
common_layers: Hidden layer sizes for shared network
policy_layers: Additional layers for policy head (after common layers)
value_layers: Additional layers for value head (after common layers)

Architecture:

Embedding layer: obs_size -> embedding_size (Linear + ReLU)
Common layers: Shared MLP
Policy head: Outputs action logits
Value head: Outputs state value

Usage:

from twisterl.nn import BasicPolicy

policy = BasicPolicy(
    obs_shape=[9],           # 3x3 puzzle = 9 observations
    num_actions=4,           # 4 possible moves
    embedding_size=512,
    common_layers=(256,),
    policy_layers=(),
    value_layers=(),
    obs_perms=(),            # Observation permutations (twists)
    act_perms=()             # Action permutations (twists)
)

# Forward pass (returns logits, not probabilities)
import torch
obs = torch.randn(32, 9)
logits, values = policy(obs)

# Predict with numpy input (returns action probabilities and value)
action_probs, value = policy.predict(obs_numpy)

Conv1dPolicy

A variant of BasicPolicy that uses 1D convolutions for the embedding layer. Useful for environments with structured 2D observations.

Parameters:

conv_dim: Which dimension to convolve over (0 or 1)

Permutation Support (Twists)

Both policy classes support permutation symmetries (“twists”) for symmetry-aware training:

# Get twists from environment
obs_perms, act_perms = env.twists()

# Create policy with permutation support
policy = BasicPolicy(
    obs_shape=env.obs_shape(),
    num_actions=env.num_actions(),
    obs_perms=obs_perms,
    act_perms=act_perms,
    ...
)

When permutations are provided, the policy can:

Apply random permutations during training for data augmentation
Handle permutation indices passed during forward pass

Rust Conversion

Policies can be converted to Rust for fast inference:

# Convert PyTorch policy to Rust
rust_policy = policy.to_rust()

This is used internally during training for fast data collection.

Network Utilities

Key utility functions:

make_sequential(in_size, layer_sizes, final_relu=True): Create a sequential MLP
sequential_to_rust(module): Convert PyTorch Sequential to Rust
embeddingbag_to_rust(module, shape, dim): Convert embedding layer to Rust

Device Management

Policies automatically handle device placement:

# Move to GPU if available
policy = policy.to("cuda")
policy.device = "cuda"

# Or use config-based device selection
# (handled automatically by Algorithm class)