Neural Networks
===============

TwisteRL provides neural network architectures for reinforcement learning policies.

NN Module
---------

.. automodule:: twisterl.nn
   :members:
   :undoc-members:
   :show-inheritance:

Policy Networks
---------------

BasicPolicy
~~~~~~~~~~~

.. autoclass:: twisterl.nn.BasicPolicy
   :members:
   :undoc-members:
   :show-inheritance:

The ``BasicPolicy`` is the main policy network used for both PPO and AlphaZero. It has an actor-critic architecture with shared embedding layers.

**Configuration:**

.. code-block:: json

   {
       "policy_cls": "twisterl.nn.BasicPolicy",
       "policy": {
           "embedding_size": 512,
           "common_layers": [256],
           "policy_layers": [],
           "value_layers": []
       }
   }

**Parameters:**

- **embedding_size**: Size of the embedding layer
- **common_layers**: Hidden layer sizes for shared network
- **policy_layers**: Additional layers for policy head (after common layers)
- **value_layers**: Additional layers for value head (after common layers)

**Architecture:**

1. Embedding layer: ``obs_size -> embedding_size`` (Linear + ReLU)
2. Common layers: Shared MLP
3. Policy head: Outputs action logits
4. Value head: Outputs state value

**Usage:**

.. code-block:: python

   from twisterl.nn import BasicPolicy

   policy = BasicPolicy(
       obs_shape=[9],           # 3x3 puzzle = 9 observations
       num_actions=4,           # 4 possible moves
       embedding_size=512,
       common_layers=(256,),
       policy_layers=(),
       value_layers=(),
       obs_perms=(),            # Observation permutations (twists)
       act_perms=()             # Action permutations (twists)
   )

   # Forward pass (returns logits, not probabilities)
   import torch
   obs = torch.randn(32, 9)
   logits, values = policy(obs)

   # Predict with numpy input (returns action probabilities and value)
   action_probs, value = policy.predict(obs_numpy)

Conv1dPolicy
~~~~~~~~~~~~

.. autoclass:: twisterl.nn.Conv1dPolicy
   :members:
   :undoc-members:
   :show-inheritance:

A variant of BasicPolicy that uses 1D convolutions for the embedding layer. Useful for environments with structured 2D observations.

**Parameters:**

- **conv_dim**: Which dimension to convolve over (0 or 1)

Permutation Support (Twists)
----------------------------

Both policy classes support permutation symmetries ("twists") for symmetry-aware training:

.. code-block:: python

   # Get twists from environment
   obs_perms, act_perms = env.twists()

   # Create policy with permutation support
   policy = BasicPolicy(
       obs_shape=env.obs_shape(),
       num_actions=env.num_actions(),
       obs_perms=obs_perms,
       act_perms=act_perms,
       ...
   )

When permutations are provided, the policy can:

- Apply random permutations during training for data augmentation
- Handle permutation indices passed during forward pass

Rust Conversion
---------------

Policies can be converted to Rust for fast inference:

.. code-block:: python

   # Convert PyTorch policy to Rust
   rust_policy = policy.to_rust()

This is used internally during training for fast data collection.

Network Utilities
-----------------

.. automodule:: twisterl.nn.utils
   :members:
   :undoc-members:
   :show-inheritance:

Key utility functions:

- ``make_sequential(in_size, layer_sizes, final_relu=True)``: Create a sequential MLP
- ``sequential_to_rust(module)``: Convert PyTorch Sequential to Rust
- ``embeddingbag_to_rust(module, shape, dim)``: Convert embedding layer to Rust

Device Management
-----------------

Policies automatically handle device placement:

.. code-block:: python

   # Move to GPU if available
   policy = policy.to("cuda")
   policy.device = "cuda"

   # Or use config-based device selection
   # (handled automatically by Algorithm class)