Letter World: The Counting Reward Machine

Overview

The Counting Reward Machine is the core component that defines what constitutes success in a task. Unlike traditional reward functions that depend only on the current state, CRMs can provide rewards based on the history of events, making them powerful for specifying complex behaviors.

The CRM Concept

A Counting Reward Machine consists of:
  • States: Different phases of the task
  • Counters: Variables that track occurrences of specific events
  • Transitions: Rules for moving between states based on events and counter values
  • Rewards: Values assigned during transitions
For the Letter World environment, we define a CRM that rewards the agent for visiting letters in a specific sequence, with counters tracking how many times certain letters have been seen.

Implementation Details

The Letter World CRM is implemented as a subclass of the CountingRewardMachine base class:
The examples are not distributed in the PyPI package currently. Please see the Installation Guide for information on how to setup a development build.
from crm.automaton import CountingRewardMachine
from examples.letter.core.label import Symbol


class LetterWorldCountingRewardMachine(CountingRewardMachine):
    """Counting reward machine for the Letter World environment."""

    def __init__(self) -> None:
        """Initialise the counting reward machine."""
        super().__init__(env_prop_enum=Symbol)

Initial Configuration

The CRM defines its initial state and counter values:
@property
def u_0(self) -> int:
    """Return the initial state of the machine."""
    return 0

@property
def c_0(self) -> tuple[int, ...]:
    """Return the initial counter configuration of the machine."""
    return (0,)

@property
def encoded_configuration_size(self) -> int:
    """Return the size of the encoded counter configuration."""
    return 2
Here:
  • u_0: The initial state is 0
  • c_0: The initial counter configuration is (0,), meaning we start with a single counter set to 0
  • encoded_configuration_size: Specifies how many bits are needed to encode the counter values

State Transition Function

The state transition function defines how the CRM moves between states based on events and counter conditions:
def _get_state_transition_function(self) -> dict:
    """Return the state transition function."""
    return {
        0: {
            "A / (-)": 0,
            "B / (-)": 1,
            "C / (-)": 0,
            "/ (-)": 0,
        },
        1: {
            "A / (-)": 1,
            "B / (-)": 1,
            "C / (NZ)": 1,
            "C / (Z)": -1,
            "/ (-)": 1,
        },
    }
The notation used in this function:
  • A / (-): When event A occurs, with any counter value
  • C / (NZ): When event C occurs and the counter is Non-Zero
  • C / (Z): When event C occurs and the counter is Zero
  • / (-): Default transition when no event occurs
The state transitions show that:
  1. From state 0, seeing B moves the CRM to state 1
  2. In state 1, seeing C with a zero counter (C / (Z)) leads to a terminal state (-1)

Counter Transition Function

The counter transition function defines how counter values change during transitions:
def _get_counter_transition_function(self) -> dict:
    """Return the counter transition function."""
    return {
        0: {
            "A / (-)": (1,),
            "B / (-)": (0,),
            "C / (-)": (0,),
            "/ (-)": (0,),
        },
        1: {
            "A / (-)": (0,),
            "B / (-)": (0,),
            "C / (NZ)": (-1,),
            "C / (Z)": (0,),
            "/ (-)": (0,),
        },
    }
This function shows:
  1. In state 0, seeing A increments the counter by 1
  2. In state 1, seeing C with a non-zero counter decrements the counter by 1

Reward Transition Function

The reward transition function defines what reward is given during each transition:
def _get_reward_transition_function(self) -> dict:
    """Return the reward transition function."""
    return {
        0: {
            "A / (-)": -0.1,
            "B / (-)": -0.1,
            "C / (-)": -0.1,
            "/ (-)": -0.1,
        },
        1: {
            "A / (-)": -0.1,
            "B / (-)": -0.1,
            "C / (NZ)": -0.1,
            "C / (Z)": 1,
            "/ (-)": -0.1,
        },
    }
The rewards show:
  1. Small negative rewards (-0.1) for most transitions (encouraging efficiency)
  2. A positive reward (1.0) when C is seen with a zero counter in state 1
  3. A larger negative reward (-1.0) for doing nothing in state 1

Understanding the Reward Structure

The reward structure in this CRM is carefully designed to guide the agent toward the desired behavior:
  • Small negative step costs (-0.1): Most actions have a small negative reward, creating a time pressure that encourages the agent to complete the task efficiently rather than wandering aimlessly.
  • Goal reward (+1.0): A significant positive reward is given only when the agent successfully completes the task—visiting C in state 1 with a counter value of 0. This is the only way to receive positive reward.
The reward design creates several important learning signals:
  1. Delayed gratification: The agent must endure short-term “costs” (negative rewards) to achieve a larger future reward.
  2. Temporal credit assignment: The agent must learn which sequence of actions leads to the positive reward, even though the reward comes only at the end.
  3. Balance requirement: The reward structure enforces the need to visit exactly as many C’s as A’s, as any imbalance will prevent reaching the terminal state with a positive reward.
This reward structure makes the CRM task challenging for reinforcement learning algorithms, as they must learn to plan far into the future and maintain a count of past events to maximize total reward.

Counter Configuration

The CRM also defines the range of possible counter values:
def _get_possible_counter_configurations(self) -> list[tuple[int]]:
    """Return the possible counter configurations."""
    return [(i,) for i in range(6)]

def sample_counter_configurations(self) -> list[tuple[int]]:
    """Return a sample counter configuration."""
    return self._get_possible_counter_configurations()
This limits the counter to values from 0 to 5, preventing unbounded counting.

The Task Defined by the CRM

Based on the implementation, the Letter World CRM defines a task where:
  1. The agent starts in state 0 with counter set to 0
  2. Seeing letter A in state 0 increments the counter
  3. Seeing letter B transitions to state 1
  4. In state 1, seeing letter C with a counter value of 0 gives a reward of 1 and terminates
  5. In state 1, seeing letter C with a non-zero counter decrements the counter
This creates a task where the agent must:
  1. Visit letter A some number of times to increment the counter
  2. Visit letter B to transition to state 1
  3. Visit letter C exactly the same number of times as A was visited
  4. When the counter reaches 0 by visiting C, the agent receives a reward and the task is complete
The CRM essentially defines a “balanced parentheses” task: each A must be balanced by a corresponding C, with B separating the two phases.

Visualization of the CRM

Here’s a simplified representation of the Letter World CRM:
Alternatively, here’s a simpler text-based representation:
State 0:
  - On A: Stay in state 0, increment counter (+1)
  - On B: Move to state 1
  - On C: Stay in state 0

State 1:
  - On A: Stay in state 1
  - On B: Stay in state 1
  - On C (counter = 0): Move to End state, give reward
  - On C (counter > 0): Stay in state 1, decrement counter (-1)

Using the CRM

The CRM is used by integrating it with the ground environment and labelling function:
# Initialize components
ground_env = LetterWorld()
lf = LetterWorldLabellingFunction()
crm = LetterWorldCountingRewardMachine()

# Create the cross-product environment
env = LetterWorldCrossProduct(
    ground_env=ground_env,
    crm=crm,
    lf=lf,
    max_steps=500,
)

# The cross-product handles the interaction between
# the ground environment and the CRM
obs, _ = env.reset()
action = env.action_space.sample()
next_obs, reward, terminated, truncated, info = env.step(action)

Key Points

  • The CRM defines rewards based on sequences of events and counter values
  • It enables specification of complex tasks that depend on history
  • For Letter World, it defines a “balanced parentheses” task
  • The agent must visit A n times, then B, then C n times
  • The CRM has states, counters, transitions, and rewards
  • Terminal states (-1) indicate task completion

Next Steps