Letter World Cross-Product Example

This example demonstrates how the cross-product MDP behaves like a standard Gymnasium environment while adding the power of Counting Reward Machines.

The Letter World Environment

The Letter World is a simple grid environment where an agent navigates to find specific letters:

Letter ‘A’ has a 50% chance of turning into letter ‘B’ when visited
Letter ‘C’ gives a reward when visited after seeing letter ‘B’
The agent must learn to visit ‘A’, hope it turns into ‘B’, and then visit ‘C’

Here’s what the environment looks like:

+-------------+
|. . . . . . .|
|A . x . . C .|
|. . . . . . .|
+-------------+

Where:

A represents letter ‘A’ (or ‘B’ after transformation)
C represents letter ‘C’
x represents the agent

Components

To create our cross-product environment, we need several components:

Ground Environment: The basic grid world (LetterWorld)
Labelling Function: Maps transitions to symbols (LetterWorldLabellingFunction)
Counting Reward Machine: Defines rewards based on symbol history (LetterWorldCountingRewardMachine)
Cross-Product: Combines all the above (LetterWorldCrossProduct)

Setting Up the Environment

The examples are not distributed in the PyPI package currently. Please see the Installation Guide for information on how to setup a development build.

First, let’s import the necessary components and create our environment:

from examples.letter.core.crossproduct import LetterWorldCrossProduct
from examples.letter.core.ground import LetterWorld
from examples.letter.core.label import LetterWorldLabellingFunction
from examples.letter.core.machine import LetterWorldCountingRewardMachine

# Create the ground environment
ground_env = LetterWorld()

# Create labelling function and counting reward machine
lf = LetterWorldLabellingFunction()
crm = LetterWorldCountingRewardMachine()

# Create the cross-product environment
cross_product = LetterWorldCrossProduct(
    ground_env=ground_env,
    crm=crm,
    lf=lf,
    max_steps=100,  # Set maximum number of steps
)

Using the Environment Like a Standard Gym Environment

The cross-product environment works just like any other Gymnasium environment:

# Reset the environment
obs, info = cross_product.reset(seed=42)
print(f"Initial observation: {obs}")

# Sample a random action
action = cross_product.action_space.sample()

# Take a step in the environment
next_obs, reward, terminated, truncated, info = cross_product.step(action)
print(f"Action: {action}")
print(f"Observation: {next_obs}")
print(f"Reward: {reward}")

Output:

Initial observation: [0 1 3 0 0]
Action: 3
Observation: [0 2 3 0 0]
Reward: -0.1

The observation is structured as:

First part: Ground observation (symbol_seen, agent_row, agent_col)
Last part: Machine state and counter values

Running an Episode

Let’s run a full episode with the cross-product environment:

# Reset and run an episode
obs, info = cross_product.reset(seed=0)
total_reward = 0
step_count = 0

print("Initial environment state:")
ground_env.render()

# Run for several steps
for _ in range(10):
    action = cross_product.action_space.sample()
    next_obs, reward, terminated, truncated, info = cross_product.step(action)
    
    total_reward += reward
    step_count += 1
    
    print(f"\nStep {step_count}:")
    print(f"  Action: {action}")
    print(f"  Observation: {next_obs}")
    print(f"  Reward: {reward}")
    
    # Render the environment
    ground_env.render()
    
    if terminated or truncated:
        print(f"Episode ended after {step_count} steps")
        break

Sample output:

Initial environment state:
+-------------+
|. . . . . . .|
|A . x . . C .|
|. . . . . . .|
+-------------+

Step 1:
  Action: 0
  Observation: [0 1 4 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|A . . x . C .|
|. . . . . . .|
+-------------+

Step 2:
  Action: 3
  Observation: [0 2 4 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|A . . . . C .|
|. . . x . . .|
+-------------+

Using a Specific Action Sequence

You can also execute a specific sequence of actions:

# Reset the environment
obs, info = cross_product.reset(seed=0)

# Define a specific action sequence to test
# (0=RIGHT, 1=LEFT, 2=UP, 3=DOWN)
actions = [1, 1, 1, 2]  # Move to letter A

for i, action in enumerate(actions):
    next_obs, reward, terminated, truncated, info = cross_product.step(action)
    print(f"\nStep {i+1} with action {action}:")
    print(f"  Observation: {next_obs}")
    print(f"  Reward: {reward}")
    
    ground_env.render()

Sample output for this sequence:

Step 1 with action 1:
  Observation: [0 1 2 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|A x . . . C .|
|. . . . . . .|
+-------------+

Step 2 with action 1:
  Observation: [0 1 1 0 0]
  Reward: -0.1

+-------------+
|. . . . . . .|
|x A . . . C .|
|. . . . . . .|
+-------------+

Step 3 with action 1:
  Observation: [0 1 0 0 0]
  Reward: -0.1

+-------------+
|x . . . . . .|
|A . . . . C .|
|. . . . . . .|
+-------------+

Step 4 with action 2:
  Observation: [0 0 0 0 0]
  Reward: -0.1

+-------------+
|x . . . . . .|
|A . . . . C .|
|. . . . . . .|
+-------------+

What Makes It Special?

The cross-product environment extends a standard Gym environment with:

Symbol Tracking: It tracks which symbols have been seen
Counter Values: It maintains counters as defined by the reward machine
State Memory: The reward can depend on the history of previously seen symbols
Reward Shaping: Complex reward signals based on achieving specific goals

For example, in our Letter World:

Visiting letter ‘A’ increments a counter
Visiting letter ‘B’ resets the counter but changes the machine state
Visiting letter ‘C’ after ‘B’ with a zero counter gives a large positive reward

The Counting Reward Machine

Let’s look at how the counting reward machine is structured:

def _get_state_transition_function(self) -> dict:
    """Return the state transition function."""
    return {
        0: {
            "A / (-)": 0,
            "B / (-)": 1,
            "C / (-)": 0,
            "/ (-)": 0,
        },
        1: {
            "A / (-)": 1,
            "B / (-)": 1,
            "C / (NZ)": 1,
            "C / (Z)": -1,
            "/ (-)": 1,
        },
    }

This defines how the machine state changes based on observed symbols and counter conditions. The notation:

“A / (-)” means “observe symbol A with any counter value”
“C / (Z)” means “observe symbol C with zero counter value”
“C / (NZ)” means “observe symbol C with non-zero counter value”

Conclusion

The cross-product environment combines the simplicity of standard Gym environments with the power of Counting Reward Machines. This allows you to:

Use it with any RL algorithm designed for Gym environments
Define complex reward structures based on symbol history
Track progress toward multi-step goals
Shape rewards to guide exploration and learning
Benefit from the sample efficiency of counterfactual experiences

This example demonstrates that using Counting Reward Machines doesn’t require changing your existing RL algorithms - it just gives you more expressive power in defining rewards!

Next Steps

Learn about Q-Learning with CRMs
Explore Counterfactual Q-Learning for more efficient learning

Get Started

Example

Code Concepts

4 - Cross-Product

Letter World Cross-Product Example

The Letter World Environment

Components

Setting Up the Environment

Using the Environment Like a Standard Gym Environment

Running an Episode

Using a Specific Action Sequence

What Makes It Special?

The Counting Reward Machine

Conclusion

Next Steps

Get Started

Example

Code Concepts

​Letter World Cross-Product Example

​The Letter World Environment

​Components

​Setting Up the Environment

​Using the Environment Like a Standard Gym Environment

​Running an Episode

​Using a Specific Action Sequence

​What Makes It Special?

​The Counting Reward Machine

​Conclusion

​Next Steps

Letter World Cross-Product Example

The Letter World Environment

Components

Setting Up the Environment

Using the Environment Like a Standard Gym Environment

Running an Episode

Using a Specific Action Sequence

What Makes It Special?

The Counting Reward Machine

Conclusion

Next Steps