Skip to main content

Documentation Index

Fetch the complete documentation index at: https://pycrm.xyz/llms.txt

Use this file to discover all available pages before exploring further.

Quick Start Guide

This guide will help you get up and running with Reward Machines (RMs) and Counting Reward Machines (CRMs) in just a few minutes.

Basic Example

We’ll use the Letter World environment, where an agent must visit letters (specific goal locations)in a specific order.
from examples.introduction.core.ground import LetterWorld
from examples.introduction.core.label import LetterWorldLabellingFunction
from examples.introduction.core.machine import LetterWorldRewardMachine
from examples.introduction.core.crossproduct import LetterWorldCrossProduct

# 1. Create the ground environment
ground_env = LetterWorld()

# 2. Create the labelling function
lf = LetterWorldLabellingFunction()

# 3. Create the Reward Machine
rm = LetterWorldRewardMachine()

# 4. Create the cross-product MDP
env = LetterWorldCrossProduct(
    ground_env=ground_env,
    machine=rm,
    lf=lf,
    max_steps=100,
)

# Use like a standard Gymnasium environment
obs, _ = env.reset()
action = env.action_space.sample()
next_obs, reward, terminated, truncated, info = env.step(action)

What’s Happening?

  1. Ground Environment (LetterWorld) A simple grid world subclass of gymnasium.Env.
  2. Labelling Function (LetterWorldLabellingFunction) Maps low-level environment transitions to high-level events (propositions).
  3. Reward Machine (RM) (LetterWorldRewardMachine) Specifies rewards based on event sequences.
  4. Cross-Product MDP (LetterWorldCrossProduct) Combines environment, labelling function, and RM into a single Gymnasium-compatible environment.
To model tasks requiring counting or extended memory, swap in a CountingRewardMachine instead of a standard RM. The workflow is identical.

Training a Simple Agent

Here’s a basic tabular Q-learning loop:
import numpy as np
from collections import defaultdict

q_table = defaultdict(lambda: np.zeros(env.action_space.n))

for episode in range(100):
    obs, _ = env.reset()
    done = False
    
    while not done:
        # Epsilon-greedy action selection
        if np.random.rand() < 0.1:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[obs])
            
        next_obs, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Q-learning update
        q_table[obs][action] += 0.1 * (
            reward + 0.99 * np.max(q_table[next_obs]) - q_table[obs][action]
        )
        
        obs = next_obs

Next Steps

Worked Examples

Core Concepts