Reinforcement Learning Agents

Introduction

The crm.agents module provides reinforcement learning algorithms that integrate with Counting Reward Machines to efficiently learn task policies. These agents are designed to take advantage of the counterfactual experience generation capabilities provided by the CRM framework. The framework includes two main types of agent implementations:
  1. Tabular agents for discrete state and action spaces
  2. Deep RL agents based on Stable Baselines 3 for continuous domains

Tabular Agents

Tabular agents are suitable for environments with discrete state and action spaces. The framework provides:

Q-Learning (QL)

The standard Q-Learning algorithm is implemented in crm.agents.tabular.ql. This provides a baseline implementation that uses the standard Q-learning update rule:
Q(s,a) ← Q(s,a) + α[r + γ·max_a'Q(s',a') - Q(s,a)]

Counterfactual Q-Learning (CQL)

The crm.agents.tabular.cql module implements Counterfactual Q-Learning, which extends standard Q-Learning to take advantage of the counterfactual experience generation capabilities of Counting Reward Machines.
from crm.agents.tabular.cql import CounterfactualQLearningAgent

# Create the agent
agent = CounterfactualQLearningAgent(
    env=cross_product_env,  # Must be a CrossProduct environment
    epsilon=0.1,            # Exploration rate
    learning_rate=0.01,     # Learning rate
    discount_factor=0.99    # Discount factor
)

# Train the agent
returns = agent.learn(total_episodes=1000)
The key enhancement is in the learning process, which:
  1. Takes a real step in the environment
  2. Generates counterfactual experiences using the CrossProduct environment
  3. Updates Q-values for all valid counterfactual experiences
  4. Significantly accelerates learning compared to standard Q-Learning
This allows the agent to learn from many possible state configurations in a single environment step, effectively “imagining” how the reward machine would behave in different states.

Deep RL Agents

For environments with continuous state or action spaces, the framework provides integrations with Stable Baselines 3.

Counterfactual SAC (C-SAC)

The crm.agents.sb3.sac.csac module implements Counterfactual Soft Actor-Critic (C-SAC), extending the SAC algorithm from Stable Baselines 3 to learn from counterfactual experiences.
from crm.agents.sb3.sac import CounterfactualSAC

# Create the agent
agent = CounterfactualSAC(
    policy="MlpPolicy",
    env=cross_product_env,  # Must be a CrossProduct environment
    learning_rate=3e-4,
    buffer_size=1_000_000,
    batch_size=256
)

# Train the agent
agent.learn(total_timesteps=100_000)
C-SAC enhances the standard SAC algorithm by:
  1. Collecting transitions from the environment
  2. Generating counterfactual experiences for each transition
  3. Adding these experiences to the replay buffer
  4. Training the policy network using both real and counterfactual experiences
This approach is particularly effective for complex continuous control tasks and environments with sparse rewards.

Vectorized Environment Support

C-SAC also provides specialized support for vectorized environments through the crm.agents.sb3.vec module, which includes:
  • DispatchSubprocVecEnv: An extension of Stable Baselines 3’s SubprocVecEnv that enables efficient parallel generation of counterfactual experiences
This implementation is designed to maintain performance when working with multiple parallel environments:
from crm.agents.sb3.vec import DispatchSubprocVecEnv

# Create vectorized environment
envs = DispatchSubprocVecEnv([
    lambda: create_cross_product_env() for _ in range(8)
])

# Create C-SAC agent with vectorized environment
agent = CounterfactualSAC(
    policy="MlpPolicy",
    env=envs,
    verbose=1
)

Performance Benefits

Agents that leverage counterfactual experiences show several advantages:
  1. Faster Convergence: Learning from counterfactual experiences often reduces the number of episodes needed to learn optimal policies by orders of magnitude.
  2. Better Sample Efficiency: By extracting more information from each environment interaction, these agents make better use of collected experiences.
  3. More Robust Policies: Since the agent explores the reward machine state space more completely, the resulting policies tend to be more robust.

Requesting Custom Agent Implementations

The CRM framework is designed to be extensible, and we’re committed to supporting a wide range of reinforcement learning algorithms. If you require an implementation not currently available, such as:
  • Integration with additional Stable Baselines 3 algorithms (PPO, TD3, etc.)
  • Support for other deep RL frameworks (RLlib, Pytorch, etc.)
  • Custom agent architectures or learning algorithms
  • Specialized handling for your environment type
Please open an issue on our GitHub repository with the following information:
  1. The algorithm or implementation you need
  2. Your use case or environment
  3. Any specific requirements or constraints
We actively monitors issues and will prioritize implementing requested features based on community needs. We believe in making the CRM framework as versatile and useful as possible for all users.

Example Usage

Here’s a complete example showing how to use the Counterfactual Q-Learning agent with a Letter World environment:
from examples.letter.core.crossproduct import LetterWorldCrossProduct
from examples.letter.core.ground import LetterWorld
from examples.letter.core.label import LetterWorldLabellingFunction
from examples.letter.core.machine import LetterWorldCountingRewardMachine
from crm.agents.tabular.cql import CounterfactualQLearningAgent

# Create environment components
ground_env = LetterWorld()
lf = LetterWorldLabellingFunction()
crm = LetterWorldCountingRewardMachine()

# Create cross-product environment
cross_product = LetterWorldCrossProduct(
    ground_env=ground_env,
    crm=crm,
    lf=lf,
    max_steps=500
)

# Create and train the agent
agent = CounterfactualQLearningAgent(
    env=cross_product,
    epsilon=0.1,            
    learning_rate=0.01,      
    discount_factor=0.99     
)

# Train the agent
returns = agent.learn(total_episodes=1000)

# Evaluate the learned policy
obs, _ = cross_product.reset()
done = False
total_reward = 0

while not done:
    action = agent.get_action(obs)
    obs, reward, terminated, truncated, _ = cross_product.step(action)
    total_reward += reward
    done = terminated or truncated

print(f"Evaluation reward: {total_reward}")

Summary

The reinforcement learning agents in the CRM framework are specifically designed to take advantage of the counterfactual experience generation capabilities of Counting Reward Machines. This approach significantly improves learning efficiency and policy quality compared to standard reinforcement learning algorithms. By providing both tabular and deep RL implementations, the framework supports a wide range of environments and task specifications, from simple discrete environments to complex continuous control problems.