The Q-learning algorithm maintains a table of state-action values and updates them based on the rewards received:
Copy
import timefrom collections import defaultdictimport numpy as npfrom tqdm import tqdmimport matplotlib.pyplot as plt# Initialize the Q-table as a defaultdict to store state-action valuesq_table = defaultdict( lambda: np.zeros(cross_product.action_space.n))# Q-learning hyperparametersEPISODES = 5000 # Total training episodesLEARNING_RATE = 0.1 # Alpha: how quickly we update Q-values with new informationDISCOUNT_FACTOR = 0.99 # Gamma: importance of future rewards vs immediate rewardsEPSILON = 0.1 # Exploration rate: probability of taking random action# Track total rewards for visualizationall_returns = []# Train the agent over multiple episodes with progress barpbar = tqdm(range(EPISODES))for episode in pbar: obs, _ = cross_product.reset() episode_return = 0 done = False # Run a single episode while not done: # Epsilon-greedy action selection if np.random.random() < EPSILON or np.all(q_table[tuple(obs)] == 0): action = np.random.randint(cross_product.action_space.n) else: action = int(np.argmax(q_table[tuple(obs)])) # Take action and observe result next_obs, reward, terminated, truncated, _ = cross_product.step(action) episode_return += reward done = terminated or truncated # Q-learning update # Q(s,a) = Q(s,a) + α * [r + γ * max_a' Q(s',a') - Q(s,a)] if done: td_target = reward # No future rewards if episode is done else: td_target = reward + DISCOUNT_FACTOR * np.max(q_table[tuple(next_obs)]) td_error = td_target - q_table[tuple(obs)][action] q_table[tuple(obs)][action] += LEARNING_RATE * td_error # Move to next state obs = next_obs # Record episode return all_returns.append(episode_return) # Update progress bar with recent average return if episode % 10 == 0: pbar.set_description( f"Episode {episode} | Ave Return: {np.mean(all_returns[-10:]):.2f}" )
After training, we can visualize the agent’s learning progress:
Copy
# Visualize training progressplt.figure(figsize=(10, 6))all_returns = np.array(all_returns)# Apply moving average for smoothing (window size = 50)smoothed_returns = np.convolve(all_returns, np.ones((50,)) / 50, mode="valid")plt.plot(smoothed_returns)plt.title('Q-Learning Performance in Letter World Environment')plt.xlabel('Episode')plt.ylabel('Average Return (50-episode moving average)')plt.grid(True, linestyle='--', alpha=0.7)plt.savefig('q-learning.png') # Save figure before showingplt.show()
The resulting learning curve looks like this:
The graph shows how the agent’s performance improves over time. Initially, returns are very negative as the agent explores randomly, but they gradually improve as the agent learns the optimal policy.
This example demonstrates how Q-learning can be used with Counting Reward Machines to train an agent to follow specific sequential tasks. The Letter World environment illustrates how CRMs can effectively model tasks that require remembering past events.For more complex environments, you might need to adjust hyperparameters or use more sophisticated reinforcement learning algorithms, but the same CRM framework can be applied.