Bears, I don't know if anyone attempted the reinforcement learning (RL) tutorial from Hugging Face that I mentioned last week. I have attempted Unit 1. Here are some of my thoughts on what's there so far.
The Tutorial Was in Low Maintenance (as of February 2026)
There was a clear statement from the tutorial home page that it was in low maintenance, and some parts of it were not functioning. This was reflected in the first notebook. When I installed the dependencies following the instruction, it failed because the Pygame version it used was still in the 2.1.x era and could no longer be found in Colab. The name of the environment (as in RL terms) has changed from LunarLander-v2 to LunarLander-v3. Also, in a later notebook, the packages were installed using Anaconda, which now requires a separate command to accept terms of service in some repos.
Knowing your tools help with troubleshooting problems.
The tutorial explicitly suggested following the notebooks in Colab rather than on your own computer, so that you will not be distracted by the setup process. But again, because the tutorial was in low maintenance, some things might not work. This is the time to read through the scripts in the notebook, understand what tools were used and why, how the packages were installed, which versions were used, then check the most recent documentation to find what has changed. An understanding in those was clearly not the focus, but it has its place.
Training Process Has Been Packaged Really Well
How many lines of code do you think you need to train an RL model? When I picked out the code I used in Unit 1, it was really just these lines that were doing RL-related work.
import gymnasium as gym
from stable_baselines3 import PPO
env = gym.make('LunarLander-v3')
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=int(1e6))
model_name = "ppo-LunarLander-v3"
model.save(model_name)
I thought the training was just that - I specify an environment and a policy - until later a sample code snippet specified a lot of additional keyword arguments. Those keyword arguments, they said, was to speed up the training process.
model = PPO(
policy = 'MlpPolicy',
env = env,
n_steps = 1024,
batch_size = 64,
n_epochs = 4,
gamma = 0.999,
gae_lambda = 0.98,
ent_coef = 0.01,
verbose=1)
I had mixed feelings about this. The training process was indeed packaged really well. I didn't need to set a lot of things to start training an agent. On the other hand, there's still that expected result of 200 to pass the unit. The result was calculated as mean of reward minus the standard deviation of reward. What if the default model didn't reach a result of 200? At that moment, I had totally no idea. That fear lingered throughout the notebook.
Training Takes a Long Time, and the Agent's First Attempts Tend to be Miserable
Google Colab now supports a GPU runtime for free called T4. But even with that, the training took well over 15 minutes for Unit 1 notebook and about 45 minutes for the Bonus Unit 1. There was a statement from the documentation of Stable Baselines 3 that, algorithms it implements sometimes require millions of samples to learn anything useful. Unit 1 notebook used 1 million episodes and Bonus Unit 1 used 2 million.
The training function was configured to print something every few (10s of thousands) episodes. Initially, the result would look daunting. The final reward the agent receives can increase a few times then suddenly drop to when it hasn't learned much. If you stare at the stats, along with the mean and standard deviation in the period, you might wonder whether the reward will eventually increase steadily (i.e., the training will converge) at all.
The lecture notebooks did, but when we leave the tutorial and go on our own adventures, it might not. We will need to learn from the experts on how to handle this.
Understanding in RL Concepts Is Paramount
I mentioned earlier that the training process has been packaged really well. I didn't need to write a lot of code. Just some understanding in RL concepts can get me pretty far.
The question is: do I undersatnd RL concepts?
From an idea borrowed from vocal training, an understanding of something comes in two dimensions:
- 正確性(seikaku-sei, correctness): Am I updating the states, rewards and other data structures correctly? Can I explain things accurately?
- 自由度(jiyuu-do, degree of freedom): Can I accept a variety of challenges, dealing with all kinds of situations in RL training?
The simplicity of the code often covers the weaknesses in us, so let us not be fooled. In fact, I haven't done anything that require understanding RL concepts yet. That happened when I just entered Unit 2 this morning. Here, I was moving from accepting the default models to understanding the maths and building one from scratch.
In the RL tutorial I experienced so far, often the idea is simple, but you need to be able to recognise it when it's written in maths. For example, if I tell you represents reward the agent receives at current timestep(i.e., the immediate reward), and represents a discount term, so that the immediate reward is more important, can you tell what this expression means?
Understanding RL concepts also allows you to structure your game so that it can be used to train an RL agent, i.e., it can be used as an environment, like the LunarLander-v3 above. There is one set of requirements to build your game, and another to make it an environment to train RL agents. The most important requirement in the latter is that it has to comply with Gymnasium library's interface, which, as you might guessed, is all about definitions in RL.
More on that topic on another day.
Gabe the Bear