Structure Your Game for Agents

Implementing a custom environment reminds us of how we should code a game

24Feb26

Roar. Last time I mentioned how understanding RL concepts can allow you to structure your game so that it can be played by an agent. Although you might not want your game to be used for reinforcement learning (RL), I think there are important lessons in structuring ("architecting") a game even for other purposes, which you will see near the end.

RL Formalism

Recall that in RL, an agent takes actions in an environment, gets a reward and makes an observation of the environment, then uses a policy $\pi\left(a_i | S_i \right)$ to decide the next action to take. The goal of RL is to learn an optimal policy such that the cumulative reward (i.e., the total reward the agent has got in the end) is the maximum.

When creating an environment in Gymnasium, the closer we can associate our game with RL concepts, the easier it would be. So it's important that we understand how things like actions, rewards and observations look in Gymnasium.

Observation and Action Spaces

The set of all possible observations and actions are called the observation space of the environment and action space of the agent respectively. Observation and action spaces are both modelled as Space in Gymnasium.

Gymnasium supports these kinds of spaces out of the box:

MultiBinary: for example, {Switch 1 On/Off, Switch 2 On/Off, ...}
Discrete: for example, {Move Up, Move Down, Do Nothing}
MultiDiscrete: for example, {Move Up/Down/Left/Right/None, Thrust On/Off, ...}
Box: for example, {Rotor 1 $n_1$ degrees, Rotor 2 $n_2$ degrees, Rotor 3 $n_3$ degrees...}
Text: for example, {"A", "B", "C", ...}

Therefore, if you structure your game state according to these shapes, it would be easier to integrate with Gymnasium.

The term "box" might sound strange, but think of it this way. Consider this system of inequalities.

\left\{ \begin{array}{l} -1\leq x \leq 1 \\ -1\leq y \leq 1 \end{array} \right.

Now, if you draw the lines $x=1$ and $x=-1$ , then any point between the two lines should satisfy $-1\leq x \leq 1$ . Similarly, any point between lines $y=1$ and $y=-1$ satisfies $-1\leq y \leq 1$ . Therefore, the points within the square formed by the lines satisfy both. In 2D, it is a square; in 3D, it will be a box; and in higher dimensions, we call it a multi-dimension box.

One use of Text is to generate action spaces like a chessboard, combined with composite space Tuple (from Gymnasium library, not the Python tuple).

from gymnasium.spaces import Tuple, Text

row_space = Text(min_length=1, max_length=1, charset="ABCDEFGH")
col_space = Text(min_length=1, max_length=1, charset="12345678")
action_space = Tuple((row_space, col_space))
action_space.sample()   # ('D', '8')
action_space.sample()   # ('H', '4')
action_space.sample()   # ('F', '3')

Rewards and Termination

Reward function design is the heart of an RL exercise. It describes what the end outcome is expected to be. The function, though must return a number, can take many forms, and the outcome can be catastrophic if you design a wrong one. One example that repeatedly appear in text is the Myth of King Midas, where the reward function only considered the amount of gold and not his true happiness.

In Gymnasium docs, there are lots of sample environments. Each environment is documented in a similar structure: action space, observation space, rewards, starting state, episode end etc. You can check there to get some inspiration for designing the reward function. Here's a spoiler of those:

Car racing
Pontoon (called Blackjack there for whatever reason)
Half cheetah (a 2D robot cheetah)

Oh my bear, does that mean all the robot dogs we see were trained using RL 🤯?

Gymnasium API

To create a custom Gymnasium environment, you subclass gymnasium.Env. The most important methods to override are:

step(action: ActType) -> tuple[ObsType, SupportsFloat, bool, bool, dict[str, Any]]: this is where the agent takes an action, gets a reward and makes an observation. The training programme calculates which action to take somewhere then calls this function to get a reward, which it may use to calculate next action. Stuff it should return are: the next observation, the immediate reward, whether it's terminated (i.e., whether the agent won or died), whether it's truncated (i.e., whether the episode ended prematurily without reaching a proper end), and other info.
reset(*, seed: int | None = None, options: dict[str, Any] | None = None) -> tuple[ObsType, dict[str, Any]]: this resets the state of the game. Every time the game (called an "episode") ends, you must reset the state of the game to start another round. It should return an observation of the initial state and other
render() -> RenderFrame | list[RenderFrame] | None: this is where the current frame is drawn, if required by the user of your environment.
close() -> None: this is the place for stuff like pygame.quit() which you have to call to release system resouces.

Pygame is a popular choice for rendering environments, by the way.

There are a few important attributes you need to consider:

action_space and observation_space, as stated above
spec, which contains the information used to gymnasium.make the environment
metadata, for now this is just the render mode to use, plus other things libraries may add
np_random, the random number generator used; bears give a random seed to fix the random number generator so that it gives the same random sequence every time the environment is created. This can help troubleshoot or reproduce problems.

Lessons for Our Game

Even if you're not thinking of letting an RL agent play your game, there are lessons that we can learn from this exercise.

Separate data from drawing. Note how, in Gymnasium API, the render function is separate from the step method.
Separate action from input method. Note also that the actions are modelled as "move up", "move down" and not "up arrow pressed", "down arrow pressed". Some environments do (like the Atari ones) model actions as input methods, but very deliberately (to provide a uniform action space for all environments based on Atari console games).
Define beginning and end conditions, and the objectives of the agent. These are the core elements of a game, but because we're a coding club, too often we don't think enough about these and start coding straight away. RL forces us to consider these carefully so that we can design the reward function and the termination conditions.
Don't forget the function to reset the game. We've seen multiple bugs in our games due to forgetting to reset things.