Reinforcement Learning with Flappy Bird
Survival of the Flappest! 💪😁
Everyone knows Flappy Bird! In
do not, just take a look at the image here.
So, the bird flaps in a Mario-like world. But instead of entering or exiting the pipes, it should avoid them. All of them, including the ground, and the top limit of the screen. The problem, lies in the gravity. Without flapping, the bird go down vertically, while the pipes are always moving from the right side of the screen, to the left. A flap would make it go up, but not for long.
Commonly played in a touch screen device where a tap make the bird flaps, this time an action is regulated by a Reinforcement Learning (RL) mechanism, with two possible outpus: flap or no. The input states are composed by observing the screen along with an internal state of the bird's current vertical position.
For the RL algorithm, an instance of XCS-RC is used, can be downloaded using
or at github.
Thanks to the Flappy Bird game maintained by pygame, we can focus on the learning part without too much effort in
managing the design.
As usual, the very first step is getting the environmental states. At first, we will do that using a
getGameState() function, since it is the easiest way. But then, some non-learning image
processing stuff will be added to get more fun. 😉
This earliest step requires at least two parts: initializing the game environment (FlappyBird) and the learning agent. Below are the codes for the former.
#game_initialization from flappybird import FlappyBird game = FlappyBird() from ple import PLE p = PLE(game, display_screen=True) p.init() actions = p.getActionSet() print("Available actions:", actions)
A successful initialization is indicated by two outputs: a pygame window, and a console output similar to below, where the pygame version may differ:
pygame 1.9.6 Hello from the pygame community. https://www.pygame.org/contribute.html Available actions: [None, 119]
no flap while
119 is a code for
Then, we initialize the learning agent using an XCS-RC
# xcs_initialization import xcs_rc agent = xcs_rc.Agent() agent.maxpopsize = 200 agent.tcomb = 100 agent.predtol = 5.0 agent.prederrtol = 0.0
Using a one-liner like
agent = xcs_rc.Agent(maxpopsize=200,
...) is also possible, depending on anyone's preference.
The current states of the screen can be
with the function
# get_screen_states screen = game.getScreenRGB()The window size for the Flappy Bird game is
288x512pixels, making an overall of 442386 values, all within the range
[0..255]. Further processes will be provided below.
This is how we get the game states the easy way. The variable
pipe_y is added with
50 as the gap between a pair of pipes is 100
# get_game_states game_state = p.getGameState() bird_y = int(game_state["player_y"]) bird_v = int(game_state["player_vel"]) pipe_x = int(game_state["next_pipe_dist_to_player"]) pipe_y = int(game_state["next_pipe_top_y"] + 50)
And, the observation part is actually done. The values of these
be fed to the bird, as it should make decisions to execute
do flap or
Now let's get the bird's action by feeding the states to the learning system, and then execute
to the environment. The variable
state is a list consisting of three values:
- the next pipe's absolute distance, from the left-hand side of the screen
- the next pipe's relative vertical distance to the bird, and
- the bird's current vertical speed
# compose_state state = [pipe_x, (bird_y + 12) - pipe_y, bird_v]
I have tested to make the bird learning only with the first two values, but the result was not
quite satisfying. A small number of states is decisive for reaching an optimal learning speed.
And, since three is not that much, I think we're good to go.
By the way, why is there a 12 added to
Well, it is actually optional, and the idea is to store the middle value of the bird's height, which is 24. So, we do not use the top position of the image, but the middle instead.
And now, let's get to the action, and execute it.
# get_agent_action, 0 = no flap; 1 = do flap action = agent.next_action(state) # execute_action p.act(actions[action])
Want to see what some lines of codes can give?
You got it. This is the
du... not learning version of our Flappy
# game_initialization here # xcs_initialization here from time import sleep NUM_RUNS = 10 while game.player.runs < NUM_RUNS: game_state = p.getGameState() bird_y = int(game_state["player_y"]) bird_v = int(game_state["player_vel"]) pipe_x = int(game_state["next_pipe_dist_to_player"]) pipe_y = int(game_state["next_pipe_top_y"] + 50) state = [pipe_x, (bird_y + 12) - pipe_y, bird_v] action = agent.next_action(state) p.act(actions[action]) # reward assignment will be placed here # to prevent it becoming a Flashy bird 😵 sleep(0.01) if game.game_over(): game.player.runs += 1 game.reset()And here's the result:
Because the chance of
no flapis equal at 50:50. However, the difference is, once it choses
do flap, performing
no flapwould give no effect for the next 8 ticks, since it will still go up anyway. On the other hand, picking
do flapmake the flappy bird even closer to hit the ceiling.
Composing the reward map is commonly quite difficult, but also a very important part in the whole
Reinforcement Learning, especially when facing dynamic environments. For this one, we will do it
as simple as possible, but yet providing an outstanding result.
So, the feedback for each action would be determined by two factors:
y= the bird's relative vertical position to the nearest pipe
diff= the difference from the previous
if 9 <= y <= 35: reward = maxreward
The first formula is pretty straightforward: set the full reward value if
greater than or equal to nine, with a maximum at 35. The idea of these values is to keep the
bird at the lower half of the pipe gaps, as often as possible. By doing that, we minimize the
risk of hitting the upper pipe when flapping. The second and third formula have something to do
else: # calculate the vertical difference from center center = 22 from_center = abs(y - center) - abs(y - diff - center) # if the bird is getting closer to center if abs(y - center) < abs(y - diff - center): reward = (0.5 + abs(from_center) * 0.05) * maxreward # otherwise, if it is going away elif abs(y - center) > abs(y - diff - center): reward = (0.5 - abs(from_center) * 0.05) * maxreward
And that's it! Based on only these lines for determining the reward, the flappy bird will be able
to learn flapping through thousands of pair of pipes. At the video, it's only around 1200 points
max. After some algorithmic improvements, more than 10000 pair of pipes can be solved,
still under 100 episodes.
However, very important, never forget to assign the reward and update the executed rules with a single line of code...
agent.apply_reward(reward)Now, if you are looking for a simple way to beat flappy bird, it is done. But as I said, using the provided internal states is too easy. Image processing would be the next challenge. And to keep it simple, no learning algorithms will be used for that. Just a bunch of simple math.
Keep checking this page out, they might be here sooner than you thought. 😉