OpenAI Benchmark Problems
CartPole, Taxi, etc. | still in progress
In Reinforcement Learning (RL), OpenAI Gym
is known as one of the standards for comparing algorithms. Therefore, this page is dedicated
solely to address them by solving the cases one by one. Starting from the Cart-Pole
environment representing a simple control theory case, and then Taxi
which is an
instance of a hierarchical learning.
We will go even further to see how other available
scenarios can be solved using XCS-RC
. All codes will be presented as simple as
possible, without reducing the important factors.
pip
install xcs-rc
or via github to get the required library.
I remember the first time studying this inverted pendulum problem at my bachelor study, which is very unsurprising because I came from the Control Engineering background. CartPole has a similar characteristics, and what we need as a start is a set of environmental states.
xcs_input
is created for that purpose. See corresponding
notes on the right column for some detailed explanations.state = env.reset() # executed once when an episode begins for t in range(200): # episode starts here xcs_input = [state[1], state[2]] action = agent.next_action(xcs_input, 2) # explore_it
action
as a
response for the given state. Now we will drive the cart to act accordingly.state, feedback, done, info = env.step(action) env.render() # draw imageThe value of
state
has a similar structure to that returned by
env.reset()
above. Other than that, we will only use done
, which
indicates an end of the current episode.
- Current pole angle (straight up 0° as reference)
- The difference to previous value
reward = 0.0 # initialize reward if state[2] * (action - 0.5) >= 0.0: # if getting closer to reference reward = 0.8 # set good reward if abs(state[2]) < 0.05: # if very close to reference reward += 0.2 # make it higher agent.apply_reward(reward * agent.maxreward)Of course, there are many ways to improve the reward calculation, like taking the cart's momentum into account. However, here we want to make it simple, and XCS can still learn well.
done
== True
.if done: t += 1 # +1 since it starts from 0 break # break the loop # episode ends here print("Finished after {} timesteps.".format(t))
So, that's it?
Yes, basically that is it. The cart will learn to keep the pole in a straight-up position using RL. Of course, one episode is insufficient. For a complete version quite similar to this one, you can check out my github page. It's way easier than you thought! 😉
XCS-RC
and the environment have to be initialized. This is how we do it.import xcs_rc import gym agent = xcs_rc.Agent(tcomb=10, prederrtol=0.0) agent.reward_map(max=1000.0, projected=800.0) env = gym.make('CartPole-v0')
An environmental state is a list (array) consisting of four elements:
- Cart: position
- Cart: velocity
- Pole: angle
- Pole: velocity At Tip
Cart: velocity
and Pole: angle
to solve the
problem.
This function finds all rules in the
population of knowledge that match state
and returns the selected action using a
method determined by the second parameter. The available options are:
explore
: pick inexperienced rules as priorityexploit
: always choose the ones with high predicted rewardexplore_it
: combination of both above, doexplore
whenexploit
predicts a reward less than projected (determined inagent.reward_map()
)
Carelessly using the term
reward
in CartPole environment might cause a small confusion. Why? Because there
are two kinds of rewards: from the environment to the agent, and from the agent to the rule.
They are two entirely separate instances. To prevent that, I will consistently use the word
feedback
for the former and reward
for the latter.
This is already available at youtube for everyone to watch! Have a nice balance! 😎
Now we tackle a problem involving a hierarchical RL, or in XCS, it is called multistep.
As the title suggests, an agent should pick-up a passenger at a certain point and then do a
drop-off at another. In this
environment, four locations are available for picking up and dropping off.
The "world" has 25 coordinates available, along with six possible actions for directions and
passenger-related (SNEW/PD). An environmental state consists of [taxi_row, taxi_col,
passenger_location, destination]
with 5 x 5 x 5 x 4 = 500
possibilities.
Put together the actions, we get 3000 available pairs of state-action.
CartPole
, composing an input structure
in Taxi
quite takes some time (actually not that long, I solved this case in less
than a workday in total). The trick is to set the number of elements as small as possible,
while still represents the whole problem. Before doing anything, don't forget to initialize.def get_input(obs): target = obs[2] if obs[2] < 4 else obs[3] # get current target if obs[:2] == loc[target]: # obs[:2] is taxi's position return [-1, -1] + [obs[2]] # return special state else: return obs[:2] + [target] # return normal stateWhat can we conclude from the code?
Well, basically, the function
get_input(obs)
where and what the taxi should do. The
value obs[2] < 4
means that a passenger is waiting. So, we set the target following
the passenger's location. Otherwise when the passenger is already in the taxi obs[2] ==
4
, the target is the drop-off's location.What is that special state? That is a trick to make XCS learn quicker. By giving a pair of negatives in taxi's coordinate, XCS can quickly get the idea that now is the correct place to pick-up or drop-off the passenger. 😉
Afterwards, we take and decode the first observation, and start the episode.
obs = list(env.decode(env.reset())) # decode first observation # episode starts here for t in range(200): # 200 steps before giving up (important!) state = get_input(obs)
Cart-Pole
that allows us to apply a
straight-forward action and reward, here comes the tricky part of multistep learning. It is
called delayed reward, and it is very effective. Basically it says, "if the
agent has not reach the target, give a discounted reward out of the highest possible
afterwards". Corresponding note on the right column might help. 🙂agent.build_matchset(state) # collect matching rules action = agent.pick_action_winner(2) # pick a winner with explore_it # delayed reward function here agent.build_actionset(action) # collect winning rulesA small number of lines is reserved for the delayed reward. For now, let's just focus on the conventional steps: get matching rules, pick a winning action and collect corresponding rules.
prev = state # keep state for a later comparison obs, _, done, _ = env.step(action) obs = list(env.decode(obs)) # decode given observation env.render() # draw image, optionalThe value of
state
has a similar structure to that returned by
env.reset()
above. Other than that, we will only use done
, which
indicates a successful drop-off.
Taxi
employs two types
of reward function:
- Normal, when the agent reaches the target
- Delayed, for otherwise
This is how we judge the normal reward.
if (prev[2] < 4 and state[2] == 4) or done: agent.apply_reward(agent.maxreward) elif prev == state: agent.apply_reward(0.0)The code basically tells us, a maximal reward is obtained when the passenger is picked-up or droped-off properly. Contrary to that, the lowest gain is awarded when no progress occurred, inidicated by an unchanged state (e.g., caused by hitting a wall).
And, what about the delayed reward?
Simply put this code to the corresponding comment before building action set, and it's done.
# this is the delayed reward if agent.aset.size() > 0: # if action set is not empty agent.apply_reward(agent.gamma * max(agent.pa))If no normal reward given, then assign a delayed one. To fully understand these tricks, you probably need to learn
XCS-RC
algorithm. Some notes on the right column here might
help a little, but we'll see. For now, I suppose there is only one thing left to make the code
up and running:
done == True
.if done: t += 1 # +1 since it starts from 0 break # break the loop # episode ends here print("Finished after {} timesteps.".format(t))
Gets a little tough?
Don't worry about it. A little reading and some coding exercise would make you understand better and able to improve the algorithms. Remember to make the agent learns hundreds of episodes or more, instead of only one like the example here. Then, it might become a smarter taxi learner! 😎
XCS-RC
and the environment have to be initialized. Here, the locations of passenger and target is also
initialized.import xcs_rc import gym agent = xcs_rc.Agent(num_actions=6, predtol=5.0) agent.reward_map(max=100.0, projected=81.0) agent.beta = 0.7 # learning rate agent.gamma = 0.8 # discount factor for multistep env = gym.make('Taxi-v2') loc = [[0, 0], [0, 4], [4, 0], [4, 3]]
An environmental state here consists
of only a number in [0..499]
and therefore should be decoded using env.decode(state)
.
The result is an array of four elements:
- Taxi: row
- Taxi: column
- Passenger: location
- Dropoff: location
[0..3]
. Once entering the taxi, the value
will be 4
.
This function finds all rules in the
population of knowledge that match state
and returns the selected action using a
method determined by the second parameter. The available options are:
explore
: pick inexperienced rules as priorityexploit
: always choose the ones with high predicted rewardexplore_it
: combination of both above, doexplore
whenexploit
predicts a reward less than projected (determined inagent.reward_map()
)
It is a common technique in RL to
guide a learning agent performing a set of expected steps. For instance, if reaching finish
gets a maximal reward at 100.0
, then moving into one cell away from finish would
be rewarded with agent.gamma x 100.0
which in this case is 80.0
.
Calculations for other cells are similar to that, as depicted below.
In other words, the current reward
is assigned after calculating the highest
next reward
possible for the new cell, discounted by agent.gamma
.
So, since the provided display is a text world instead of graphics, it might be a little unworthy to record a video. Instead, this figure of #inefficient_steps might give you an impression of how the taxi behaves.