强化学习:Q-Learning

时间:2022-01-01 04:28:38

强化学习初探1

本篇文章主要是用于记录学习Travis DeWolf(加拿大滑铁卢大学系统设计工程学院的博士后)的博客文章的总结,REINFORCEMENT LEARNING PART 1: Q-LEARNING AND EXPLORATION 和REINFORCEMENT LEARNING PART 2: SARSA VS Q-LEARNING

1.Q-Learning and exploration

1.1 Q-learning review

    Representation:

       s    -    environment states

       a    -    possible actions in those states

       Q   -    state-action value, value/quality of each of those actions in each of those states

    In Q-learning, starting by setting all state-action/Q to 0(simple implementation), then go around and explore the state-action space.

    After you try an action in a state, you evaluate the state that it has lead to. If it has lead to an undesirable outcome you reduce the Q value (or weight) of that action from that state so that other actions will have a greater value and be chosen instead the next time you’re in that state. Similarly, if you’re rewarded for taking a particular action, the weight of that action for that state is increased, so you’re more likely to choose it again the next time you’re in that state. Importantly, when you update Q, you’re updating it for the previous state-action combination. You can only update Q after you’ve seen what results.

Example: cat, mouse, cheese

    state (where the cat was in front of you), action(go forward) -> reduce the Q(weight of that action for that state)

    So that the next time the cat is in front of the mouse, the mouse won’t choose to go forward and might choose to go to the side or away from the cat instead (you are a mouse with respawning powers). Note that this doesn’t reduce the value of moving forward when there is no cat in front of you, the value for ‘move forward’ is only reduced in the situation that there’s a cat in front of you. 

    In the opposite case, s(where the cheese in front of you), a(move forward) -> get rewarded, add the Q

   So now the next time you’re in the situation (state) that there’s cheese in front of you, the action ‘move forward’ is more likely to be chosen, because last time you chose it you were rewarded.

    No foresignt further than one time step, which make a look-ahead value.

    When we’re updating a given Q value for the state-action combination we just experienced, we do a search over all the Q values for the state the we ended up in. We find the maximum state-action value in this state, and incorporate that into our update of the Q value representing the state-action combination we just experienced.

a.

强化学习:Q-Learning

b.

强化学习:Q-Learning

c.

强化学习:Q-Learning

d. Now when we are updating the Q value for the previous state-action combination we look at all the Q values for the state ‘cheese is one step ahead’. We see that one of these values is high (this being for the action ‘move forward’) and this value is incorporated in the update of the previous state-action combination.

强化学习:Q-Learning

    Specifically we’re updating the previous state-action value using the equation:

    Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a)) 

    where s is the previous state, a is the previous action, s' is the current state, and alpha is the discount factor (set to .5 here).

    Intuitively, the change in the Q-value for performing action a in state s is the difference between the actual reward (reward(s,a) + max(Q(s'))) and the expected reward (Q(s,a)) multiplied by a learning rate, alpha. You can think of this as a kind of PD control, driving your system to the target, which is in this case the correct Q-value.

    Here, we evaluate the reward of moving ahead when the cheese is two steps ahead as the reward for moving into that state (0), plus the reward of the best action from that state (moving into the cheese +50), minus the expected value of that state (0), multiplied by our learning rate (.5) = +25

2. Exploration

    In the most straightforward implementation of Q-learning, state-action values are stored in a look-up table. So we have a giant table, which is size N x M, where N is the number of different possible states, and M is the number of different possible actions. So then at decision time we simply go to that table, look up the corresponding action values for that state, and choose the maximum, in equation form:

def choose_action(self, state):
    q = [self.getQ(state, a) for a in self.actions]
    maxQ = max(q)
    action = self.actions[maxQ]
    return action

    Almost. There are a couple of additional things we need to add. First, we need to cover the case where there are several actions that all have the same value. So to do that, if there are several with the same value, randomly choose one of them.

1
2
3
4
5
6
7
8
9
10
11
12
def choose_action( self , state):
     q = [ self .getQ(state, a) for a in self .actions]
     maxQ = max (q)
     count = q.count(maxQ)
     if count > 1 :
         best = [i for i in range ( len ( self .actions)) if q[i] = = maxQ]
         i = random.choice(best)
     else :
         i = q.index(maxQ)
 
     action = self .actions[i]
     return action

    This helps us out of that situation, but now if we ever happen upon a decent option, we’ll always choose that one in the future, even if there is a way better option available. To overcome this, we’re going to need to introduce exploration. The standard way to get exploration is to introduce an additional term, epsilon. We then randomly generate a value, and if that value is less than epsilon, a random action is chosen, instead of following our normal tactic of choosing the max Q value.

1
2
3
4
5
6
7
8
9
def choose_action( self , state):
     if random.random() < self .epsilon: # exploration action = random.choice(self.actions) else: q = [self.getQ(state, a) for a in self.actions] maxQ = max(q) count = q.count(maxQ) if count > 1:
             best = [i for i in range ( len ( self .actions)) if q[i] = = maxQ]
             i = random.choice(best)
         else :
             i = q.index(maxQ)
 
        action = self .actions[i]
     return action

    The problem now is that we even after we’ve explored all the options and we know for sure the best option, we still sometimes choose a random action; the exploration doesn’t turn off. There are a number of ways to overcome this, most involving manually adjusting epsilon as the mouse learns, so that it explores less and less as time passes and it has presumably learned the best actions for the majority of situations. 

    Travis is not a big fan of this, so instead he has implemented exploration the following way: If the randomly generated value is less than epsilon, then randomly add values to the Q values for this state, scaled by the maximum Q value of this state. In this way, exploration is added, but you’re still using your learned Q values as a basis for choosing your action, rather than just randomly choosing an action completely at random.

1
2
3
4
5
6
7
8
9
10
11
12
13
def choose_action( self , state):
     q = [ self .getQ(state, a) for a in self .actions]
     maxQ = max (q)
 
     if random.random()  1 :
         best = [i for i in range ( len ( self .actions)) if q[i] = = maxQ]
         i = random.choice(best)
     else :
         i = q.index(maxQ)
 
     action = self .actions[i]
 
     return action

    Very pleasingly, you get results comparable to the case where you perform lots of learning and then set epsilon = 0 to turn off random movements.

 2. SARSA and Q-learning

2.1 SARSA

    SARSA stands for State-Action-Reward-State-Action. In SARSA, the agent starts in state 1, performs action 1, and gets a reward (reward 1). Now, it’s in state 2 and performs another action (action 2) and gets the reward from this state (reward 2) before it goes back and updates the value of action 1 performed in state 1. 

    In contrast, in Q-learning the agent starts in state 1, performs action 1 and gets a reward (reward 1), and then looks and sees what the maximum possible reward for an action is in state 2, and uses that to update the action value of performing action 1 in state 1. So the difference is in the way the future reward is found. In Q-learning it’s simply the highest possible action that can be taken from state 2, and in SARSA it’s the value of the actual action that was taken.

    This means that SARSA takes into account the control policy by which the agent is moving, and incorporates that into its update of action values, where Q-learning simply assumes that an optimal policy is being followed. This difference can be a little difficult conceptually to tease out at first but with an example will hopefully become clear.

2.2 Example: Mouse vs cliff

    Let’s look at a simple scenario, a mouse is trying to get to a piece of cheese(green block). Additionally, there is a cliff in the map that must be avoided, or the mouse falls, gets a negative reward, and has to start back at the beginning. The simulation looks something like exactly like this:

强化学习:Q-Learning
    where the black is the edge of the map (walls), the red is the cliff area, the blue is the mouse and the green is the cheese. 

    Now, as we all remember, in the basic Q-learning control policy the action to take is chosen by having the highest action value. However, there is also a chance that some random action will be chosen; this is the built-in exploration mechanism of the agent. This means that even if we see this scenario:

强化学习:Q-Learning
    There is a chance that that mouse is going to say ‘yes I see the best move, but…the hell with it’ and jump over the edge! All in the name of exploration. This becomes a problem, because if the mouse was following an optimal control strategy, it would simply run right along the edge of the cliff all the way over to the cheese and grab it. Q-learning assumes that the mouse is following the optimal control strategy, so the action values will converge such that the best path is along the cliff. Here’s an animation of the result of running the Q-learning code for a long time:

强化学习:Q-Learning
    The solution that the mouse ends up with is running along the edge of the cliff and occasionally jumping off and plummeting to its death.

    However, if the agent’s actual control strategy is taken into account when learning, something very different happens. Here is the result of the mouse learning to find its way to the cheese using SARSA:

强化学习:Q-Learning
    Why, that’s much better! The mouse has learned that from time to time it does really foolish things, so the best path is not to run along the edge of the cliff straight to the cheese but to get far away from the cliff and then work its way over safely. As you can see, even if a random action is chosen there is little chance of it resulting in death.

2.3 Learning action values with SARSA

    So now we know how SARSA determines it’s updates to the action values. It’s a very minor difference between the SARSA and Q-learning implementations, but it causes a profound effect.

    Here is the Q-learning learn method:

40
41
42
43
def learn( self , state1, action1, reward, state2):
     maxqnew = max ([ self .getQ(state2, a) for a in self .actions])
     self .learnQ(state1, action1,
                 reward, reward + self .gamma * maxqnew)

    And here is the SARSA learn method

39
40
41
42
def learn( self , state1, action1, reward, state2, action2):
     qnext = self .getQ(state2, action2)
     self .learnQ(state1, action1,
                 reward, reward + self .gamma * qnext)

    As we can see, the SARSA method takes another parameter, action2, which is the action that was taken by the agent from the second state. This allows the agent to explicitly find the future reward value, qnext, that followed, rather than assuming that the optimal action will be taken and that the largest reward, maxqnew, resulted.

    Written out, the Q-learning update policy is Q(s, a) = reward(s) + alpha * max(Q(s')), and the SARSA update policy is Q(s, a) = reward(s) + alpha * Q(s', a'). This is how SARSA is able to take into account the control policy of the agent during learning. It means that information needs to be stored longer before the action values can be updated, but also means that our mouse is going to jump off a cliff much less frequently, which we can probably all agree is a good thing.