Markov决策过程:值迭代,它是如何工作的?

时间:2020-12-23 11:43:42

I've been reading a lot about Markov Decision Processes (using value iteration) lately but I simply can't get my head around them. I've found a lot of resources on the Internet / books, but they all use mathematical formulas that are way too complex for my competencies.

最近我读了很多关于马尔可夫决策过程(使用价值迭代)的文章,但我根本无法理解它们。我在互联网/书籍上找到了很多资源,但他们都使用的数学公式对我的能力来说太复杂了。

Since this is my first year at college, I've found that the explanations and formulas provided on the web use notions / terms that are way too complicated for me and they assume that the reader knows certain things that I've simply never heard of.

因为这是我在大学的第一年,我发现网络上提供的解释和公式对我来说太复杂了,他们假设读者知道一些我根本没听说过的东西。

I want to use it on a 2D grid (filled with walls(unattainable), coins(desirable) and enemies that move(which must be avoided at all costs)). The whole goal is to collect all the coins without touching the enemies, and I want to create an AI for the main player using a Markov Decision Process (MDP). Here is how it partially looks like (note that the game-related aspect is not so much of a concern here. I just really want to understand MDPs in general):

我想在2D网格上使用它(填充墙(无法实现)、硬币(可取的)和移动的敌人(必须不惜一切代价避免)。整个目标是收集所有的硬币而不触及敌人,我想用Markov决策过程(MDP)为主要玩家创建一个AI。这里是它部分看起来的样子(注意,与游戏相关的方面并不是那么令人担忧。我只是想了解一般的MDPs):

Markov决策过程:值迭代,它是如何工作的?

From what I understand, a rude simplification of MDPs is that they can create a grid which holds in which direction we need to go (kind of a grid of "arrows" pointing where we need to go, starting at a certain position on the grid) to get to certain goals and avoid certain obstacles. Specific to my situation, that would mean that it allows the player to know in which direction to go to collect the coins and avoid the enemies.

据我所知,粗鲁的mdp的简化是他们可以创建一个网格,在哪个方向,我们需要去(一种网格的“箭头”指出,我们需要去的地方,在某一个位置开始网格)到特定的目标和避免特定的障碍。具体到我的情况,这意味着它允许玩家知道在哪个方向去收集硬币和避开敌人。

Now, using the MDP terms, it would mean that it creates a collection of states(the grid) which holds certain policies(the action to take -> up, down, right, left) for a certain state(a position on the grid). The policies are determined by the "utility" values of each state, which themselves are calculated by evaluating how much getting there would be beneficial in the short and long term.

现在,使用MDP术语,它将意味着它创建了一个状态集合(grid),该集合包含某些策略(将>向上、向下、右、左)用于某个状态(网格上的位置)。这些政策是由每个州的“效用”值决定的,它们本身是通过评估在短期和长期内获得多少收益来计算的。

Is this correct? Or am I completely on the wrong track?

这是正确的吗?或者我完全在错误的轨道上?

I'd at least like to know what the variables from the following equation represent in my situation:

我至少想知道在我的情况下,下列方程的变量代表什么:

Markov决策过程:值迭代,它是如何工作的?

(taken from the book "Artificial Intelligence - A Modern Approach" from Russell & Norvig)

(取自罗素和诺维格的《人工智能-现代方法》)

I know that s would be a list of all the squares from the grid, a would be a specific action (up / down / right / left), but what about the rest?

我知道s会是网格中所有方块的列表,a是一个特定的动作(向上/向下/右/左),但是其他的呢?

How would the reward and utility functions be implemented?

如何实现奖励和实用功能?

It would be really great if someone knew a simple link which shows pseudo-code to implement a basic version with similarities to my situation in a very slow way, because I don't even know where to start here.

如果有人知道一个简单的链接,它会显示伪代码,以非常慢的方式实现与我的情况相似的基本版本,那就太好了,因为我甚至不知道从哪里开始。

Thank you for your precious time.

谢谢你宝贵的时间。

(Note: feel free to add / remove tags or tell me in the comments if I should give more details about something or anything like that.)

(注:如果我应该提供一些类似的细节,请随意添加/删除标签或在评论中告诉我。)

4 个解决方案

#1


35  

Yes, the mathematical notation can make it seem much more complicated than it is. Really, it is a very simple idea. I have a implemented a value iteration demo applet that you can play with to get a better idea.

是的,数学符号可以使它看起来比实际复杂得多。真的,这是一个非常简单的想法。我有一个实现的值迭代演示applet,您可以使用它来获得更好的想法。

Basically, lets say you have a 2D grid with a robot in it. The robot can try to move North, South, East, West (those are the actions a) but, because its left wheel is slippery, when it tries to move North there is only a .9 probability that it will end up at the square North of it while there is a .1 probability that it will end up at the square West of it (similarly for the other 3 actions). These probabilities are captured by the T() function. Namely, T(s,A,s') will look like:

基本上,假设你有一个2D网格,里面有一个机器人。机器人可以移动北、南、东、西(这些行为),但是,因为它的*很滑,当它试图移动北只有可能性。9,它最终将广场北部的虽然是一个概率。1,它最终将在西广场(同样对于其他3操作)。这些概率被T()函数捕获。即T(s,A,s)

s    A      s'     T    //x=0,y=0 is at the top-left of the screen
x,y  North  x,y+1  .9   //we do move north
x,y  North  x-1,y  .1   //wheels slipped, so we move West
x,y  East   x+1,y  .9
x,y  East   x,y-1  .1
x,y  South  x,y+1  .9
x,y  South  x-1,y  .1 
x,y  West   x-1,y  .9
x,y  West   x,y+1  .1 

You then set the Reward to be 0 for all states, but 100 for the goal state, that is, the location you want the robot to get to.

然后你将所有状态的奖励设为0,但目标状态为100,也就是你想要机器人到达的位置。

What value-iteration does is its starts by giving a Utility of 100 to the goal state and 0 to all the other states. Then on the first iteration this 100 of utility gets distributed back 1-step from the goal, so all states that can get to the goal state in 1 step (all 4 squares right next to it) will get some utility. Namely, they will get a Utility equal to the probability that from that state we can get to the goal stated. We then continue iterating, at each step we move the utility back 1 more step away from the goal.

价值迭代的开始是将100的效用赋给目标状态0到所有其他状态。然后在第一次迭代中,这100个实用程序从目标中返回1步,因此所有能在1步中到达目标状态的状态(在它旁边的4个方块)将得到一些效用。也就是说,它们会得到一个效用等于从这个状态我们可以达到目标的概率。然后我们继续迭代,在每一步中,我们将实用程序向后移动一步。

In the example above, say you start with R(5,5)= 100 and R(.) = 0 for all other states. So the goal is to get to 5,5.

在上面的例子中,假设您从R(5,5)= 100和R(.) = 0开始所有其他状态。目标是5 5。

On the first iteration we set

在第一次迭代中我们设置。

R(5,6) = gamma * (.9 * 100) + gamma * (.1 * 100)

R(5,6) = gamma *(。9 * 100)+ gamma *(。1 * 100)

because on 5,6 if you go North there is a .9 probability of ending up at 5,5, while if you go West there is a .1 probability of ending up at 5,5.

因为在5 6,如果你往北走,有一个。9的概率是5,5,而如果你往西走,有一个。1的概率是5,5。

Similarly for (5,4), (4,5), (6,5).

类似的(5,4),(4,5),(6,5)。

All other states remain with U = 0 after the first iteration of value iteration.

在第一次迭代的值迭代之后,所有其他状态都保持为U = 0。

#2


5  

I would recommend using Q-learning for your implementation.

我建议使用Q-learning来实现。

Maybe you can use this post I wrote as an inspiration. This is a Q-learning demo with Java source code. This demo is a map with 6 fields and the AI learns where it should go from every state to get to the reward.

也许你可以用我写的这个帖子作为灵感。这是一个使用Java源代码的Q-learning演示。这个演示是一个带6个字段的地图,人工智能学会了从每个状态到奖励的位置。

Q-learning is a technique for letting the AI learn by itself by giving it reward or punishment.

Q-learning是一种让人工智能通过奖励或惩罚来学习的技术。

This example shows the Q-learning used for path finding. A robot learns where it should go from any state.

这个例子显示了用于路径查找的Q-learning。一个机器人可以从任何一个州学习它应该去哪里。

The robot starts at a random place, it keeps memory of the score while it explores the area, whenever it reaches the goal, we repeat with a new random start. After enough repetitions the score values will be stationary (convergence).

机器人从一个随机的地方开始,它在探索这个区域的时候保持对分数的记忆,当它到达目标的时候,我们重复一个新的随机开始。经过足够的重复,分数值将是平稳的(收敛)。

In this example the action outcome is deterministic (transition probability is 1) and the action selection is random. The score values are calculated by the Q-learning algorithm Q(s,a).
The image shows the states (A,B,C,D,E,F), possible actions from the states and the reward given.

在这个例子中,动作结果是决定性的(转移概率是1),动作选择是随机的。分数值由Q-learning算法Q(s,a)计算。图像显示状态(A,B,C,D,E,F),可能来自于状态的动作和给予的奖励。

Markov决策过程:值迭代,它是如何工作的?

Result Q*(s,a)
Markov决策过程:值迭代,它是如何工作的?

结果问*(,)

Policy Π*(s)
Markov决策过程:值迭代,它是如何工作的?

政策Π*(年代)

Qlearning.java

Qlearning.java

import java.text.DecimalFormat;
import java.util.Random;

/**
 * @author Kunuk Nykjaer
 */
public class Qlearning {
    final DecimalFormat df = new DecimalFormat("#.##");

    // path finding
    final double alpha = 0.1;
    final double gamma = 0.9;


// states A,B,C,D,E,F
// e.g. from A we can go to B or D
// from C we can only go to C
// C is goal state, reward 100 when B->C or F->C
//
// _______
// |A|B|C|
// |_____|
// |D|E|F|
// |_____|
//

    final int stateA = 0;
    final int stateB = 1;
    final int stateC = 2;
    final int stateD = 3;
    final int stateE = 4;
    final int stateF = 5;

    final int statesCount = 6;
    final int[] states = new int[]{stateA,stateB,stateC,stateD,stateE,stateF};

    // http://en.wikipedia.org/wiki/Q-learning
    // http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/Q-Learning.htm

    // Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a))

    int[][] R = new int[statesCount][statesCount]; // reward lookup
    double[][] Q = new double[statesCount][statesCount]; // Q learning

    int[] actionsFromA = new int[] { stateB, stateD };
    int[] actionsFromB = new int[] { stateA, stateC, stateE };
    int[] actionsFromC = new int[] { stateC };
    int[] actionsFromD = new int[] { stateA, stateE };
    int[] actionsFromE = new int[] { stateB, stateD, stateF };
    int[] actionsFromF = new int[] { stateC, stateE };
    int[][] actions = new int[][] { actionsFromA, actionsFromB, actionsFromC,
            actionsFromD, actionsFromE, actionsFromF };

    String[] stateNames = new String[] { "A", "B", "C", "D", "E", "F" };

    public Qlearning() {
        init();
    }

    public void init() {       
        R[stateB][stateC] = 100; // from b to c
        R[stateF][stateC] = 100; // from f to c    
    }

    public static void main(String[] args) {
        long BEGIN = System.currentTimeMillis();

        Qlearning obj = new Qlearning();

        obj.run();
        obj.printResult();
        obj.showPolicy();

        long END = System.currentTimeMillis();
        System.out.println("Time: " + (END - BEGIN) / 1000.0 + " sec.");
    }

    void run() {
        /*
         1. Set parameter , and environment reward matrix R
         2. Initialize matrix Q as zero matrix
         3. For each episode: Select random initial state
            Do while not reach goal state o
                Select one among all possible actions for the current state o
                Using this possible action, consider to go to the next state o
                Get maximum Q value of this next state based on all possible actions o
                Compute o Set the next state as the current state
         */

        // For each episode
        Random rand = new Random();
        for (int i = 0; i < 1000; i++) { // train episodes
            // Select random initial state
            int state = rand.nextInt(statesCount);
            while (state != stateC) // goal state
            {
                // Select one among all possible actions for the current state
                int[] actionsFromState = actions[state];

                // Selection strategy is random in this example
                int index = rand.nextInt(actionsFromState.length);
                int action = actionsFromState[index];

                // Action outcome is set to deterministic in this example
                // Transition probability is 1
                int nextState = action; // data structure

                // Using this possible action, consider to go to the next state
                double q = Q(state, action);
                double maxQ = maxQ(nextState);
                int r = R(state, action);

                double value = q + alpha * (r + gamma * maxQ - q);
                setQ(state, action, value);

                // Set the next state as the current state
                state = nextState;
            }
        }
    }

    double maxQ(int s) {
        int[] actionsFromState = actions[s];
        double maxValue = Double.MIN_VALUE;
        for (int i = 0; i < actionsFromState.length; i++) {
            int nextState = actionsFromState[i];
            double value = Q[s][nextState];

            if (value > maxValue)
                maxValue = value;
        }
        return maxValue;
    }

    // get policy from state
    int policy(int state) {
        int[] actionsFromState = actions[state];
        double maxValue = Double.MIN_VALUE;
        int policyGotoState = state; // default goto self if not found
        for (int i = 0; i < actionsFromState.length; i++) {
            int nextState = actionsFromState[i];
            double value = Q[state][nextState];

            if (value > maxValue) {
                maxValue = value;
                policyGotoState = nextState;
            }
        }
        return policyGotoState;
    }

    double Q(int s, int a) {
        return Q[s][a];
    }

    void setQ(int s, int a, double value) {
        Q[s][a] = value;
    }

    int R(int s, int a) {
        return R[s][a];
    }

    void printResult() {
        System.out.println("Print result");
        for (int i = 0; i < Q.length; i++) {
            System.out.print("out from " + stateNames[i] + ":  ");
            for (int j = 0; j < Q[i].length; j++) {
                System.out.print(df.format(Q[i][j]) + " ");
            }
            System.out.println();
        }
    }

    // policy is maxQ(states)
    void showPolicy() {
        System.out.println("\nshowPolicy");
        for (int i = 0; i < states.length; i++) {
            int from = states[i];
            int to =  policy(from);
            System.out.println("from "+stateNames[from]+" goto "+stateNames[to]);
        }          
    }
}

Print result

打印结果

out from A:  0 90 0 72,9 0 0
out from B:  81 0 100 0 81 0
out from C:  0 0 0 0 0 0
out from D:  81 0 0 0 81 0
out from E:  0 90 0 72,9 0 90
out from F:  0 0 100 0 81 0

showPolicy
from a goto B
from b goto C
from c goto C
from d goto A
from e goto B
from f goto C
Time: 0.025 sec.

#3


4  

Not a complete answer, but a clarifying remark.

不是一个完整的答案,而是一个澄清的评论。

The state is not a single cell. The state contains the information what is in each cell for all concerned cells at once. This means one state element contains the information which cells are solid and which are empty; which ones contain monsters; where are coins; where is the player.

状态不是一个单细胞。状态包含所有相关单元格的信息。这意味着一个状态元素包含了单元格为实的信息,而这些信息是空的;哪些包含怪物;硬币在哪里;在哪里的球员。

Maybe you could use a map from each cell to its content as state. This does ignore the movement of monsters and player, which are probably very important, too.

也许您可以使用从每个单元格到它的内容作为状态的映射。这确实忽略了怪物和玩家的动作,这可能也是非常重要的。

The details depend on how you want to model your problem (deciding what belongs to the state and in which form).

细节取决于您如何建模您的问题(决定什么属于状态和哪种形式)。

Then a policy maps each state to an action like left, right, jump, etc.

然后,策略将每个状态映射为左、右、跳等动作。

First you must understand the problem that is expressed by a MDP before thinking about how algorithms like value iteration work.

首先,您必须了解MDP所表达的问题,然后再考虑如何像价值迭代这样的算法。

#4


2  

I know this is a fairly old post, but i came across it when looking for MDP related questions, I did want to note (for folks coming in here) a few more comments about when you stated what "s" and "a" were.

我知道这是一个相当老的帖子,但是我在寻找MDP相关问题时遇到了这个问题,我确实想要注意(对于进来的人),当你说出“s”和“a”时,会有更多的评论。

I think for a you are absolutely correct it's your list of [up,down,left,right].

我想对于a你是绝对正确的这是你的列表(上,下,左,右)。

However for s it's really the location in the grid and s' is the location you can go to. What that means is that you pick a state, and then you pick a particular s' and go through all the actions that can take you to that sprime, which you use to figure out those values. (pick a max out of those). Finally you go for the next s' and do the same thing, when you've exhausted all the s' values then you find the max of what you just finished searching on.

然而,对于s来说,它实际上是网格中的位置,而s是你可以去的位置。这意味着你选择一个状态,然后选择一个特定的s,然后遍历所有可以带你到sprime的动作,你可以用它来计算出这些值。(从中挑出一个最大的)。当你耗尽了所有的s值之后,你会发现你刚刚完成的搜索的最大值。

Suppose you picked a grid cell in the corner, you'd only have 2 states you could possibly move to (assuming bottom left corner), depending on how you choose to "name" your states, we could in this case assume a state is an x,y coordinate, so your current state s is 1,1 and your s' (or s prime) list is x+1,y and x,y+1 (no diagonal in this example) (The Summation part that goes over all s')

假设您选择了一个网格单元在角落里,你只有2个你可能(假设左下角),这取决于你选择“名称”,在这种情况下我们可以假设一个状态是一个x,y坐标,所以你的当前状态是1,1,你的年代(或s ')列表是x + 1,y,x,y + 1(在这个例子中没有对角)(求和部分超过所有的s ')

Also you don't have it listed in your equation, but the max is of a or the action that gives you the max, so first you pick the s' that gives you the max and then within that you pick the action (at least this is my understanding of the algorithm).

还你没有列出方程,但最大的或给你最大的行动,首先你选的年代给你尽了最大的努力,然后在你选择行动(至少这是我的理解算法)。

So if you had

所以如果你有

x,y+1 left = 10 
x,y+1 right = 5 

x+1,y left = 3
x+1,y right 2

You'll pick x,y+1 as your s', but then you'll need to pick an action that is maximized which is in this case left for x,y+1. I'm not sure if there is a subtle difference between just finding the maximum number and finding the state then the maximum number though so maybe someone someday can clarify that for me.

你会选择x,y+1作为s,但是你需要选择一个最大化的动作在这个例子中,是x y+1。我不确定在找到最大值和找到状态之间是否有细微的差别,尽管如此,也许某天某个人可以为我澄清这一点。

If your movements are deterministic (meaning if you say go forward, you go forward with 100% certainty), then it's pretty easy you have one action, However if they are non deterministic, you have a say 80% certainty then you should consider the other actions which could get you there. This is the context of the slippery wheel that Jose mentioned above.

如果你的运动是决定性的(如果你说向前走,你会百分之百的确定),那么你就很容易有一个行动,但是如果它们不是决定性的,你有80%的确定性,那么你应该考虑其他的行动,这些行动会让你到达那里。这是何塞提到的滑轮的背景。

I don't want to detract what others have said, but just to give some additional information.

我不想贬低别人所说的话,只是想补充一些信息。

#1


35  

Yes, the mathematical notation can make it seem much more complicated than it is. Really, it is a very simple idea. I have a implemented a value iteration demo applet that you can play with to get a better idea.

是的,数学符号可以使它看起来比实际复杂得多。真的,这是一个非常简单的想法。我有一个实现的值迭代演示applet,您可以使用它来获得更好的想法。

Basically, lets say you have a 2D grid with a robot in it. The robot can try to move North, South, East, West (those are the actions a) but, because its left wheel is slippery, when it tries to move North there is only a .9 probability that it will end up at the square North of it while there is a .1 probability that it will end up at the square West of it (similarly for the other 3 actions). These probabilities are captured by the T() function. Namely, T(s,A,s') will look like:

基本上,假设你有一个2D网格,里面有一个机器人。机器人可以移动北、南、东、西(这些行为),但是,因为它的*很滑,当它试图移动北只有可能性。9,它最终将广场北部的虽然是一个概率。1,它最终将在西广场(同样对于其他3操作)。这些概率被T()函数捕获。即T(s,A,s)

s    A      s'     T    //x=0,y=0 is at the top-left of the screen
x,y  North  x,y+1  .9   //we do move north
x,y  North  x-1,y  .1   //wheels slipped, so we move West
x,y  East   x+1,y  .9
x,y  East   x,y-1  .1
x,y  South  x,y+1  .9
x,y  South  x-1,y  .1 
x,y  West   x-1,y  .9
x,y  West   x,y+1  .1 

You then set the Reward to be 0 for all states, but 100 for the goal state, that is, the location you want the robot to get to.

然后你将所有状态的奖励设为0,但目标状态为100,也就是你想要机器人到达的位置。

What value-iteration does is its starts by giving a Utility of 100 to the goal state and 0 to all the other states. Then on the first iteration this 100 of utility gets distributed back 1-step from the goal, so all states that can get to the goal state in 1 step (all 4 squares right next to it) will get some utility. Namely, they will get a Utility equal to the probability that from that state we can get to the goal stated. We then continue iterating, at each step we move the utility back 1 more step away from the goal.

价值迭代的开始是将100的效用赋给目标状态0到所有其他状态。然后在第一次迭代中,这100个实用程序从目标中返回1步,因此所有能在1步中到达目标状态的状态(在它旁边的4个方块)将得到一些效用。也就是说,它们会得到一个效用等于从这个状态我们可以达到目标的概率。然后我们继续迭代,在每一步中,我们将实用程序向后移动一步。

In the example above, say you start with R(5,5)= 100 and R(.) = 0 for all other states. So the goal is to get to 5,5.

在上面的例子中,假设您从R(5,5)= 100和R(.) = 0开始所有其他状态。目标是5 5。

On the first iteration we set

在第一次迭代中我们设置。

R(5,6) = gamma * (.9 * 100) + gamma * (.1 * 100)

R(5,6) = gamma *(。9 * 100)+ gamma *(。1 * 100)

because on 5,6 if you go North there is a .9 probability of ending up at 5,5, while if you go West there is a .1 probability of ending up at 5,5.

因为在5 6,如果你往北走,有一个。9的概率是5,5,而如果你往西走,有一个。1的概率是5,5。

Similarly for (5,4), (4,5), (6,5).

类似的(5,4),(4,5),(6,5)。

All other states remain with U = 0 after the first iteration of value iteration.

在第一次迭代的值迭代之后,所有其他状态都保持为U = 0。

#2


5  

I would recommend using Q-learning for your implementation.

我建议使用Q-learning来实现。

Maybe you can use this post I wrote as an inspiration. This is a Q-learning demo with Java source code. This demo is a map with 6 fields and the AI learns where it should go from every state to get to the reward.

也许你可以用我写的这个帖子作为灵感。这是一个使用Java源代码的Q-learning演示。这个演示是一个带6个字段的地图,人工智能学会了从每个状态到奖励的位置。

Q-learning is a technique for letting the AI learn by itself by giving it reward or punishment.

Q-learning是一种让人工智能通过奖励或惩罚来学习的技术。

This example shows the Q-learning used for path finding. A robot learns where it should go from any state.

这个例子显示了用于路径查找的Q-learning。一个机器人可以从任何一个州学习它应该去哪里。

The robot starts at a random place, it keeps memory of the score while it explores the area, whenever it reaches the goal, we repeat with a new random start. After enough repetitions the score values will be stationary (convergence).

机器人从一个随机的地方开始,它在探索这个区域的时候保持对分数的记忆,当它到达目标的时候,我们重复一个新的随机开始。经过足够的重复,分数值将是平稳的(收敛)。

In this example the action outcome is deterministic (transition probability is 1) and the action selection is random. The score values are calculated by the Q-learning algorithm Q(s,a).
The image shows the states (A,B,C,D,E,F), possible actions from the states and the reward given.

在这个例子中,动作结果是决定性的(转移概率是1),动作选择是随机的。分数值由Q-learning算法Q(s,a)计算。图像显示状态(A,B,C,D,E,F),可能来自于状态的动作和给予的奖励。

Markov决策过程:值迭代,它是如何工作的?

Result Q*(s,a)
Markov决策过程:值迭代,它是如何工作的?

结果问*(,)

Policy Π*(s)
Markov决策过程:值迭代,它是如何工作的?

政策Π*(年代)

Qlearning.java

Qlearning.java

import java.text.DecimalFormat;
import java.util.Random;

/**
 * @author Kunuk Nykjaer
 */
public class Qlearning {
    final DecimalFormat df = new DecimalFormat("#.##");

    // path finding
    final double alpha = 0.1;
    final double gamma = 0.9;


// states A,B,C,D,E,F
// e.g. from A we can go to B or D
// from C we can only go to C
// C is goal state, reward 100 when B->C or F->C
//
// _______
// |A|B|C|
// |_____|
// |D|E|F|
// |_____|
//

    final int stateA = 0;
    final int stateB = 1;
    final int stateC = 2;
    final int stateD = 3;
    final int stateE = 4;
    final int stateF = 5;

    final int statesCount = 6;
    final int[] states = new int[]{stateA,stateB,stateC,stateD,stateE,stateF};

    // http://en.wikipedia.org/wiki/Q-learning
    // http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/Q-Learning.htm

    // Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a))

    int[][] R = new int[statesCount][statesCount]; // reward lookup
    double[][] Q = new double[statesCount][statesCount]; // Q learning

    int[] actionsFromA = new int[] { stateB, stateD };
    int[] actionsFromB = new int[] { stateA, stateC, stateE };
    int[] actionsFromC = new int[] { stateC };
    int[] actionsFromD = new int[] { stateA, stateE };
    int[] actionsFromE = new int[] { stateB, stateD, stateF };
    int[] actionsFromF = new int[] { stateC, stateE };
    int[][] actions = new int[][] { actionsFromA, actionsFromB, actionsFromC,
            actionsFromD, actionsFromE, actionsFromF };

    String[] stateNames = new String[] { "A", "B", "C", "D", "E", "F" };

    public Qlearning() {
        init();
    }

    public void init() {       
        R[stateB][stateC] = 100; // from b to c
        R[stateF][stateC] = 100; // from f to c    
    }

    public static void main(String[] args) {
        long BEGIN = System.currentTimeMillis();

        Qlearning obj = new Qlearning();

        obj.run();
        obj.printResult();
        obj.showPolicy();

        long END = System.currentTimeMillis();
        System.out.println("Time: " + (END - BEGIN) / 1000.0 + " sec.");
    }

    void run() {
        /*
         1. Set parameter , and environment reward matrix R
         2. Initialize matrix Q as zero matrix
         3. For each episode: Select random initial state
            Do while not reach goal state o
                Select one among all possible actions for the current state o
                Using this possible action, consider to go to the next state o
                Get maximum Q value of this next state based on all possible actions o
                Compute o Set the next state as the current state
         */

        // For each episode
        Random rand = new Random();
        for (int i = 0; i < 1000; i++) { // train episodes
            // Select random initial state
            int state = rand.nextInt(statesCount);
            while (state != stateC) // goal state
            {
                // Select one among all possible actions for the current state
                int[] actionsFromState = actions[state];

                // Selection strategy is random in this example
                int index = rand.nextInt(actionsFromState.length);
                int action = actionsFromState[index];

                // Action outcome is set to deterministic in this example
                // Transition probability is 1
                int nextState = action; // data structure

                // Using this possible action, consider to go to the next state
                double q = Q(state, action);
                double maxQ = maxQ(nextState);
                int r = R(state, action);

                double value = q + alpha * (r + gamma * maxQ - q);
                setQ(state, action, value);

                // Set the next state as the current state
                state = nextState;
            }
        }
    }

    double maxQ(int s) {
        int[] actionsFromState = actions[s];
        double maxValue = Double.MIN_VALUE;
        for (int i = 0; i < actionsFromState.length; i++) {
            int nextState = actionsFromState[i];
            double value = Q[s][nextState];

            if (value > maxValue)
                maxValue = value;
        }
        return maxValue;
    }

    // get policy from state
    int policy(int state) {
        int[] actionsFromState = actions[state];
        double maxValue = Double.MIN_VALUE;
        int policyGotoState = state; // default goto self if not found
        for (int i = 0; i < actionsFromState.length; i++) {
            int nextState = actionsFromState[i];
            double value = Q[state][nextState];

            if (value > maxValue) {
                maxValue = value;
                policyGotoState = nextState;
            }
        }
        return policyGotoState;
    }

    double Q(int s, int a) {
        return Q[s][a];
    }

    void setQ(int s, int a, double value) {
        Q[s][a] = value;
    }

    int R(int s, int a) {
        return R[s][a];
    }

    void printResult() {
        System.out.println("Print result");
        for (int i = 0; i < Q.length; i++) {
            System.out.print("out from " + stateNames[i] + ":  ");
            for (int j = 0; j < Q[i].length; j++) {
                System.out.print(df.format(Q[i][j]) + " ");
            }
            System.out.println();
        }
    }

    // policy is maxQ(states)
    void showPolicy() {
        System.out.println("\nshowPolicy");
        for (int i = 0; i < states.length; i++) {
            int from = states[i];
            int to =  policy(from);
            System.out.println("from "+stateNames[from]+" goto "+stateNames[to]);
        }          
    }
}

Print result

打印结果

out from A:  0 90 0 72,9 0 0
out from B:  81 0 100 0 81 0
out from C:  0 0 0 0 0 0
out from D:  81 0 0 0 81 0
out from E:  0 90 0 72,9 0 90
out from F:  0 0 100 0 81 0

showPolicy
from a goto B
from b goto C
from c goto C
from d goto A
from e goto B
from f goto C
Time: 0.025 sec.

#3


4  

Not a complete answer, but a clarifying remark.

不是一个完整的答案,而是一个澄清的评论。

The state is not a single cell. The state contains the information what is in each cell for all concerned cells at once. This means one state element contains the information which cells are solid and which are empty; which ones contain monsters; where are coins; where is the player.

状态不是一个单细胞。状态包含所有相关单元格的信息。这意味着一个状态元素包含了单元格为实的信息,而这些信息是空的;哪些包含怪物;硬币在哪里;在哪里的球员。

Maybe you could use a map from each cell to its content as state. This does ignore the movement of monsters and player, which are probably very important, too.

也许您可以使用从每个单元格到它的内容作为状态的映射。这确实忽略了怪物和玩家的动作,这可能也是非常重要的。

The details depend on how you want to model your problem (deciding what belongs to the state and in which form).

细节取决于您如何建模您的问题(决定什么属于状态和哪种形式)。

Then a policy maps each state to an action like left, right, jump, etc.

然后,策略将每个状态映射为左、右、跳等动作。

First you must understand the problem that is expressed by a MDP before thinking about how algorithms like value iteration work.

首先,您必须了解MDP所表达的问题,然后再考虑如何像价值迭代这样的算法。

#4


2  

I know this is a fairly old post, but i came across it when looking for MDP related questions, I did want to note (for folks coming in here) a few more comments about when you stated what "s" and "a" were.

我知道这是一个相当老的帖子,但是我在寻找MDP相关问题时遇到了这个问题,我确实想要注意(对于进来的人),当你说出“s”和“a”时,会有更多的评论。

I think for a you are absolutely correct it's your list of [up,down,left,right].

我想对于a你是绝对正确的这是你的列表(上,下,左,右)。

However for s it's really the location in the grid and s' is the location you can go to. What that means is that you pick a state, and then you pick a particular s' and go through all the actions that can take you to that sprime, which you use to figure out those values. (pick a max out of those). Finally you go for the next s' and do the same thing, when you've exhausted all the s' values then you find the max of what you just finished searching on.

然而,对于s来说,它实际上是网格中的位置,而s是你可以去的位置。这意味着你选择一个状态,然后选择一个特定的s,然后遍历所有可以带你到sprime的动作,你可以用它来计算出这些值。(从中挑出一个最大的)。当你耗尽了所有的s值之后,你会发现你刚刚完成的搜索的最大值。

Suppose you picked a grid cell in the corner, you'd only have 2 states you could possibly move to (assuming bottom left corner), depending on how you choose to "name" your states, we could in this case assume a state is an x,y coordinate, so your current state s is 1,1 and your s' (or s prime) list is x+1,y and x,y+1 (no diagonal in this example) (The Summation part that goes over all s')

假设您选择了一个网格单元在角落里,你只有2个你可能(假设左下角),这取决于你选择“名称”,在这种情况下我们可以假设一个状态是一个x,y坐标,所以你的当前状态是1,1,你的年代(或s ')列表是x + 1,y,x,y + 1(在这个例子中没有对角)(求和部分超过所有的s ')

Also you don't have it listed in your equation, but the max is of a or the action that gives you the max, so first you pick the s' that gives you the max and then within that you pick the action (at least this is my understanding of the algorithm).

还你没有列出方程,但最大的或给你最大的行动,首先你选的年代给你尽了最大的努力,然后在你选择行动(至少这是我的理解算法)。

So if you had

所以如果你有

x,y+1 left = 10 
x,y+1 right = 5 

x+1,y left = 3
x+1,y right 2

You'll pick x,y+1 as your s', but then you'll need to pick an action that is maximized which is in this case left for x,y+1. I'm not sure if there is a subtle difference between just finding the maximum number and finding the state then the maximum number though so maybe someone someday can clarify that for me.

你会选择x,y+1作为s,但是你需要选择一个最大化的动作在这个例子中,是x y+1。我不确定在找到最大值和找到状态之间是否有细微的差别,尽管如此,也许某天某个人可以为我澄清这一点。

If your movements are deterministic (meaning if you say go forward, you go forward with 100% certainty), then it's pretty easy you have one action, However if they are non deterministic, you have a say 80% certainty then you should consider the other actions which could get you there. This is the context of the slippery wheel that Jose mentioned above.

如果你的运动是决定性的(如果你说向前走,你会百分之百的确定),那么你就很容易有一个行动,但是如果它们不是决定性的,你有80%的确定性,那么你应该考虑其他的行动,这些行动会让你到达那里。这是何塞提到的滑轮的背景。

I don't want to detract what others have said, but just to give some additional information.

我不想贬低别人所说的话,只是想补充一些信息。