Q-Learning和TD(lambda)奖励

How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards. How do I need to guess them from the beginning?

这两种RL技术中的奖励是如何起作用的?我的意思是，他们都改进了政策和评估，但不是奖励。我要怎么从一开始就猜出来呢?

2 个解决方案

#1

You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.

你不需要猜测回报。奖励是环境的反馈，奖励是环境的参数。算法的工作条件是agent只能观察反馈、状态空间和动作空间。

The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.

Q-learning和TD的关键思想是异步随机逼近，我们利用长期奖赏期望的噪声评估来近似Bellman算子的不动点。

For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.

例如，如果我们想估计期望高斯分布那么我们可以对它进行抽样和平均。

#2

Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm

强化学习是针对那些人工智能代理不了解其所处世界的问题。因此，强化学习算法不仅能在每个状态下为您提供一个策略/最佳动作，而且还能在一个完全陌生的环境中导航(不知道哪个动作会导致哪个结果状态)，并学习这个新环境的参数。这些是基于模型的强化学习算法

Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.

现在Q学习和时间差异学习是无模型强化学习算法。也就是说，人工智能代理做的事情与基于模型的Algo相同，但它不需要学习它所运行的世界的模型(比如转移概率)。通过多次迭代，它将每个状态映射到在该状态下执行的最佳操作。

Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.

现在说到你的问题，你不必猜测不同州的回报。最初，当代理是环境的新成员时，它只是从它所在的状态中选择一个要执行的随机动作并将其交给模拟器。基于转换函数的模拟器返回状态操作对的结果状态，并返回处于该状态的奖励。

The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.

模拟器与现实世界中的自然相似。举个例子，你发现世界上有一些不熟悉的东西，你做一些动作，比如触摸它，如果这个东西是热的物体，自然会以疼痛的形式给予奖励，这样下次你尝试这个动作的时候就会知道发生了什么。在进行编程时，需要注意的是，模拟程序的工作对试图学习环境的人工智能代理是不可见的。

Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.

现在根据代理感知到的奖励，它支持q值(在Q-Learning的情况下)或效用值(在TD-Learning的情况下)。在多次迭代中，这些q值会收敛，您可以根据状态对的q值为每个状态选择最优动作。

#1