The Joys of Conjugate Priors

时间:2023-02-17 17:51:44

The Joys of Conjugate Priors

(Warning: this post is a bit technical.)



Suppose you are a Bayesian reasoning agent.  While going about your daily activities, you observe an event of type The Joys of Conjugate Priors.
 Because you're a good Bayesian, you have some internal parameter The Joys of Conjugate Priors which
represents your belief that The Joys of Conjugate Priors will occur.



Now, you're familiar with the Ways of Bayes, and therefore you know that your beliefs must be updated with every new datapoint you perceive.  Your observation of The Joys of Conjugate Priors is
a datapoint, and thus you'll want to modify The Joys of Conjugate Priors
But how much should this datapoint influence The Joys of Conjugate Priors
Well, that will depend on how sure you are of The Joys of Conjugate Priors in
the first place.  If you calculated The Joys of Conjugate Priors based
on a careful experiment involving hundreds of thousands of observations, then you're probably pretty confident in its value, and this single observation of The Joys of Conjugate Priors shouldn't
have much impact.  But if your estimate of The Joys of Conjugate Priors is
just a wild guess based on something your unreliable friend told you, then this datapoint is important and should be weighted much more heavily in your reestimation of The Joys of Conjugate Priors.



Of course, when you reestimate The Joys of Conjugate Priors,
you'll also have to reestimate how confident you are in its value.  Or, to put it a different way, you'll want to compute a new probability distribution over possible values of The Joys of Conjugate Priors
This new distribution will beThe Joys of Conjugate Priors,
and it can be computed using Bayes' rule:



The Joys of Conjugate Priors



Here, since The Joys of Conjugate Priors is
a parameter used to specify the distribution from which The Joys of Conjugate Priors is
drawn, it can be assumed that computing The Joys of Conjugate Priors is
straightforward.  The Joys of Conjugate Priors is your old
distribution over The Joys of Conjugate Priors, which you already
have; it says how accurate you think different settings of the parameters are, and allows you to compute your confidence in any given value of The Joys of Conjugate Priors
So the numerator should be straightforward to compute; it's the denominator which might give you trouble, since for an arbitrary distribution, computing the integral is likely to be intractable.



But you're probably not really looking for a distribution over different parameter settings; you're looking for a single best setting of the parameters that you can use for making predictions. 
If this is your goal, then once you've computed the distribution The Joys of Conjugate Priors,
you can pick the value of The Joys of Conjugate Priors that maximizes
it.  This will be your new parameter, and because you have the formula The Joys of Conjugate Priors,
you'll know exactly how confident you are in this parameter. 



In practice, picking the value of The Joys of Conjugate Priors which
maximizes The Joys of Conjugate Priors is usually pretty
difficult, thanks to the presence of local optima, as well as the general difficulty of optimization problems.  For simple enough distributions, you can use the EM algorithm, which is guarranteed to converge to a local optimum.  But for more complicated distributions,
even this method is intractable, and approximate algorithms must be used.  Because of this concern, it's important to keep the distributions The Joys of Conjugate Priors and The Joys of Conjugate Priors simple. 
Choosing the distribution The Joys of Conjugate Priors is
a matter of model selection; more complicated models can capture deeper patterns in data, but will take more time and space to compute with.



It is assumed that the type of model is chosen before deciding on the form of the distribution The Joys of Conjugate Priors
So how do you choose a good distribution for The Joys of Conjugate Priors
Notice that every time you see a new datapoint, you'll have to do the computation in the equation above.  Thus, in the course of observing data, you'll be multiplying lots of different probability distributions together.  If these distributions are chosen
poorly, The Joys of Conjugate Priors could get quite messy
very quickly.



If you're a smart Bayesian agent, then, you'll pick The Joys of Conjugate Priors to
be a conjugate prior to the distribution The Joys of Conjugate Priors
The distribution The Joys of Conjugate Priors is conjugate to The Joys of Conjugate Priors if
multiplying these two distributions together and normalizing results in another distribution of the same form as The Joys of Conjugate Priors.



Let's consider a concrete example: flipping a biased coin.  Suppose you use the bernoulli distribution to model your coin.  Then it has a parameter The Joys of Conjugate Priors which
represents the probability of gettings heads.  Assume that the value 1 corresponds to heads, and the value 0 corresponds to tails.  Then the distribution of the outcome The Joys of Conjugate Priors of
the coin flip looks like this:



The Joys of Conjugate Priors



It turns out that the conjugate prior for the bernoulli distribution is something called the beta distribution.  It has two parameters, The Joys of Conjugate Priors and The Joys of Conjugate Priors,
which we call hyperparameters because they are parameters for a distribution over our parameters.  (Eek!) 



The beta distribution looks like this:



The Joys of Conjugate Priors



Since The Joys of Conjugate Priors represents
the probability of getting heads, it can take on any value between 0 and 1, and thus this function is normalized properly.



Suppose you observe a single coin flip The Joys of Conjugate Priors and
want to update your beliefs regarding The Joys of Conjugate Priors
Since the denominator of the beta function in the equation above is just a normalizing constant, you can ignore it for the moment while computing The Joys of Conjugate Priors,
as long as you promise to normalize after completing the computation:



The Joys of Conjugate Priors



Normalizing this equation will, of course, give another beta distribution, confirming that this is indeed a conjugate prior for the bernoulli distribution.  Super cool, right?



If you are familiar with the binomial distribution, you should see that the numerator of the beta distribution in the equation for The Joys of Conjugate Priors looks
remarkably similar to the non-factorial part of the binomial distribution.  This suggests a form for the normalization constant:



The Joys of Conjugate Priors



The beta and binomial distributions are almost identical.  The biggest difference between them is that the beta distribution is a function of The Joys of Conjugate Priors,
with The Joys of Conjugate Priors and The Joys of Conjugate Priors as
prespecified parameters, while the binomial distribution is a function of The Joys of Conjugate Priors,
with The Joys of Conjugate Priors and The Joys of Conjugate Priors as
prespecified parameters.  It should be clear that the beta distribution is also conjugate to the binomial distribution, making it just that much awesomer. 



Another difference between the two distributions is that the beta distribution uses gammas where the binomial distribution uses factorials.  Recall that the gamma function is just a generalization
of the factorial to the reals; thus, the beta distribution allows The Joys of Conjugate Priors and The Joys of Conjugate Priors to
be any positive real number, while the binomial distribution is only defined for integers.  As a final note on the beta distribution, the -1 in the exponents is not philosophically significant; I think it is mostly there so that the gamma functions will not
contain +1s.  For more information about the mathematics behind the gamma function and the beta distribution, I recommend checking out this pdf:http://www.mhtl.uwaterloo.ca/courses/me755/web_chap1.pdf
It gives an actual derivation which shows that the first equation for The Joys of Conjugate Priors is
equivalent to the second equation for The Joys of Conjugate Priors,
which is nice if you don't find the argument by analogy to the binomial distribution convincing.



So, what is the philosophical significance of the conjugate prior?  Is it just a pretty piece of mathematics that makes the computation work out the way we'd like it to?  No; there is deep
philosophical significance to the form of the beta distribution. 



Recall the intuition from above: if you've seen a lot of data already, then one more datapoint shouldn't change your understanding of the world too drastically.  If, on the other hand, you've
seen relatively little data, then a single datapoint could influence your beliefs significantly.  This intuition is captured by the form of the conjugate prior.  The Joys of Conjugate Priors and The Joys of Conjugate Priors can
be viewed as keeping track of how many heads and tails you've seen, respectively.  So if you've already done some experiments with this coin, you can store that data in a beta distribution and use that as your conjugate prior.  The beta distribution captures
the difference between claiming that the coin has 30% chance of coming up heads after seeing 3 heads and 7 tails, and claiming that the coin has a 30% chance of coming up heads after seeing 3000 heads and 7000 tails.



Suppose you haven't observed any coin flips yet, but you have some intuition about what the distribution should be.  Then you can choose values for The Joys of Conjugate Priors and The Joys of Conjugate Priors that
represent your prior understanding of the coin.  Higher values of The Joys of Conjugate Priorsindicate
more confidence in your intuition; thus, choosing the appropriate hyperparameters is a method of quantifying your prior understanding so that it can be used in computation.  The Joys of Conjugate Priors and The Joys of Conjugate Priors will
act like "imaginary data"; when you update your distribution over The Joys of Conjugate Priors after
observing a coin flip The Joys of Conjugate Priors, it will be like you
already saw The Joys of Conjugate Priors heads and The Joys of Conjugate Priors tails
before that coin flip.

 

If you want to express that you have no prior knowledge about the system, you can do so by setting The Joys of Conjugate Priors and The Joys of Conjugate Priors to
1.  This will turn the beta distribution into a uniform distribution.  You can also use the beta distribution to do add-N smoothing, by setting The Joys of Conjugate Priors and The Joys of Conjugate Priors to
both be N+1.  Setting the hyperparameters to a value lower than 1 causes them to act like "negative data", which helps avoid overfitting The Joys of Conjugate Priors to
noise in the actual data.



In conclusion, the beta distribution, which is a conjugate prior to the bernoulli and binomial distributions, is super awesome.  It makes it possible to do Bayesian reasoning in a computationally
efficient manner, as well as having the philosophically satisfying interpretation of representing real or imaginary prior data.  Other conjugate priors, such as the dirichlet prior for the multinomial distribution, are similarly cool.

The Joys of Conjugate Priors的更多相关文章

  1. Conjugate prior relationships

    Conjugate prior relationships The following diagram summarizes conjugate prior relationships for a n ...

  2. [Bayes] Understanding Bayes: Updating priors via the likelihood

    From: https://alexanderetz.com/2015/07/25/understanding-bayes-updating-priors-via-the-likelihood/ Re ...

  3. 转:Conjugate prior-共轭先验的解释

    Conjugate prior-共轭先验的解释    原文:http://blog.csdn.net/polly_yang/article/details/8250161 一 问题来源: 看PRML第 ...

  4. PRML读书笔记——2 Probability Distributions

    2.1. Binary Variables 1. Bernoulli distribution, p(x = 1|µ) = µ 2.Binomial distribution + 3.beta dis ...

  5. [MCSM]Exponential family: 指数分布族

    Exponential family(指数分布族)是一个经常出现的概念,但是对其定义并不是特别的清晰,今天好好看了看WIKI上的内容,有了一个大致的了解,先和大家分享下.本文基本是WIKI上部分内容的 ...

  6. PRML Chapter 2. Probability Distributions

    PRML Chapter 2. Probability Distributions P68 conjugate priors In Bayesian probability theory, if th ...

  7. 广义线性模型 GLM

    Logistic Regression 同 Liner Regression 均属于广义线性模型,Liner Regression 假设 $y|x ; \theta$ 服从 Gaussian 分布,而 ...

  8. 机器学习的数学基础(1)--Dirichlet分布

    机器学习的数学基础(1)--Dirichlet分布 这一系列(机器学习的数学基础)主要包括目前学习过程中回过头复习的基础数学知识的总结. 基础知识:conjugate priors共轭先验 共轭先验是 ...

  9. 随机采样和随机模拟:吉布斯采样Gibbs Sampling实现文档分类

    http://blog.csdn.net/pipisorry/article/details/51525308 吉布斯采样的实现问题 本文主要说明如何通过吉布斯采样进行文档分类(聚类),当然更复杂的实 ...

随机推荐

  1. [LeetCode] Binary Tree Paths 二叉树路径

    Given a binary tree, return all root-to-leaf paths. For example, given the following binary tree: 1 ...

  2. Python安装mysql-python错误提示python setup.py egg_info

    做python项目,需要用到mysql,一般用python-mysql,安装时遇到错误提示如下: Command "python setup.py egg_info" failed ...

  3. [转]C语言文件操作

    1,两种文件存取方式(输入,输出方式) 顺序存取 直接存取 2,数据的两种存放形式 文本文件 二进制文件 13.2文件指针 定义文件类型指针变量的一般形式: FILE *指针变量名; 例如: FILE ...

  4. 1058-Tom and Jerry

    描述 Tom和Jerry在10*10的方格中: *...*..... ......*... ...*...*.. .......... ...*.C.... *.....*... ...*...... ...

  5. C#集合-队列

    本文来自:http://www.cnblogs.com/yangyancheng/archive/2011/04/28/2031615.html 队列是其元素以先进先出(FIFO)的方式来处理的集合. ...

  6. HDU 2092 整数解

    整数解 Time Limit: 1000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)Total Submiss ...

  7. vue2.0实践 —— Node + vue 实现移动官网

    简介 使用 Node + vue 对公司的官网进行了一个简单的移动端的实现. 源码 https://github.com/wx1993/node-vue-fabaocn 效果 组件 轮播图(使用 vu ...

  8. AspNetCore 中使用 InentityServer4(2)

    基于上一篇文章 实现对IdnetityServer4 服务的使用 1:添加接口解决方案,并且使接口受认证服务的保护: 首先在解决方案中添加Api项目如下图所示: 在API项目中添加Nuget 引用 如 ...

  9. 项目Alpha冲刺——代码规范、任务及计划

    代码规范 JS规范 JS规范在线预览 PHP规范 PHP规范在线预览 Unity C#脚本规范 C#规范下载 任务计划 图表 计划进度燃尽表 网站部分任务计划 任务 时间 内容 第一天 4.24 阅读 ...

  10. ScriptManager的几个属性和方法

    ScriptManager的几个属性和方法   一.EnablePageMethods ScriptManager的EnablePageMethods属性用于设定客户端javascript直接调用服务 ...