For those wishing to follow along with the R-based demo in class, click here for the companion R script for this lecture.

In the previous two lectures, we have discussed **model-based
inference** using likelihood functions in a *frequentist*
framework. Specifically, we have constructed likelihood functions with
one or more free parameters and then used numerical optimization to find
the maximum-likelihood estimate (MLE). Parameter uncertainty was
accommodated by assessing the curvature of the likelihood surface in the
vicinity of the MLE (e.g., using the ‘rule of 2’ and the concept of the
*likelihood profile*).

Today we will start discussing **Bayesian inference**,
which is less of a departure from these previous topics than you might
think.

Bayesian inference also requires us to develop likelihood functions that describe our hypotheses about how our data were generated.

Bayesian methods also use the curvature of the likelihood surface to make inference about parameter uncertainty.

The difference is that we are no longer interested in the maximum
likelihood estimate (MLE) and the properties of maximum likelihood
estimators. Instead, we’re interested in using the likelihood function
to update our **degree of belief** about all possible
parameter values. Effectively, we want to obtain a *joint probability
distribution* (called the **joint posterior
distribution**) that reflects the probability of any given set of
parameters being the true parameter value.

Remember, Bayesian inference allows us to represent parameter uncertainty as probability distributions, so our the parameters of interest are no longer fixed but unknown values (points), they are sets of potential values, each with an associated probability!

Estimating the probability of an event using a binomial distribution is one of the simplest likelihood-based problems, so it provides a good starting point!

Let’s imagine we know *N* (*N*, the number of
independent trials, is fixed), but we want to estimate *p* (the
probability of an event occurring). For now, we will assume we have no
prior information about *p*, such that any value of *p* is
equally likely.

To set the prior, let’s assume *p* follows a uniform
distribution between 0 and 1:

```
# Bayesian analysis example using Binomial distribution -----------------------
# first visualize the "prior" for the probability *p* as a uniform distribution
curve(dunif(x),ylim=c(0,2),col="red", ylab = "probability", xlab="parameter \"p\"")
```

`#hist(runif(10000),freq=F,ylim=c(0,2),col="red")`

An alternative way to specify this uniform (flat) prior is to use the beta distribution, with both parameters of the distribution (shape1 and shape2) set to 1

```
# Alternative prior: beta distribution (conjugate prior)
curve(dbeta(x,1,1),ylim=c(0,2),col="red",ylab = "probability", xlab="parameter \"p\"")
#hist(rbeta(10000,1,1),freq=F,ylim=c(0,2),col="red") # histogram of random numbers from a flat beta distribution
```

Why choose the beta distribution here? The answer is that the beta
distribution is the **conjugate prior** for the *p*
parameter in the binomial distribution. This makes Bayesian estimation
easy and straightforward, as we will see!

A conjugate prior is a distribution for a parameter that matches the data-generating model: that is, the prior has the same general form as the likelihood function. An important consequence of this is that prior and the posterior distribution can be represented with the same probability distribution (e.g., gamma, beta, normal). We will come back to this!

Let’s work through an example. Take the same frog-call survey we have imagined before (a simple problem with binomial data). Recall that a particular site is occupied. After visiting the site 10 times, we detected the frog (heard its call) 3 times out of 10 total visits. We are interested in determining the detection probability.

We know the likelihood of the data across parameter space. So now we have both a prior distribution for our parameter of interest, as well as a likelihood surface.

```
# frog call example: --------------
# imagine we detected the frog in 3 of 10 visits to a known-occupied wetland
# visualize the data likelihood alongside the prior probability
# recall that the likelihood surface is not a probability distribution (thus, the 2 y axes)
data = 3
param.space <- seq(0,1,by=0.001)
likelihood <- dbinom(data,size=10,prob=param.space)
par(mai=c(1,1,0,1))
curve(dbeta(x,1,1),ylim=c(0,2),col="blue",lty=2,ylab="Probability density (prior)",xlab="parameter (p)")
points(param.space,likelihood,type="l",col="red",lwd=2,lty=2)
axis(4,at=seq(0,.3,length=4),labels = seq(0,0.3,length=4))
mtext("Likelihood (relative plausibility)", side=4, col="red",line=3)
```

Recall that the likelihood curve is NOT a probability distribution. It does not sum to 1! In Bayesian analyses, we translate the likelihood to a probability distribution using Bayes rule!!

\(Prob(Model|Data) = \frac{Prob(Data|Model)\cdot Prob(Model))}{Prob(Data)}\)

The \(Prob(Data|Model)\) term is the likelihood curve…

The \(Prob(Model)\) term is the prior distribution

BUT… what is the unconditional probability of the data (\(Prob(Data)\))?

Well, it’s the likelihood of the data, summed across all possible parameter values, weighted by the prior. This evaluates to a single number (also known as the “evidence”).

In practical terms, the denominator, \(Prob(Data)\), can be seen as a normalizing constant that effectively converts the numerator \(Prob(Data|Model)\cdot Prob(Model))\) into a true probability distribution.

As always (in this class), let’s do it first by brute force!!

```
# Brute-force Bayes ----------------------------
####
# prior across parameter space
prior <- dbeta(param.space,shape1=1,shape2=1) # flat prior
#prior
####
## Numerator for Bayes rule: weight the data likelihood by the prior
weighted.likelihood <- likelihood*prior # Numerator for Bayes rule
####
## Denominator for Bayes rule: compute normalization constant
normalization.constant <- sum(weighted.likelihood) # "evidence" is a single constant number: the sum of the weighted likelihoods!
####
## Posterior!! (numerator/denominator)
posterior <- weighted.likelihood/normalization.constant # this is Bayes' rule!
#####
## Plot it out!
par(mai=c(1,1,0,1))
plot(param.space,prior,ylim=c(0,5),type="l",lwd=1,lty=2,col="blue",ylab="Probability Density",xlab="param.space")
points(param.space,posterior*length(param.space),type="l",col="blue",lwd=2,lty=1) # convert posterior probability to probability density
points(param.space,likelihood,type="l",col="red",lwd=1,lty=2)
axis(4,at=seq(0,.3,length=4),labels = seq(0,0.3,length=4))
mtext("Likelihood", side=4, col="red",line=3)
```

Notice that the shape of the posterior distribution looks a lot like
the shape of the likelihood surface. What this says to us is that the
prior information has been *overwhelmed* by the information
content of the data (as summarized by the likelihood surface).

Again: recall that the likelihood surface is NOT a probability
distribution- it is a way to assess the relative plausibility of the
data across parameter space. The prior and the posterior are *true
probability distributions*, and necessarily integrate to 1.

What if we have a more informative prior?

```
# Try an informative prior! -------------
prior <- dbeta(param.space,shape1=15,shape2=5)
#prior
## weight the data likelihood by the prior
weighted.likelihood <- likelihood*prior
## compute normalization constant
normalization.constant <- sum(weighted.likelihood)
## Posterior!!
posterior <- weighted.likelihood/normalization.constant
## Plot it out!
par(mai=c(1,1,0,1))
plot(param.space,prior,ylim=c(0,5),type="l",lwd=1,lty=2,col="blue",ylab="Probability Density",xlab="param.space")
points(param.space,posterior*length(param.space),type="l",col="blue",lwd=2,lty=1)
points(param.space,likelihood,type="l",col="red",lwd=1,lty=2)
axis(4,at=seq(0,.3,length=4),labels = seq(0,0.3,length=4))
mtext("Likelihood", side=4, col="red",line=3)
```

What does this tell us?

What about if we have more data? Let’s imagine we visited 10 occupied ponds and observed the following data:

3, 1, 6, 2, 3, 2, 6, 1, 3, 3

```
####
# Collect more data and try again...
moredata <- c(3, 1, 6, 2, 3, 2, 6, 1, 3, 3)
## prior
prior <- dbeta(param.space,shape1=15,shape2=5)
## likelihood
likelihood <- sapply(param.space,function(t) prod(dbinom(moredata,size=10,prob=t)))
## weight the data likelihood by the prior
weighted.likelihood <- likelihood*prior
## compute normalization constant
normalization.constant <- sum(weighted.likelihood)
## Posterior!!
posterior <- weighted.likelihood/normalization.constant
## Plot it out!
par(mai=c(1,1,0,1))
plot(param.space,prior,ylim=c(0,10),type="l",lwd=1,lty=2,col="blue",ylab="Probability Density",xlab="param.space")
points(param.space,posterior*length(param.space),type="l",col="blue",lwd=2,lty=1)
points(param.space,likelihood*1e9,type="l",col="red",lwd=1,lty=2)
axis(4,at=seq(0,6,by=1),labels = seq(0,6e-9,by=1e-9))
mtext("Likelihood", side=4, col="red",line=3)
```

What about a super informative prior??

```
####
# Try a very informative prior!
likelihood <- dbinom(data,size=10,prob=param.space)
```

```
prior <- dbeta(param.space,shape1=150,shape2=50)
#prior
## weight the data likelihood by the prior
weighted.likelihood <- likelihood*prior
## compute normalization constant
normalization.constant <- sum(weighted.likelihood)
## Posterior!!
posterior <- weighted.likelihood/normalization.constant
## Plot it out!
par(mai=c(1,1,0,1))
plot(param.space,prior,ylim=c(0,15),type="l",lwd=1,lty=2,col="blue",ylab="Probability Density",xlab="param.space")
points(param.space,posterior*length(param.space),type="l",col="blue",lwd=2,lty=1)
points(param.space,likelihood*5,type="l",col="red",lwd=1,lty=2)
axis(4,at=seq(0,2,by=0.4),labels = seq(0,0.5,by=.1))
mtext("Likelihood", side=4, col="red",line=3)
```

Okay, now let’s do it the more mathematically elegant way! When we
work with a conjugate prior, the updating process is easy. The posterior
distribution for the *p* term in the above example can be
computed by:

\(Beta(shape1=prior+k,shape2=prior+(N-k))\)

Let’s do the same thing, now using the conjugate prior method…

```
## Conjugate priors ---------------------------
# Do it again- this time with conjugate priors...
### PRIOR
prior_beta <- c(shape1=1,shape2=1)
curve(dbeta(x,prior_beta['shape1'],prior_beta['shape2']),ylim=c(0,5),ylab="Prob Density",col="blue",lwd=1,lty=2,xlab="param.space")
### POSTERIOR
curve(dbeta(x,prior_beta['shape1']+data,prior_beta['shape2']+(10-data)),ylim=c(0,4),ylab="Prob Density",col="blue",lwd=2,lty=1,xlab="param.space",add=T)
```

And again, this time with an informative prior!

```
####
# With informative prior...
### PRIOR
prior_beta <- c(shape1=15,shape2=5)
curve(dbeta(x,prior_beta['shape1'],prior_beta['shape2']),ylim=c(0,5),ylab="Prob Density",col="blue",lwd=1,lty=2,xlab="param.space")
### POSTERIOR
curve(dbeta(x,prior_beta['shape1']+data,prior_beta['shape2']+(10-data)),ylim=c(0,4),ylab="Prob Density",col="blue",lwd=2,lty=1,xlab="param.space",add=T)
```

`graphics.off()`

And the super informative prior??

```
####
# And with super informative prior...
### PRIOR
prior_beta <- c(shape1=150,shape2=50)
curve(dbeta(x,prior_beta['shape1'],prior_beta['shape2']),ylim=c(0,15),ylab="Prob Density",col="blue",lwd=1,lty=2,xlab="param.space")
### POSTERIOR
curve(dbeta(x,prior_beta['shape1']+data,prior_beta['shape2']+(10-data)),ylim=c(0,15),ylab="Prob Density",col="blue",lwd=2,lty=1,xlab="param.space",add=T)
```

One of the differences between the MLE and the Bayesian paradigm
(although both use likelihood as a way to summarize the information
content of the data) is that the point estimate is not usually the
maximum (mode) of the posterior distribution (in MLE, we by definition
try to find the parameter value that maximizes the likelihood function)
but the **mean** (expected value) or the
**median** of the posterior distribution.

Imagine you had a skewed posterior distribution that looked something like this (this is actually a lognormal distribution):

```
## Bayesian point estimate -----------------------
# Example: bayesian point estimates can differ markedly from MLE
curve(dlnorm(x,4,1),from=0.001,to=200,ylab="prob density") # use a lognormal distribution for example of skewed posterior dist...
```

Where is the mode? Where is the mean?

```
####
# Compute and plot the mean and the mode of the distribution
param.space2 <- seq(0.001,200,length=10000)
skewed.posterior <- dlnorm(param.space2,4,1)
mean <- mean(rlnorm(10000,4,1))
mode <- param.space2[which.max(skewed.posterior)]
plot(param.space2,skewed.posterior,type="l",ylab="prob density")
abline(v=c(mean,mode),col=gray(c(0.5,0.2)),lwd=3,lty=2) # add to plot
```

What about for the first worked example? Is there a big difference between the mean and the mode?

```
####
# Do the same for the frog detection example from above
#graphics.off()
### POSTERIOR
posterior <- dbeta(param.space,1+data,1+(10-data))
mean <- mean(rbeta(10000,1+data,1+(10-data)))
mode <- param.space[which.max(posterior)]
plot(param.space,posterior,type="l",col="blue",lwd=2)
abline(v=c(mean,mode),col=gray(c(0.5,0.2)),lwd=3,lty=2) # add to plot
```

In MLE, we find the parameter value that maximizes the likelihood function. Why don’t we use the mean of the likelihood function in Frequentist statistics? In Bayesian analysis, why don’t we use the mode of the posterior distribution?

We often call Bayesian confidence intervals *credible
intervals* to distinguish from their frequentist analog. Bayesian
(e.g., 95%) credible intervals can be interpreted in the way you
probably have always wanted to interpret frequentist (95%) confidence
intervals. It is way more satisfying for many people!

You are 95% sure that the parameter value is between the lower and upper bound!!!

**Q:** Why can’t we say this in a frequentist
framework?

Let’s try this:

```
####
# A Bayesian confidence interval (95% credible interval) -------------------
### POSTERIOR
curve(dbeta(x,1+data,1+(10-data)),ylim=c(0,4),ylab="Prob Density",col="blue",lwd=2,lty=1,xlab="param.space")
### CREDIBLE INTERVAL
credible.interval <- qbeta(c(0.025,0.975),1+data,1+(10-data)) # get the credible interval using the quantile method
abline(v=credible.interval,col=gray(0.5),lwd=3,lty=2) # add to plot
```

There are many possible Bayesian credible intervals, even assuming we want a 95% CI. That is, there are many intervals that enclose 95% of the posterior distribution. The Highest Posterior Density interval is the one that is the shortest out of all possible intervals. This is the natural choice for a Bayesian CI since it represents the collection of the most likely values for the parameters - every point within the interval is more plausible than any point outside the interval.

NOTE: the HPD can be discontinuous for multimodal posterior distributions!

One of the reasons Bayesian analysis was less common historically was
that there were no mathematically straightforward ways to do the
analysis. *There still are not* BUT we have fast computers and
ingenious computational algorithms. Basically, we can use various forms
of more-or-less brute force computation, along with some mathematical
tricks, to successfully conduct model-based inference in a Bayesian
framework.

Let’s start with the brutest of brute-force methods (not possible in high-dimensional systems):

For continuity we will continue the myxomatosis dataset. Recall that we are estimating the shape and scale (or alternatively, the rate) parameters of the Gamma distribution that describes the virus titer in Australian rabbits for the highest-grade infections.

```
#### Bayesian analysis without a conjugate prior
# Revisit the Myxomatosis example --------------------------
library(emdbook)
MyxDat <- MyxoTiter_sum
Myx <- subset(MyxDat,grade==1)
head(Myx)
```

```
## grade day titer
## 1 1 2 5.207
## 2 1 2 5.734
## 3 1 2 6.613
## 4 1 3 5.997
## 5 1 3 6.612
## 6 1 3 6.810
```

Recall the histogram looks like this:

`hist(Myx$titer,freq=FALSE) # visualize the data, again!`

And we are trying to fit a Gamma distribution to these data:

```
### Error is modeled as gamma distributed
hist(Myx$titer,freq=FALSE)
curve(dgamma(x,shape=40,scale=0.15),add=T,col="red") # note here we are using scale and not rate...
```

We already have a likelihood function for this problem! Note that we are now looking at raw likelihoods and not log likelihoods!

```
####
# recall our likelihood function for these data (not on log scale this time!)
GammaLikelihoodFunction <- function(params){
prod(dgamma(Myx$titer,shape=params['shape'],scale=params['scale']))
}
params <- c(40,0.15)
names(params) <- c("shape","scale")
params
```

```
## shape scale
## 40.00 0.15
```

`GammaLikelihoodFunction(params)`

`## [1] 2.906766e-22`

And we recall that the 2D likelihood surface looks something like this (looks a bit different without the log transform though!):

```
# define 2-D parameter space (in real probability scale)!
shapevec <- seq(10,100,by=0.1)
scalevec <- seq(0.01,0.3,by=0.001)
# define the likelihood surface across this grid within parameter space
likelihood2D <- matrix(nrow=length(shapevec),ncol=length(scalevec)) # initialize storage variable
newparams <- params
for(i in 1:length(shapevec)){
newparams['shape'] <- shapevec[i]
for(j in 1:length(scalevec)){
newparams['scale'] <- scalevec[j]
likelihood2D[i,j] <- GammaLikelihoodFunction(newparams)
}
}
# Visualize the likelihood surface
image(x=shapevec,y=scalevec,z=likelihood2D,zlim=c(1e-70,1e-17),col=topo.colors(12))
contour(x=shapevec,y=scalevec,z=likelihood2D,levels=c(1e-18,1e-17),add=T)
```

Now let’s use this 2-D likelihood surface as a jumping-off point for a brute-force Bayesian solution to this problem!

First we need to set priors for the shape and scale parameters… For example, we could set uniform distributions for these parameters. For this example, let’s imagine a prior in which all pixels in the above image are equally likely and all pixels outside this range have posterior probability of zero:

```
# compute the area of each pixel (for probability density computation)
pixelArea <- 0.0001 # for determining probability densities
# define the prior probability surface across this grid within parameter space
prior2D <- matrix(1, nrow=length(shapevec),ncol=length(scalevec)) # initialize prior
prior2D <- prior2D/length(prior2D)
# Visualize the 2-D prior distribution
image(x=shapevec,y=scalevec,z=prior2D,zlim=c(0,0.001),col=rainbow(10))
```

Okay, not very interesting!

But now we have the raw information we need to apply Bayes’ rule!

```
# Apply Bayes Rule! -------------------
weighted.likelihood <- prior2D * likelihood2D # numerator of Bayes rule
normalization.constant <- sum(weighted.likelihood) # denominator of Bayes rule
posterior2D <- weighted.likelihood/normalization.constant
# Visualize the 2-D posterior distribution
image(x=shapevec,y=scalevec,z=(posterior2D/pixelArea),zlim=c(0,5),col=topo.colors(12))
contour(x=shapevec,y=scalevec,z=(posterior2D/pixelArea),levels=c(1:4),add=T,drawlabels=FALSE)
```

Now we can use this posterior distribution to get our point estimates and parameter uncertainty estimates. We could just take the 2d posterior distribution surface and draw the contour containing 95% of the probability density:

First let’s find the contour line below which only 5% of the probability density is contained (this represents an HPD credible interval):

```
# try to find the contour that contains 95% of our degree of belief! ----------------
possible.contours <- data.frame(contour = seq(0.13e-4,1e-4,length=100), quantile = NA)
i=1
for(i in 1:nrow(possible.contours)){
ndx <- which(posterior2D<possible.contours$contour[i],arr.ind = T)
possible.contours$quantile[i] <- sum(posterior2D[ndx])
}
head(possible.contours,10)
```

```
## contour quantile
## 1 1.300000e-05 0.03484278
## 2 1.387879e-05 0.03803099
## 3 1.475758e-05 0.04110702
## 4 1.563636e-05 0.04417352
## 5 1.651515e-05 0.04751733
## 6 1.739394e-05 0.05046922
## 7 1.827273e-05 0.05369474
## 8 1.915152e-05 0.05674380
## 9 2.003030e-05 0.05980036
## 10 2.090909e-05 0.06315809
```

From here we can see that the contour line at posterior probability 1.739394e-05 encloses ~95% of the probability density

```
# Visualize the 2D credible region (HPD)
q95 <- 1.739394e-05
image(x=shapevec,y=scalevec,z=posterior2D,zlim=c(0.5e-11,5e-4),col=topo.colors(12))
contour(x=shapevec,y=scalevec,z=posterior2D,levels=q95,add=T,lwd=3,col="red",drawlabels=FALSE)
```

Where is our point estimate? Let’s find the posterior mean and mode!

```
# Visualize a point estimate
image(x=shapevec,y=scalevec,z=posterior2D,zlim=c(0.5e-11,5e-4),col=topo.colors(12))
contour(x=shapevec,y=scalevec,z=posterior2D,levels=q95,add=T,lwd=3,col="red",drawlabels=FALSE)
meanshape <- sum(shapevec*posterior2D)
meanscale <- sum(scalevec*posterior2D)
points(meanshape,meanscale,pch=20,cex=2,col="red") # posterior mean
mode <- which(posterior2D==max(posterior2D),arr.ind = T)
points(shapevec[mode[1]],scalevec[mode[2]], pch="X",col="black",cex=1.5) # posterior mode
```

We can also plot out the *marginal posterior distributions*
for the parameters separately. For example, the probability that the
‘shape’ parameter falls within a given range, regardless of the other
variable:

```
# Plot out posterior distributions separately for each parameter -------------
marginal.dist.shape <- apply(posterior2D,1,mean) # factor out the scale param- focus only on the shape param.
plot(shapevec,(marginal.dist.shape/sum(marginal.dist.shape))/0.1,type="l",lwd=2,col="blue",ylab="probability density",main="Posterior probability")
abline(v=meanshape)
```

```
marginal.dist.scale <- apply(posterior2D,2,mean)
plot(scalevec,(marginal.dist.scale/sum(marginal.dist.scale))/0.001,type="l",lwd=2,col="blue",ylab="probability density",main="Posterior probability")
abline(v=meanscale)
```

`meanshape`

`## [1] 48.08924`

`meanscale`

`## [1] 0.1549478`

And from here we can compute the posterior mean and Bayesian credible intervals.

How do our Bayesian estimates compare with the maximum likelihood estimates?? (shape = 49.3666607 scale = 0.1402629)

We can also *sample directly from this 2-D posterior
distribution*:

```
# Sample parameters from the joint posterior
SampleFromPosterior <- function(n){
shape <- rep(shapevec,times=length(scalevec))
scale <- rep(scalevec,each=length(shapevec))
jointparams <- data.frame(shape=shape,scale=scale)
probs <- as.vector(posterior2D)
samples <- sample(c(1:length(probs)),size=n,replace=TRUE,prob=probs)
jointparams[samples,]
}
samples<-SampleFromPosterior(n=10000)
par(mfrow=c(3,2))
plot(samples,col=1:10000)
plot(samples,type="l")
plot(ts(samples[,1]),xlab="sample",ylab="shape")
plot(ts(samples[,2]),xlab="sample",ylab="scale")
hist(samples[,1],40,xlab="shape",main="histogram of shape param")
hist(samples[,2],40,xlab="scale",main="histogram of scale param")
```

`par(mfrow=c(1,1))`

Now in 99% of analysis we will undertake in the real world, we simply
won’t have the computational power to partition our parameter space into
tiny discrete pixels and completely evaluate the posterior probability
for all pixels in that *n*-dimensional space. In these cases, we
tend to harness algorithms known as Markov-Chain Monte Carlo (MCMC).
This is the focus of the next lecture.