So, Gibbs sampling is one way (among many) to estimate a Bayesian model. It’s also one of the many estimating techniques contained under the hood of MLWiN. If you’re really interested, I’d highly recommend Marin & Robert’s Bayesian Essentials with R (I have a copy I don’t use if the library doesn’t have it). It’s a great, simple introduction to the relevant concepts (and importantly, makes you actually apply them).

You’re probably familiar with things like maximum likelihood estimation, of which ordinary least squares estimators are a type. Usually, these methods focus on the likelihood function, which is an inversion of the probability problem of concluding the probability of your theory $$\theta$$ (usually some parameter of interest about a process) from some data $$y$$.

$p(\theta | y) \propto \mathcal{L}(y | \theta)$ Here, $$\mathcal{L}$$ is a likelihood function, and $$a \propto b$$ is read as “a is proportional to b”. Likelihoods are not probabilities, but they’re related to them. They express how likely the data ($$y$$) is, given an assumption about what the underlying parameters are ($$\theta$$) that govern these outcomes. In maximum likelihood approaches, we just look for the value of $$\theta$$ that maximizes $$\mathcal{L}$$, which is a pretty straightforward mathematical challenge. This gives us the parameter estimates that are most likely to have generated our data.1

## The Bayesian Interpretation

Usually, Bayesians interpret this statement as saying the probability of your belief given the data should be proportional to the probability of the data given your belief.

$p(\text{belief | data}) \propto \mathcal{L}(\text{data | belief})$

But, this is actually missing one part!

Formally, this likelihood relationship actually comes from Bayes’ Theorem. Since we observe $$y$$, the data, and want to get the probability our theory $$\theta$$ is true, we use Bayes’ Theorem: $p(\theta | y) = \frac{p(y | \theta) p(\theta)}{p(y)}$ Generally, we ignore $$p(y)$$, the probability of the data, raw, over all possible theories. We do this for a few reasons. First, $$p(y)$$ is pretty conceptually hard to grasp without becoming an extra-dimensional being… it has no easy way to specify. Second, it’s easy to remove from the expression (mathematically) because it’s constant with respect to $$\theta$$: we take our data $$y$$ as “known,” so probability we would observe it over infinite universes that may operate under different rules is fixed.

This means we’re left with

$p(\theta | y) \propto p(y | \theta) p(\theta)$

Or, in the plain language above, the probability of your belief given the data should be proportional to the probability of the data given your belief, adjusted by the probability of your beliefs are true:

$p(\text{belief | data}) \propto \mathcal{L}(\text{data | belief}) p(\text{belief before observing data})$

We call $$p(\theta)$$ a “prior probability,” and you might’ve worked with these already in other packages.

# A Bayesian Regression

So, say we have a simple linear regression: \begin{align} y &= X\beta + \epsilon \\ \epsilon &\sim \mathcal{N}(0,\sigma^2) \end{align}

For us, we need to specify some kind of $$p(\theta)$$, the probability of our theory before observing data. This is the prior, and it means we need to pick some statistical distribution from which we think $$\beta,\sigma$$ are generated.

For some estimation methods (like the simple Gibbs sampling discussed here), we need to pick special priors to make the problem easy. We’ll let $$\beta \sim \mathcal{N}(\mu, \tau^2)$$, where $$\mu$$ is our “prior belief” about the average $$\beta$$ effect we might observe, and $$\tau^2$$ is how intrinsically noisy we think the effect is. For $$\sigma^2$$, our prior parameters $$(a,b)$$ are for an “inverse gamma” distribution. We use this distribution in part because of its mathematical structure. But, it also gives $$a$$ and $$b$$ really natural interpretations: $$a$$ will correspond to how many “observations” we think our prior belief about the model variance should be worth relative to our real observations in $$y$$, and $$b$$ is can then our uncertainty about the variation in the data. If $$a$$ is small, we don’t give our prior much “weight” in terms of our observations. If $$b$$ is big, then our variance could be really small or really big.

Together, this makes a full statement of the Bayesian regression, including priors: \begin{align} y &= X\beta + \epsilon \\ \epsilon &\sim \mathcal{N}(0,\sigma^2) \\ \beta &\sim \mathcal{N}(\mu, \sigma^2) \\ \sigma^2 &\sim \text{Inverse Gamma}(a,b) \end{align} It’s important to remember that this is only one possible specification, and many others are used in practice. Some people even put a prior on $$\sigma$$, rather than $$\sigma^2$$. Others model $$\sigma^2_i$$ as unique variances for each observation, accounting for possible heteroskedasticity. Others may model the full covariance of $$y$$ directly with an $$N \times N$$ matrix $$\Sigma$$. For us, though, these distributions are classically used because they make Gibbs sampling, discussed in the next section, easier.

Regardless, since $$\epsilon$$ is normally distributed, our model of $$y$$ conditional on our unknown parameters is the typical normal distribution used in regression: $y | (\beta, \sigma) \sim \mathcal{N}(X\beta, \sigma^2)$ This works because, if $$\beta$$ was known, the mean of $$y$$ is simply $$X\beta$$. Our error is modeled as normally distributed with mean 0 and variance $$\sigma^2$$ by the Gauss-Markov theorem. We’re just saying that $$\beta$$ and $$\sigma$$ themselves may have some uncertainty we need to factor in.

Pulling this all back together with the information about $$p(\theta)$$, our priors, we can use that statement about belief & data we made before to structure our belief after having observed data (sometimes called the “posterior belief” because it’s post-observation): $p(\text{belief | data}) \propto \mathcal{L}(\text{data | belief}) p(\text{belief before observing data})$

is re-spelled:

$p(\beta, \sigma^2 | y) \propto p(y | \beta, \sigma^2) p(\beta) p(\sigma^2)$ This is pretty complex, and is a function of two parameters, $$\beta, \sigma$$. But, at least we now have specified the distributional forms for each of the components on the right hand side: $p(\beta, \sigma^2 | y) \propto \mathcal{N}(X\beta, \sigma^2) \mathcal{N}(\mu, \tau^2) \text{IG}(a,b)$ This is pretty complex as it stands, so we’re going to try to simplify it a little more so we can actually estimate $$\beta$$ and $$\sigma^2$$.

## Gibbs sampling

Gibbs sampling reduces this statement to a sequence of distributions that are a function of a single parameter. Put more simply, it means we can treat the complicated function $$p(\beta, \sigma^2|y)$$, which has two unknowns, as two much more simple functions: $$p(\beta | \sigma^2, y)$$ and $$p(\sigma^2 | \beta, y)$$.

We do this by asking: what if we just pretended we knew what $$\sigma^2$$ was? Well… the disribution $$p(\beta | y, \sigma^2)$$ would be much simpler: $p(\beta | y, \sigma^2) \propto \mathcal{N}(X\beta, \sigma^2) \mathcal{N}(\mu, \tau^2)$ If you do some algebra, you can work out the fact that this turns into: $\beta | y, \sigma^2 \sim \mathcal{N}(m, s)$ where $$m$$ and $$s$$ are now mixtures of the prior information and information in the data. In fact, when you do the algebra, it’ll make clear that $$m$$ and $$s$$ have some familiar forms: $m = (X'X)^{-1}X'y \ \ \ \ \ \ \ \ s = \sigma^2(X'X)^{-1}$ That is, our prior for $$\beta$$ is normal, and our “conditional posterior” for $$\beta$$ is normal. This is called “conditional conjugacy,” when the distribution of one of our parameters (assuming everything else is known) is in the same form as our prior belief about the parameter. If we did know $$\sigma^2$$, we could just draw from this using rnorm in R!

This realization is what powers Gibbs sampling.2

Second, the parameters of $$\beta$$ should be recognizable from a linear regression equation. Specifically, $$m$$ is the estimator for $$\hat{\beta}$$ in a typical linear regression! But, since we’ve now incorporated some uncertainty into $$\beta$$ using on our prior belief, $$\beta$$ is considered as randomly distributed around that value. Regardless of this uncertainty, on average, $$\beta$$ will be its least squares estimate. Its randomness around the least squares estimate depends on the *empirical covariance, $$(X'X)^{-1}$$, and whatever our assumed-known value of $$\sigma^2$$ is.

For $$\sigma^2$$, we now will pretend we know $$\beta$$. This also simplifies the distribution of $$\sigma^2$$ alone: $\sigma^2 | y, \beta \sim \mathcal{N}(X\beta, \sigma^2) \text{IG}(a,b)$ This is the distribution of $$y$$ times our prior belief about $$\sigma^2$$, as was the case for $$\beta$$.

Again, if you do the algebra, this turns out to be conditionally conjugate as well: $\sigma^2 | y, \beta \sim \text{IG}(a_n,b_n)$ where $a_n = \frac{n}{2} + a \ \ \ \ \ \ \ \ b_n = (y - X\beta)'(y - X\beta) / 2$ If we only knew $$\beta$$, this could be sampled using rinvgamma from MCMCpack.

## This is (not) a Trick

This is a good trick, but you can’t just assume you know these things… can you?

Well… what if we started at some good-enough value of $$\beta$$ or $$\sigma^2$$, draw the other one , and then use value to draw a better value of the first parameter we assumed? Does that give us the right answer in the long run?

To walk you through how this works, I’ll be using a subscript to denote that we’re in some “iteration” of the sampler. So, $$\beta_k$$ is the $$\beta$$ value we’ve constructed at the $$k$$th iteration.

So, for example, we can just arbitrarily state a first guess for $$\beta$$ is zero (so $$\beta_1 = 0$$). Then, we’d draw sigma randomly from: $\sigma^2_1 | y, \beta \sim \text{IG}(\frac{N}{2},(y - X \bullet 0)'(y - X\bullet 0)/2)$ Then, we could use this $$\sigma^2_1$$ we’ve just drawn as if it’s “known” and draw our next iteration’s $$\beta$$, called $$\beta_2$$, from: $\beta_2 | y, \sigma^2 \sim \mathcal{N}((X'X)^{-1}X'y, \sigma^2_1 (X'X)^{-1})$ Theorems proven by Geman & Geman (1984) demonstrate that, if we do this back & forth for a long time, the draws of $$\beta$$ and $$\sigma^2$$ will get arbitrarily close to the “correct” distributions for $$\beta$$, $$\sigma^2$$.

The course of parameter draws taken over iterations is what you’d see as a “traceplot” in many MCMC packages. Usually, we only analyze the end of the trace, since that’s assumed to be drawn from the “correct” distribution. But, implementing this in R is pretty straightforward.

## Hand-crafted artisinal Gibbs sampling

Just for an example, let’s use the R trees dataset, and do a regression on tree volume using Girth and Height.

data(trees)
attach(trees)
library(MCMCpack)

You can see the relationships between the two variables below:

plot(Height, Volume, data=trees)
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter

plot(Girth, Volume, data=trees)
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter