-
Notifications
You must be signed in to change notification settings - Fork 4
/
bayesian-inference-r-2016.html
495 lines (308 loc) · 12.1 KB
/
bayesian-inference-r-2016.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
<!DOCTYPE html>
<html>
<head>
<title>Introduction to Bayesian Inference in R</title>
<meta charset="utf-8">
<style>
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
body { font-family: 'Droid Serif'; }
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: normal;
}
.remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; }
</style>
</head>
<body>
<textarea id="source">
class: center, middle
# Introduction to Bayesian Inference in R
## Robert Dodier
This document and associated material is on Github:
[robert-dodier/bayesian-inference-r-2016](https://github.com/robert-dodier/bayesian-inference-r-2016)
---
# What is Bayesian inference?
* Bayesian inference = laws of probability applied to any proposition
* Proposition = statement about uncertain variables
* Variables = observable quantities, model parameters, latent (hidden) variables, hypotheses, etc
* Fundamental operation is to compute conditional probabilities
* i.e. probability of an interesting proposition ...
* given (conditional on) some data (i.e. whatever is fixed, either by observation or assumption)
* Bayesian inference is a framework within which we construct a way to handle any problem involving uncertain propositions
---
# What to do about uninteresting variables
* What about variables which are neither interesting nor given??
* Laws of probability dictate the Right Thing To Do:
* namely, integrate joint probability of "interesting" and "not interesting" over "not interesting"
* formally,
`\(P(\mathrm{interesting} | \mathrm{given}) = \int P(\mathrm{interesting}, \mathrm{uninteresting} | \mathrm{given}) d\mathrm{uninteresting}\)`
(where integral is a summation if uninteresting is discrete)
* This is the motivation for the computational stuff we'll see later
---
# Some problems involving uncertain propositions
* Missing data
* Forecasts
* Diagnosis
* Model selection
* Model parameters
* "What if" scenarios
---
# Reasonable, non-Bayesian approaches
* Missing data: plug in estimate of missing variable
* Forecasts: output forecast = best guess
* Diagnosis: output diagnosis = best guess
* Model selection: choose "the best" model
* Model parameters: choose best parameters
* "What if" scenarios: plug in example inputs into model and get output
---
# Approaches as enriched by a Bayesian perspective
* Missing data: consider all possible values of missing variable
* Forecasts: consider all possible output values
* Diagnosis: consider all possible diagnoses
* Model selection: consider all possible models
* Model parameters: consider all possible parameters
* "What if" scenarios: consider all possible results
---
# A closer look at the model parameters problem
* Goal is to construct posterior distribution of parameters given data
* In some cases an exact distribution of parameters can be constructed exactly or approximately
* In other cases, can't construct it explicitly but can construct an algorithm to sample from posterior
* ... then to produce e.g. forecasts, sample from posterior and use parameters to generate output
* ... and the result is a posterior distribution over the output
---
# Linear regression example
* We have a small data sample
* We'll assume the data are described by a linear model `\(y = a_0 + a_1 x + \mathrm{noise}\)`
* We'd like to compute a forecast at a given point
* ... taking uncertainty in the model parameters into account
---
![Example data for linear regression problem with no title](data.svg)
---
![Example data with drought title](data-title-drought.svg)
---
![Example data with surgery title](data-title-surgery-cost.svg)
---
![Example data with electrical demand title](data-title-peak-demand.svg)
---
There are many possible lines, some more plausible, some less
![Example data with possible lines](data+random-lines.svg)
---
Let's set our goal to compute the posterior distribution `\(p(y | x, \mathrm{data})\)`
* For that, we need to average over the variables we don't know, namely `\(a_0\)` and `\(a_1\)`.
* Specifically we need to compute
`\(\int \int p(y | x, a_0, a_1) p(a_0, a_1 | \mathrm{data}) p(a_0, a_1) da_0 da_1\)`
---
# Computing integrals -- (1) symbolic
* Stuff like: `\(\int_0^x \alpha e^{-\alpha t} \beta e^{-\beta (x - t)} dt = \frac{\alpha \beta}{\beta - \alpha} (e^{-\alpha x} - e^{-\beta x})\)`
* Gives insight if assumptions are satisfied
* Limited applicability
* In R, maybe rsympy or maybe ryacas which are glue code for computer algebra systems
---
# Computing integrals -- (2) quadrature formulas
* stats::integrate adaptive subdivision based on rules exact for polynomials
---
# Linear regression example using quadrature
`py.given.x` computes `\(p(y|x, \mathrm{data})\)`
```r
py.given.x <-
function (y, x) {
integrate (
function (a.1) {
sapply (a.1,
function (a.1) {
integrate (function (a.0) {
py.given.x.a0.a1.sigma.e (y, x, a.0, a.1, 0.25) *
P2 (a.0, a.1)},
-Inf, Inf)$value})},
0, Inf)}
```
Integrand is `\(p(y | x, a_0, a_1) p(a_0, a_1 | \mathrm{data}) p(a_0, a_1)\)`
---
# Calculating a posterior distribution via quadrature
```r
yy <- seq (from=0.4, to=2.1, by=0.01)
sapply (yy, function (y) {py.given.x (y, 0.75) $ value}) -> pp
I <- sum(pp)*0.01
pp <- pp/I
plot (x=yy, y=pp, type="l")
```
---
![Posterior distribution of y given 0.75 via quadrature](y-given-x-via-quadrature.svg)
---
# About quadrature formulas
* Works well in 1 dimension, tolerable in a few more
* Unworkable in many dimensions (a common application scenario)
---
# Computing integrals -- (3) simple Monte Carlo
* runif, rnorm, rgamma, etc generate pseudorandom numbers
* Approximate integral as average over samples: `\(\int f(x) p(x) dx \approx \frac{1}{n} \sum_i f(x_i)\)`
---
# Linear regression example using simple Monte Carlo
```r
py.given.x.via.monte.carlo <-
function (y, x) {
mean (mapply (function (a.0, a.1) {
py.given.x.a0.a1.sigma.e (y, x, a.0, a.1, 0.25) *
L(a.0, a.1, 0.25) *
psigma.e(0.25)},
rnorm (1000, mean=0, sd=1),
rgamma (1000, shape=2, scale=1)))}
```
---
# Calculating a posterior distribution via simple Monte Carlo
```r
pp <- sapply (yy, function (y) {py.given.x.via.monte.carlo (y, 0.75)})
I <- sum(pp)*0.01
pp <- pp/I
```
---
![Posterior distribution of y given x=0.75 via simple Monte Carlo](y-given-x-via-monte-carlo.svg)
---
# About simple Monte Carlo
* Easy to apply, may be inefficient
* Limited applicability (difficulty of constructing sampling distribution)
* ... this is the motivation for Markov chain Monte Carlo
---
# Computing integrals -- (4) Markov chain Monte Carlo
* More general than ordinary Monte Carlo
* If one can't explicitly construct sampling distribution,
construct Markov chain s.t. stationary distribution is desired distribution
* For this Markov chain, averages over samples are equal to averages over the desired distribution
* From current sample, jump somewhere else; maybe accept the new sample. Repeat many times.
---
# Markov chain Monte Carlo example
We'll work with a function proportional to the distribution we want to sample;
namely the product of the likelihood function and the prior for the parameters.
Here is the likelihood function for the linear regression example:
```r
L <- function (a.0, a.1, sigma.e) {
py.given.x.a0.a1.sigma.e (data.y[1], data.x[1], a.0, a.1, sigma.e) *
py.given.x.a0.a1.sigma.e (data.y[2], data.x[2], a.0, a.1, sigma.e) *
py.given.x.a0.a1.sigma.e (data.y[3], data.x[3], a.0, a.1, sigma.e)}
```
and here is the prior:
```r
pa.0 <- function (a.0) {exp(-0.5*a.0^2)/sqrt(2*pi)}
pa.1 <- function (a.1) {a.1*exp(-a.1)}
```
---
![Function proportional to target distribution](proportional_target.svg)
---
Proposal distribution which is just a Gaussian bump:
```r
Q <- function (a.0, a.1) {rnorm (2, mean=c(a.0, a.1), sd=0.05)}
```
Generate a new proposed point:
```r
p1 <- Q (p0[1], p0[2])
```
Compute acceptance ratio:
```r
r <- L (p1[1], p1[2]) / L (p0[1], p0[2])
```
Accept or reject proposal:
```r
if (r > 1 || r > runif(1)) {p0 <- p1}
```
---
Generate a sequence of samples given a starting point `xy0`:
```r
mcmc.sequence <- function (n, p0) {
a.0 <- vector (length=n)
a.1 <- vector (length=n)
a.0[1] <- p0[1]
a.1[1] <- p0[2]
for (i in 2:n) {
p1 <- Q (p0[1], p0[2])
r <- L (p1[1], p1[2]) / L (p0[1], p0[2])
if (r > 1 || r > runif(1)) {p0 <- p1}
a.0[i] <- p0[1]
a.1[i] <- p0[2]
}
list (x=a.0, y=a.1)
}
```
---
"Burn-in" sequence -- initial samples not representative of distribution
!["Burn-in" sequence](target+initial-sequence.svg)
---
Running the chain longer yields more representative samples
![Burn-in plus additional samples](target+initial+subsequent-sequence.svg)
---
![Posterior distribution of y given x=0.75 via Markov chain Monte Carlo](y-given-x-via-mcmc.svg)
---
# About Markov chain Monte Carlo
* MCMC slower than ordinary MC due to autocorrelation, but more general
* Writing a sampler is a distraction that obscures the model
---
# Stan, a package for Bayesian inference
* Stan has its own modeling language, primarily declarative
* User states how variables are related, Stan figures out calculations
* Inference via "no U-turn sampling" MCMC
* Also implements variational inference and optimization functions
---
# Linear regression model in Stan
```
model {
real mu;
for (i in 1:3) {
mu = alpha[1] + alpha[2] * x[i];
y[i] ~ normal(mu, sigma_e);
}
}
generated quantities {
real y_pred;
y_pred = normal_rng(alpha[1] + alpha[2] * new_x, sigma_e);
}
```
---
# Making inferences via RStan
```r
library (rstan)
stan.data <- list (x=c(0.1, 0.5, 0.6), y=c(0.35, 0.9, 1.1), sigma_e=0.25, new_x=0.75)
my.stan.model <- stan (file="model.stan", data=stan.data, iter=2000, chains=4)
traceplot (my.stan.model, pars = c("alpha"), inc_warmup = T)
y_pred_struct <- extract (my.stan.model, 'y_pred')
y_pred <- unlist (y_pred_struct, use.names=F)
hist (y_pred, freq=F, main="Density of y|x via Stan")
```
---
Samples from the Markov chain for the parameters
![Samples from the Markov chain for the parameters](stan-model-posterior-samples.svg)
---
![Posterior distribution of y given x=0.75](y-given-x-via-stan.svg)
---
# What use is a probability?
* Probability calculations are part of a bigger picture of decision analysis
* Rational approach is to make decision on basis of expected utility
* Utility = quantify value or preference
* Probability = quantify belief
* Expected utility = utility integrated wrt probability distribution
* Instead of using a probability by itself, consider expected gain/loss
* e.g. P(change service|data) = 0.2; what should service provider do?
* answer has to depend on what service provider stands to gain or lose
</textarea>
<script src="https://gnab.github.io/remark/downloads/remark-latest.min.js"></script>
<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML&delayStartupUntil=configured" type="text/javascript"></script>
<script type="text/javascript">
var slideshow = remark.create();
// Setup MathJax
MathJax.Hub.Config({
tex2jax: {
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
}
});
MathJax.Hub.Queue(function() {
$(MathJax.Hub.getAllJax()).map(function(index, elem) {
return(elem.SourceElement());
}).parent().addClass('has-jax');
});
MathJax.Hub.Configured();
</script>
<script>
var slideshow = remark.create();
</script>
</body>
</html>