all_code.Rmd

---
title: "Intro2R"
author: "Jonathan Rosenblatt"
date: "March 18, 2015"
output: html_document

---
# R Basics

Tips for this introction:
- If you are working alone, consider starting with "An Introduction to R" here:
http://cran.r-project.org/manuals.html 
- Make sure you use RStudio.
- ctrl+return to run lines from editor.
- alt+shift+k for RStudio keyboard shortcuts.
- ctrl+alt+j to navigate between sections
- tab for autocompletion
- ctrl+1 to skip to editor. 
- ctrl+2 to skip to console.
- ctrl+8 to skip to the environment list.
- Folding:
  - alt+l collapse chunk.
  - alt+shift+l unfold chunk.
  - alt+o collapse all.
  - alt+shift+o unfold all.
  

## Simple calculator
```{r example}
10+5
70*81
2**4
2^4
log(10)       					
log(16, 2)    					
log(1000, 10)   				
```


## Controlling output format:
```{r}
round(log(10))  # typically most useful

signif(log(10))		
prettyNum(log(10), digits=5)
format(log(10), digits=4, scientific=T, justify='right')
```


## Probability calculator 
Wish you knew this when you did Intro To Probability class?
```{r}
dbinom(x=3, size=10, prob=0.5) 	# For X~B(n=10, p=0.5) returns P(X=3)
dbinom(3, 10, 0.5)

pbinom(q=3, size=10, prob=0.5) # For X~B(n=10, p=0.5) returns P(X<=3) 	
dbinom(x=0, size=10, prob=0.5)+dbinom(x=1, size=10, prob=0.5)+dbinom(x=2, size=10, prob=0.5)+dbinom(x=3, size=10, prob=0.5) # Same as previous

qbinom(p=0.1718, size=10, prob=0.5) # For X~B(n=10, p=0.5) returns k such that P(X<=k)=0.1718

rbinom(n=1, size=10, prob=0.5) 	
rbinom(n=10, size=10, prob=0.5)
rbinom(n=100, size=10, prob=0.5)
```


## Getting help
Get help for a particular function.
```{r, eval=FALSE}
?dbinom 
help(dbinom)
```

Search local help files for a particular string.
```{r, eval=FALSE}
??binomial
help.search('dbinom') 
```

Load a menu with several important manuals:
```{r, eval=FALSE}
help.start() 
```


## Variable asignment:
Asignments into a variable named "x":
```{r}
x = rbinom(n=1000, size=10, prob=0.5) # Works. Bad style.
x <- rbinom(n=1000, size=10, prob=0.5) # Asignments into a variable named "x"
```
More on style: http://adv-r.had.co.nz/Style.html


Print contents:
```{r}
x
print(x)  
(x <- rbinom(n=1000, size=10, prob=0.5))  # Assign and print.
```


Operate on the object
```{r}
mean(x)  
var(x)  
hist(x)  
rm(x) # remove variable
```


For more information on distributions see http://cran.r-project.org/web/views/Distributions.html


## Piping for better style and readability
```{r}
# install.packages('magrittr')
library(magrittr)
```

```{r}
x <- rbinom(n=1000, size=10, prob=0.5)

x %>% var() # Instead of var(x)
x %>% hist()  # Instead of hist(x)
x %>% mean() %>% round(2) %>% add(10) 
```

This example clearly demonstrates the benefits (from http://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html)
```{r}
# Functional (onion) style
car_data <- 
  transform(aggregate(. ~ cyl, 
                      data = subset(mtcars, hp > 100), 
                      FUN = function(x) round(mean(x, 2))), 
            kpl = mpg*0.4251)


# magrittr style
car_data <- 
  mtcars %>%
  subset(hp > 100) %>%
  aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>%
  transform(kpl = mpg %>% multiply_by(0.4251)) %>%
  print
```


## Vector creation and manipulation 
```{r}
c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
10:21 							
seq(from=10, to=21, by=1) 							
x seq(from=10, to=21, by=2) 								
x <- c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21) 	
x
```


You can assign AFTER the computation is finished:
```{r}
c(1,2,3)
y<- .Last.value 
y
```


Operations usually work element-wise:
```{r}
x+2
x*2    
x^2    
sqrt(x)  
log(x)   
```


## Simple plotting 
```{r}
x<- 1:100; y<- 3+sin(x) # Create arbitrary data
plot(x = x, y = y) # x,y syntax  						
plot(y ~ x) # y~x syntax (I like better)
```

Control plot appearance:   
```{r}
plot(y~x, type='l', main='Plotting a connected line')
plot(y~x, type='h', main='Sticks plot', xlab='Insert x axis label', ylab='Insert y axis label')
plot(y~x, pch=5)
plot(y~x, pch=10, type='p', col='blue', cex=4)
abline(3, 0.002)
```

Available plotting options
```{r, eval=FALSE}
example(plot)
example(points)
?plot
help(package='graphics')
```

When your plotting gets serious, move to ggplot2 and ggvis as soon as possible.


___


## Data frame Manipulation 
data.frames extend the matrix class, in that they allow the binding of vectors of several classes (with same length).
```{r}
x<- 1:100; y<- 3 + sin(x) 
class(x) # R (high) level representation of an object.

# mode(x) 
# typeof(x) 
```


Create and checkout your first data frame
```{r}
frame1 <- data.frame(x=x, sin=y)	
frame1
head(frame1)
frame1 %>% head() # just print the beginning
frame1 %>% View() # Excel-like view (never edit!)

class(frame1) # the object is of type data.frame
dim(frame1)  							
dim(x)
length(frame1)
length(x)

str(frame1) # the inner structure of an object
attributes(frame1) # get the objec's meta data
```

### Exctraction
single element:
```{r}
frame1[1, 2]    						
frame1[2, 1]     						
```

Exctract _column_ by index:
```{r}
frame1[1, ]      						
frame1[,1] %>% t
frame1[,1] %>% t %>% dim
```

Exctract column by name:
```{r}
names(frame1)   						
frame1[, 'sin']
dim(frame1[, 'sin'])  # extract as a vector. no dim attribute.
frame1['sin'] 
dim(frame1['x',]) # extract as a data.frame. has dim attribute.
frame1[,1:2] %>% class
frame1[2] %>% class
frame1[2, ] # extract a row

frame1$sin %>% class
```

`subset()` does the same
```{r}
subset(frame1, select=sin) 
subset(frame1, select=2)
subset(frame1, select= c(2,0))
```


Sanity conservation notice!
Always think if you want to extract a vector or a frame:
- Note the difference between `[]` and `[[]]` exctraction!
- Note the difference between `frame[,1]` and `frame[1]`.
```{r}
a <- frame1[1]
b <- frame1[[1]]
a==b # Seems identical. But not really:
class(a)
class(b)
# Causes different behaviour:
a[1]
b[1]
```

More about extraction: http://adv-r.had.co.nz/Subsetting.html

### dplyr package
`dplyr` makes the manupilation of data.frames a breeze.
It is very fast, and straightforward 

Install the package:
```{r}
# install.packages('dplyr')
```

The following examples are taken from:
https://github.com/justmarkham/dplyr-tutorial/blob/master/dplyr-tutorial.Rmd
```{r}
# install.packages('nycflights13')
library(nycflights13)
dim(flights)
View(flights)
names(flights)
class(flights) # a tbl_df is an extension of the data.frame class
library(dplyr) # calling dplyr

filter(flights, month == 1, day == 1) #dplyr style
flights[flights$month == 1 & flights$day == 1, ] # old style
flights %>% filter(month == 1, day == 1) # dplyr with magrittr style (yes!)

filter(flights, month == 1 | month == 2)
sli1ce(flights, 1:10) # selects rows

arrange(flights, year, month, day) # sort
arrange(flights, desc(arr_delay)) # sort descending

select(flights, year, month, day) # select columns
select(flights, year:day) # select column range
select(flights, -(year:day)) # drop columns
rename(flights, tail_num = tailnum) # rename variables
# add a new computed colume
mutate(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60) 
# you can refer to columns just created!
mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)
# keep only new variables
transmute(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)
# simple statistics
summarise(flights,
  delay = mean(dep_delay, na.rm = TRUE)
  )

sample_n(flights, 10) # random subsample
sample_frac(flights, 0.01) # random subsample
```

Subgroup operations
```{r}
by_tailnum <- group_by(flights, tailnum)
by_tailnum %>% class # a groupping object
delay <- summarise(by_tailnum,
  count = n(),
  avg.dist = mean(distance, na.rm = TRUE),
  avg.delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, avg.dist < 2000)
View(delay)

destinations <- group_by(flights, dest)
summarise(destinations,
  planes = n_distinct(tailnum),
  flights = n()
)

# Grouping works in a hirarchy. summarise() peels outer layer.
daily <- group_by(flights, year, month, day)
(per_day   <- summarise(daily, flights = n()))
(per_month <- summarise(per_day, flights = sum(flights)))
(per_year  <- summarise(per_month, flights = sum(flights)))
```


Two table operations
```{r}
airlines %>% View
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

flights2 %>% left_join(airlines) # join on left table with automatic matching.

flights2 %>% left_join(weather)

flights2 %>% left_join(planes, by = "tailnum") # with named matching

flights2 %>% left_join(airports, c("dest" = "faa"))

flights2 %>% left_join(airports, c("origin" = "faa"))
```

Types of join
```{r}
(df1 <- data_frame(x = c(1, 2), y = 2:1))
(df2 <- data_frame(x = c(1, 3), a = 10, b = "a"))

df1 %>% inner_join(df2) # SELECT * FROM x JOIN y ON x.a = y.a

df1 %>% left_join(df2) # SELECT * FROM x LEFT JOIN y ON x.a = y.a

df1 %>% right_join(df2) # SELECT * FROM x RIGHT JOIN y ON x.a = y.a
df2 %>% left_join(df1) 

df1 %>% full_join(df2) # SELECT * FROM x FULL JOIN y ON x.a = y.a

# return only unmatched cases
flights %>%
  anti_join(planes, by = "tailnum") %>% 
  count(tailnum, sort = TRUE) 
# SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a)

df1 %>% semi_join(df2, by = "x")  # SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)
```

Set operations
```{r}
(df1 <- data_frame(x = 1:2, y = c(1L, 1L)))
(df2 <- data_frame(x = 1:2, y = 1:2))

intersect(df1, df2) # SELECT * FROM x INTERSECT SELECT * FROM y

union(df1, df2) # SELECT * FROM x UNION SELECT * FROM y

setdiff(df1, df2) # SELECT * FROM x EXCEPT SELECT * FROM y

setdiff(df2, df1)
```

Leaving dplyr for now...

### Arrays 
Arrays generalize matrices to higher dimension:
```{r}
x<- array(1:24, dim=c(6,4) )
x[6,4]
x<- array(1:24, dim=c(6,2,2) )
x[6,2,2] 
x<- array(1:24, dim=c(2,3,2,2) )
x[2,3,2,2] 
```


## Data Import and export 
For a complete review see:
http://cran.r-project.org/doc/manuals/R-data.html
also in  help.start() -> "Import and Export Manual" 

### Import from WEB 
`read.table()` is the main importing workhorse.
```{r}
URL <- 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/bone.data'
tirgul1 <- read.table(URL)
```

Always look at the imported result!
```{r}
View(tirgul1)
# hmmm... header interpreted as data. Fix with header=TRUE:
tirgul1 <- read.table(URL, header = TRUE) 
View(tirgul1)
```

### Import .csv files
Let's write a simple file so that we have something to import:
```{r}
View(airquality) #  examine the data to export
(temp.file.name <- tempfile()) # get an arbitrary file name
write.csv(x = airquality, file = temp.file.name) #export
```

Now let's import:
```{r}
# my.data<- read.csv(file='/home/jonathan/Projects/...')
my.data<- read.csv(file=temp.file.name)
View(my.data)
```

__Note__: Under MS Windows(R) you might want need '\\\' instead of '/'

### Imprt .txt files 
Tries to guess the seperator
```{r, eval=FALSE}
my.data<- read.table(file='C:\\Documents and Settings\\Jonathan\\My Documents\\...') #
```
Specifies the seperator explicitly
```{r, eval=FALSE}
my.data<- read.delim(file='C:\\Documents and Settings\\Jonathan\\My Documents\\...') 
```
If you care about your sanity, see ?read.table before starting imports.

### Writing Data to files

Get and set the current directory:
```{r, eval=FALSE}
getwd() #What is the working directory?
setwd() #Setting the working directory in Linux
```

```{r}
write.csv(x=tirgul1, file='/tmp/tirgul1.csv') #
```

See ?write.table for details.

### .XLS files 
Strongly recommended to convert to .csv
If you still insist see:
http://cran.r-project.org/doc/manuals/R-data.html#Reading-Excel-spreadsheets

### Massive files 
Better store as matrices and not data.frames.
`scan()` is faster than `read.table()` but less convenient:

Create the example data:
```{r}
cols<- 1e3
# Note: On Windoes you might neet to change /tmp/A.txt to /temp/A.txt 
rnorm(cols^2) %>%
  matrix(ncol=cols) %>%
  write.table(file='/tmp/A.txt', col.names= F, row.names= F)
# Measure speed of import:
system.time(A<- read.table('/tmp/A.txt', header=F))
system.time(A <- scan(file='/tmp/A.txt', n = cols^2) %>%
              matrix(ncol=cols, byrow = TRUE))

file.remove('/tmp/A.txt') 
```

This matter will be revisited in the last class.

### Databases:
Start [here](https://rforanalytics.wordpress.com/useful-links-for-r/odbc-databases-for-r/)

### Hands on example (from the WEB)
```{r}
URL <- 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/bone.data'
tirgul1 <- read.table(URL, header = TRUE)

names(tirgul1)
tirgul1 %>% head
tirgul1 %>% tail
View(tirgul1)
dim(tirgul1)
length(tirgul1)
```

R can be object oriented (read about S3 and S4 if interested).
See how `summary()` behaves differely on different object classes:
```{r}
class(tirgul1[, 1]); class(tirgul1[, 2]); class(tirgul1[, 3]); class(tirgul1[, 4])
summary(tirgul1)
```


Matrix is more efficient than data frames. But can store only a single class of vectors.
```{r}
tirgul.matrix <- as.matrix(tirgul1) 
tirgul.matrix
class(tirgul.matrix)
# notice everything has been cast to the most general class.
class(tirgul.matrix[, 1]); class(tirgul.matrix[, 2]); class(tirgul.matrix[, 3]); class(tirgul.matrix[, 4])
summary(tirgul.matrix)
```

Note: if re-writing an expression bothers you (as it should!), here are some solutions:
```{r}
# The apply family of functions:
sapply(tirgul.matrix, class)

# looping
for(j in 1:ncol(tirgul.matrix)) print(class(tirgul.matrix[,j]))
```

Make sure you read `?sapply`. 
LISP fans might also like to read `?MAP`.


Operations _within_ data objects:
```{r}
plot(tirgul1$gender)
tirgul1$gender %>% plot() # 
with(tirgul1, plot(gender) ) # Same opration. Different syntax.

mean(tirgul1$age)
tirgul1$age %>% mean() # 
with(tirgul1, mean(age) ) # Same opration. Different syntax.
```


```{r}
tirgul1$age <- tirgul1$age * 365
tirgul1<- transform(tirgul1, age=age*365 )  #Age in days
with(tirgul1, mean(age) )
tirgul1<- transform(tirgul1, age=age/365 )  #Does this revert back to years?
with(tirgul1, mean(age) )
```

Then again, many of these fnuctions are replaced by more friendly functions in the dplyr package (see below).


## Sorting 
```{r}
(x<- c(20, 11, 13, 23, 7, 4))
(y<- sort(x))
(ord<- order(x))
x[ord] # Exctracting along the order is the same as sorting.
ranks<- rank(x)
identical(y[ranks] , x) # Compares two objects

(z<- c('b','a','c','d','e','z'))
xz<- data.frame(x,z)
sort(xz)
xz[ord,] # Sorting a data frame using one column
```


## Looping 
For a crash course in R programming (not only data analysis) try:   
http://adv-r.had.co.nz/                            
The usual for(), while(), repeat() 
```{r}
for (i in 1:100){
    print(i)
    }
```


```{r}
for (helloeveryone in seq(10, 100, by=2) ){
    print(helloeveryone)
    }
```


## Recursion 
Typically very slow due to memory management issues.

```{r}
fib<-function(n) {
    if (n < 2) fn<-1 
    else fn<-Recall(n - 1) + Recall(n - 2) 
    return(fn)
} 
fib(30)
```


## Finding your objects 
```{r}
ls() #Lists all available objects
ls(pattern='x')

ls(pattern='[0-9]') # Search using regular expressions
ls(pattern='[A-Z]')
```

ctrl+8 in RStudio.

### What are the available environments?
```{r}
search() # This is the search hirarchy of called objects.
```
When you start serious programming in R, read this:
http://adv-r.had.co.nz/Environments.html


# Univariate Exploratory Statistics


##  Exploring Categorical Variables  
```{r}
gender <- c(rep('Boy', 10), rep('Girl', 12))
drink <- c(rep('Coke', 5), rep('Sprite', 3), rep('Coffee', 6), rep('Tea', 7), rep('Water', 1))  
class(gender);class(drink)

cbind(gender, drink)
table1 <- table(gender, drink) 
table1										
```


Margins
```{r}
table(gender) 
table(drink)
dotchart(as.matrix(table(gender)))
dotchart(as.matrix(table(drink)))

barplot(table1, legend.text=T)    			
barplot(t(table1), legend.text=T)    		

plot(table1, main="Frequency Bar Chart", sub="Notice columns width is also propostional to counts!")
plot(t(table1))

data1<-data.frame(gender, drink)
plot(data1) 
plot(gender~drink) #Will not work
plot(gender, drink) #Will not work

gender.n<-apply(table1, 1, sum) 		
gender.n
drink.n<-apply(table1, 2, sum)     		
drink.n

apply(table1, 2, '/', gender.n)   		
apply(table1, 1, '/', drink.n)     		

apropos('table')        
margin.table(table1, 2)  
prop.table(table1, 1)    
prop.table(table1, 2)     

par(mfrow=c(1, 2))
pie(prop.table(table1, 1)['Boy', ], main='Drinks given Boys')   
pie(prop.table(table1, 1)['Girl', ], main='Drinks given Girls') 
barplot(prop.table(table1, 1)['Boy', ], main='Boys');barplot(prop.table(table1, 1)['Girl', ], main='Girls') 
par(mfrow=c(2, 3)) 
pie(prop.table(table1, 1)[, 'Coffee'], main='Coffee');pie(prop.table(table1, 1)[, 'Coke'], main='Coke');  
pie(prop.table(table1, 1)[, 'Sprite'], main='Sprite');pie(prop.table(table1, 1)[, 'Tea'], main='Tea');    
pie(prop.table(table1, 1)[, 'Water'], main='Water');                                     

par(mfrow=c(1, 1))   


barplot(table1)
barplot(prop.table(table1, 1))
barplot(prop.table(table1, 2))
barplot(t(prop.table(table1, 1)), legend.text=T)

```

              
Using the ggplot2 package
```{r}
library(ggplot2)
qplot(gender, data=data1, geom='bar', fill=drink )
qplot(gender, data=data1, geom='bar' ) + facet_grid(~drink)
qplot(drink, data=data1, geom='bar' ) + facet_grid(~gender)

gender<-factor(gender);drink=factor(drink) 
```


##  Exploring Continous Variables 

Manual Histogram
```{r}
x <- c(-2.44, -1.70,  -1.45,  -1.27,  -1.25,  -1.12,  -1.10,  -1.05,  -1.01,  -0.50,  -0.33,  -0.12,  -0.01,   0.24,   0.51,   0.80,   1.04,   1.15,   1.28,   1.77)
stripchart(x)
x<- c(rnorm(500),rnorm(300,3))
stripchart(x)
hist(x, prob=T,main='')	## Disjoint window histogram 
rug(x)
lines(density(x, kernel='rectangular', bw=0.1), main='') ## Simple moving average with width 1 
title(expression(W(t)==ifelse(abs(t)<=0.5, 1, 0))) 
```


Generating and exploring normal data
```{r}
sample1<-rnorm(100) 							
table(sample1) 									
barplot(table(sample1)) 						
stem(sample1) 
stem(sample1, scale=2) 
stem(sample1, scale=0.5)
hist(sample1, freq=T, main='Counts')      	
hist(sample1, freq=F, main='Frequencies') 	
lines(density(sample1))                  	
rug(sample1)
```


## The Boxplot 
```{r}
boxplot(sample1)	
text(x=1.3, y=c(-0.6195636, 0.2581893, 0.7848411), labels=c('Quartile 1', 'Median', 'Quartile 3'))
abline(h=qnorm(0.25), col='red') 
abline(h=qnorm(0.5), col='blue') 
abline(h=qnorm(0.75), col='red') 

# Adjusting the Boxplot fences for the size of the data:
giantSample<- rnorm(100000)
par(mfrow=c(2,2))
boxplot(giantSample, range=1) # Too many outliers :-(
boxplot(giantSample, range=2)
boxplot(giantSample, range=3) # Too few outliers :-(
boxplot(giantSample, range=2.5) # Good distance of fences :-)
par(mfrow=c(1,1))
```


Several different visualisations:
```{r}
sample2<-rnorm(1000)     
stem(sample2)          
hist(sample2)          
plot(density(sample2))  
rug(sample2)
```


True data 
```{r}
URL <- 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/bone.data'
bone <- read.table(URL, header = TRUE)
names(bone)
summary(bone) 			
stripchart(bone['age'])
stem(bone[, 'age']) 									
hist(bone[, 'age'], prob=T) 							
lines(density(bone[, 'age'])) 
with(bone, rug(age))

ind<-bone[, 'gender']=='male'

par(mfrow=c(2, 1))
plot(density(bone[ind, 'age']), main='Male')
rug(bone[ind,'age'])
plot(density(bone[!ind, 'age']), main='Female')
rug(bone[!ind,'age'])

plot(density(bone[ind, 'age']), main='Male', xlim=c(5, 30)) # Adjusting x axis to fit both genders
plot(density(bone[!ind, 'age']), main='Female', xlim=c(5, 30)) # Adjusting x axis to fit both genders

boxplot(bone[ind, 'age'], main='Male')
boxplot(bone[!ind, 'age'], main='Female')

par(mfrow=c(1, 1))
boxplot(bone$age~bone$gender)
with(bone, boxplot(spnbmd~gender))
```


## Graphical parameters 
```{r}
attach(bone) 
stripchart(age)
stripchart(age~gender)
stripchart(age~gender, v=T)

boxplot(age~gender)
boxplot(age~gender, horizontal=T, col=c('pink','lightblue') )
title(main='Amazing Boxplots!')
title(sub="Well actually.. I've seen better Boxplots")

plot(density(age),main='')
plot(density(age),main='',type='h')
plot(density(age),main='',type='o')
plot(density(age),main='',type='p')
plot(density(age),main='',type='l')

?plot.default

plot(density(age),main='')
rug(age)
boxplot(age, add=T, horizontal=T, at=0.02, boxwex=0.05, col='grey')
title(expression(alpha==f[i] (beta)))
example(plotmath)

par(mfrow=c(2,1))
(males<- gender=='male')
plot(density(age[males]), main='Male') ; rug(age[males])
plot(density(age[!males]), main='Female') ; rug(age[!males])

range(age)
plot(density(age[males]), main='Male', xlim=c(9,26)) ; rug(age[males])
plot(density(age[!males]), main='Female', xlim=c(9,26)) ; rug(age[!males])
par(mfrow=c(1,2))
plot(density(age[males]), main='Male', xlim=c(9,26)) ; rug(age[males])
plot(density(age[!males]), main='Female', xlim=c(9,26)) ; rug(age[!males])

par(mfrow=c(1,1),ask=T)
plot(density(age[males]), main='Male', xlim=c(9,26)) ; rug(age[males])
plot(density(age[!males]), main='Female', xlim=c(9,26)) ; rug(age[!males])

plot(density(age[males]), main='Male', xlim=c(9,26),ylim=c(0,0.08)) 
par(mfrow=c(1,1),ask=F, new=T)
plot(density(age[!males]), main='Female', xlim=c(9,26),ylim=c(0,0.08)) 

plot(density(age[males]), main='Male', xlim=c(9,26)) 
lines(density(age[!males]), main='Female', xlim=c(9,26)) 

plot(density(age[males]), main='Male', xlim=c(9,26)) 
lines(density(age[!males]), main='Female', xlim=c(9,26),lwd=2) 

plot(density(age[males]), xlim=c(9,26), main='') 
lines(density(age[!males]), xlim=c(9,26),lty=2) 
legend(locator(1), legend=c("Male","Female"), lty=c(1,2))

plot(density(age[males]), xlim=c(9,26), main='',col='blue', lwd=2) 
lines(density(age[!males]), xlim=c(9,26),lty=2, col='red',lwd=2) 
legend(locator(1), legend=c("Male","Female"), lty=c(1,2), col=c('blue','red'))

plot(density(age[males]), main='Male', xlim=c(9,26)) 
points(density(age[!males]), main='Female', xlim=c(9,26), bg='red') 
points(locator(3),pch="+")
points(locator(3),pch=10, cex=4)

plot(density(age[males]), main='Male', xlim=c(9,26)) 
points(density(age[!males]), main='Female', xlim=c(9,26), bg='red') 
points(locator(6),pch=c('a','b','c'))
```


## Integer data 
Integer data will most certainly produce overlaps if plotted. Either add jitter, or treat as discrete.
```{r}
r.age<-round(age)
plot(density(r.age))
rug(r.age)
plot(density(r.age, from=9))
rug(jitter(r.age))
hist(r.age)
rug(jitter(r.age))
```


## Plotting

### Preparing data for plotting
2D data can be in either `wide' or `long' format. 
Most R functions are designed for long formats. 
Let's start by trying to plot in the wide format.
Notice each dosage is plotted seperately (yes, I could have looped).
```{r}
wide.data<-data.frame(id=1:4, age=c(40,50,60,50), dose1=c(1,2,1,2),dose2=c(2,1,2,1), dose4=c(3,3,3,3))
wide.data

plot(dose1~age, data=wide.data, ylim=range(c(dose1,dose2,dose4)), ylab='')
points(dose2~age, data=wide.data, pch=2)
points(dose4~age, data=wide.data, pch=3)
```


Ploting in long format is much easier. 
I will first convert the data manually.
```{r}
(dose.type<-c(
		rep('dose1', length(wide.data$dose1)),
		rep('dose2', length(wide.data$dose2)),
		rep('dose4', length(wide.data$dose4))))
(dose<- c(wide.data$dose1,wide.data$dose2,wide.data$dose4))
(long.id<- rep(wide.data$id,3))
(long.age<- rep(wide.data$age,3))

long.data <- data.frame(long.id, long.age, dose.type, dose)
View(long.data)

plot(dose~long.age, data=long.data, pch=as.numeric(dose.type))
```
I will now try to avoid this manual reshaping.

### Reshaping data

#### base package
```{r}
stack(data.frame(wide.data$dose1,wide.data$dose2,wide.data$dose4))

reshape(wide.data, varying=list(c("dose1","dose2","dose4")), direction="long", idvar=c("id","age"), v.names="dose")
reshape(wide.data, varying=list(c("dose1","dose2","dose4")), direction="long", idvar="id", timevar="age", v.names="dose")
```

#### respahe package
```{r}
# melt() is much more confortable then reshape( )
library(reshape)
melted.data<- melt(data=wide.data, id.vars=c("id","age") )
cast(melted.data, age+id~variable)

cast(melted.data)
```


#### tidyr package
This is the package I recommend if you cannot reshape manually.
Example from [here](http://blog.rstudio.org/2014/07/22/introducing-tidyr/)
```{r}
library(tidyr)
library(dplyr)

# Data in wide format:
messy <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  a = c(67, 80, 64),
  b = c(56, 90, 50)
)
messy

# Convert to long format:
messy %>% gather(drug, heartrate, a:b)
```

```{r}
# Another example- from wide to long:
set.seed(10)
messy <- data.frame(
  id = 1:4,
  trt = sample(rep(c('control', 'treatment'), each = 2)),
  work.T1 = runif(4),
  home.T1 = runif(4),
  work.T2 = runif(4),
  home.T2 = runif(4)
)
messy %>% head
tidier <- messy %>%  gather(key, time, -id, -trt)
tidier %>% head(8)

# From long to wide
tidy <- tidier %>%
  separate(key, into = c("location", "time"), sep = "\\.") 
tidy %>% head(8)
```

### Fancy Plotting 
```{r}
library(ggplot2)
URL <- 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/bone.data'
bone <- read.table(URL, header = TRUE)
qplot(spnbmd, data=bone)
qplot(x=gender, y=spnbmd, data=bone, geom='boxplot')
qplot(spnbmd, data=bone, geom='histogram')+ facet_wrap(~gender)
qplot(spnbmd, data=bone, geom='density')+ facet_wrap(~gender)
qplot(spnbmd, data=bone)+ geom_density(col='red', size=1)+ facet_wrap(~gender)
qplot(spnbmd, data=bone, fill=gender, geom='density', alpha=1)
```

Diamonds example (Taken from Wickham's web site: http://had.co.nz/stat405/)
```{r}
?diamonds
dim(diamonds)
head(diamonds)
```

```{r}
qplot(carat, data = diamonds)
qplot(carat, data = diamonds, binwidth = 1)
qplot(carat, data = diamonds, binwidth = 0.1)
qplot(carat, data = diamonds, binwidth = 0.01)
resolution(diamonds$carat)
last_plot() + xlim(0, 3)

qplot(depth, data = diamonds, binwidth = 0.2)
qplot(depth, data = diamonds, binwidth = 0.2,fill = cut) + xlim(55, 70)
qplot(depth, data = diamonds, binwidth = 0.562) +xlim(55, 70) + facet_wrap(~ cut)

qplot(table, price, data = diamonds)
qplot(table, price, data = diamonds, geom = "boxplot")
qplot(table, price, data = diamonds, geom="boxplot",group = round(table))

qplot(carat, price, data = diamonds)
qplot(carat, price, data = diamonds, alpha = I(1/10))

qplot(carat, price, data = diamonds, geom = "bin2d", main='Count Heatmap')
qplot(carat, price, data = diamonds, geom = "hex")
qplot(carat, price, data = diamonds) + geom_smooth()
```


For more information on ggplot2 see http://had.co.nz/ggplot2 

### Interactive Plotting 
```{r}
# install.packages("iplots")
library(iplots)
data(iris)
attach(iris) 
```


The folllowing plots are interconnected in that selecting points in one of them, selects the points in all.
```{r}
iplot(Sepal.Width, Petal.Width)
iplot(Sepal.Width/Sepal.Length, Species)
ihist(iris$Sepal.Width)
ibar(Species)
```

Note: For more powerfull interactive plotting try:
1. [Mondrian](http://en.wikipedia.org/wiki/Mondrian_%28software%29) (open source)
2. [JMP](http://www.jmp.com/en_us/home.html) (expensive)
3. Ggobi and [Rggobi](http://cran.r-project.org/web/packages/rggobi/index.html) (open source)


## Location Summaries
Examine the robustness of several location summaries
```{r}
# Generate data:
x<- rnorm(11)
last.observations<- seq(1, 1000, by=100)

# Compute location for the many different samples
means<- numeric(length=0)
medians<- numeric(length=0)
mean05<- numeric(length=0)
for (y in last.observations){
	all.data<- c(x,y)
	means<- c(means, mean(all.data))
	medians<- c(medians, median(all.data))
	mean05<- c(mean05, mean(all.data, trim=0.1))	
}

expanded.data<- expand.grid(x=x, y=last.observations)
with(expanded.data, boxplot(x~y, xlab='Last Observation'))
lines(means~last.observations, lty=2, lwd=3, col='red')
lines(medians~last.observations, lty=3, lwd=3, col='brown')
lines(mean05~last.observations, lty=4, lwd=3, col='blue')
```


## Spread & Symmetry Summaries
```{r}
x1<- rnorm(1000, 100, 50)
x2<- rexp(1000, 1/100)
x3<- rgamma(1000, 3, 1/30)
x4<- rf(1000, 100, 1)/100

plot(density(x1));#abline(v=100, lty=2);abline(v=0, lty=2)
plot(density(x2));#abline(v=100, lty=2);abline(v=0, lty=2)
plot(density(x3));#abline(v=100, lty=2);abline(v=0, lty=2)
plot(density(x4));#abline(v=100, lty=2);abline(v=0, lty=2)

boxplot(x1, x2, x3, x4, ylim=c(-100, 400))
```

###  Measures of spread
```{r}
cat('SD=', sqrt(var(x1)), 'MAD=', mad(x1), 'IQR=', IQR(x1), '\n') 
cat('SD=', sqrt(var(x2)), 'MAD=', mad(x2), 'IQR=', IQR(x2), '\n') 
cat('SD=', sqrt(var(x3)), 'MAD=', mad(x3), 'IQR=', IQR(x3), '\n') 
cat('SD=', sqrt(var(x4)), 'MAD=', mad(x4), 'IQR=', IQR(x4), '\n') 
```

### Measures of Skewness 

Yule
```{r}
boxplot(x1, x2, x3)
boxplot(x1, x2, x3, x4)

yule<-function(x) { 
  (mean(c(quantile(x, 0.25), quantile(x, 0.75))-median(x))) / ( IQR(x)/2 ) 
  } 

yule(x1);yule(x2);yule(x3);yule(x4)
```

Pearson
```{r}
pearson<- function(x) {
	m<-mean(x)
	nom<-sum((x-m)^3)
	denom<- sum( (x-m)^2)^(3/2)
	return(nom/denom)	
}

pearson(x1)
pearson(x2)
pearson(x3)
pearson(x4)
```

The sensitivity function captures the sensitivity of a summary to the value of a single observation (``hypothesis stability'' in machine learning).
```{r}
(new.obs<- -1000:1000)
mean.sensitivity<- rep(0,length(new.obs))
median.sensitivity<- rep(0,length(new.obs))
alpha.t.sensitivity<- rep(0,length(new.obs))
sd.sensitivity<- rep(0,length(new.obs))
mad.sensitivity<- rep(0,length(new.obs))
iqr.sensitivity<- rep(0,length(new.obs))
pearson.sensitivity<- rep(0,length(new.obs))
yule.sensitivity<- rep(0,length(new.obs))

for (i in seq_along(new.obs)){
	temp.data<- c(x1,new.obs[i])

	mean.sensitivity[i]<- mean(temp.data)
	median.sensitivity[i]<- median(temp.data)
	alpha.t.sensitivity[i]<- mean(temp.data, trim=0.1)
	pearson.sensitivity[i]<- pearson(temp.data)
	yule.sensitivity[i]<- yule(temp.data)
	sd.sensitivity[i]<- sd(temp.data)
	mad.sensitivity[i]<- mad(temp.data)
	iqr.sensitivity[i]<- IQR(temp.data)
}

plot(mean.sensitivity~new.obs, type='l')
lines(median.sensitivity~new.obs, lty=2)
lines(alpha.t.sensitivity~new.obs, lty=3)
legend(locator(1), legend=c('Mean','Meadian','Alpha Trimmed'), lty=c(1,2,3))
abline(v=median(x1),lty=5)
abline(v=quantile(x1,0.1),lty=6)
abline(v=quantile(x1,0.9),lty=6)


plot(sd.sensitivity~new.obs,type='l',ylim=c(40,80))
lines(mad.sensitivity~new.obs,lty=2)
lines(iqr.sensitivity~new.obs,type='l',lty=3)
legend(locator(1), legend=c('SD','MAD','IQR'), lty=c(1,2,3))

r<-0.05
plot(pearson.sensitivity~new.obs,ylim=c(-r,r),type='l')
lines(yule.sensitivity~new.obs, lty=2)
```


## Univariate transformations
```{r}
(periods <- c(rnorm(500, 15),rnorm(100, 10)))
(cells<-10*2^periods)

hist(cells); rug(cells)
hist(log(cells)); rug(log(cells))
```


# The Normal Distribution
```{r}
# The Standard normal distribution (a.k.a. Gaussian) PDF
mu<-0;sd<-1;
curve(dnorm(x, mean=mu, sd=sd), -5, 5, ylim=c(0, 1), col='red');abline(v=mu)  


#Non standard normal distribution
mu<-0; sd<-0.5; curve(dnorm(x, mean=mu, sd=sd), add=T);abline(v=mu)
mu<-1; sd<-2; curve(dnorm(x, mean=mu, sd=sd), add=T)
mu<--2; sd<-.2; curve(dnorm(x, mean=mu, sd=sd), add=T)

# CDFs of scaled Gaussians
curve(pnorm(x, 0, 1), -4, 4, main='Commulative Standard Normal');abline(v=0);abline(0.5, 0, lty=2)
curve(pnorm(x, 0, 1), -10, 10, main='Commulative Standard Normal'); abline(v=0);abline(0.5, 0, lty=2)
curve(pnorm(x, 0, 3), -10, 10, add=T, col='red')
curve(pnorm(x, 0, 5), -10, 10, add=T, col='blue')

# CDF of translated Gaussians
curve(pnorm(x, 0, 1), -10, 10, main='Commulative Standard Normal'); abline(v=0); abline(0.5, 0, lty=2)
curve(pnorm(x, -1, 1), -10, 10, add=T, col='red')
curve(pnorm(x, 1, 1), -10, 10, add=T, col='blue')
legend(x=-7, y=0.8, legend=c(expression(mu==0), expression(mu==1), expression(mu==-1)), lty=1, col=c('black', 'blue', 'red'))

## Commulative density function (CDF)
pnorm(0, mean=0, sd=1) 
pnorm(1.5, mean=0, sd=1)
pnorm(510, mean=500, sd=11)

## Inverse commulative density function a.k.a. Quantile Function
qnorm(0.2, mean=0, sd=1)  
qnorm(0.5, mean=0, sd=1)
qnorm(0.975, mean=10, sd=7)

## Probability density function
dnorm(0, mean=0, sd=1) 

```

Demonstrating the CLT
```{r}
# The normal approximation of the binomial distribution
n<-100;p<-0.5
curve(dbinom(x, n, p), 0, n, type='h', ylim=c(0, 1)) 
curve(pbinom(x, n, p), 0, n, type='S', ylim=c(0, 1))# Binomial CDF
curve(pnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), add=T, col='red') # Gaussian approximation

n<-1000;p<-0.5
curve(pbinom(x, n, p), 0, n, type='s', ylim=c(0, 1)) # The Commulative binomial probaility function
curve(pnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), add=T, col='red')

n<-10;p<-0.5
curve(pbinom(x, n, p), 0, n, type='S', ylim=c(0, 1)) # The Commulative binomial probaility function
curve(pnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), add=T, col='red')

n<-10;p<-0.05
curve(pbinom(x, n, p), 0, n, type='S', ylim=c(0, 1)) # The Commulative binomial probaility function
curve(pnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), add=T, col='red')
curve(dbinom(x, n, p), 0, n, type='h', ylim=c(0, 1))


# Now with a geometric population
rgeom(100, 0.3) 
hist(rgeom(1000, 0.3))
generating <- matrix(0, ncol=100, nrow=1000)

for (i in 1:1000) generating[i, ] <- rgeom(100, 0.3)
image(generating)

sums<-apply(generating, 1, sum);sums 
length(sums)

par(mfrow=c(1, 2)); hist(rgeom(1000, 0.3), main='Original Geometric Distribution'); hist(sums, main='Distribution of sum'); par(mfrow=c(1, 1))
```


## The QQ plot 
A simple and efficient tool to compare between distributions.
```{r}
mystery.2<-function(y) {
  n<-length(y)
  y<-sort(y)
  i<-1:n
  q<-(i-0.5)/n
  x<-qnorm(q, mean(y), sqrt(var(y)))
  plot(y~x, xlab='Theoretical Quantiles', ylab='Empirical Quantiles')
}

normals.1<-rnorm(100, 0, 1); hist(normals.1)
mystery.2(normals.1); abline(0, 1)

normals.2<-rnorm(100, 0, 10); hist(normals.2)
mystery.2(normals.2); abline(0, 1)

## No need to write the function every time...
qqnorm(normals.1)   
qqnorm(normals.2)   

## How would non-normal observations look? ##
non.normals.1<-runif(100); hist(non.normals.1)
mystery.2(non.normals.1); abline(0, 1)

non.normals.2<-rexp(100, 1); hist(non.normals.2)
mystery.2(non.normals.2); abline(0, 1)

non.normals.3<-rgeom(100, 0.5); hist(non.normals.3)
mystery.2(non.normals.3); abline(0, 1)

## Adapting for a non-normal distribution: ##
qq.uniform<-function(y) {
  n<-length(y);    y<-sort(y);    i<-1:n;    q<-(i-0.5)/n
  x<-qunif(q, min=min(y), max=max(y)) #each disribution will require it's own parameters!
  plot(y~x, xlab='Theoretical Quantiles', ylab='Empirical Quantiles')
}
qq.uniform(non.normals.1);abline(0, 1)
qq.uniform(non.normals.2);abline(0, 1)
qq.uniform(normals.2);abline(0, 1)


# Checking the normal approximation of the binomial distribution with a qqplot.
n<-10;p<-0.2; binom.data=rbinom(100, n, p); mystery.2(binom.data);abline(0, 1)
n<-10;p<-0.5;binom.data=rbinom(100, n, p); mystery.2(binom.data);abline(0, 1)
n<-100;p<-0.5;binom.data<-rbinom(100, n, p); mystery.2(binom.data);abline(0, 1)
n<-1000;p<-0.5;binom.data<-rbinom(100, n, p); mystery.2(binom.data);abline(0, 1)


# Theoretical example
n<-100; p<-0.5
curve(dbinom(x, n, p), 0, n, type='h', ylim=c(0, 1))
curve(pbinom(x, n, p), 0, n, type='S', ylim=c(0, 1)) # The Commulative binomial probaility function
curve(pnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), add=T, col='red')

n<-1000;p<-0.5
curve(pbinom(x, n, p), 0, n, type='s', ylim=c(0, 1)) # The Commulative binomial probaility function
curve(pnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), add=T, col='red')

n<-10;p<-0.5
curve(pbinom(x, n, p), 0, n, type='S', ylim=c(0, 1)) # The Commulative binomial probaility function
curve(pnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), add=T, col='red')

n<-10;p<-0.05
curve(pbinom(x, n, p), 0, n, type='S', ylim=c(0, 1)) # The Commulative binomial probaility function
curve(pnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), add=T, col='red')
curve(dbinom(x, n, p), 0, n, type='h', ylim=c(0, 1))
```


# Multiple data vectors 
We now leave the single-vector world and move to the analysis of dependencies between several vectors. 

## Scatter plots
```{r}
# Sine function
x<-seq(-pi, pi, 0.01)
y<-sin(x)
plot(y~x)

#Exponent function
x<-seq(-pi, pi, 0.01)
y<-exp(x)
plot(y~x)

# Sinc function
x<-seq(-10*pi, 10*pi, 0.01)
y<-sin(x)/x
plot(y~x)

# Fancy function
x<-seq(-pi, pi, 0.01)
y<-sin(exp(x))+cos(2*x)
plot(y~x)
plot(y~x, type='l')
plot(y~x, type='o')

## Some real life data
URL <- 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ozone.data'
ozone <- read.table(URL, header=T)
names(ozone)
plot(ozone)
```


## 3D plotting  
```{r}
# install.packages('rgl')
library(rgl)
plot3d(ozone[, 1:3]) 
```


## Plotting a surface 
```{r}
x<-seq(0, 1, 0.01)
y<-seq(0, 1, 0.01)
xy.grid<-expand.grid(x, y)
func1<-function(mesh) exp(mesh[, 1]+mesh[, 2])
z<-func1(xy.grid)
xyz<-data.frame(xy.grid, z)
plot3d(xyz, xlab='x', ylab='y')  
```


##  Fitting linear lines and surfaces  
We will now try and fit linear surfaces to our data.

### Well behaved data 
```{r}
x <- 1:100
a <- 2
b <- 3.5
sigma <- 10
y <- a+b*x+rnorm(100, 0, sigma)
plot(y~x)
```

### Ordinary Least Squares 
```{r}
ols.line<-function(x, y){
    sxy<-sum( (x-mean(x) ) * (y-mean(y) ) )    
    sxx<-sum( (x-mean(x)) ^ 2 )
    b1<-sxy / sxx
    a1<-mean(y) - b1 * mean(x)
    return(list(slope=b1, intercept=a1))
}

ols<-ols.line(x, y) ; ols
abline(ols$intercept, ols$slope, lty=2, lwd=3)
predictions <-  ols$intercept + ols$slope * x
residuals<- y - predictions
plot(residuals) ; abline(h=0)
```

### Robust regression
```{r}
rline  <- function(x, y)
{
  ind<- is.na(x) | is.na(y)
	x<- x[!ind]
	y<- y[!ind]
	o <- order(x)  #The permutation  that will sort x the into ascending order. 
	x <- sort(x)   #The sorted x vector
	y <- y[o]      #The sorted y vector
	n <- length(x) #The length of the dataset
	n3 <- round(n/3)
	xb <- median(x[c(1:n3)]) #The median of the x of the lower part
	yb <- median(y[c(1:n3)]) #The median of the y of the lower part
	xt <- median(x[c((n - n3 + 1):n)]) #The median of the x of the upper part
	yt <- median(y[c((n - n3 + 1):n)]) #The median of the y of the upper part
	b1 <- (yt-yb)/(xt-xb) # Calculating the slope
	b0 <- mean(yt, yb)-b1*mean(xt, xb) # Calculating the (temporary) intercept
	pred <- b0 + b1 * x    
	mr <- median(y - pred) #The adjustment to the median of the initial residuals
	b0 <- b0 + mr          #The "adjusted" intercept
	pred <- b0 + b1 * x    #The final predictions
	resid <- y - pred      #The final errors
	return(list(intercept=b0,  slope=b1,  pred=pred,  resid=resid,  x=x,  xb=xb,  yb=yb,  xt=xt,  yt=yt))
}

# Robust regression with well behaved data
robust <- rline(x, y)
robust$intercept ; robust$slope
plot(y~x)
abline(robust$intercept, robust$slope, col='red', lwd=3, lty=2)

# Robust regression with ill behaved data
y <- a + b*x+rbinom(100, 1, 0.1)*(rbinom(100, 1, 0.5)-0.5)*10000
plot(y~x, ylim=c(-1, 1)*5000) ; abline(a, b, col='blue', lwd=3, lty=5)

plot(y~x, ylim=c(-1, 1)*500) ; abline(a, b, col='blue', lwd=3, lty=5)
ols.ill <- ols.line(x, y); ols.ill
abline(ols.ill$intercept, ols.ill$slope, col='brown')
robust.ill<-rline(x, y)
abline(robust.ill$intercept, robust.ill$slope, col='red')
legend(x=60,y=-200, lty=1, legend=c('Real', 'OLS line', 'Robust Line'), col=c('blue', 'brown', 'red'), lwd=c(3, 1, 1))
```

### How are the intercept and slope related? 
This little simulation demonstrated that if the data is not centered, then a correlation is to be expected between slope and intercept.
```{r}
x<-1:100
a<-2
b<-3.5
sigma<-10
iterations<-100
retries <- matrix(0, 2*iterations, ncol=2)
for (i in 1:iterations){
    y<-a+b*x+rnorm(100, 0, sigma)
    temp.robust<-rline(x, y)  
    retries[i, 1]<-temp.robust$slope-b        
    retries[i, 2]<-temp.robust$intercept-a
}

ind<- as.numeric((retries[, 1]>0) & (retries[, 2]>0))+1
plot(retries, xlab='Deviation from real slope', ylab='Deviation from real intercept', pch=ind)     
abline(v=0);abline(h=0)
```

### What is a high correlation?
```{r}
samples<- 100
x <- 1:samples
a <- 2
b <- 3.5
sigma <- 500
y <- a+b*x+rnorm(samples, 0, sigma)
(initial.correlation<- cor(y,x))

# Permute the data and compute the correlation:
iterations<-1000
correlations<- rep(NA, iterations)
for (i in 1:iterations){
  ind<- sample(length(y))
  correlations[i] <- cor(y[ind], x)
}
hist(correlations, xlim=c(-1,1))
rug(correlations)
abline(v=initial.correlation)
```

### Dangers of Extrapolation  
```{r}
x<-runif(1000)*5
y<-exp(x)+rnorm(1000)
plot(y~x, main='Whole relation')

rect(xleft=0, ybottom=-5, xright=2, ytop=10)

plot(y~x, main='Local relation', cex=0.5, xlim=c(0, 2), ylim=c(-5, 10));abline(v=2, lty=3)

ind<-x<=2;ind
ols.interpolating<-ols.line(x[ind], y[ind]);ols.interpolating
abline(ols.interpolating$intercept ,  ols.interpolating$slope, col='red')
text(x=0.5, y=6, labels='Interpolates Nicely', cex=2)

plot(y~x, main='Whole relation')
abline(ols.interpolating$intercept ,  ols.interpolating$slope, col='red')
abline(v=2, lty=3)
text(x=2, y=121, labels='Extrapolates Terribly!', cex=2)

# Non-linearity might be fixed with a transformation:
# Which of the following looks better (more linear)? 
plot(y~exp(x))
plot(log(y)~x)
plot(log(y)~log(x))
```

### Intuition underlying Pearson's correlation 
It is actually the *OLS* slope if dealing with *standardized* variables.
```{r}
x<-1:100;cat('Old X average=', mean(x), '. Old X Variance=', var(x), '\n')
a<-2 ; b<-3.5 ; sigma<-10
y<-a+b*x+rnorm(100, 0, sigma);cat('Old Y average=', mean(y), '. Old Y Variance=', var(y), '\n')
plot(y~x)

x.stan <- (x-mean(x))/sqrt(var(x));cat('New X average=', mean(x.stan), '. New X variance=', var(x.stan), '\n')
y.stan<-(y-mean(y))/sqrt(var(y));cat('New Y average=', round(mean(y.stan), 2), '. New Y variance=', var(y.stan), '\n')
plot(y~x, cex=0.5);points(y.stan~x.stan, col='red')

ols.line(x.stan, y.stan) # Calculating the slope in Z-score scale
cor(x.stan, y.stan) # Calculating the correlation coef of the Z-scores  #MatLab: corrcoef
cor(x, y) # The cor coef in the original scale is the same as the slope in Z-score scale!
```

### Multivariate linear regression 
```{r}
# install.packages('rgl')
library(rgl)

xy.grid <- data.frame(x1=runif(10000), x2=runif(10000))

func1<-function(mesh, a0, a1, a2, sigma) {
    n<-nrow(mesh)
    a0 + a1 * mesh[, 1] + a2 * mesh[, 2] + rnorm(n, 0, sigma)
    }
    
# More noise hides the stucture in the data:
z<-func1(xy.grid, a0=5, a1=1, a2=3, .0); z; xyz=data.frame(xy.grid, z); plot3d(xyz, xlab='x1', ylab='x2')
z<-func1(xy.grid, a0=5, a1=1, a2=3, .4); xyz=data.frame(xy.grid, z); plot3d(xyz, xlab='x1', ylab='x2')
z<-func1(xy.grid, a0=5, a1=1, a2=3, 11); xyz=data.frame(xy.grid, z); plot3d(xyz, xlab='x1', ylab='x2')

z<-func1(xy.grid, a0=5, a1=1, a2=3, .4); xyz=data.frame(xy.grid, z); plot3d(xyz, xlab='x1', ylab='x2')
```

`lm()` is the major workhorse for OLS returning the solution to $(X'X)^{-1} X'y$.
```{r}
lm(z~., xyz) # Solves the system "prediction=(X'X)^-1 X'y"
```


### Linearizing Transformations and Examining Residuals
See also http://www.gapminder.org/
```{r}
## Log example #1:
x<-runif(1000)*5
y<-exp(x)+rnorm(1000)
plot(y~x, main='Whole relation')
rect(xleft=0, ybottom=-5, xright=2, ytop=10)
plot(y~x, main='Local relation', cex=0.5, xlim=c(0, 2), ylim=c(-5, 10));abline(v=2, lty=3)
ind<-x<=2;ind
ols.interpolating<-ols.line(x[ind], y[ind]);ols.interpolating
abline(ols.interpolating$intercept ,  ols.interpolating$slope, col='red')
plot(y~x, main='Whole relation')
abline(ols.interpolating$intercept ,  ols.interpolating$slope, col='red')
abline(v=2, lty=3)

yResiduals<- y - (ols.interpolating$intercept+ols.interpolating$slope * x)
par(mfrow=c(1,1))
plot(y~x, main='Local relation; scatter', cex=0.5, xlim=c(0, 2), ylim=c(-5, 10)); abline(v=2, lty=3);
abline(ols.interpolating$intercept ,  ols.interpolating$slope, col='red')
plot(yResiduals ~x, main='Local relation; residuals', cex=0.5, xlim=c(0, 2), ylim=c(-5, 10)); abline(0,0) 
# Note: Non linearity is easier to spot when inspecting residuals!

par(mfrow=c(1,1))
new.line<- ols.line(exp(x[ind]), y[ind]) # transforming x:
predicted.y<- new.line$intercept+new.line$slope * exp(x)
yResiduals2<- y - (predicted.y)
plot(yResiduals2 ~x, main='Local relation', cex=0.5, xlim=c(0, 2), ylim=c(-5, 10));abline(0,0) # Residuals look much better!
plot(y~x)
points(predicted.y~x, col='red', cex=0.5)


# Now inspect normality of residuals:
qqnorm(yResiduals); qqline(yResiduals)
qqnorm(yResiduals2); qqline(yResiduals2)
par(mfrow=c(1,1))

# Has R^2 improved?
compute.R2<- function(x,y){
	new.line<- ols.line(x, y)
	Residuals<- y - (new.line$intercept+new.line$slope * x)
	SSR<- sum(Residuals^2)
	SST<- sum((y-mean(y))^2)
	R2<- 1-SSR/SST
	return(R2)
}
compute.R2(x,y);cor(x,y)^2
compute.R2(exp(x),y);cor(exp(x),y)^2
```

### Distribution of y|x:
Judging by the QQplot: $Y|x \sim \mathcal{N}( a+b*exp(x), \sigma_e^2)$
and $\sigma_e^2 = (1-R^2) * \sigma_y^2$.
What is your guess of the next value of y?
What is your guess of the range of 68% of values of y?  
 
```{r}
plot(y~x, cex=0.2, xlim=c(0,3), ylim=c(0,10))
points(predicted.y~x, col='red', cex=0.2)
R2<- compute.R2(exp(x),y)
sigma.epsilon<- sqrt( (1-R2)* var(y))
segments(x, predicted.y-sigma.epsilon, x,  predicted.y+sigma.epsilon, col='lightgrey', cex=0.2)
```

### Log Linear Model 
```{r}
x<- runif(100)
a<- 2
b<- 3.5
y<- exp(a+b*x+rnorm(100,sd=0.2))
plot(y~x)

line.1<- ols.line(y=y,x=exp(x))
preds.1<- line.1$intercept + line.1$slope * exp(x)
resids.1<- y-preds.1
plot(y~x);points(preds.1 ~ x, col='red')
plot(resids.1~x);abline(0,0)

line.2<- ols.line(y=log(y),x=x)
preds.2<- exp(line.2$intercept + line.2$slope * x)
plot(y~x);points(preds.2 ~ x, col='red')
resids.2<- y - preds.2
plot(resids.2~x);abline(0,0)
resids.2.2<- log(y) - log(preds.2)
plot(resids.2.2~x);abline(0,0)

```
Q: What is this good for?!?
A: Prediction intervals! (amongst others)


In linear scale
```{r}
R2<- compute.R2(x=x,y=log(y))
sigma.epsilon<- sqrt( (1-R2)* var(log(y)))
plot(log(y)~x);points(log(preds.2) ~ x, col='red')
segments(x, log(preds.2)-sigma.epsilon, x,  log(preds.2)+sigma.epsilon, col='darkgreen', cex=0.2)
# In exp() scale:
plot(y~x); points(preds.2 ~ x, col='red')
segments(x, exp(log(preds.2)-sigma.epsilon), x,  exp(log(preds.2)+sigma.epsilon), col='darkgreen', cex=0.2)
```

### Log-Log Model
```{r}
x<- runif(100)
a<- 2
b<- 3.5
y<- exp(a + b * log(x)+rnorm(100, sd=1))
plot(y~x)

ols.line.1<- ols.line(x=x, y=y)
predict.1<- ols.line.1$intercept + ols.line.1$slope * x
plot(y~x); points(predict.1~x, col='red')

ols.line.2<- ols.line(x=exp(x), y=y)
predict.2<- ols.line.2$intercept + ols.line.2$slope * exp(x)
plot(y~x); points(predict.2~x, col='red')

ols.line.3<- ols.line(x=x, y=log(y))
predict.3<- exp(ols.line.3$intercept + ols.line.3$slope * x)
plot(y~x); points(predict.3~x, col='red')
plot(log(y)~x);points(log(predict.3) ~ x, col='red')


ols.line.4<- ols.line(x=log(x), y=log(y))
predict.4<- exp(ols.line.4$intercept + ols.line.4$slope * log(x))
plot(y~x); points(predict.4~x, col='red')
plot(log(y)~x);points(log(predict.4) ~ x, col='red')
plot(log(y)~log(x))

(R2<- compute.R2(x=log(x),y=log(y)))
sigma.epsilon<- sqrt( (1-R2)* var(log(y)))
plot(y~x); points(predict.4 ~ x, col='red')
segments(x, exp(log(predict.4)-sigma.epsilon), x,  exp(log(predict.4)+sigma.epsilon), col='darkgreen', cex=0.2)
```


Interpreting a log-log model:
b is the *percent* change in y, for a percent change in x!
Why? 
```{r}
e.y<- expression(exp(a)*x**b)
D(e.y,'x')
```
So $\Delta y/y = b \Delta x/x$ Hurray!

### Varying Variance (heteroskedasticity) 
```{r}
x<- runif(100, max=2*pi)
a<- 2
b<- 3.5
sds<- 1+sin(x)
plot(sds~x)
y<- a + b * x + rnorm(100, sd=sds)
plot(y~x)

ols.line.1<- ols.line(x=x, y=y)
predict.1<- ols.line.1$intercept + ols.line.1$slope * x
plot(y~x); points(predict.1~x, col='red')
(R2<- compute.R2(x,y))
sigma.epsilon<- sqrt( (1-R2)* var(y))
segments(x, predict.1-sigma.epsilon, x,  predict.1+sigma.epsilon, col='darkgreen', cex=0.2)

```


# Random methods

## Bootstrapping 
```{r}
heights<-read.table('heights.txt')[, 1]

heights.sample<-sample(heights, 100, replace=T)
```

Estimateing the variance and MSE of different expectancy estimators
True variance= 225
True variance of the mean
```{r}
B<-10000
means=NULL
medians=NULL
alpha.trims<-NULL
for (i in 1:B) {
    boot.sample<-sample(heights.sample, replace=T)
    means<-c(means, mean(boot.sample))
    medians<-c(medians, median(boot.sample))
    alpha.trims<-c(alpha.trims, mean(boot.sample, trim=0.1))    
}

n<-length(heights.sample)
population.variance<-sum( (heights.sample-mean(heights.sample))^2  ) / (n-1)
```

The variance of the mean using theoretical considerations
```{r}
population.variance/n 
```

The variance of the mean estimated using the BootStrap
```{r}
boot.mean.variance<-sum( (means-mean(means))^2  ) / (B-1);boot.mean.variance

# The MSE of the mean estimated using the BootStrap
boot.mean.MSE<-sum( (means-mean(heights.sample))^2  ) / (B-1) 
boot.mean.MSE

cat('True variance of the mean=2.55 Unbiased estimator=', population.variance/n, '\n', 'BootStrap estimation=', boot.mean.variance, '\n')
```


The variance of the median estimated using the BootStrap
```{r}
boot.median.variance<-sum( (medians-mean(medians))^2  ) / (B-1) ;boot.median.variance
boot.median.bias<-mean(medians)-median(heights.sample);boot.median.bias
boot.median.variance + boot.median.bias^2
# The MSE of the median estimated using the BootStrap
boot.median.MSE<-sum( (medians-median(heights.sample))^2  ) / (B-1);boot.median.MSE 
```

```{r}
# The variance of the trimmed mean estimated using the BootStrap
boot.trim.variance<-sum( (alpha.trims-mean(alpha.trims))^2  ) / (B-1);boot.trim.variance
boot.trim.bias<-mean(alpha.trims)-mean(heights.sample, trim=0.1);boot.trim.bias
boot.trim.variance + boot.trim.bias^2
# The MSE of the trimmed mean estimated using the BootStrap
boot.trim.MSE<-sum( (alpha.trims-mean(heights.sample, trim=0.1))^2  ) / (B-1);boot.trims.MSE 

cat('BootStrap mean MSE=', boot.mean.MSE,
	'\n BootStrap median MSE=', boot.median.MSE, 
	'\n BootStrap trimmed mean MSE=', boot.trim.MSE, 
	'\n')
```


Is the bootstrap a good estimator of the MSE ? Let's try it on the median:
```{r}
B<-10000
means<-NULL
medians<-NULL
alpha.trims<-NULL

for (i in 1:B) {
    # This time I sample from the population and not the sample!
    sample<-sample(heights, 100, replace=T) 
    means<-c(means, mean(sample))
    medians<-c(medians, median(sample))
    alpha.trims<-c(alpha.trims, mean(sample, trim=0.1))    
}

median.variance<-sum( (medians-mean(medians))^2  ) / (B-1) ;median.variance
median.bias<-mean(medians)-175;median.bias
median.variance+median.bias^2
median.MSE<-sum( (medians-175)^2  ) / (B-1);median.MSE 

cat('BootStrap median MSE estimation=', boot.median.MSE, 
'\n Simulation MSE estimation=', median.MSE, '\n')
```


## CI Simulations 
This code creates samples,  computes the CI of the expectancy assuming unknown variance and checks how many CIs have captured the real expectation.
```{r}

reps<-100
samples<-matrix(rnorm(30*reps, 170, 10), ncol=30) # creating samples

my.ci<-function(sample, alpha) {
    n<-length(sample)
    y<-mean(sample)
    s<-sqrt( sum( (sample-y)^2  ) / (n-1)    )
    c<-qt(1-alpha/2, n-1)
    return( c( y-s/sqrt(n)*c, y+s/sqrt(n)*c  ))
} # defining the expectancy CI assuming unknown variance

n<-dim(samples)[1]
CIs<-matrix(0, ncol=2, nrow=n);CIs # Just preparing the array to hold the data.
alpha<-0.1
for (i in 1:n) CIs[i, ]<-my.ci(samples[i, ], alpha) #Calculating CIs for each sample
CIs;CIs<-cbind(CIs, rep(NA, n))

plot(NULL, xlim=c(min(CIs[, 1:2]), max(CIs[, 1:2])), ylim=c(1, n), xlab='Interval', ylab='Sample')
lines(x=as.vector(t(CIs)), y=rep(seq(1:n), each=3));abline(v=170)

(CIs[, 1]<170)*(CIs[, 2]>170) # How many CI's actually contain the true expectancy?
table(CIs[, 1]<170)*(CIs[, 2]>170)

oks<-factor((CIs[, 1]<170)*(CIs[, 2]>170));levels(oks)<-c('Missed', 'Got it!');table(oks)
```


# String handelind    
```{r}
print("Hello\n") 	# Wrong!
show("Hello\n") 	# Wrong!
cat("Hello\n")		# Right!

# Windows directories need double escapes:
print("C:\\Program Files\\") 
cat("C:\\Program Files\\", sep="\n")

# String concatenation:
paste("Hello", "World", "!")
paste("Hello", "World", "!", sep="")
paste("Hello", " World", "!", sep="")

x <- 5
paste("x=", x)
paste("x=", x, paste="")

cat("x=", x, "\n") #Too many spaces :-(
cat("x=", x, "\n", sep="")

# Collapsing strings:
s <- c("Hello", " ", "World", "!")
paste(s)
paste(s, sep="")
paste(s, collapse="")
paste(s, collapse=" 1")


s <- c("Hello", "World!")
paste(1:3, "Hello World!")
paste(1:3, "Hello World!", sep=":")
paste(1:3, "Hello World!", sep=":", collapse="\n")
cat(paste(1:3, "Hello World!", sep=":", collapse="\n"), "\n") # cat() does not collapse :-(


# Substrings:
s <- "Hello World"
substring(s, start=4, stop=6)

# Splits:
s <- "foo, bar, baz"
strsplit(s, ", ")

s <- "foo-->bar-->baz"
strsplit(s, "-->")

# Using regular expressions (see ?regexp):
s <- "foo, bar, baz"
strsplit(s, ", *")
strsplit(s, "")

# Looking in *vectors* of strings:
(s <- apply(matrix(LETTERS[1:24], nr=4), 2, paste, collapse=""))

grep("O", s) # Returns location
grep("O", s, value=T) # Returns value


regexpr(pattern="o", text="Hello")
regexpr(pattern="o", text=c("Hello", "World!"))

s <- c("Hello", "World!")
regexpr("o", s)
s <- c("Helll ooo", "Wrld!")
regexpr("o", s)

# Fuzzy (approximate) matches:
grep ("abc", c("abbc", "jdfja", "cba")) 	# No match :-(
agrep ("abc", c("abbc", "jdfja", "cba")) 	# Match! :-)

## Note: agrep() is the function used in help.search()
s <- "foo bar baz"
gsub(pattern=" ", replacement="", s)   # Remove all the spaces
s <- "foo  bar   baz"
gsub("  ", " ", s)
gsub(" +", "", s) # Using regular expression
gsub(" +", " ", s)  # Remove multiple spaces and replace them by single spaces

s <- "foo bar baz"
sub(pattern=" ", replacement="", s) # sub() only replaces first occurance.
gsub("  ", " ", s)
```


If you use strings often, try the stringr package.


# Beating memory constraints
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(cache=TRUE)
```


## Diagnostics


## Tips and Tricks

1. For *batch* algorithms memory usage should not exceed $30%$.
  
2. Swap files:
  - NEVER use swap file.
  - No matter what the monitors say, if it takes a long time to quit a job, you are facing a memory constraint.

4. R releases memory only when needed, not when possible ("lazy" release).

5. Don't count on R returning RAM to the operating system. Restart R if FACEBOOK slows down. 


## Bla bla... Let's see some code!
Inspired  [here](http://www.r-bloggers.com/bigglm-on-your-big-data-set-in-open-source-r-it-just-works-similar-as-in-sas/).


Download a fat data file:
```{r download_data}
# download.file("http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPUFS/Downloads/2010_Carrier_PUF.zip", "2010_Carrier_PUF.zip")
# unzip(zipfile="2010_Carrier_PUF.zip")
```


`data.table` package for fast imports.
```{r import_data}
library(data.table)
data <- fread(input = "2010_BSA_Carrier_PUF.csv", 
                   sep = ',',
                   header=TRUE)

library(magrittr)
data %>% 
  setnames(c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "service.count", "provider.type", "servicesprocessed", "place.served", "payment", "carrierline.count"))


object.size(data)
pryr::object_size(data)
```


Is a copy of the object created? Nope.
```{r tracemem}
tracemem(data)
.test <- glm(payment ~ sex + age + place.served, data = data[1:1e2,], family=poisson) 
```


What is the change in memory allocation?
```{r RAM_change}
library(pryr)
mem_change(
  glm(payment ~ sex + age + place.served, data = data[1:1e2,], family=poisson) 
    )
```
Yeah, but what about all those matrix multiplications in the process?!?
We go do a line-by-line analysis.

```{r lineprof}
# devtools::install_github("hadley/lineprof")

prof <- lineprof::lineprof(
  glm(payment ~ sex + age + place.served, data = data)
  )
lineprof::shine(prof)
```


But actually, I just like to have my Task-Manager constantly open:
```{r inspect_RAM}
# Run and inspect RAM/CPU
glm(payment ~ sex + age + place.served, data = data, family=poisson)
```


Now lets artificially scale the problem.
Note: `copies` is small so that fitting can be done in real-time.
To demonstrate the problem, I would have set `copies <- 10`.
```{r artificial_scale}
copies <- 2
data.2 <- do.call(rbind, lapply(1:copies, function(x) data) )
data.2 %>% dim
data %>% object_size
data.2 %>% object_size
```


When you run the following code at home, it will *not* show memory exhaustion, but will take a long time to run and to release when stopped.
It is thus a *memory* constraint.
```{r}
## Don't run:
## glm.2 <-glm(payment ~ sex + age + place.served, data = data.2, family=poisson)
```
Since the data easily fits in RAM, it can be fixed simply by a *streaming* algorithm. 


The following object, can't even be stored in memory. 
Streaming *from RAM* will not solve the problem. 
We will get back to this...
```{r}
## Don't run:
## copies <- 1e2
## data.3 <- do.call(rbind, lapply(1:copies, function(x) data) )
```


## Streaming regression from RAM 
biglm
```{r biglm}
library(biglm)
mymodel <- bigglm(payment ~ sex + age + place.served, 
                  data = data.2, 
                  family = poisson(), 
                  maxit=1e3)

# Too long! Quit the job and time the release.

# For demonstration: OLS example with original data.
mymodel <- bigglm(payment ~ sex + age + place.served, data =data )
mymodel <- data %>% bigglm(payment ~ sex + age + place.served, data =. )
```
Remarks:
- R is immediatly(!) available after quitting the job.
- `bigglm` objects behave (almost) like `glm` objects w.r.t. `coef`, `summary`,...
- `bigglm` is aimed at *memory* constraints. Not speed.


## Exploit sparsity
Very relevant to factors with many levels.
```{r}
reps <- 1e6
y<-rnorm(reps)
x<- letters %>% 
  sample(reps, replace=TRUE) %>% 
  factor

X.1 <- model.matrix(~x-1) # Make dummy variable matrix

library(MatrixModels)
X.2<-as(x,"sparseMatrix") %>% t # Makes sparse dummy matrix

dim(X.1)
dim(X.2)

object_size(X.1)
object_size(X.2)
```


```{r}
system.time(lm.1 <- lm(y ~ X.1))
system.time(lm.1 <- lm.fit(y=y, x=X.1))
system.time(lm.2 <- MatrixModels:::lm.fit.sparse(X.2,y))

all.equal(lm.2, unname(lm.1$coefficients), tolerance = 1e-12)
```


## Streaming classification from RAM
`LiblineaR`, and `RSofia` will stream from RAM your data for classification problems;
mainly SVM.


## Out of RAM
What if it is not the *algorithm* that causes the problem, but merely importing my objects?

### ff
The `ff` package replaces R's in-RAM storage mechanism with on-disk (efficient) storage.
```{r}
library(LaF)

# Open connection to file:
.dat <- laf_open_csv(filename = "2010_BSA_Carrier_PUF.csv",
                    column_types = c("integer", "integer", "categorical", "categorical", "categorical", "integer", "integer", "categorical", "integer", "integer", "integer"), 
                    column_names = c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "service.count", "provider.type", "servicesprocessed", "place.served", "payment", "carrierline.count"), 
                    skip = 1)

# Write data as ff object
library(ffbase)
data.ffdf <- laf_to_ffdf(laf = .dat)

object_size(data)
object_size(data.ffdf)
```


Caution: `base` functions are unaware of `ff`.
Adapted algorithms are required...
```{r}
data$age %>% table
data.ffdf$age %>% table.ff
```


Luckily, bigglm has it's `ff` version:
```{r biglm_regression}
mymodel.ffdf.2 <- bigglm.ffdf(payment ~ sex + age + place.served, 
                              data = data.ffdf, 
                              family = poisson(), 
                              maxit=1e3)

# Again, too slow. Stop and run:
mymodel.ffdf.2 <- bigglm.ffdf(payment ~ sex + age + place.served, 
                              data = data.ffdf)
```
The previous can scale to any file I can store on disk (but might take a while).


I will now inflate the data to a size that would not fit in RAM.
```{r}
copies <- 2e1
data.2.ffdf <- do.call(rbind, lapply(1:copies, function(x) data.ffdf) )

# Actual size:
(sum(.rambytes[vmode(data.2.ffdf)]) * (nrow(data.2.ffdf) * 9.31322575 * 10^(-10))) %>%
  round(4)  %>%
  cat('Size in GB ',.)
# In memory:
object_size(data.2.ffdf)
```


And now I can run this MASSIVE regression:
```{r biglm_ffdf_regression}
## Do no run:

#  mymodel.ffdf.2 <- bigglm.ffdf(payment ~ sex + age + place.served,                            
#                                data = data.2.ffdf, 
#                                family = poisson(), 
#                                maxit=1e3)
```
Notes:
- Notice again the quick release. 

- Solving RAM constraints does not guarantee speed. 
This particular problem is worth parallelizing.

- SAS, SPSS, Revolutios-R,... all rely on similar ideas. 
Currently, their "closed" versions are typically faster than the open `ff`. 
Keep an eye on those benchmark reports.

- Clearly, with so few variables I would be better of *subsampling*.

### Out of RAM Classification
I do not know if there are `ff` versions of `LiblineaR` or `RSofia`.
If you find out, let me know.