Wine QualityReds.Rmd

---
output:
  html_document: default
  pdf_document: default
---
```{r}
#install.packages("knitr")
#hide code as a default setting globaly
knitr::opts_chunk$set(echo = FALSE, include=TRUE, results="hide")
knitr::opts_chunk$set(fig.width=9,fig.height=5,fig.path='Figs/',
                      fig.align='center',tidy=TRUE,
                      echo=FALSE,warning=FALSE,message=FALSE)
```

# Explore and Summarize Data by Oloeriu Lacramioara

In this project, we will explore a data set on red wines quality. Our main objective is to explore the chemical properties influences the quality of red wines. This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine


```{r}
#install.packages("ggplot2")
#install.packages("gridBase")
#install.packages("gridExtra")
#install.packages("GGally")
#install.packages("dplyr")
#install.packages("tidyr")

library(stats)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(GGally)
library(tidyr)


```
# Load the Data

```{r}
#getwd()
wine <- read.csv("wineQualityReds.csv",row.names=NULL)
wine$X <- NULL
```

#Univariate Plots Section

There are 1,599 observation with total 13 variables.
X is an unique identifier.
All the variables in the dataset have numeric values except X and quality which are integer.
```{r}
str(wine)

wine$quality.level <- ifelse(wine$quality < 5, "low", 
                             ifelse(wine$quality < 7, "average", "high"))
wine$quality.level <- factor(wine$quality.level, 
                             levels=c("high", "average", "low"), ordered=TRUE)
attach(wine)
```

Our dataset consists of 12 variables, with 1599 observations. Quality variable is discrete and the others are continuous.

we need to analyze:

fixed.acidity

volatile.acidity

sulfur.dioxide
sulphated and alcohol 

density and pH

residual.sugar and chlorides 

citric.acid 

```{r}
qplot(factor(quality), data=wine, geom="bar", xlab="Quality") + theme_bw()
summary(wine$quality)
```
Red wine quality is normally distributed and concentrated around 5 and 6.

```{r}
uni_qplot <- function(x, dat=NULL, xlab, binwidth=0.01) {
  if(missing(dat)) {
    qplot(x, data=wine, xlab=xlab, binwidth=binwidth) + theme_bw()
  }
  else {
    qplot(x, data=dat, xlab=xlab, binwidth=binwidth) + theme_bw()
  }
}
uni_qplot(x=fixed.acidity, xlab="Fixed acidity (g/dm^3)", binwidth=0.1)

```

The distribution of fixed acidity is right skewed, and concentrated around 7.9.
The graph shows a maximum value of 7.9, being distributed somewhere between 3 and 12

```{r}
uni_qplot(x=volatile.acidity, xlab="Volatile acidity (g/dm^3)")
summary(wine$volatile.acidity)

```
The distribution of volatile acidity seem to be unclear whether it is bimodal or unimodel, right skewed or normal.
```{r}
uni_qplot(citric.acid, xlab="Citric acid (g/dm^3)")
summary(wine$citric.acid)
```
The distribution of citric acid is not normal.The value is concentrated somewhere between 0 and 0.75. In the normal node, the value must be greater than zero to fit into normal.

```{r}
uni_qplot(residual.sugar, xlab="Residual sugar (g/dm^3)", binwidth=0.5)
summary(wine$residual.sugar)

```

The distribution of residual sugar is right skewed, and concentrated around 2. There are a few outliers in the plot.

```{r}
uni_qplot(x=chlorides, xlab="Chlorides (g/dm^3)")
summary(wine$chlorides)
```

The distribution of chlorides is normal, and concentrated around 0.08. The plot has some outliers.

```{r}
uni_qplot(free.sulfur.dioxide, xlab="Free sulfur dioxide (mg/dm^3)", binwidth=0.5)
summary(wine$free.sulfur.dioxide)

```

The distribution of free sulfur dioxide is right skewed and concentrated around 14.The chart has a net downward value with small increases.

```{r}
uni_qplot(total.sulfur.dioxide, xlab="Total sulfur dioxide (mg/dm^3)", binwidth=5)
summary(wine$total.sulfur.dioxide)

```

The distribution of total sulfur dioxide is right skewed and concentrated around 38. There are a few outliers in the plot.


```{r}
uni_qplot(density, xlab="Density (g/cm^3)", binwidth=0.001)
summary(wine$density)

```

The distribution of density is normal and concentrated around 0.9967.We notice it spread with a slight asymmetry to the left.

```{r}
uni_qplot(wine$pH, xlab="pH")
summary(wine$pH)
```

The distribution of pH is normal and concentrated around 3.310.the graph is symmetrically distributed between 0 and 4 with a maximum value somewhere at 3.310.

```{r}
uni_qplot(sulphates, xlab="Sulphates (g/dm^3)")
summary(wine$sulphates)

```

The distribution of sulphates is right skewed and concentrated around 0.6581. The plot has some outliers.
The graph looks symmetrically distributed a little to the right with certain asymmetric values after the value 1.

```{r}
uni_qplot(alcohol, xlab="Alcohol (%)", binwidth=0.4)
summary(wine$alcohol)
```

The distribution of alcohol is right skewed and concentrated around 10.20.the asymmetric histogram distributed to the right.

```{r}
quality78 <- subset(wine, quality == 8 | quality == 7)
quality34 <- subset(wine, quality == 3 | quality == 4)
volatile78 <- uni_qplot(quality78$volatile.acidity, dat=quality78, 
                        xlab="Volatile acidity (g/dm^3), quality 7 & 8", 
                        binwidth=0.1)
volatile34 <- uni_qplot(quality34$volatile.acidity, dat=quality34, 
                        xlab="Volatile acidity (g/dm^3), quality 3 & 4", 
                        binwidth=0.1)

density78 <- uni_qplot(quality78$density, dat=quality78, 
                       xlab="Density (g/cm^3), quality 7 & 8", binwidth=0.001)
density34 <- uni_qplot(quality34$density, dat=quality34, 
                       xlab="Density (g/cm^3), quality 3 & 4", binwidth=0.001)

citric78 <- uni_qplot(quality78$citric.acid, dat=quality78, 
                      xlab="Citric acid (g/dm^3), quality 7 & 8")
citric34 <- uni_qplot(quality34$citric.acid, dat=quality34, 
                      xlab="Citric acid (g/dm^3), quality 3 & 4")

alcohol78 <- uni_qplot(quality78$alcohol, dat=quality78, 
                       xlab="Alcohol (%), quality 7 & 8", binwidth=0.1)
alcohol34 <- uni_qplot(quality34$alcohol, dat=quality34, 
                       xlab="Alcohol (%), quality 3 & 4", binwidth=0.1)

grid.arrange(volatile34, volatile78, density34, density78, 
             citric34, citric78, alcohol34, alcohol78, ncol=2)

```

#Univariate Analysis

What is the structure of your dataset?
There are 1,599 red wines in the dataset with 11 features on the chemical properties of the wine. ( fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality).

#Other observations:

The median quality is 6. Most wines have a pH of 3.4 or higher. About 75% of wine have quality that is lower than 6. The median percent alcohol content is 10.20 and the max percent alcohol content is 14.90.

What is/are the main feature(s) of interest in your dataset?
The main features in the data set are pH and quality. I'd like to determine which features are best for predicting the quality of a wine. I suspect pH and some combination of the other variables can be used to build a predictive model to grade the quality of wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Volatile acidity, citric acid, and alcohol likely contribute to the quality of a wine. I think volatile acidity (the amount of acetic acid in wine) and alcohol (the percent alcohol content of the wine) probably contribute most to the quality after researching information on wine quality.

Did you create any new variables from existing variables in the dataset?
I created a new variable called "quality.level" which is categorically divided into "low", "average", and "high". This grouping method will help us detect the difference among each group more easily.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Having visualized acitric acid and volatile acidity data, I observed some unusual distributions, so I guess this fact may have some correlation with the quality of red wine. Since the data is clean, I did not perform any cleaning process or modification of the data.

#Bivariate Plots Section

```{r}
bi_qplot <- function(x, y, z="jitter") {
  if(z=="boxplot") {
    qplot(x=x, y=y, data=wine, geom="jitter", alpha=0.01) + 
      geom_boxplot() +
      guides(alpha="none") +
      theme_bw()
  }
  else {
    qplot(x=x, y=y, data=wine, geom=z, alpha=0.01) + 
      guides(alpha="none") +
      theme_bw()
  }
}

bi_qplot(quality.level, volatile.acidity, "boxplot") +
  xlab("Quality level") +
  ylab("Volatile acidity (g/dm^3)")

```

The graph shows a very clear trend; the lower volatile acidity is, the higher the quality becomes. The correlation coefficient between quality and volatile acidity is -0.39. This can be explained by the fact that volatile acidity at too high of levels can lead to an unpleasant, vinegar taste.

```{r}
bi_qplot(quality.level, citric.acid, "boxplot") +
  xlab("Quality level") +
  ylab("Citric acid")
grp <- group_by(wine, quality.level)
cnts <- summarize(grp, count=n(), 
                  median=median(citric.acid), 
                  mean=mean(citric.acid), 
                  variance=var(citric.acid), 
                  Q1=quantile(citric.acid, 0.25), 
                  Q3=quantile(citric.acid, 0.75))
print(cnts)

```

The correlation coefficient is 0.226; the graph shows a weak positive relationship between quality level and citric acid.


```{r}
bi_qplot(quality.level, alcohol) +
  xlab("Quality level") +
  ylab("Alcohol")
cor.test(wine$quality, wine$alcohol)
```

The correlation coefficient is 0.476m. The graph shows a positive relationship between alcohol and quality. The medium quality and the low quality wine have an alcohol content somewhere around 10, on the other hand the top quality wine contains an alcohol percentage of 12.

```{r}
bi_qplot(alcohol, volatile.acidity) +
  xlab("Alcohol (%)") +
  ylab("Volatile acidity (g/dm^3)")
```

A weak negative correlation of -0.2 exists between percent alcohol content and volatile acidity.

```{r}
bi_qplot(residual.sugar, alcohol) +
  xlab("Residual sugar (g/dm^3)") +
  ylab("Alcohol (%)")

```

There is almost no relationship between the residual sugar and the alcohol content. There is a relationship between the wine made from baked fruits and ribs. To keep the wine sweeter, fermentation should be prolonged until more sugar is consumed and the effect leads to wine with a higher alcohol concentration.


```{r}
bi_qplot(citric.acid, volatile.acidity) +
  xlab("Citric acid (g/dm^3)") +
  ylab("Volatile acidity (g/dm^3)")
cor.test(wine$citric.acid, wine$volatile.acidity)

```

There is a negative correlation between citric acid and volatile acidity.

```{r}
bi_qplot(alcohol, density) + 
  xlab("Alcohol (%)") + 
  ylab("Density (g/cm^3)")
```

The correlation coefficient is -0.5, so the relationship is quite clear. As percent alcohol content increases, the density decreases. The reason is simple: the density of wine is lower than the density of pure water.

```{r}
addFeatures <- wine[,!colnames(wine) %in% c("volatile.acidity", 
                                            "quality", "quality.level")]
ggpairs(addFeatures, 
        columnLabels=c("f.aci", "ci.aci", "res.sug", "chlo", "fr.su.dio", 
                       "to.su.dio", "dens", "pH", "sulph", "alco"), 
        lower = list(continuous = wrap("points", size=1, shape = I('.'))),
        upper = list(combo = wrap("box", outlier.shape = I('.')))) + 
  theme(axis.ticks=element_blank(),
        axis.line=element_blank(), 
        axis.text=element_blank(), 
        panel.grid.major= element_blank())

```

This graph shows positive relationship between density and fixed acidity, positive relationship between fixed acidity and citric acid, negative relationship between pH and acidity.

#Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
I observed a negative relationships between quality level and volatile acidity, and positive correlation between quality level and alcohol. I am not suprised at this result, because men tend to grade stronger wines as high quality, whereas wines with low percent alcohol are often not graded as such. High volatile acidity is also perceived to be undesirable because it impacts the taste of wines. Alcohol and volatile acidity don't have any clear relationship between each other.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Yes, I observed positive relationship between density and fixed acidity, positive relationship between fixed acidity and citric acid, and negative relationship between pH and fixed acidity. Other variables either show very weak relationship or do not show any relationship.

What was the strongest relationship you found?
Quality is positively and strongly correlated with alcohol, and it is also negatively correlated with volatile acidity. Alcohol and volatile acidity could be used in a model to predict the quality of wine.

#Multivariate Plots Section

```{r}
multi_qplot <- function(x, y, z, alpha=0.4) {
  qplot(x, y, data=wine, color=z) +
    geom_point(position=position_jitter(w = 0.025, h = 0.025), alpha=alpha) +
    guides(alpha=FALSE)+
  scale_color_brewer(type = "div", palette = "RdYlBu", name="Quality", direction=-1)
}
multi_qplot(density, volatile.acidity, quality.level) +
  xlab("Density (g/cm^3)") +
  ylab("Volatile acidity (g/cm^3)") +
  labs(color="Quality level")
```

The densities of high quality wines are concentrated between 0.994 and 0.998, and the lower part of volatile acidity (y axis)

```{r}
multi_qplot(volatile.acidity, alcohol, quality.level)  +
  xlab("Volatile acidity (g/dm^3)") +
  ylab("Alcohol (%)") + 
  labs(color="Quality level", size="Citric acid")
print("Percent alcohol contents by quality level:")
wine %>% 
  group_by(quality.level) %>% 
  summarize(mean=mean(alcohol),sd=sd(alcohol))
print("Volatile acidities by quality level:")
wine %>% 
  group_by(quality.level) %>% 
  summarize(mean=mean(volatile.acidity),sd=sd(volatile.acidity))

```


High quality feature seems to be associated with alcohol ranging from 11 to 13, volatile acidity from 0.2 to 0.5, and citric acid from 0.25 to 0.75

```{r}
multi_qplot(fixed.acidity, volatile.acidity, quality.level) + 
  aes(size=pH) +
  xlab("Fixed acidity (g/dm^3)") +
  ylab("Volatile acidity (g/dm^3)") +
  labs(color="Quality level")

multi_qplot(residual.sugar, alcohol, quality.level) + 
  xlab("Residual sugar (g/dm^3)") +
  ylab("Alcohol (%)") +
  labs(color="Quality level")

multi_qplot(fixed.acidity, alcohol, quality.level) + 
  aes(size=citric.acid) +
  xlab("Fixed acidity (g/dm^3)") +
  ylab("Alcohol (%)") + 
  labs(color="Quality level", size="Citric acid")

den_qplot <- function(x, color, xlab) {
  ggplot(data=wine, aes(x, colour=color)) + 
    geom_density() + 
    xlab(xlab) + 
    labs(colour="Quality level") +
    theme_bw()
}
den_qplot(fixed.acidity, quality.level, "Fixed acidity (g/dm^3)")

```

The distribution of low and average quality wines seem to be concentrated at fixed acidity values that are between 6 and 10. pH increases as fixed acidity decreases, and citric acid increases as fixed acidity increases.

```{r}
alcoQuaDensity <- den_qplot(alcohol, quality.level, "Alcohol (%)")
print(alcoQuaDensity)

alcohol_lm <- lm(data=wine, quality~alcohol)
summary(alcohol_lm)


```

High quality wine density line is distinct from the others, and mostly distributed between 11 and 12.

```{r}
volaQuaDensity <- den_qplot(volatile.acidity, quality.level, 
                            "Volatile acidity (g/dm^3)")
print(volaQuaDensity)

volacid_lm <- lm(data=wine, quality~volatile.acidity)
summary(volacid_lm)

```

This chart shows a very clear trend; as volatile acidity decreases, the quality of wine increases. Wines with volatile acidity exceeding 1 are almost rated as low quality. The linear model of volatile acidity has an R-squared of 0.152 which means this feature alone does not explain much of the variability of red wine quality.

```{r}
feaInterest_lm <- lm(data=wine, quality~volatile.acidity + alcohol)
summary(feaInterest_lm)

```

R-squared increases by two times after adding alcohol to the linear model.

#Multivariate Analysis

Quality has a weak positive relationship with alcohol, and weak negative relationship with volatile acid. The R squared values are low but p-values are significant; this result indicates that the regression models have significant variable but explains little of the variability. The quality of wine does not solely depends on volatile acidity and alcohol but also other features. Therefore, it is hard to build a predictive model that can accurately predict the quality of wine.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
When looking at wine quality level, we see a positive relationship between fixed acidity and citric acid

Were there any interesting or surprising interactions between features?
Residual sugar, supposed to play an important part in wine taste, actually has very little impact on wine quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
Yes, I created 3 models. Their p-values are significant; however, the R squared values are under 0.4, so they do not provide us with enough explanation about the variability of the response data around their means.

#Final Plots and Summary

#Plot One

```{r}
ggplot(data=wine, aes(factor(quality), fill=quality.level)) + 
  geom_bar() + 
  xlab("Quality") + 
  ylab("Number of wines")
```

#Description One

The distribution of red wine quality appears to be normal. 82.5% of wines are rated 5 and 6 (average quality). Although the rating scale is between 0 and 10, there exists no wine that is rated 1, 2, 9 or 10.

#Plot Two

```{r}
bi_qplot(quality.level, citric.acid, "boxplot") +
  xlab("Quality level") +
  ylab("Citric acid (g/dm^3)")

```

#Description Two

While citric citric do not have a strong correlation with quality, it is an important component in the quality of wine. Because citric acid is an organic acid that contributes to the total acidity of a wine, it is crucial to have a righ amount of citric acid in wine. Adding citric acid will give the wine "freshness" otherwise not present and will effectively make a wine more acidic. Wines with citric acid exceeding 0.75 are hardly rated as high quality. 50% of high quality wines have a relatively high citric acid that ranges between 0.3 and 0.49, whereas average and low quality wines have lower amount of citric acid.

#Plot Three

```{r}
vol.alco <- multi_qplot(volatile.acidity, alcohol, quality.level) +
geom_point(size=4, shape=2, colour="steelblue", alpha=0.002) +
xlab(expression(Volatile~Acidity~(g/dm^{3}))) +
ylab("Alcohol (% by Volume)") +
labs(color="Quality level") +
theme_bw()

# Move to a new page
grid.newpage()
# Create layout : nrow = 2, ncol = 2
pushViewport(viewport(layout = grid.layout(2, 2)))
# A helper function to define a region on the layout
define_region <- function(row, col){
  viewport(layout.pos.row = row, layout.pos.col = col)
} 
# Arrange the plots
print(vol.alco, vp=define_region(1, 1:2))
print(volaQuaDensity, vp = define_region(2, 1))
print(alcoQuaDensity, vp = define_region(2, 2))

```

#Description Three

As we can see, when volatile acidity is greater than 1, the probability of the wine being excellent is zero. When volatile acidity is either 0 or 0.3, there is roughly a 40% probability that the wine is excellent. However, when volatile acidity is between 1 and 1.2 there is an 80% chance that the wine is bad. Moreover, any wine with a volatile acidity greater than 1.4 has a 100% chance of being bad. Therefore, volatile acidity is a good predictor for bad wines.

#Reflection

The wines data set contains information on 1599 wines across twelve variables from around 2009. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wines across many variables and tried creating 3 linear models to predict red wine quality.

There was a trend between the volatile acidity of a wine and its quality. There was also a trend between the alcohol and its quality. For the linear model, all wines were included since information on quality, volatile acidity and alcohol were available for all the wines. The third linear model with 2 variables were able to account for 31.6% of the variance in the dataset.

There are very few wines that are rated as low or high quality. We could improve the quality of our analysis by collecting more data, and creating more variables that may contribute to the quality of wine. This will certainly improve the accuracy of the prediction models. Having said that, we have successfully identified features that impact the quality of red wine, visualized their relationships and summarized their statistics.