-
Notifications
You must be signed in to change notification settings - Fork 0
/
Wine QualityReds.Rmd
528 lines (377 loc) · 21 KB
/
Wine QualityReds.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
---
output:
html_document: default
pdf_document: default
---
```{r}
#install.packages("knitr")
#hide code as a default setting globaly
knitr::opts_chunk$set(echo = FALSE, include=TRUE, results="hide")
knitr::opts_chunk$set(fig.width=9,fig.height=5,fig.path='Figs/',
fig.align='center',tidy=TRUE,
echo=FALSE,warning=FALSE,message=FALSE)
```
# Explore and Summarize Data by Oloeriu Lacramioara
In this project, we will explore a data set on red wines quality. Our main objective is to explore the chemical properties influences the quality of red wines. This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine
```{r}
#install.packages("ggplot2")
#install.packages("gridBase")
#install.packages("gridExtra")
#install.packages("GGally")
#install.packages("dplyr")
#install.packages("tidyr")
library(stats)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(GGally)
library(tidyr)
```
# Load the Data
```{r}
#getwd()
wine <- read.csv("wineQualityReds.csv",row.names=NULL)
wine$X <- NULL
```
#Univariate Plots Section
There are 1,599 observation with total 13 variables.
X is an unique identifier.
All the variables in the dataset have numeric values except X and quality which are integer.
```{r}
str(wine)
wine$quality.level <- ifelse(wine$quality < 5, "low",
ifelse(wine$quality < 7, "average", "high"))
wine$quality.level <- factor(wine$quality.level,
levels=c("high", "average", "low"), ordered=TRUE)
attach(wine)
```
Our dataset consists of 12 variables, with 1599 observations. Quality variable is discrete and the others are continuous.
we need to analyze:
fixed.acidity
volatile.acidity
sulfur.dioxide
sulphated and alcohol
density and pH
residual.sugar and chlorides
citric.acid
```{r}
qplot(factor(quality), data=wine, geom="bar", xlab="Quality") + theme_bw()
summary(wine$quality)
```
Red wine quality is normally distributed and concentrated around 5 and 6.
```{r}
uni_qplot <- function(x, dat=NULL, xlab, binwidth=0.01) {
if(missing(dat)) {
qplot(x, data=wine, xlab=xlab, binwidth=binwidth) + theme_bw()
}
else {
qplot(x, data=dat, xlab=xlab, binwidth=binwidth) + theme_bw()
}
}
uni_qplot(x=fixed.acidity, xlab="Fixed acidity (g/dm^3)", binwidth=0.1)
```
The distribution of fixed acidity is right skewed, and concentrated around 7.9.
The graph shows a maximum value of 7.9, being distributed somewhere between 3 and 12
```{r}
uni_qplot(x=volatile.acidity, xlab="Volatile acidity (g/dm^3)")
summary(wine$volatile.acidity)
```
The distribution of volatile acidity seem to be unclear whether it is bimodal or unimodel, right skewed or normal.
```{r}
uni_qplot(citric.acid, xlab="Citric acid (g/dm^3)")
summary(wine$citric.acid)
```
The distribution of citric acid is not normal.The value is concentrated somewhere between 0 and 0.75. In the normal node, the value must be greater than zero to fit into normal.
```{r}
uni_qplot(residual.sugar, xlab="Residual sugar (g/dm^3)", binwidth=0.5)
summary(wine$residual.sugar)
```
The distribution of residual sugar is right skewed, and concentrated around 2. There are a few outliers in the plot.
```{r}
uni_qplot(x=chlorides, xlab="Chlorides (g/dm^3)")
summary(wine$chlorides)
```
The distribution of chlorides is normal, and concentrated around 0.08. The plot has some outliers.
```{r}
uni_qplot(free.sulfur.dioxide, xlab="Free sulfur dioxide (mg/dm^3)", binwidth=0.5)
summary(wine$free.sulfur.dioxide)
```
The distribution of free sulfur dioxide is right skewed and concentrated around 14.The chart has a net downward value with small increases.
```{r}
uni_qplot(total.sulfur.dioxide, xlab="Total sulfur dioxide (mg/dm^3)", binwidth=5)
summary(wine$total.sulfur.dioxide)
```
The distribution of total sulfur dioxide is right skewed and concentrated around 38. There are a few outliers in the plot.
```{r}
uni_qplot(density, xlab="Density (g/cm^3)", binwidth=0.001)
summary(wine$density)
```
The distribution of density is normal and concentrated around 0.9967.We notice it spread with a slight asymmetry to the left.
```{r}
uni_qplot(wine$pH, xlab="pH")
summary(wine$pH)
```
The distribution of pH is normal and concentrated around 3.310.the graph is symmetrically distributed between 0 and 4 with a maximum value somewhere at 3.310.
```{r}
uni_qplot(sulphates, xlab="Sulphates (g/dm^3)")
summary(wine$sulphates)
```
The distribution of sulphates is right skewed and concentrated around 0.6581. The plot has some outliers.
The graph looks symmetrically distributed a little to the right with certain asymmetric values after the value 1.
```{r}
uni_qplot(alcohol, xlab="Alcohol (%)", binwidth=0.4)
summary(wine$alcohol)
```
The distribution of alcohol is right skewed and concentrated around 10.20.the asymmetric histogram distributed to the right.
```{r}
quality78 <- subset(wine, quality == 8 | quality == 7)
quality34 <- subset(wine, quality == 3 | quality == 4)
volatile78 <- uni_qplot(quality78$volatile.acidity, dat=quality78,
xlab="Volatile acidity (g/dm^3), quality 7 & 8",
binwidth=0.1)
volatile34 <- uni_qplot(quality34$volatile.acidity, dat=quality34,
xlab="Volatile acidity (g/dm^3), quality 3 & 4",
binwidth=0.1)
density78 <- uni_qplot(quality78$density, dat=quality78,
xlab="Density (g/cm^3), quality 7 & 8", binwidth=0.001)
density34 <- uni_qplot(quality34$density, dat=quality34,
xlab="Density (g/cm^3), quality 3 & 4", binwidth=0.001)
citric78 <- uni_qplot(quality78$citric.acid, dat=quality78,
xlab="Citric acid (g/dm^3), quality 7 & 8")
citric34 <- uni_qplot(quality34$citric.acid, dat=quality34,
xlab="Citric acid (g/dm^3), quality 3 & 4")
alcohol78 <- uni_qplot(quality78$alcohol, dat=quality78,
xlab="Alcohol (%), quality 7 & 8", binwidth=0.1)
alcohol34 <- uni_qplot(quality34$alcohol, dat=quality34,
xlab="Alcohol (%), quality 3 & 4", binwidth=0.1)
grid.arrange(volatile34, volatile78, density34, density78,
citric34, citric78, alcohol34, alcohol78, ncol=2)
```
#Univariate Analysis
What is the structure of your dataset?
There are 1,599 red wines in the dataset with 11 features on the chemical properties of the wine. ( fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality).
#Other observations:
The median quality is 6. Most wines have a pH of 3.4 or higher. About 75% of wine have quality that is lower than 6. The median percent alcohol content is 10.20 and the max percent alcohol content is 14.90.
What is/are the main feature(s) of interest in your dataset?
The main features in the data set are pH and quality. I'd like to determine which features are best for predicting the quality of a wine. I suspect pH and some combination of the other variables can be used to build a predictive model to grade the quality of wines.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Volatile acidity, citric acid, and alcohol likely contribute to the quality of a wine. I think volatile acidity (the amount of acetic acid in wine) and alcohol (the percent alcohol content of the wine) probably contribute most to the quality after researching information on wine quality.
Did you create any new variables from existing variables in the dataset?
I created a new variable called "quality.level" which is categorically divided into "low", "average", and "high". This grouping method will help us detect the difference among each group more easily.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Having visualized acitric acid and volatile acidity data, I observed some unusual distributions, so I guess this fact may have some correlation with the quality of red wine. Since the data is clean, I did not perform any cleaning process or modification of the data.
#Bivariate Plots Section
```{r}
bi_qplot <- function(x, y, z="jitter") {
if(z=="boxplot") {
qplot(x=x, y=y, data=wine, geom="jitter", alpha=0.01) +
geom_boxplot() +
guides(alpha="none") +
theme_bw()
}
else {
qplot(x=x, y=y, data=wine, geom=z, alpha=0.01) +
guides(alpha="none") +
theme_bw()
}
}
bi_qplot(quality.level, volatile.acidity, "boxplot") +
xlab("Quality level") +
ylab("Volatile acidity (g/dm^3)")
```
The graph shows a very clear trend; the lower volatile acidity is, the higher the quality becomes. The correlation coefficient between quality and volatile acidity is -0.39. This can be explained by the fact that volatile acidity at too high of levels can lead to an unpleasant, vinegar taste.
```{r}
bi_qplot(quality.level, citric.acid, "boxplot") +
xlab("Quality level") +
ylab("Citric acid")
grp <- group_by(wine, quality.level)
cnts <- summarize(grp, count=n(),
median=median(citric.acid),
mean=mean(citric.acid),
variance=var(citric.acid),
Q1=quantile(citric.acid, 0.25),
Q3=quantile(citric.acid, 0.75))
print(cnts)
```
The correlation coefficient is 0.226; the graph shows a weak positive relationship between quality level and citric acid.
```{r}
bi_qplot(quality.level, alcohol) +
xlab("Quality level") +
ylab("Alcohol")
cor.test(wine$quality, wine$alcohol)
```
The correlation coefficient is 0.476m. The graph shows a positive relationship between alcohol and quality. The medium quality and the low quality wine have an alcohol content somewhere around 10, on the other hand the top quality wine contains an alcohol percentage of 12.
```{r}
bi_qplot(alcohol, volatile.acidity) +
xlab("Alcohol (%)") +
ylab("Volatile acidity (g/dm^3)")
```
A weak negative correlation of -0.2 exists between percent alcohol content and volatile acidity.
```{r}
bi_qplot(residual.sugar, alcohol) +
xlab("Residual sugar (g/dm^3)") +
ylab("Alcohol (%)")
```
There is almost no relationship between the residual sugar and the alcohol content. There is a relationship between the wine made from baked fruits and ribs. To keep the wine sweeter, fermentation should be prolonged until more sugar is consumed and the effect leads to wine with a higher alcohol concentration.
```{r}
bi_qplot(citric.acid, volatile.acidity) +
xlab("Citric acid (g/dm^3)") +
ylab("Volatile acidity (g/dm^3)")
cor.test(wine$citric.acid, wine$volatile.acidity)
```
There is a negative correlation between citric acid and volatile acidity.
```{r}
bi_qplot(alcohol, density) +
xlab("Alcohol (%)") +
ylab("Density (g/cm^3)")
```
The correlation coefficient is -0.5, so the relationship is quite clear. As percent alcohol content increases, the density decreases. The reason is simple: the density of wine is lower than the density of pure water.
```{r}
addFeatures <- wine[,!colnames(wine) %in% c("volatile.acidity",
"quality", "quality.level")]
ggpairs(addFeatures,
columnLabels=c("f.aci", "ci.aci", "res.sug", "chlo", "fr.su.dio",
"to.su.dio", "dens", "pH", "sulph", "alco"),
lower = list(continuous = wrap("points", size=1, shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.')))) +
theme(axis.ticks=element_blank(),
axis.line=element_blank(),
axis.text=element_blank(),
panel.grid.major= element_blank())
```
This graph shows positive relationship between density and fixed acidity, positive relationship between fixed acidity and citric acid, negative relationship between pH and acidity.
#Bivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
I observed a negative relationships between quality level and volatile acidity, and positive correlation between quality level and alcohol. I am not suprised at this result, because men tend to grade stronger wines as high quality, whereas wines with low percent alcohol are often not graded as such. High volatile acidity is also perceived to be undesirable because it impacts the taste of wines. Alcohol and volatile acidity don't have any clear relationship between each other.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Yes, I observed positive relationship between density and fixed acidity, positive relationship between fixed acidity and citric acid, and negative relationship between pH and fixed acidity. Other variables either show very weak relationship or do not show any relationship.
What was the strongest relationship you found?
Quality is positively and strongly correlated with alcohol, and it is also negatively correlated with volatile acidity. Alcohol and volatile acidity could be used in a model to predict the quality of wine.
#Multivariate Plots Section
```{r}
multi_qplot <- function(x, y, z, alpha=0.4) {
qplot(x, y, data=wine, color=z) +
geom_point(position=position_jitter(w = 0.025, h = 0.025), alpha=alpha) +
guides(alpha=FALSE)+
scale_color_brewer(type = "div", palette = "RdYlBu", name="Quality", direction=-1)
}
multi_qplot(density, volatile.acidity, quality.level) +
xlab("Density (g/cm^3)") +
ylab("Volatile acidity (g/cm^3)") +
labs(color="Quality level")
```
The densities of high quality wines are concentrated between 0.994 and 0.998, and the lower part of volatile acidity (y axis)
```{r}
multi_qplot(volatile.acidity, alcohol, quality.level) +
xlab("Volatile acidity (g/dm^3)") +
ylab("Alcohol (%)") +
labs(color="Quality level", size="Citric acid")
print("Percent alcohol contents by quality level:")
wine %>%
group_by(quality.level) %>%
summarize(mean=mean(alcohol),sd=sd(alcohol))
print("Volatile acidities by quality level:")
wine %>%
group_by(quality.level) %>%
summarize(mean=mean(volatile.acidity),sd=sd(volatile.acidity))
```
High quality feature seems to be associated with alcohol ranging from 11 to 13, volatile acidity from 0.2 to 0.5, and citric acid from 0.25 to 0.75
```{r}
multi_qplot(fixed.acidity, volatile.acidity, quality.level) +
aes(size=pH) +
xlab("Fixed acidity (g/dm^3)") +
ylab("Volatile acidity (g/dm^3)") +
labs(color="Quality level")
multi_qplot(residual.sugar, alcohol, quality.level) +
xlab("Residual sugar (g/dm^3)") +
ylab("Alcohol (%)") +
labs(color="Quality level")
multi_qplot(fixed.acidity, alcohol, quality.level) +
aes(size=citric.acid) +
xlab("Fixed acidity (g/dm^3)") +
ylab("Alcohol (%)") +
labs(color="Quality level", size="Citric acid")
den_qplot <- function(x, color, xlab) {
ggplot(data=wine, aes(x, colour=color)) +
geom_density() +
xlab(xlab) +
labs(colour="Quality level") +
theme_bw()
}
den_qplot(fixed.acidity, quality.level, "Fixed acidity (g/dm^3)")
```
The distribution of low and average quality wines seem to be concentrated at fixed acidity values that are between 6 and 10. pH increases as fixed acidity decreases, and citric acid increases as fixed acidity increases.
```{r}
alcoQuaDensity <- den_qplot(alcohol, quality.level, "Alcohol (%)")
print(alcoQuaDensity)
alcohol_lm <- lm(data=wine, quality~alcohol)
summary(alcohol_lm)
```
High quality wine density line is distinct from the others, and mostly distributed between 11 and 12.
```{r}
volaQuaDensity <- den_qplot(volatile.acidity, quality.level,
"Volatile acidity (g/dm^3)")
print(volaQuaDensity)
volacid_lm <- lm(data=wine, quality~volatile.acidity)
summary(volacid_lm)
```
This chart shows a very clear trend; as volatile acidity decreases, the quality of wine increases. Wines with volatile acidity exceeding 1 are almost rated as low quality. The linear model of volatile acidity has an R-squared of 0.152 which means this feature alone does not explain much of the variability of red wine quality.
```{r}
feaInterest_lm <- lm(data=wine, quality~volatile.acidity + alcohol)
summary(feaInterest_lm)
```
R-squared increases by two times after adding alcohol to the linear model.
#Multivariate Analysis
Quality has a weak positive relationship with alcohol, and weak negative relationship with volatile acid. The R squared values are low but p-values are significant; this result indicates that the regression models have significant variable but explains little of the variability. The quality of wine does not solely depends on volatile acidity and alcohol but also other features. Therefore, it is hard to build a predictive model that can accurately predict the quality of wine.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
When looking at wine quality level, we see a positive relationship between fixed acidity and citric acid
Were there any interesting or surprising interactions between features?
Residual sugar, supposed to play an important part in wine taste, actually has very little impact on wine quality.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
Yes, I created 3 models. Their p-values are significant; however, the R squared values are under 0.4, so they do not provide us with enough explanation about the variability of the response data around their means.
#Final Plots and Summary
#Plot One
```{r}
ggplot(data=wine, aes(factor(quality), fill=quality.level)) +
geom_bar() +
xlab("Quality") +
ylab("Number of wines")
```
#Description One
The distribution of red wine quality appears to be normal. 82.5% of wines are rated 5 and 6 (average quality). Although the rating scale is between 0 and 10, there exists no wine that is rated 1, 2, 9 or 10.
#Plot Two
```{r}
bi_qplot(quality.level, citric.acid, "boxplot") +
xlab("Quality level") +
ylab("Citric acid (g/dm^3)")
```
#Description Two
While citric citric do not have a strong correlation with quality, it is an important component in the quality of wine. Because citric acid is an organic acid that contributes to the total acidity of a wine, it is crucial to have a righ amount of citric acid in wine. Adding citric acid will give the wine "freshness" otherwise not present and will effectively make a wine more acidic. Wines with citric acid exceeding 0.75 are hardly rated as high quality. 50% of high quality wines have a relatively high citric acid that ranges between 0.3 and 0.49, whereas average and low quality wines have lower amount of citric acid.
#Plot Three
```{r}
vol.alco <- multi_qplot(volatile.acidity, alcohol, quality.level) +
geom_point(size=4, shape=2, colour="steelblue", alpha=0.002) +
xlab(expression(Volatile~Acidity~(g/dm^{3}))) +
ylab("Alcohol (% by Volume)") +
labs(color="Quality level") +
theme_bw()
# Move to a new page
grid.newpage()
# Create layout : nrow = 2, ncol = 2
pushViewport(viewport(layout = grid.layout(2, 2)))
# A helper function to define a region on the layout
define_region <- function(row, col){
viewport(layout.pos.row = row, layout.pos.col = col)
}
# Arrange the plots
print(vol.alco, vp=define_region(1, 1:2))
print(volaQuaDensity, vp = define_region(2, 1))
print(alcoQuaDensity, vp = define_region(2, 2))
```
#Description Three
As we can see, when volatile acidity is greater than 1, the probability of the wine being excellent is zero. When volatile acidity is either 0 or 0.3, there is roughly a 40% probability that the wine is excellent. However, when volatile acidity is between 1 and 1.2 there is an 80% chance that the wine is bad. Moreover, any wine with a volatile acidity greater than 1.4 has a 100% chance of being bad. Therefore, volatile acidity is a good predictor for bad wines.
#Reflection
The wines data set contains information on 1599 wines across twelve variables from around 2009. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wines across many variables and tried creating 3 linear models to predict red wine quality.
There was a trend between the volatile acidity of a wine and its quality. There was also a trend between the alcohol and its quality. For the linear model, all wines were included since information on quality, volatile acidity and alcohol were available for all the wines. The third linear model with 2 variables were able to account for 31.6% of the variance in the dataset.
There are very few wines that are rated as low or high quality. We could improve the quality of our analysis by collecting more data, and creating more variables that may contribute to the quality of wine. This will certainly improve the accuracy of the prediction models. Having said that, we have successfully identified features that impact the quality of red wine, visualized their relationships and summarized their statistics.