Analysis of Quality of Red Wines by Ajay Das

Introduction

The following study is performed as a project submission to apply exploratory data analysis techniques using R. In this study I will analyze a publicly available dataset containing observations regarding red wines. The dataset has a set of objective attributes of wine (e.g. PH values) and a subjective quality rating based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). The objective of the study is to answer the question “Which chemical properties influence the quality of red wines?”

Univariate Plots Section

## [1] 1599   12

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The dataset has 1599 observations of 12 variables. There are no categorical variables. Quality rating is imported as integers and can be considered categorical for analysis purposes.

The quality rating for the red wines in the sample dataset range from 3 - 8 with median at 6, which is average considering scale is from 0 (very bad) to 10 (very excellent). Density does not change much (75% between 0.9901 to 0.9978) The median percent alcohol content is 10.2 Most wines are not sweet with residual sugar less than 2.6 (wines with greater than 45 grams/liter are considered sweet)

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

From the plot I can see that most wines (1319 out of 1599) have rating of 5 or 6 which is of average quality in a scale from 0 (very bad) to 10 (very Excellent). This means the dataset is not uniform with equal representation from all quality types. This actually supports the intuition that most wines in the market are of average quality. Throughout this study I will try to find out which attributes make for good quality wine.

Fixed acidity (g / dm^3) is the presense of tartaric acid in the wines. Most acids involved with wine are fixed or nonvolatile (that do not evaporate readily). This has a normal distribution with median 7.9 and some outliers greater than 14.

volatile acidity (g / dm^3) is the presence of acetic acid in the wines which at too high of levels can lead to an unpleasant, vinegar taste. Volatile acidity has a normal distribution with median 0.52. This can be good candidate for evaluating wine quality.

## 
## FALSE  TRUE 
##  1467   132

citric acid (g / dm^3) is found in small quantities, citric acid can add ‘freshness’ and flavor to wines. This may have some effect on quality. The distribution is flat with 132 wines that don’t have any citric acid.

Residual sugar (g / dm^3) is the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. The dataset doesn’t have any sweet wines with residual sugar between 0.9 to 15.5 with majority under 3. After removing outliers the distribution looks normal with median at 2.2.

Chlorides (sodium chloride - g / dm^3) is the amount of salt in the wine. 50% of the wines have chlorides between 0.07 and 0.09. It does not look like chloride may have any efect on wine quality. I will do some correlation analysis later to find out if there is a relation. After removing outliers the distribution is normal with median at 0.08.

Free sulfur dioxide (mg / dm^3) is the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. This may help keep wines from degrading and influence quality rating. The distribution is skewed with most values less than 30 with some outliers greater than 60.

Total sulfur dioxide (mg / dm^3) is the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. The median is 38 which can mean in about 50% of the wines where it may be evident in nose and taste and influence the perception of quality.

The density of wine (g / cm^3) is close to that of water (= 0.9982) depending on the percent alcohol and sugar content. This does not seem to be a very strong candidate for predicting wine quality as it is related to two other variables and does not have a big range.

pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. In our sample dataset pH has a normal distribution with most wines having pH between 3 to 3.6 with median of 3.31

Sulphates (potassium sulphate - g / dm3) is a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant. This may help keep wine fresh for longer. The distribution is close to normal with a few outliers greater than 1.5.

Alcohol measures the percentage of alcohol by volume in the wine. The distribution is skewed with most wines having alcohol content less than 10%. This can be a possible candidate for predicting wine quality.

Univariate Analysis

What is the structure of your dataset?

The dataset consists of 1599 observations of red wines which are characterized by 11 chemical property attributes (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol) and one qualitative attribute (quality). Most of the variables are numeric and there are no categorical variables (although quality rating being qualitative may be considered categorical).

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the quality rating of the red wines. This is a qualitative attribute that is based on sensory data from wine experts. Through our exploratory analysis I will try to find out which are the important attributes that contribute to a good quality wine. In the sample dataset, the quality rating is from 3 to 8 with a median at 6. Doing a table on the quality we see that most wines are rated 5 or 6 (which is average in a scale from 0 to 10)

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Some of the variables in the dataset which can support my investigation is the content of free sulfur dioxide, sulphates and citric acid which may be able to keep wine fresh and improve its taste. I can also look at alcohol content which can influence the taste of wines. Also increasing content of volatile acidity can degrade the taste of wines and can help in the investigation.

Did you create any new variables from existing variables in the dataset?

I created a categorical variable quality.factor which is quality variable converted to a factor. This may help in plots and analysis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There were some outliers in the dataset which were not that extreme to impact the study. The outliers can be easily removed from plots by limiting the axes. I did not find anything unusual enough to perform data transformation.

Bivariate Plots Section

The pairwise plot reveals some potential relationship between some variables with quality rating which I will explore further using correlation calculation.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Wine quality has a small negative correlation with volatile acidity (-0.39) and a small positive correlation with alcohol (0.48). I plotted it using corrplot to visualize the correlation using color code. The correlation plot shows the mild correlation of quality with alcohol and volatile acidity that was noted earlier. There are few other correlations also worth noting:

Alcohol is negatively correlated with density at -0.5
pH is negatively correlated with citric acid (-0.54) and fixed acidity (-0.68). This is expected as higher acidity leads to a lower pH value.
Density is positively correlated with fixed acidity at (0.67)
Free sulfur dioxide is positively correlated with total sulfur dioxide at (0.67) which seems obvious.
Citric acid is negatively correlated with volatile acidity (-0.55) and positively correlated with fixed acidity (0.67)
There is minor positive correlation of quality rating with sulphates (0.25) which which acts as an antimicrobial and antioxidant and with citric acid (0.23) which adds freshness to wines. It looks like freshness of wine has some contribution to quality rating.

I will investigate further the relationship of quality with alcohol and volatile acidity.

## reds$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## reds$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## reds$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## reds$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## reds$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## reds$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

The median alcohol content in the wines of higher quality (6, 7 and 8) is higer than the lower quality wines which supports the theory.

## reds$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## reds$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## reds$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## reds$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## reds$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## reds$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

From these box plots between quality and volatile acidity I can see that lower volatile acidity is needed for a higher quality wine.

I will also plot the other two attributes sulphates and citric acid to see if there is any relationship

## reds$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## reds$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## reds$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## reds$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## reds$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## reds$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

After eliminating outliers greater than 1.1, it looks like median content of sulphates is a little higher in wines of good quality.

## reds$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## reds$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## reds$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## reds$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## reds$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## reds$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

After eliminating values higher than 0.8 as outliers, I notice that there is a steady increase in the median citric acid content in wines as the quality goes up.

This plot was just some exploration of the relationship between fixed acidity and density which have a positive correlation of 0.67. This relationship is not very intuitive and kind of strange. This will not contribute to the main analysis.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The first step in Bivariate analysis was to do a pairwise plot with all the attributes with quality rating and also calculate the correlation between them. To identify the attributes that contribute to a good quality wine, I focused on the relationship of quality with the chemical attributes of wine. From the correlation matrix I found that higher quality wines tend to have higher alcohol content and lower volatile acidity. After removing some outliers I also noticed that adding sulphates and citric acid may also contribute to higher wine quality as they may have contributed in keeping wine fresh.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Other interesting relationships between other features are:

Alcohol is negatively correlated with density at -0.5
Density is positively correlated with fixed acidity at (0.67)
Citric acid is negatively correlated with volatile acidity (-0.55) and positively correlated with fixed acidity (0.67)

What was the strongest relationship you found?

Strongest relationship which is not obvious by definition is the relationship between density and fixed acidity (correlation = 0.67)

Multivariate Plots Section

The color separation in the plot points supports my observation that a good quality wine has a higher alcohol content and low volatile acidity.

The boxplot is created for (alcohol-volatile acidity) to compound the effect of both alcohol and volatile acidity. This again helps visualize that high quality wines have high alcohol and low volatile acidity.

After eliminating outliers I can see that there is no strong separation of the colors but the colors do indicate that most high quality wines tend to have higher content of sulphates and citric acid.

The boxplot is created for (citric acid + sulphates) to compound the effect of both citric acid and sulphates. This again helps visualize that high quality wines have high amount of citric acid and sulphates.

This plot brings all the four variables together and displays that the good quality wines populate the top right corner of the scatter plot.

This density plot shows a clear sepatation around 11 separates higer quality wines than lower quality ones.

I will look at few more relationships that may not have impact on the main analysis but probably interesting:

Alcohol is correlated with density at -0.5, which led me to explore the distribution of quality given a scatter plot of alcohol and density. Although good quality wines are found more at the top left area of the chart, there are some good quality wines that are found in the right side of the plot as well. The orange dashed line indicates the density of water at 20 degrees C. Alcohol is less dense than water, so keeping other factors constant higher alcohol would usually mean lower density. There may be other factors involved which pushed higher density for some wines with high alcohol.

In wine making sugar is converted to alcohol by fermentation process. It can be expected that more alcohol content could mean low residual sugar. So I wanted to look at how the interaction between these two attributes contribute to quality. After eliminating top 1% residual sugar as outliers I noticed from the plot that most of the wines have low residual sugar regardless of the alcohol content. This may mean that wine makers make use of most of the sugar in the grapes to increase the alcohol content and the proportion of alcohol and residual sugar depends on initial sugar content in the grapes.

Citric acid and volatile acidity have a correlation coefficient of -0.55. From the scatter plot I can see that the good quality wines cluster around the bottom right corner.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = reds)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = reds)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = reds)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = reds)
## 
## ================================================================
##                        m1         m2         m3         m4      
## ----------------------------------------------------------------
##   (Intercept)        1.875***   3.095***   2.611***   2.646***  
##                     (0.175)    (0.184)    (0.196)    (0.201)    
##   alcohol            0.361***   0.314***   0.309***   0.309***  
##                     (0.017)    (0.016)    (0.016)    (0.016)    
##   volatile.acidity             -1.384***  -1.221***  -1.265***  
##                                (0.095)    (0.097)    (0.113)    
##   sulphates                                0.679***   0.696***  
##                                           (0.101)    (0.103)    
##   citric.acid                                        -0.079     
##                                                      (0.104)    
## ----------------------------------------------------------------
##   R-squared             0.227      0.317      0.336      0.336  
##   adj. R-squared        0.226      0.316      0.335      0.334  
##   sigma                 0.710      0.668      0.659      0.659  
##   F                   468.267    370.379    268.912    201.777  
##   p                     0.000      0.000      0.000      0.000  
##   Log-likelihood    -1721.057  -1621.814  -1599.384  -1599.093  
##   Deviance            805.870    711.796    692.105    691.852  
##   AIC                3448.114   3251.628   3208.768   3210.186  
##   BIC                3464.245   3273.136   3235.654   3242.448  
##   N                  1599       1599       1599       1599      
## ================================================================

The selected variables in the model explain 33.6 % of the variance. This is not very significant but it is good that we can explain some of the variances.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In the Multivariate analysis, I used quality as a color for scatter plot points to identify if the good quality wines (of similar color) cluster together in certain areas of the plot. The first plot revealed that good quality wines do indeed cluster around high alcohol and low volatile acidity region. In the scatter plot for sulphates and citric acid the quality color codes are more dispersed but there is some indication of improving quality with increasing sulphates and citric acid. Combining all the variables together provides a good separation between high quality and low quality wines.

Were there any interesting or surprising interactions between features?

Alcohol and volatile acidity together help explain most of the wine quality and I also noted that sulphates and citric acid also work together to help bring out the freshness of wines and help push up the quality a bit. I also looked at the interaction of alcohol and residual sugar, but there was nothing significant there. Citric acid and volatile acidity also showed some correlation.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created 4 models to predict the wine quality using variables added in sequence starting with alcohol then adding volatile acidity, sulphates and citric acid. Alcohol explains 22.7% of variance by itself and adding volatile acidity it goes up to 31.7%. By adding sulphates the model explains 33.6 % of variance. Adding citric acid does not help increase the effectiveness of the model as the interaction between volatile acidity and citric acid takes over.

The model is good at providing some indication of the attributes that may contribute to a good quality wine. This also supports some of the initial theory about the attributes. But R-squared value of 0.336 indicates that it is not a very good model for predicting wine quality. Quality rating of wine being a qualitative measure depending on subjective analysis of wine experts it is reasonable that most of the variance is not explained by the model.

Final Plots and Summary

Plot One

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Description One

This plot was selected to highlight the distribution of quality ratings across all the wine samples avaiable in the dataset and its possible impact on the study. From the histogram of the distribution of wine quality across the red wines in the dataset it can be seen that most of the wines have rating of 5 (681) and 6 (638). Together they account for 1319 out of 1599 (82.5%) of the wines. This uneven distribution can limit the effectiveness of the study to identify which of the 11 chemical attributes contribute to the quality ratings for each wine.

Plot Two

## reds$quality.factor: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## reds$quality.factor: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## reds$quality.factor: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## reds$quality.factor: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## reds$quality.factor: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## reds$quality.factor: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

## reds$quality.factor: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## reds$quality.factor: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## reds$quality.factor: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## reds$quality.factor: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## reds$quality.factor: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## reds$quality.factor: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity, data = reds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59342 -0.40416 -0.07426  0.46539  2.25809 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.09547    0.18450   16.78   <2e-16 ***
## alcohol           0.31381    0.01601   19.60   <2e-16 ***
## volatile.acidity -1.38364    0.09527  -14.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared:  0.317,  Adjusted R-squared:  0.3161 
## F-statistic: 370.4 on 2 and 1596 DF,  p-value: < 2.2e-16

Description Two

The two variables with which wine quality had strongest relation is alcohol percentage and volatile acidity. By plotting the alcohol percentage and volatile acidity separated by wine quality I can see that on an average wines of higher quality have higher alcohol percentage and lower volatile acidity. In the summaries I see a steady increase in alcohol percentage and decrease in volatile acidity as the quality increases. From the summary of the linear model I observe that together they explain 31.7 % variance in wine quality. Although the R-squared value is not very significant, it can act as a guideline regarding what attributes make for a good quality wine.

Plot Three

Description Three

This final plot brings out the interaction between alcohol and volatile acidity through scatter plot and the coloring by quality helps identify if higher quality wines cluster together in a different part of the plot than the low quality wines. From the Plot I can see that there is a reasonable separation between wine colors corresponding to high quality and low quality. The higher quality wines cluster at the top left part of the plot whereas wines of lower quality cluster together to the bottom right.

Reflection

The dataset is public available for research from Cortez et al., 2009 [1]. This particular dataset containes 1599 observation of red wines of various quality ratings based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). The dataset also contains 11 chemical attributes of each wine (such as pH and Alcohol Percenatge).

In the study I wanted to figure out if there is any relationship between wine quality which is a subjective opinion of wine experts and the objective measurements. Plotting a histogram of wine quality revealed that the wine quality distribution is not balanced and lot of wines have average rating (5 and 6) while missing quality ratings of very bad (1 and 2) and very excellent (9 and 10). This could have had some effect on the study. Continuing with the study and evaluating the correlation coefficients I found out that wine quality had some moderate correlation with alcohol percentage (0.48) and volatile acidity (-0.39). The quality also had minor correlation with sulphates (0.25) and citric acid (0.23). Exploring these relationships further using boxplots and scatterplots I found that indeed most high quality wines have a higher alcohol percentage and low volatile acidity. High alcohol content could have been a consequence of the wine making process of high quality wines. Volatile acidity which is the amount of acetic acid in wine, at too high of levels can lead to an unpleasant vinegar taste and can lead to a low quality rating. The other two attributes could have had some impact on the freshness and taste of wines. Citric acid which is found in small quantities can add ‘freshness’ and flavor to wines. Sulphates which is a wine additive wich acts as an antimicrobial and antioxidant for preventing degradation of wines. Together these four attributes accounted for 33.6 % of variance in wine quality which is not very significant for predicting wine quality but can be used as a guideline to control these four attributes for improving wine quality. It is very difficult to model and predict subjective variables such as wine quality so this study has its limitations. Also this study could have been improved by using a more balanced dataset with more wines from the lower quality (0 to 4) and higher quality (7 to 10) groups.

[1] Data Source: The Dataset is public available for research. The details are described in [Cortez et al., 2009] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib