Libraries

library(tidyverse)
library(factoextra)
library(caret)
library(cluster)
library(dbscan)
library(fpc)
library(flexclust)

Load Data and Explore It

wine_data <- read.csv("/Users/dylandoyle-rowan/Desktop/Business Analytics/wine-clustering.csv")

head(wine_data)
summary(wine_data)
##     Alcohol        Malic_Acid         Ash         Ash_Alcanity  
##  Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
##  1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
##  Median :13.05   Median :1.865   Median :2.360   Median :19.50  
##  Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
##  3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
##  Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
##    Magnesium      Total_Phenols     Flavanoids    Nonflavanoid_Phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700      
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400      
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color_Intensity       Hue             OD280      
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0

Question 1

Q1-A

Normalization affects the distance of all the data to each other. Distance measures used in clustering are influenced by the scale of each variable, and normalization helps keep the scale consistent between data to allow the clusters to be accurate and the data to be comparable.

The number of clusters affects clustering outcomes in a few ways. Underfitting is when there aren’t enough clusters, which creates overly broad groups that combine data types that maybe shouldn’t be. Overfitting is when there are too many clusters and this creates distinctions that don’t generalize data very well.

Q1-B

Normalizing Data

wines.norm <- scale(wine_data)

Determining Optimal K

set.seed(123)
fviz_nbclust(wines.norm, kmeans, method = "wss") +
  ggtitle("Optimal Number of Clusters - Elbow Method")

Choosing 3 because that is where the distance between the Y axis starts to become consistent.

fviz_nbclust(wines.norm, kmeans, method = "silhouette") +
  ggtitle("Optimal Number of Clusters - Silhouette Method")

Choosing 3 because that is also the result from the silhouette method.

K-Means Clustering Based on k = 3

optimal_k <- 3
kmeans.wines <- kmeans(wines.norm, centers = optimal_k, nstart = 25)

Clustering Results

print(kmeans.wines$size)
## [1] 65 62 51
print(kmeans.wines$withinss)
## [1] 558.6971 385.6983 326.3537
print(kmeans.wines$centers)
##      Alcohol Malic_Acid        Ash Ash_Alcanity   Magnesium Total_Phenols
## 1 -0.9234669 -0.3929331 -0.4931257    0.1701220 -0.49032869   -0.07576891
## 2  0.8328826 -0.3029551  0.3636801   -0.6084749  0.57596208    0.88274724
## 3  0.1644436  0.8690954  0.1863726    0.5228924 -0.07526047   -0.97657548
##    Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity        Hue
## 1  0.02075402          -0.03343924      0.05810161      -0.8993770  0.4605046
## 2  0.97506900          -0.56050853      0.57865427       0.1705823  0.4726504
## 3 -1.21182921           0.72402116     -0.77751312       0.9388902 -1.1615122
##        OD280    Proline
## 1  0.2700025 -0.7517257
## 2  0.7770551  1.1220202
## 3 -1.2887761 -0.4059428

Visualize

fviz_cluster(kmeans.wines, data = wines.norm,
             palette = c("red", "blue", "green"),
             geom = "point",
             ellipse.type = "convex",
             ggtheme = theme_bw()) +
  ggtitle("K-Means Clustering of Wine Dataset")

Interpreting Clusters

Cluster 1 (Red) - High in Malic Acid and Color Intensity. Low in Flavanoids, Total Phenols, Hue. These are potentially dark wines.

Cluster 2 (Blue) - High in Alcohol, Flavanoids, Total Phenols. Low in Ash Alkalinity, NonFlavanoid Phenols. These are good quality wines with high alcohol percentage.

Cluster 3 (Green) - Pretty neutral among attributes but has lows in Alcohol, Color Intensity, and Ash. These could be simple white wines potentially.

Question 2

Q2-A

DBSCAN and K-Means are very different. K-Means has only 1 hyperparameter (k), while DBSCAN has 2 (epsilon and minPts).

K-Means is very sensitive to outliers and noise data, while DBSCAN is good with both.

K-Means forces all points into clusters while DBSCAN identifies noisy data points.

Q2-B - DBSCAN Clustering

dbscan::kNNdistplot(wines.norm, k = 4)
abline(h = 2.0, lty = 2, col = "red")
title("k-NN Distance Plot for Epsilon Selection")

DBscan.wine <- fpc::dbscan(wines.norm, eps = 2.0, MinPts = 5)

DBSCAN Results

DBscan.wine
## dbscan Pts=178 MinPts=5 eps=2
##         0  1 2 3 4
## border 85 32 4 2 7
## seed    0 42 1 3 2
## total  85 74 5 5 9

Plot

fviz_cluster(DBscan.wine,
             data = wines.norm,
             geom = "point",
             stand = FALSE,
             frame = FALSE,
             pointsize = 2,
             palette = c("red", "blue", "green", "gold"),
             ggtheme = theme_bw()) +
  ggtitle("DBSCAN Clustering of Wine Dataset")

Analyze Clusters

table(DBscan.wine$cluster)
## 
##  0  1  2  3  4 
## 85 74  5  5  9

Noise Points (Cluster 0)

sum(DBscan.wine$cluster == 0)
## [1] 85

The parameters epsilon (ε) and minPts influence the clustering results.

Epsilon - A higher epsilon means more points become core points, which brings less noise.

MinPts - A higher minPts means denser regions to form a cluster, which brings more noise.

Question 3

Pros and Cons of Each Method

K-Means

Pros

  • Easy to implement and faster
  • Makes tighter clusters
  • Only requires 1 hyperparameter (k)

Cons

  • Hard to predict the number of clusters
  • Sensitive to outliers, noise, and initial centroids
  • Assumes spherical clusters

DBSCAN

Pros

  • Don’t need to specify the amount of clusters
  • Good with noisy data
  • Even identifies noise points

Cons

  • Two hyperparameters needed
  • Optimizing epsilon is difficult
  • Struggles with varying densities