wine_data <- read.csv("/Users/dylandoyle-rowan/Desktop/Business Analytics/wine-clustering.csv")
head(wine_data)## Alcohol Malic_Acid Ash Ash_Alcanity
## Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.865 Median :2.360 Median :19.50
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
## Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
## Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
## Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color_Intensity Hue OD280
## Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
## 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
## Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
## Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
## Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
## Proline
## Min. : 278.0
## 1st Qu.: 500.5
## Median : 673.5
## Mean : 746.9
## 3rd Qu.: 985.0
## Max. :1680.0
Normalization affects the distance of all the data to each other. Distance measures used in clustering are influenced by the scale of each variable, and normalization helps keep the scale consistent between data to allow the clusters to be accurate and the data to be comparable.
The number of clusters affects clustering outcomes in a few ways. Underfitting is when there aren’t enough clusters, which creates overly broad groups that combine data types that maybe shouldn’t be. Overfitting is when there are too many clusters and this creates distinctions that don’t generalize data very well.
set.seed(123)
fviz_nbclust(wines.norm, kmeans, method = "wss") +
ggtitle("Optimal Number of Clusters - Elbow Method")Choosing 3 because that is where the distance between the Y axis starts to become consistent.
fviz_nbclust(wines.norm, kmeans, method = "silhouette") +
ggtitle("Optimal Number of Clusters - Silhouette Method")Choosing 3 because that is also the result from the silhouette method.
## [1] 65 62 51
## [1] 558.6971 385.6983 326.3537
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## 1 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
## 2 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
## 3 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue
## 1 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046
## 2 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504
## 3 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122
## OD280 Proline
## 1 0.2700025 -0.7517257
## 2 0.7770551 1.1220202
## 3 -1.2887761 -0.4059428
fviz_cluster(kmeans.wines, data = wines.norm,
palette = c("red", "blue", "green"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw()) +
ggtitle("K-Means Clustering of Wine Dataset")Cluster 1 (Red) - High in Malic Acid and Color Intensity. Low in Flavanoids, Total Phenols, Hue. These are potentially dark wines.
Cluster 2 (Blue) - High in Alcohol, Flavanoids, Total Phenols. Low in Ash Alkalinity, NonFlavanoid Phenols. These are good quality wines with high alcohol percentage.
Cluster 3 (Green) - Pretty neutral among attributes but has lows in Alcohol, Color Intensity, and Ash. These could be simple white wines potentially.
DBSCAN and K-Means are very different. K-Means has only 1 hyperparameter (k), while DBSCAN has 2 (epsilon and minPts).
K-Means is very sensitive to outliers and noise data, while DBSCAN is good with both.
K-Means forces all points into clusters while DBSCAN identifies noisy data points.
dbscan::kNNdistplot(wines.norm, k = 4)
abline(h = 2.0, lty = 2, col = "red")
title("k-NN Distance Plot for Epsilon Selection")## dbscan Pts=178 MinPts=5 eps=2
## 0 1 2 3 4
## border 85 32 4 2 7
## seed 0 42 1 3 2
## total 85 74 5 5 9
fviz_cluster(DBscan.wine,
data = wines.norm,
geom = "point",
stand = FALSE,
frame = FALSE,
pointsize = 2,
palette = c("red", "blue", "green", "gold"),
ggtheme = theme_bw()) +
ggtitle("DBSCAN Clustering of Wine Dataset")## [1] 85
The parameters epsilon (ε) and minPts influence the clustering results.
Epsilon - A higher epsilon means more points become core points, which brings less noise.
MinPts - A higher minPts means denser regions to form a cluster, which brings more noise.
Pros
Cons
Pros
Cons