Back to Analytics

R / RStudio

Clustering Wines with K-Means and DBSCAN

Coursework · Solo · 2025

RK-MeansDBSCANClusteringUnsupervised LearningUCI Wine Dataset

Question

Can clustering tell us anything meaningful about wine attributes, and how do K-Means and DBSCAN compare on the same dataset?

Hypothesis

I expected K-Means to give cleaner clusters since the wine dataset doesn't have a ton of noise, but I wanted to see how DBSCAN would handle it as a comparison.

Approach

Started by normalizing the wine data so the distance calculations would work properly across features. Used both the elbow method and the silhouette method to pick the right number of clusters, and both pointed to k=3. Ran K-Means with k=3, then ran DBSCAN on the same data to compare how each algorithm handled the structure.

Result

The three K-Means clusters lined up with real wine types. One looked like dark wines with high color intensity and low flavanoids, one looked like high quality wines with high alcohol and phenols, and one looked like simple white wines. DBSCAN gave a different view by also flagging noise points instead of forcing every wine into a cluster. The biggest takeaway was that K-Means is fast and clean but assumes spherical clusters, while DBSCAN handles noise better but is harder to tune with two parameters.

Hierarchical Clustering with AGNES and DIANA

Coursework · Solo · 2025

RAGNESDIANAHierarchical ClusteringUnsupervised LearningUCI Wine Dataset

Question

Which hierarchical clustering method gives the cleanest structure on the wine dataset, and which linkage strategy works best?

Hypothesis

I figured one of the AGNES linkage methods would come out on top since they each handle distance differently, but I wanted to see how the divisive method DIANA would stack up too.

Approach

Scaled the wine data first so the distances would be comparable across features. Ran AGNES with four linkage methods (single, complete, average, and Ward) and compared their agglomerative coefficients. Then ran DIANA on the same data to see how a divisive approach would handle it. Cut both trees at k=3 to compare cluster sizes side by side.

Result

AGNES with Ward linkage had the highest agglomerative coefficient, which made it the most cohesive method on this dataset. It also gave more balanced clusters of 69, 58, and 51 wines, while DIANA produced sizes of 91, 38, and 49. I don't think K-Means or DBSCAN would be a better fit here either since K-Means is sensitive to outliers and DBSCAN is built more for finding density. For this dataset, AGNES Ward was the right call.

Predicting Heart Disease with Naive Bayes

Coursework · Solo · 2025

RNaive BayesClassificationHealthcareUCI Heart Disease Dataset

Question

Can a Naive Bayes model predict heart disease using blood pressure and chest pain type?

Hypothesis

I figured the model would find a real pattern since blood pressure and chest pain are both well known risk factors for heart disease.

Approach

I worked through Bayes by hand first to make sure I understood the math. I calculated the conditional probabilities for every combination of blood pressure and chest pain type. Then I built the same model in R using the naiveBayes function on a 60/40 train and validate split with the UCI Heart Disease dataset (303 patients).

Result

The model hit 78.7% accuracy on validation, which sounded good at first. But when I looked at the confusion matrix the model was actually predicting "No" for every single patient. The 78.7% was just because 78.7% of the data was already "No." Specificity was 0%. That taught me accuracy on its own is a bad metric when the classes are imbalanced. With more time I would try oversampling the minority class or weighting the model so it actually learns to spot the "Yes" cases.