library(tidyverse)

In this lab, we discuss two simple ML algorithms for unsupervised and supervised learning: k-means clustering and k-nearest neighbor. Both of them are based on some similarity metrics, such as Euclidean distance. So we first discuss similarity.

Similarity

There are different metrics to measure similarity between two objects. They are real-valued functions of a pair of vectors.

Different similarity measures

Euclidean distance

One of the most commonly used measures is Euclidean distance. That is, given two vectors \(X=(x_1, x_2, ..., x_p)\), \(Y=(y_1, y_2, ..., y_p)\), the Euclidean distance between \(X\) and \(Y\) is \[||X-Y||_2=\sqrt{(x_1-y_1)^2+(x_2-y_2)^2+...+(x_p-y_p)^2}\] The notation \(||\cdot||_2\) is also called \(L_2\) norm.

In R, it is very easy to compute this distance using function dist(). For example,

x<- c(2, -2, 3, 5)
y<- c(1, 3, 0, 2)
z<- c(0, -1, 2, 2)
dist(rbind(x, y, z))
##          x        y
## y 6.633250         
## z 3.872983 4.582576

Note that this is not a matrix object. It only shows the distance between all pairs. To get a matrix object, you need to convert it to matrix.

as.matrix(dist(rbind(x, y, z)))
##          x        y        z
## x 0.000000 6.633250 3.872983
## y 6.633250 0.000000 4.582576
## z 3.872983 4.582576 0.000000

Manhattan distance

Manhattan distance is defined as \[||X-Y||_1=|x_1-y_1|+|x_2-y_2|+...+|x_p-y_p|\] The notation \(||\cdot||_1\) is also called \(L_1\) norm.

as.matrix(dist(rbind(x, y, z), method = "manhattan"))
##    x  y z
## x  0 12 7
## y 12  0 7
## z  7  7 0

Cosine similarity

Cosine of the angle between two vectors. It is defined as \[\cos(\theta)=\frac{X\cdot Y}{||X||_2\times ||Y||_2}\]

library(proxy)
as.matrix(dist(rbind(x, y, z), method = "cosine"))
##           x         y         z
## x 0.0000000 0.7525642 0.0741799
## y 0.7525642 0.0000000 0.9109129
## z 0.0741799 0.9109129 0.0000000

Other similarity measures

There are many other similarity measures including correlation, Jaccard, binary… For different application, different measures are employed. For example, this article talks about Euclidean distance vs. Cosine similarity.

Exercise

  1. How to compute these similarity measures between two vectors without using dist()?
  2. How to compute these similarity measures between three vectors without using dist()?
  3. How to compute these similarity measures between n vectors without using dist()? Try to write a function just like dist().

Back to top


K-means clustering

We use the seeds data set to demonstrate clustering analysis in R. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. A description of the dataset can be viewed at (https://archive.ics.uci.edu/ml/datasets/seeds)

seed = read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt', header=F)
seed = seed[,1:7]
colnames(seed) = c("area", "perimeter","campactness", "length", "width", "asymmetry", "groovelength")

Scale data to have zero mean and unit variance for each column

This is important especially when different variables are in different scales. The reason is that the variables with large magnitude may dominate others when calculating the Euclidean distance.

seed <- scale(seed) 

Use k-means method for clustering and plot results.

# K-Means Cluster Analysis
fit <- kmeans(seed, 5) 

Plot cluster in kmeans

install.packages("fpc")
library(fpc)
plotcluster(seed, fit$cluster)

We can get the number of observations in each cluster

# Show the size of each cluster
fit$size
## [1] 35 49 28 48 50
# equivalently
table(fit$cluster)
## 
##  1  2  3  4  5 
## 35 49 28 48 50

We can also get the center of each group

fit$centers
##          area  perimeter campactness     length      width  asymmetry
## 1 -0.85088154 -0.8953466  -0.1856645 -0.9269660 -0.7670320 -0.3280779
## 2 -1.07640452 -1.0286446  -1.2199685 -0.8713974 -1.1839002  1.0631272
## 3  0.54533533  0.6114829   0.1726661  0.6120589  0.5041665  0.1721008
## 4  1.47902488  1.4624359   0.6809737  1.4452124  1.3616973 -0.1594780
## 5 -0.07475817 -0.1115546   0.5751064 -0.2273113  0.1075819 -0.7554877
##   groovelength
## 1   -0.9069100
## 2   -0.5327303
## 3    0.6627207
## 4    1.4632901
## 5   -0.6189695

We can extract all observations that are in 1st group

seed[fit$cluster==1,]

Exercise:

  1. Can you get the center of each group by using group_by() and summarise()?
  2. Can you get the sum of squares of each group?
  3. Can you get the sum of squares of the entire sample?

Total, within-group, between group sum squared errors

fit$totss
## [1] 1463
fit$withinss
## [1] 52.68816 80.73984 36.34953 82.28720 74.55275
fit$tot.withinss
## [1] 326.6175
fit$betweenss
## [1] 1136.383

Exercise

  1. Can you tell the relationship between all these Sum Squared Errors.
  2. Run above lines again (start from kmeans()), did you get the same result? Why?
  3. ** Write your own k-means algorithm based on Euclidean distance by using function dist().

Determine the number of clusters

Here a simple within group sum of squares (wss) method is used. We can calculate the wss for different number of clusters.

wss <- rep(0, 12)
for (i in 1:12){
  wss[i] <- kmeans(seed, centers=i)$tot.withinss
} 
plot(1:12, wss, type="b", xlab="Number of Clusters",ylab="Total within-group SS")

A general rule is called elbow rule. That is, choose the k such that the amount of decreasing from k to k+1 is much less than that from k-1 to k.

Further readings

fpc package has cluster.stat() function that can calcuate other cluster validity measures such as Average Silhouette Coefficient (between -1 and 1, the higher the better), or Dunn index (betwen 0 and infinity, the higher the better):

The package NbClust provides 30 indexes for determining the optimal number of clusters in a data set.

For more sophiscated methods, see for example blog, or course notes.

This article on Cross Validated provides a great illustration of the situations when k-means would fail.

Back to top


K-nearest neighbor (KNN)

We demonstrate this simple machine learning algorithm using Iris dataset, a famous dataset for almost all machine learning courses.

Load and prepare the data

data("iris")
as_tibble(iris)
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows
# standardize values for features
iris1 <- iris
iris1[,1:4]<- scale(iris[,1:4])

Random split data to training (60%) and testing (40%)

# random row indices
index<- sample(nrow(iris1), 0.6*nrow(iris1))
# split data to training and testing
iris.train<- iris1[index,]
iris.test<- iris1[-index,]

Train the model

In R, knn() function is designed to perform K-nearest neighbor. It is in package "class".

install.packages("class")

The functionknn() takes four main inputs: train=(X variables in training sample), test=(X variables in testing sample), cl=(Y variable in training sample), and k=(size of neighbor). The function outputs a single object, which is the prediction for the testing sample.

library(class)
knn.iris<- knn(train = iris.train[, -5], test = iris.test[, -5], cl=iris.train[,5], k=5)
knn.iris
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     setosa     setosa     setosa    
## [19] setosa     setosa     setosa     versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] versicolor versicolor versicolor virginica  versicolor versicolor
## [37] virginica  versicolor versicolor virginica  versicolor versicolor
## [43] versicolor versicolor virginica  virginica  virginica  virginica 
## [49] virginica  virginica  virginica  virginica  virginica  virginica 
## [55] virginica  versicolor virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Here, the function knn() typically requires 4 inputs: train=(X variables in training sample), test=(X variables in testing sample), cl=(Y variable in training sample) and k=(size of neighbor). There are other two optional inputs, which you can check by running ?knn in console. The function outputs a single object, which is the prediction for the testing sample.

Prediction accuracy

table(iris.test[,5], knn.iris, dnn = c("True", "Predicted"))
##             Predicted
## True         setosa versicolor virginica
##   setosa         21          0         0
##   versicolor      0         20         3
##   virginica       0          1        15
sum(iris.test[,5] != knn.iris)
## [1] 4

Exercise

  1. Try different k, see how the error rate changes.
  2. Write a loop to conduct k-fold cross validation for knn. Then reinvestigate question 1.
  3. ** Write your own KNN algorithm based on Euclidean distance by using function dist().

Use caret package

The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems. In general, it incorporates cv and model tuning procedures.

install.packages("caret")
library(caret)
set.seed(123)
ctrl <- trainControl(method="cv", number=5) 
knn.fit <- train(Species ~ ., data = iris1, method = "knn", trControl = ctrl, tuneLength = 20)
knn.fit
## k-Nearest Neighbors 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 120, 120, 120, 120, 120 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa
##    5  0.9533333  0.93 
##    7  0.9533333  0.93 
##    9  0.9533333  0.93 
##   11  0.9666667  0.95 
##   13  0.9600000  0.94 
##   15  0.9466667  0.92 
##   17  0.9400000  0.91 
##   19  0.9400000  0.91 
##   21  0.9400000  0.91 
##   23  0.9333333  0.90 
##   25  0.9266667  0.89 
##   27  0.9266667  0.89 
##   29  0.9133333  0.87 
##   31  0.9133333  0.87 
##   33  0.8933333  0.84 
##   35  0.8933333  0.84 
##   37  0.8933333  0.84 
##   39  0.8933333  0.84 
##   41  0.8800000  0.82 
##   43  0.8600000  0.79 
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.

We can create a plot to show the impact of k.

plot(knn.fit$results$k, knn.fit$results$Accuracy, type = 'b', xlab = "k")

Back to top