library(tidyverse)
In this lab, we discuss two simple ML algorithms for unsupervised and supervised learning: k-means clustering and k-nearest neighbor. Both of them are based on some similarity metrics, such as Euclidean distance. So we first discuss similarity.
There are different metrics to measure similarity between two objects. They are real-valued functions of a pair of vectors.
One of the most commonly used measures is Euclidean distance. That is, given two vectors \(X=(x_1, x_2, ..., x_p)\), \(Y=(y_1, y_2, ..., y_p)\), the Euclidean distance between \(X\) and \(Y\) is \[||X-Y||_2=\sqrt{(x_1-y_1)^2+(x_2-y_2)^2+...+(x_p-y_p)^2}\] The notation \(||\cdot||_2\) is also called \(L_2\) norm.
In R, it is very easy to compute this distance using function
dist()
. For example,
x<- c(2, -2, 3, 5)
y<- c(1, 3, 0, 2)
z<- c(0, -1, 2, 2)
dist(rbind(x, y, z))
## x y
## y 6.633250
## z 3.872983 4.582576
Note that this is not a matrix object. It only shows the distance between all pairs. To get a matrix object, you need to convert it to matrix.
as.matrix(dist(rbind(x, y, z)))
## x y z
## x 0.000000 6.633250 3.872983
## y 6.633250 0.000000 4.582576
## z 3.872983 4.582576 0.000000
Manhattan distance is defined as \[||X-Y||_1=|x_1-y_1|+|x_2-y_2|+...+|x_p-y_p|\] The notation \(||\cdot||_1\) is also called \(L_1\) norm.
as.matrix(dist(rbind(x, y, z), method = "manhattan"))
## x y z
## x 0 12 7
## y 12 0 7
## z 7 7 0
Cosine of the angle between two vectors. It is defined as \[\cos(\theta)=\frac{X\cdot Y}{||X||_2\times ||Y||_2}\]
library(proxy)
as.matrix(dist(rbind(x, y, z), method = "cosine"))
## x y z
## x 0.0000000 0.7525642 0.0741799
## y 0.7525642 0.0000000 0.9109129
## z 0.0741799 0.9109129 0.0000000
There are many other similarity measures including correlation, Jaccard, binary… For different application, different measures are employed. For example, this article talks about Euclidean distance vs. Cosine similarity.
dist()
?dist()
?dist()
? Try to write a function just like
dist()
.We use the seeds data set to demonstrate clustering analysis in R. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. A description of the dataset can be viewed at (https://archive.ics.uci.edu/ml/datasets/seeds)
seed = read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt', header=F)
seed = seed[,1:7]
colnames(seed) = c("area", "perimeter","campactness", "length", "width", "asymmetry", "groovelength")
This is important especially when different variables are in different scales. The reason is that the variables with large magnitude may dominate others when calculating the Euclidean distance.
seed <- scale(seed)
# K-Means Cluster Analysis
fit <- kmeans(seed, 5)
install.packages("fpc")
library(fpc)
plotcluster(seed, fit$cluster)
We can get the number of observations in each cluster
# Show the size of each cluster
fit$size
## [1] 35 49 28 48 50
# equivalently
table(fit$cluster)
##
## 1 2 3 4 5
## 35 49 28 48 50
We can also get the center of each group
fit$centers
## area perimeter campactness length width asymmetry
## 1 -0.85088154 -0.8953466 -0.1856645 -0.9269660 -0.7670320 -0.3280779
## 2 -1.07640452 -1.0286446 -1.2199685 -0.8713974 -1.1839002 1.0631272
## 3 0.54533533 0.6114829 0.1726661 0.6120589 0.5041665 0.1721008
## 4 1.47902488 1.4624359 0.6809737 1.4452124 1.3616973 -0.1594780
## 5 -0.07475817 -0.1115546 0.5751064 -0.2273113 0.1075819 -0.7554877
## groovelength
## 1 -0.9069100
## 2 -0.5327303
## 3 0.6627207
## 4 1.4632901
## 5 -0.6189695
We can extract all observations that are in 1st group
seed[fit$cluster==1,]
group_by()
and summarise()
?fit$totss
## [1] 1463
fit$withinss
## [1] 52.68816 80.73984 36.34953 82.28720 74.55275
fit$tot.withinss
## [1] 326.6175
fit$betweenss
## [1] 1136.383
kmeans()
), did you
get the same result? Why?dist()
.Here a simple within group sum of squares (wss) method is used. We can calculate the wss for different number of clusters.
wss <- rep(0, 12)
for (i in 1:12){
wss[i] <- kmeans(seed, centers=i)$tot.withinss
}
plot(1:12, wss, type="b", xlab="Number of Clusters",ylab="Total within-group SS")
A general rule is called elbow rule. That is, choose the k such that the amount of decreasing from k to k+1 is much less than that from k-1 to k.
fpc package has cluster.stat() function that can calcuate other cluster validity measures such as Average Silhouette Coefficient (between -1 and 1, the higher the better), or Dunn index (betwen 0 and infinity, the higher the better):
The package NbClust
provides 30 indexes for determining
the optimal number of clusters in a data set.
For more sophiscated methods, see for example blog, or course notes.
This article on Cross Validated provides a great illustration of the situations when k-means would fail.
We demonstrate this simple machine learning algorithm using Iris dataset, a famous dataset for almost all machine learning courses.
data("iris")
as_tibble(iris)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
# standardize values for features
iris1 <- iris
iris1[,1:4]<- scale(iris[,1:4])
Random split data to training (60%) and testing (40%)
# random row indices
index<- sample(nrow(iris1), 0.6*nrow(iris1))
# split data to training and testing
iris.train<- iris1[index,]
iris.test<- iris1[-index,]
In R, knn()
function is designed to perform K-nearest
neighbor. It is in package "class"
.
install.packages("class")
The functionknn()
takes four main inputs:
train=
(X variables in training sample),
test=
(X variables in testing sample), cl=
(Y
variable in training sample), and k=
(size of neighbor). The
function outputs a single object, which is the prediction for the
testing sample.
library(class)
knn.iris<- knn(train = iris.train[, -5], test = iris.test[, -5], cl=iris.train[,5], k=5)
knn.iris
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] versicolor versicolor versicolor virginica versicolor versicolor
## [37] virginica versicolor versicolor virginica versicolor versicolor
## [43] versicolor versicolor virginica virginica virginica virginica
## [49] virginica virginica virginica virginica virginica virginica
## [55] virginica versicolor virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Here, the function knn()
typically requires 4 inputs:
train=
(X variables in training sample),
test=
(X variables in testing sample), cl=
(Y
variable in training sample) and k=
(size of neighbor).
There are other two optional inputs, which you can check by running
?knn
in console. The function outputs a single object,
which is the prediction for the testing sample.
table(iris.test[,5], knn.iris, dnn = c("True", "Predicted"))
## Predicted
## True setosa versicolor virginica
## setosa 21 0 0
## versicolor 0 20 3
## virginica 0 1 15
sum(iris.test[,5] != knn.iris)
## [1] 4
dist()
.caret
packageThe caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems. In general, it incorporates cv and model tuning procedures.
install.packages("caret")
library(caret)
set.seed(123)
ctrl <- trainControl(method="cv", number=5)
knn.fit <- train(Species ~ ., data = iris1, method = "knn", trControl = ctrl, tuneLength = 20)
knn.fit
## k-Nearest Neighbors
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 120, 120, 120, 120, 120
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9533333 0.93
## 7 0.9533333 0.93
## 9 0.9533333 0.93
## 11 0.9666667 0.95
## 13 0.9600000 0.94
## 15 0.9466667 0.92
## 17 0.9400000 0.91
## 19 0.9400000 0.91
## 21 0.9400000 0.91
## 23 0.9333333 0.90
## 25 0.9266667 0.89
## 27 0.9266667 0.89
## 29 0.9133333 0.87
## 31 0.9133333 0.87
## 33 0.8933333 0.84
## 35 0.8933333 0.84
## 37 0.8933333 0.84
## 39 0.8933333 0.84
## 41 0.8800000 0.82
## 43 0.8600000 0.79
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
We can create a plot to show the impact of k
.
plot(knn.fit$results$k, knn.fit$results$Accuracy, type = 'b', xlab = "k")