Set the path

1 2	setwd("D:/git-ware/machine-learning-R-examples/01_KNN") getwd()

Get the data from the Internet and do the exploration

Download the data from Internet

Data: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data

Description: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names

# download the data from the Internet
library(RCurl)
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
name_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names"
download.file(data_url, "wdbc.data", method="libcurl")
download.file(name_url, "wdbc.names", method = "libcurl")

# loading dataset
wdbc = read.csv("wdbc.data", header = FALSE)

# set the column names  
wdbc.names=c("Radius","Texture","Perimeter","Area","Smoothness","Compactness","Concavity","Concave points","Symmetry","Fractal dimension")
wdbc.names=c(wdbc.names,paste(wdbc.names,"_mean",sep=""),paste(wdbc.names,"_worst",sep=""))

names(wdbc)=c("id","diagnosis",wdbc.names)
str(wdbc)
dim(wdbc)

So from the output above, we can see that the dataset includes 569 samples and 32 features.
From the description file, we know that the first feature(column) is id, and the second feature(column) is the diagnosis, it has two labels: “B” is benign cancer, and the “M” means malignant cancer. And the rest are the features which include the means, standard error and maximum, etc.
We explore the ratio of two labels
1
table(wdbc$diagnosis)
We see that the dataset has 357 benign cancer and 212 malignant cancer.

Since the id column has no significance to the model, so we delete the it

1
2
3

wdbc$diagnosis=factor(wdbc$diagnosis,levels=c("B","M"),labels=c("Benign","Malignant"))
round(prop.table(table(wdbc$diagnosis)) * 100, digits = 1)
wdbc=wdbc[-1]

1	summary(wdbc[c("Radius_mean", "Area_mean", "Smoothness_mean")])

We can see that the scale between features are far too much, so we have to do the normalization.

Here is the function of what define.

1
2
3

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

1 2	wdbc_n <- as.data.frame(lapply(wdbc[2:31], normalize)) summary(wdbc_n[c("Radius_mean", "Area_mean", "Smoothness_mean")])

After the normalization, we can see the scale of the features are on the same level.

Split the data and build the model

Split the data into two parts: training data and testing data

divide the data into two parts with the interval directly

wdbc_train = wdbc_n[1:469, ]
wdbc_test = wdbc_n[470:569, ]
wdbc_train_label = wdbc[1:469, 1]
wdbc_test_label = wdbc[470:569, 1]
mal_rate=table(wdbc_train_label)
round(mal_rate[2]/sum(mal_rate), digits = 2)

There is a problem that if the first 469 cloumn includes most of the benign cancer data, that is not good for the partition. So we have to sample the data.

set.seed(2017)
inTrain = sample(1:dim(wdbc_n)[1],469,replace=F)
wdbc_train=wdbc_n[inTrain,]
wdbc_test=wdbc_n[-inTrain,]
wdbc_train_label=wdbc[inTrain,1]
wdbc_test_label=wdbc[-inTrain,1]
mal_rate=table(wdbc_train_label)
round(mal_rate[2]/sum(mal_rate), digits = 2)

otherwise, we can use the createDataPartition function of the caret package to do the job.

require(caret)
set.seed(2017)
inTrain=createDataPartition(y=wdbc$diagnosis,p=0.8,list=FALSE)
wdbc_train=wdbc_n[inTrain,]
wdbc_test=wdbc_n[-inTrain,]
wdbc_train_label=wdbc[inTrain,1]
wdbc_test_label=wdbc[-inTrain,1]
mal_rate=table(wdbc_train_label)
round(mal_rate[2]/sum(mal_rate), digits = 2)

Build the KNN model

1 2	require(class) wdbc_test_pred <- knn(train = wdbc_train, test = wdbc_test,cl = wdbc_train_label, k=21)

K=21 here is based on the sqrt of the length(wdbc_train_label).
the parameters of the knn funciton
- train: training data
- test: testing data
- cl: the labels of training data

Model validation and improvement

Model validation

This is a classification problem, so we use the crossTable to validate the model

1 2	require(gmodels) CrossTable(x = wdbc_test_label, y = wdbc_test_pred,prop.chisq=FALSE)

From the cross table, we can get that: TN = 69,TP = 39, FN=3,FP=2
So
- accuracy = (TN+TP)/113=97.345%
- sensitivity=TP/(TP+FN)= 92.86%
- Specificity=TN/(TN+FP)= 100%
And the sensitivity is how much we predict the benign breast cancer correctly, and the specificity is how much we predict the malignant breast cancer correctly.

Model improvement

Since before we have used the max-min normalization, it maybe reduce the influence of the minimum. So we now try to use the z-score normalization

wdbc_z=as.data.frame(scale(wdbc[-1]))
summary(wdbc_z$Area_mean)
set.seed(2017)
inTrain=createDataPartition(y=wdbc$diagnosis,p=0.8,list=FALSE)
wdbc_train=wdbc_z[inTrain,]
wdbc_test=wdbc_z[-inTrain,]
wdbc_train_label=wdbc[inTrain,1]
wdbc_test_label=wdbc[-inTrain,1]
wdbc_test_pred = knn(train = wdbc_train, test = wdbc_test,cl = wdbc_train_label, k=21)
CrossTable(x = wdbc_test_label, y = wdbc_test_pred,prop.chisq=FALSE)

From the cross table, we can get: TN = 71,TP = 37, FN=5,FP=0
so we can calculate:
- accuracy = (TN+TP)/113=96.46%
- sensitivity=TP/(TP+FN)= 90.47%
- Specificity=TN/(TN+FP)= 100%
But it’s out of our expect, it gets worse now.
So we can try to tune the k value too.