Set the path
1 | setwd("D:/git-ware/machine-learning-R-examples/01_KNN") |
Get the data from the Internet and do the exploration
Download the data from Internet
Data: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
Description: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17# download the data from the Internet
library(RCurl)
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
name_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names"
download.file(data_url, "wdbc.data", method="libcurl")
download.file(name_url, "wdbc.names", method = "libcurl")
# loading dataset
wdbc = read.csv("wdbc.data", header = FALSE)
# set the column names
wdbc.names=c("Radius","Texture","Perimeter","Area","Smoothness","Compactness","Concavity","Concave points","Symmetry","Fractal dimension")
wdbc.names=c(wdbc.names,paste(wdbc.names,"_mean",sep=""),paste(wdbc.names,"_worst",sep=""))
names(wdbc)=c("id","diagnosis",wdbc.names)
str(wdbc)
dim(wdbc)So from the output above, we can see that the dataset includes 569 samples and 32 features.
From the description file, we know that the first feature(column) is id, and the second feature(column) is the diagnosis, it has two labels: “B” is benign cancer, and the “M” means malignant cancer. And the rest are the features which include the means, standard error and maximum, etc.
We explore the ratio of two labels
1
table(wdbc$diagnosis)
We see that the dataset has 357 benign cancer and 212 malignant cancer.
Since the id column has no significance to the model, so we delete the it
1
2
3wdbc$diagnosis=factor(wdbc$diagnosis,levels=c("B","M"),labels=c("Benign","Malignant"))
round(prop.table(table(wdbc$diagnosis)) * 100, digits = 1)
wdbc=wdbc[-1]1
summary(wdbc[c("Radius_mean", "Area_mean", "Smoothness_mean")])
We can see that the scale between features are far too much, so we have to do the normalization.
Here is the function of what define.
1
2
3normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
1 | wdbc_n <- as.data.frame(lapply(wdbc[2:31], normalize)) |
- After the normalization, we can see the scale of the features are on the same level.
Split the data and build the model
Split the data into two parts: training data and testing data
- divide the data into two parts with the interval directly
1
2
3
4
5
6wdbc_train = wdbc_n[1:469, ]
wdbc_test = wdbc_n[470:569, ]
wdbc_train_label = wdbc[1:469, 1]
wdbc_test_label = wdbc[470:569, 1]
mal_rate=table(wdbc_train_label)
round(mal_rate[2]/sum(mal_rate), digits = 2) - There is a problem that if the first 469 cloumn includes most of the benign cancer data, that is not good for the partition. So we have to sample the data.
1 | set.seed(2017) |
- otherwise, we can use the createDataPartition function of the caret package to do the job.
1
2
3
4
5
6
7
8
9require(caret)
set.seed(2017)
inTrain=createDataPartition(y=wdbc$diagnosis,p=0.8,list=FALSE)
wdbc_train=wdbc_n[inTrain,]
wdbc_test=wdbc_n[-inTrain,]
wdbc_train_label=wdbc[inTrain,1]
wdbc_test_label=wdbc[-inTrain,1]
mal_rate=table(wdbc_train_label)
round(mal_rate[2]/sum(mal_rate), digits = 2)
Build the KNN model
1 | require(class) |
- K=21 here is based on the sqrt of the length(wdbc_train_label).
- the parameters of the knn funciton
- train: training data
- test: testing data
- cl: the labels of training data
Model validation and improvement
Model validation
- This is a classification problem, so we use the crossTable to validate the model
1
2require(gmodels)
CrossTable(x = wdbc_test_label, y = wdbc_test_pred,prop.chisq=FALSE) - From the cross table, we can get that: TN = 69,TP = 39, FN=3,FP=2
- So
- accuracy = (TN+TP)/113=97.345%
- sensitivity=TP/(TP+FN)= 92.86%
- Specificity=TN/(TN+FP)= 100%
- And the sensitivity is how much we predict the benign breast cancer correctly, and the specificity is how much we predict the malignant breast cancer correctly.
Model improvement
- Since before we have used the max-min normalization, it maybe reduce the influence of the minimum. So we now try to use the z-score normalization
1
2
3
4
5
6
7
8
9
10wdbc_z=as.data.frame(scale(wdbc[-1]))
summary(wdbc_z$Area_mean)
set.seed(2017)
inTrain=createDataPartition(y=wdbc$diagnosis,p=0.8,list=FALSE)
wdbc_train=wdbc_z[inTrain,]
wdbc_test=wdbc_z[-inTrain,]
wdbc_train_label=wdbc[inTrain,1]
wdbc_test_label=wdbc[-inTrain,1]
wdbc_test_pred = knn(train = wdbc_train, test = wdbc_test,cl = wdbc_train_label, k=21)
CrossTable(x = wdbc_test_label, y = wdbc_test_pred,prop.chisq=FALSE) - From the cross table, we can get: TN = 71,TP = 37, FN=5,FP=0
- so we can calculate:
- accuracy = (TN+TP)/113=96.46%
- sensitivity=TP/(TP+FN)= 90.47%
- Specificity=TN/(TN+FP)= 100%
- But it’s out of our expect, it gets worse now.
- So we can try to tune the k value too.
- 本文标题:KNN Example 01 To predict the cancer of breast cancer
- 创建时间:2017-02-15 13:44:32
- 本文链接:2017/02/15/技能-修行-进步-R语言/KNN-example01-breast-cancer-predict/
- 版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!