KNN Example 01 To predict the cancer of breast cancer
Tim Chen(motion$) Lv5

Set the path

1
2
setwd("D:/git-ware/machine-learning-R-examples/01_KNN")
getwd()

Get the data from the Internet and do the exploration

Download the data from Internet

  • Data: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data

  • Description: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    # download the data from the Internet
    library(RCurl)
    data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
    name_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names"
    download.file(data_url, "wdbc.data", method="libcurl")
    download.file(name_url, "wdbc.names", method = "libcurl")

    # loading dataset
    wdbc = read.csv("wdbc.data", header = FALSE)

    # set the column names
    wdbc.names=c("Radius","Texture","Perimeter","Area","Smoothness","Compactness","Concavity","Concave points","Symmetry","Fractal dimension")
    wdbc.names=c(wdbc.names,paste(wdbc.names,"_mean",sep=""),paste(wdbc.names,"_worst",sep=""))

    names(wdbc)=c("id","diagnosis",wdbc.names)
    str(wdbc)
    dim(wdbc)
  • So from the output above, we can see that the dataset includes 569 samples and 32 features.

  • From the description file, we know that the first feature(column) is id, and the second feature(column) is the diagnosis, it has two labels: “B” is benign cancer, and the “M” means malignant cancer. And the rest are the features which include the means, standard error and maximum, etc.

  • We explore the ratio of two labels

    1
    table(wdbc$diagnosis)
  • We see that the dataset has 357 benign cancer and 212 malignant cancer.

  • Since the id column has no significance to the model, so we delete the it

    1
    2
    3
    wdbc$diagnosis=factor(wdbc$diagnosis,levels=c("B","M"),labels=c("Benign","Malignant"))
    round(prop.table(table(wdbc$diagnosis)) * 100, digits = 1)
    wdbc=wdbc[-1]
    1
    summary(wdbc[c("Radius_mean", "Area_mean", "Smoothness_mean")])
  • We can see that the scale between features are far too much, so we have to do the normalization.

  • Here is the function of what define.

    1
    2
    3
    normalize <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
    }
1
2
wdbc_n <- as.data.frame(lapply(wdbc[2:31], normalize))
summary(wdbc_n[c("Radius_mean", "Area_mean", "Smoothness_mean")])
  • After the normalization, we can see the scale of the features are on the same level.

Split the data and build the model

Split the data into two parts: training data and testing data

  • divide the data into two parts with the interval directly
    1
    2
    3
    4
    5
    6
    wdbc_train = wdbc_n[1:469, ]
    wdbc_test = wdbc_n[470:569, ]
    wdbc_train_label = wdbc[1:469, 1]
    wdbc_test_label = wdbc[470:569, 1]
    mal_rate=table(wdbc_train_label)
    round(mal_rate[2]/sum(mal_rate), digits = 2)
  • There is a problem that if the first 469 cloumn includes most of the benign cancer data, that is not good for the partition. So we have to sample the data.
1
2
3
4
5
6
7
8
set.seed(2017)
inTrain = sample(1:dim(wdbc_n)[1],469,replace=F)
wdbc_train=wdbc_n[inTrain,]
wdbc_test=wdbc_n[-inTrain,]
wdbc_train_label=wdbc[inTrain,1]
wdbc_test_label=wdbc[-inTrain,1]
mal_rate=table(wdbc_train_label)
round(mal_rate[2]/sum(mal_rate), digits = 2)
  • otherwise, we can use the createDataPartition function of the caret package to do the job.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    require(caret)
    set.seed(2017)
    inTrain=createDataPartition(y=wdbc$diagnosis,p=0.8,list=FALSE)
    wdbc_train=wdbc_n[inTrain,]
    wdbc_test=wdbc_n[-inTrain,]
    wdbc_train_label=wdbc[inTrain,1]
    wdbc_test_label=wdbc[-inTrain,1]
    mal_rate=table(wdbc_train_label)
    round(mal_rate[2]/sum(mal_rate), digits = 2)

Build the KNN model

1
2
require(class)
wdbc_test_pred <- knn(train = wdbc_train, test = wdbc_test,cl = wdbc_train_label, k=21)
  • K=21 here is based on the sqrt of the length(wdbc_train_label).
  • the parameters of the knn funciton
    • train: training data
    • test: testing data
    • cl: the labels of training data

Model validation and improvement

Model validation

  • This is a classification problem, so we use the crossTable to validate the model
    1
    2
    require(gmodels)
    CrossTable(x = wdbc_test_label, y = wdbc_test_pred,prop.chisq=FALSE)
  • From the cross table, we can get that: TN = 69,TP = 39, FN=3,FP=2
  • So
    • accuracy = (TN+TP)/113=97.345%
    • sensitivity=TP/(TP+FN)= 92.86%
    • Specificity=TN/(TN+FP)= 100%
  • And the sensitivity is how much we predict the benign breast cancer correctly, and the specificity is how much we predict the malignant breast cancer correctly.

Model improvement

  • Since before we have used the max-min normalization, it maybe reduce the influence of the minimum. So we now try to use the z-score normalization
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    wdbc_z=as.data.frame(scale(wdbc[-1]))
    summary(wdbc_z$Area_mean)
    set.seed(2017)
    inTrain=createDataPartition(y=wdbc$diagnosis,p=0.8,list=FALSE)
    wdbc_train=wdbc_z[inTrain,]
    wdbc_test=wdbc_z[-inTrain,]
    wdbc_train_label=wdbc[inTrain,1]
    wdbc_test_label=wdbc[-inTrain,1]
    wdbc_test_pred = knn(train = wdbc_train, test = wdbc_test,cl = wdbc_train_label, k=21)
    CrossTable(x = wdbc_test_label, y = wdbc_test_pred,prop.chisq=FALSE)
  • From the cross table, we can get: TN = 71,TP = 37, FN=5,FP=0
  • so we can calculate:
    • accuracy = (TN+TP)/113=96.46%
    • sensitivity=TP/(TP+FN)= 90.47%
    • Specificity=TN/(TN+FP)= 100%
  • But it’s out of our expect, it gets worse now.
  • So we can try to tune the k value too.
 评论