A Comparison of Supervised Learning Algorithm(Part I)

Which supervised learning algorithm is the best? For people who just start their machine learning journey, this question always comes to their mind.

To answer this question,we used 4 different types of data sets (One for regression problem, and the other 4 for binary classification problem) to test 6 supervised learning algorithms: SVMs (linear and kernel), neural networks, logistic regression, gradient boosting, random forests, decision trees, bagged trees, boosted trees, linear ridge regression. And also applied model Averaging to improve the models.

Table 1. Description of Data Sets
Type Data sets # Predictor

Attributes

Train Size Test Size Class Distribution

 

Regression Boston Housing 13 253 253
Binary classification Wdbc 30 300 269 Malignant: 212 Benign:357
Hypothyroid

(has missing values)

34 1956 1134 Hypothyroid: 149negative:2941
Ionosphere

(delete the first 2 predictors)

34->32 220 131 Good:225

Bad:126

For parts of the methods, Boston housing is trained twice, once with original attributes and once with scaled data. Also, to test the performance of logistic regression, Boston housing has been converted to a binary problem by treating the value of “MEDV” larger than its median as positive and the rest as negative.

Data Cleaning

Hypothyroid has lots of missing value, and 18 of 34 predictors are binary. The information on missing values and how many unique values there are for each variable are shown as below (Only shows the results of training data):

sapply(hyp,function(x) sum(is.na(x)))
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26
  0 277 44 0 0 0 0 0 0 0 0 0 0 0 0 278 0 429 0 152 0 151 0 150 0 1840
sapply(hyp, function(x) length(unique(x)))
 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26
  2 89 3 2 2 2 2 2 2 2 2 2 2 2 2 202 2 67 2 245 2 151 2 244 2 44

A visual take on the missing values in this data set might be helpful:

Picture1.png

We replaced the missing values with the mean in the continuous variables. Since there are few missing values in the binary variable “Sex”, and replace by the variable’s median or mean will cause bias, we deleted these 73 records. (training:44, test:29)

For the ionosphere dataset which contains 34 predictors, we have eliminated the two first features, because the first one had the same value in one of the classes and the second feature assumes the value equals to 0 for all observations.

Parameter Tuning Results by Datasets

I tuned the parameter by using 70/30 splits for training data and compared different models’ cross-validation error. In addition, this report also used the Caret package in R to tuning the gradient boosting and random forest model. This section summarizes the tuning procedures and the results of each variable. By interpreting these results, I will choose the appropriate parameters for each algorithm.

Regression Problem (Boston Housing)

1. Gradient Boosting

The tuning result shows that whether data is scaled doesn’t affect the gradient boosting model. And when tree-depth=5, the cross validation error is the smallest.


> depth=c(1,2,3,4,5)
> tuning.gbm(trn,depth,20,0.7)
[1] "=== cross validation error estimation ==="
[1] "depth= 1 : error= 12.2088907005888 +- 2.63624188615258"
[1] "depth= 2 : error= 9.99951434560433 +- 2.30181005481302"
[1] "depth= 3 : error= 9.39671136126591 +- 2.19944247921552"
[1] "depth= 4 : error= 9.14194019449996 +- 2.16337261929302"
[1] "depth= 5 : error= 8.97911772123462 +- 2.18464631653844"
[1] 5
> #normalized data
> tuning.gbm(ntrn,depth,20,0.7)
[1] "=== cross validation error estimation ==="
[1] "depth= 1 : error= 13.9370040081716 +- 3.04440240601424"
[1] "depth= 2 : error= 11.4634148778012 +- 2.57605950695619"
[1] "depth= 3 : error= 10.6968093959549 +- 2.5842720460932"
[1] "depth= 4 : error= 10.4965351284161 +- 2.70512669539836"
[1] "depth= 5 : error= 10.2043560393421 +- 2.72245106119603"
[1] 5

view raw

gbm.r

hosted with ❤ by GitHub

By using caret package, we confirm that the best combination is: shrinkage=0.001, number of trees=300, n.minobsinnode = 10 and tree-depth=5. Part of the results are shown as below: Picture1

gbmFit1
Stochastic Gradient Boosting
253 samples
 13 predictor
No pre-processing
Resampling: Repeated Train/Test Splits Estimated (25 reps, 0.7%)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
  shrinkage interaction.depth n.minobsinnode n.trees RMSE Rsquared RMSE SD Rsquared SD
  0.001 1 10 15 8.337091357 0.6640256183 0.7941402762 0.09248626694
  0.001 1 10 30 8.265578098 0.6722929163 0.7915189328 0.08870257422
  0.001 1 10 45 8.195669943 0.6778771101 0.7895308048 0.08435275865
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were n.trees = 300, interaction.depth = 5, shrinkage = 0.01 and n.minobsinnode = 10.

2.Random Forest

The tuning result shows the cross-validation error and R-squared value, with different numbers of independent variables used to include in a tree construction. When mtry=6, the cross validation error is the smallest.


> tuning.rf(trn,mtry,20,0.7)
[1] "=== cross validation error estimation ==="
[1] "mtry= 2 : error= 9.89178214242082 +- 4.06175518360904"
[1] "mtry= 4 : error= 7.7269900435814 +- 2.53016370451501"
[1] "mtry= 6 : error= 7.48214012711454 +- 1.98854233231896"
[1] "mtry= 8 : error= 7.53741415199526 +- 1.78742904316897"
[1] "mtry= 10 : error= 7.92473002002763 +- 1.88953063618415"
[1] "mtry= 12 : error= 8.37506100241071 +- 1.9922333938174"
[1] "mtry= 14 : error= 8.61657448209067 +- 2.04644195003867"
[1] "mtry= 16 : error= 8.59454798339517 +- 2.05803385159497"
[1] "mtry= 18 : error= 8.61899651896161 +- 2.1064976963059"
[1] "mtry= 20 : error= 8.57905798710291 +- 2.08969847600883"
[1] 6

view raw

rf.r

hosted with ❤ by GitHub

The results of caret package confirmed the choice.Picture1.png

rfFit1
Random Forest

253 samples
13 predictor

No pre-processing
Resampling: Repeated Train/Test Splits Estimated (25 reps, 0.7%)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:

mtry RMSE Rsquared RMSE SD Rsquared SD
2 3.290959961 0.8765870076 0.5079010701 0.01964500327
4 2.914160845 0.8946978414 0.3797723734 0.02067581818
6 2.826270709 0.8963652917 0.3505325664 0.02565374865
8 2.828542294 0.8932305921 0.3384923951 0.02829620363
10 2.856976760 0.8893013207 0.3504001443 0.02988188614
12 2.909261177 0.8842750778 0.3543389148 0.03184843971
14 2.964247468 0.8795746376 0.3713993793 0.03205890656
16 2.957650138 0.8801113129 0.3717238723 0.03240662814
18 2.955660322 0.8802977349 0.3745751188 0.03249307098
20 2.955327699 0.8802633254 0.3646805880 0.03206477136

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 6.

 

3. Neural Networks

In this section, we used both original and scaled data, to find the difference. The scaled data’s cross validation error is smaller. So we chose the neural networks with neural networks with resilient back propagation, and the number of neurons in the hidden layer for the training process should be 6.


> tuning.nn(trn,neuron,10,0.7)
[1] "=== cross validation error estimation ==="
[1] "depth= 1 : error= 70.9794189514517 +- 15.904800477383"
[1] "depth= 2 : error= 70.9722170232778 +- 15.8954005190479"
[1] "depth= 3 : error= 71.0124803004767 +- 15.8597654828364"
[1] "depth= 4 : error= 70.5867870706421 +- 15.818947243105"
[1] "depth= 5 : error= 70.7432130613939 +- 15.6752259506747"
[1] "depth= 6 : error= 71.2585730501765 +- 16.1677503694949"
[1] "depth= 7 : error= 71.0870255049561 +- 16.0324643862916"
[1] "depth= 8 : error= 71.0129340109873 +- 15.9439431180934"
[1] "depth= 9 : error= 69.2765991412325 +- 17.4238413145427"
[1] "depth= 10 : error= 68.0890805039481 +- 18.795955239098"
[1] 10
> #normalized data
> tuning.nn(ntrn,neuron,10,0.7)
[1] "=== cross validation error estimation ==="
[1] "depth= 1 : error= 8.36827085978132 +- 1.38729760554322"
[1] "depth= 2 : error= 10.1439153325811 +- 2.97320904104239"
[1] "depth= 3 : error= 10.5033178163002 +- 3.07957861823334"
[1] "depth= 4 : error= 9.326750833886 +- 2.41268456494504"
[1] "depth= 5 : error= 8.30320248745101 +- 2.31702615480129"
[1] "depth= 6 : error= 8.28338957708407 +- 1.5092471033646"
[1] "depth= 7 : error= 7.55265022184259 +- 2.1759082753909"
[1] "depth= 8 : error= 7.30288737880777 +- 1.94906339595741"
[1] "depth= 9 : error= 8.06655827016435 +- 1.5545036672556"
[1] "depth= 10 : error= 8.67308490729445 +- 2.44640219350546"
[1] 8

view raw

nn.r

hosted with ❤ by GitHub

4.SVM (linear and kernel)

Since all the kernel methods are based on distance, so we need to scale the data set before applying to SVM model. The tuning result shows the cross-validation error with different kernel methods. Since SVM will overfit training data easily, we will select the linear SVM with the cost=3.2.


> tuning.svm("normalized data",ntrn,cvec,10,0.7)
[1] "normalized data"
[1] "#########linear SVM#################"
[1] "parameter= 1 the best cost= 3.2 the error = 13.0911820319358"
[1] "########polynomial kernel SVM##########"
[1] "parameter= 2 the best cost= 1 the error = 36.0239737549915"
[1] "parameter= 3 the best cost= 1 the error = 15.5345486240454"
[1] "parameter= 4 the best cost= 0.32 the error = 33.5509708662857"
[1] "parameter= 5 the best cost= 0.32 the error = 41.9638713171909"
[1] "parameter= 6 the best cost= 0.032 the error = 45.8911349089365"
[1] "parameter= 7 the best cost= 0.01 the error = 52.4204054549299"
[1] "########RBF kernel SVM###############"
[1] "parameter= 0.05 the best cost= 100 the error = 73.8399571148679"
[1] "parameter= 0.1 the best cost= 100 the error = 76.136578442901"
[1] "parameter= 0.2 the best cost= 3.2 the error = 67.1695151573611"
[1] "parameter= 0.3 the best cost= 100 the error = 58.8905466491342"
[1] "parameter= 0.4 the best cost= 100 the error = 49.190563732546"
[1] "parameter= 0.5 the best cost= 100 the error = 39.3667236728242"
[1] "parameter= 0.8 the best cost= 3.2 the error = 32.1457803322543"
[1] "parameter= 5 the best cost= 32 the error = 6.90665923311557"

view raw

svm.r

hosted with ❤ by GitHub

4.Linear ridge regression For linear ridge regression, the result shows that we should choose lambda=0.01. We Can reduce this error by choosing the features, so rerun the same tuning process by choosing features=[ 6 11 8 10 13 ]


> tuning.rg(ntrn,lambda,10,0.7)
[1] "=== cross validation error estimation ==="
[1] "lambda= 0.0001 : error= 11.0574771599181 +- 1.5911393576281"
[1] "lambda= 0.001 : error= 11.0554573634712 +- 1.59159985887222"
[1] "lambda= 0.01 : error= 11.0377334394576 +- 1.59698319488323"
[1] "lambda= 0.1 : error= 11.0537037442444 +- 1.70124419652984"
[1] "lambda= 1 : error= 14.7438810915394 +- 2.85190527401637"
[1] "lambda= 10 : error= 24.6020816723033 +- 5.47557805542126"
[1] "lambda= 100 : error= 40.0103673810083 +- 9.25347219989225"
[1] "lambda= 1000 : error= 56.4994813733851 +- 9.68175560688852"
[1] "lambda= 10000 : error= 61.7886340465619 +- 9.57833571234911"
[1] "lambda= 100000 : error= 62.4693812043716 +- 9.56137138319671"
[1] "lambda= 1000000 : error= 62.539471175857 +- 9.55958407895257"
[1] "lambda= 10000000 : error= 62.5465009852284 +- 9.55940440620494"
[1] "lambda= 100000000 : error= 62.547204174969 +- 9.55938642947216"
[1] 0.01

view raw

rg.r

hosted with ❤ by GitHub

Advertisement

One thought on “A Comparison of Supervised Learning Algorithm(Part I)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s