Which Machine Learning Classifier Is the Best on 18 UCI Datasets?

We all know that deep neural networks (DNNs) are great for image recognition and speech processing. What about good ol' structured/numerical datasets? I compared DNNs to other standard ML algorithms on many public classification datasets from the UCI ML repository, and here are the results. I divided each dataset into a training and test set (80-20 split), and report test accuracies (averaged over 3 random test-train splits) below. The specific classifiers that I tried were:

  • Support Vector
  • Naive Bayes
  • Decision Tree
  • Random Forests
  • 1-layer NN with 100 hidden units
  • 2-layer NN with 100 hidden units in each layer
  • 3-layer NN with 100 hidden units in each layer

All techniques use the standard implementations in the sklearn library, and did not do any hyper-parameter tuning. The goal was to see which model is best if you want good off-the-shelf performance. A short description of each dataset (taken from the UCI website) is included. Conclusions are listed at the end.

In [1]:
import pandas as pd, numpy as np
from utils import Timer
from uci_utils import *

%load_ext autoreload
%autoreload 2

timer = Timer()
pd.set_option('precision', 3)

Adult Income

Predict whether income exceeds $50K/yr based on census data.

In [2]:
timer.start()
X, y = UCI_Dataset_Loader.adult()
scores = compute_test_accuracies(X,y)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (26048, 108), Test set size: (6513, 108), # of classes: 2

Time needed to run experiment: 1630.932 s

Best classifier: Random Forests

Out[2]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.753 0.799 0.817 0.847 0.605 0.599 0.784

Car Evaluation

The database evaluates cars according to the following concepts: car acceptability, overall price, buying price, price of the maintenance, number of doors, capacity in terms of persons to carry, the size of luggage boot, and estimated safety of the car.

In [3]:
timer.start()
X, y = UCI_Dataset_Loader.car()
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (1382, 21), Test set size: (346, 21), # of classes: 4

Time needed to run experiment: 12.962 s

Best classifier: 2-layer NN

Out[3]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.906 0.799 0.962 0.935 0.993 0.999 0.997

Credit Default

This research aimed at the case of customers' default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods.

In [4]:
timer.start()
X, y = UCI_Dataset_Loader.credit_default()
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (24000, 24), Test set size: (6000, 24), # of classes: 2

Time needed to run experiment: 697.384 s

Best classifier: Random Forests

Out[4]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.774 0.372 0.723 0.804 0.664 0.662 0.714

Dermatology

The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. TPatients were first evaluated clinically with 12 features. Afterwards, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an analysis of the samples under a microscope.

In [5]:
timer.start()
X, y = UCI_Dataset_Loader.dermatology() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (292, 94), Test set size: (74, 94), # of classes: 4

Time needed to run experiment: 8.086 s

Best classifier: Support Vector

Out[5]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.622 0.189 0.559 0.527 0.482 0.486 0.491

Diabetic Retinopathy

This dataset contains features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not. All features represent either a detected lesion, a descriptive feature of a anatomical part or an image-level descriptor.

In [6]:
timer.start()
X, y = UCI_Dataset_Loader.diabetic_retinopathy() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (920, 19), Test set size: (231, 19), # of classes: 2

Time needed to run experiment: 3.1 s

Best classifier: 1-layer NN

Out[6]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.561 0.613 0.625 0.635 0.723 0.701 0.683

E. coli

Predict which kind of E. coli you have based on a number of cellular features.

In [7]:
timer.start()
X, y = UCI_Dataset_Loader.ecoli() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (268, 7), Test set size: (68, 7), # of classes: 8

Time needed to run experiment: 4.664 s

Best classifier: 3-layer NN

Out[7]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.721 0.76 0.814 0.843 0.76 0.907 0.912

EEG to Predict Eye State

All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.

In [8]:
timer.start()
X, y = UCI_Dataset_Loader.eeg_eyes() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (11984, 14), Test set size: (2996, 14), # of classes: 2

Time needed to run experiment: 89.296 s

Best classifier: Random Forests

Out[8]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.561 0.486 0.832 0.898 0.504 0.445 0.491

Haberman's Breast Cancer Survival

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

In [9]:
timer.start()
X, y = UCI_Dataset_Loader.haberman() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (244, 3), Test set size: (62, 3), # of classes: 2

Time needed to run experiment: 0.577 s

Best classifier: Naive Bayes, 2-layer NN

Out[9]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.699 0.747 0.634 0.661 0.586 0.747 0.737

Ionosphere Radar Returns

Classification of radar returns from the ionosphere

In [10]:
timer.start()
X, y = UCI_Dataset_Loader.ionosphere() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (280, 34), Test set size: (71, 34), # of classes: 2

Time needed to run experiment: 4.164 s

Best classifier: 3-layer NN

Out[10]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.897 0.864 0.869 0.906 0.897 0.911 0.925

Mice Protein Expression

The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. There are 38 control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice.

In [11]:
timer.start()
X, y = UCI_Dataset_Loader.mice_protein() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (864, 77), Test set size: (216, 77), # of classes: 8

Time needed to run experiment: 15.947 s

Best classifier: 3-layer NN

Out[11]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.696 0.79 0.818 0.937 0.963 0.952 0.968

Nursery Admittance

Nursery Database was derived from a hierarchical decision model originally developed to rank applications for nursery schools. It was used during several years in 1980's when there was excessive enrollment to these schools in Ljubljana, Slovenia, and the rejected applications frequently needed an objective explanation. The final decision depended on three subproblems: occupation of parents and child's nursery, family structure and financial standing, and social and health picture of the family.

In [12]:
timer.start()
X, y = UCI_Dataset_Loader.nursery() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (10367, 27), Test set size: (2592, 27), # of classes: 5

Time needed to run experiment: 49.664 s

Best classifier: 1-layer NN, 2-layer NN, 3-layer NN

Out[12]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.972 0.836 0.996 0.983 1 1 1

Seed Classification

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

In [13]:
timer.start()
X, y = UCI_Dataset_Loader.seeds() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (167, 7), Test set size: (42, 7), # of classes: 3

Time needed to run experiment: 0.853 s

Best classifier: Support Vector, Naive Bayes, Random Forests

Out[13]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.929 0.929 0.905 0.929 0.421 0.278 0.603

Seismic Mining

Mining activity was and is always connected with the occurrence of dangers which are commonly called mining hazards. In this case, the goal is to predict hazardous seismic activity

In [14]:
timer.start()
X, y = UCI_Dataset_Loader.seismic() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (2066, 24), Test set size: (517, 24), # of classes: 2

Time needed to run experiment: 3.042 s

Best classifier: Support Vector

Out[14]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.942 0.911 0.885 0.937 0.663 0.734 0.471

Soybean

"Learning by being told and learning from examples: an experimental comparison of the two methodes of knowledge acquisition in the context of developing an expert system for soybean desease diagnoiss"

In [15]:
timer.start()
X, y = UCI_Dataset_Loader.soybean() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (36, 35), Test set size: (10, 35), # of classes: 4

Time needed to run experiment: 1.271 s

Best classifier: Support Vector, Naive Bayes, Decision Tree, Random Forests, 1-layer NN, 2-layer NN, 3-layer NN

Out[15]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 1 1 1 1 1 1 1

Teaching Assistant Evaluation

The data consist of evaluations of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant (TA) assignments at the Statistics Department of the University of Wisconsin-Madison. The scores were divided into 3 roughly equal-sized categories ("low", "medium", and "high") to form the class variable.

In [16]:
timer.start()
X, y = UCI_Dataset_Loader.teaching_assistant() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (120, 5), Test set size: (30, 5), # of classes: 3

Time needed to run experiment: 0.47 s

Best classifier: Decision Tree

Out[16]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.422 0.533 0.556 0.533 0.344 0.4 0.3

Tic Tac Toe Endgame

This database encodes the complete set of possible board configurations at the end of tic-tac-toe games, where "x" is assumed to have played first. The target concept is "win for x" (i.e., true when "x" has one of 8 possible ways to create a "three-in-a-row").

In [17]:
timer.start()
X, y = UCI_Dataset_Loader.tic_tac_toe() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (765, 27), Test set size: (192, 27), # of classes: 2

Time needed to run experiment: 7.827 s

Best classifier: 2-layer NN

Out[17]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.852 0.639 0.938 0.951 0.976 0.979 0.967

Website Phishing

We have identified different features related to legitimate and phishy websites and collected 1353 different websites from difference sources.Phishing websites were collected from Phishtank data archive, which is a free community site where users can submit, verify, track and share phishing data.

In [18]:
timer.start()
X, y = UCI_Dataset_Loader.website_phishing() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (1082, 9), Test set size: (271, 9), # of classes: 3

Time needed to run experiment: 11.575 s

Best classifier: 2-layer NN

Out[18]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.866 0.838 0.893 0.895 0.889 0.904 0.893

Wholesale Customer Region

The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. Can we predict what region/country a customer is from based on how they spend?

In [19]:
timer.start()
X, y = UCI_Dataset_Loader.wholesale_customers() # <-- change this line for different experiments
scores = compute_test_accuracies(X,y, verbose=1)
timer.end_and_md_print()
print_best(scores)
pd.DataFrame(scores, columns=names, index=['Test Accuracy']).style.apply(highlight_max,axis=1)

Training set size: (352, 6), Test set size: (88, 6), # of classes: 3

Time needed to run experiment: 0.527 s

Best classifier: Support Vector

Out[19]:
Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
Test Accuracy 0.758 0.485 0.542 0.682 0.5 0.655 0.61

Conclusions

Aggregated Results

In [27]:
all_data_df = pd.DataFrame(np.array(all_data).squeeze())
all_data_df.columns = ['# of Points','Dimensionality','# of Classes'] + names
all_data_df.style.apply(highlight_max_excluding_first_three,axis=1)
Out[27]:
# of Points Dimensionality # of Classes Support Vector Naive Bayes Decision Tree Random Forests 1-layer NN 2-layer NN 3-layer NN
0 3.26e+04 108 2 0.753 0.799 0.817 0.847 0.605 0.599 0.784
1 1.73e+03 21 4 0.906 0.799 0.962 0.935 0.993 0.999 0.997
2 3e+04 24 2 0.774 0.372 0.723 0.804 0.664 0.662 0.714
3 366 94 4 0.622 0.189 0.559 0.527 0.482 0.486 0.491
4 1.15e+03 19 2 0.561 0.613 0.625 0.635 0.723 0.701 0.683
5 336 7 8 0.721 0.76 0.814 0.843 0.76 0.907 0.912
6 1.5e+04 14 2 0.561 0.486 0.832 0.898 0.504 0.445 0.491
7 306 3 2 0.699 0.747 0.634 0.661 0.586 0.747 0.737
8 351 34 2 0.897 0.864 0.869 0.906 0.897 0.911 0.925
9 1.08e+03 77 8 0.696 0.79 0.818 0.937 0.963 0.952 0.968
10 1.3e+04 27 5 0.972 0.836 0.996 0.983 1 1 1
11 209 7 3 0.929 0.929 0.905 0.929 0.421 0.278 0.603
12 2.58e+03 24 2 0.942 0.911 0.885 0.937 0.663 0.734 0.471
13 46 35 4 1 1 1 1 1 1 1
14 150 5 3 0.422 0.533 0.556 0.533 0.344 0.4 0.3
15 957 27 2 0.852 0.639 0.938 0.951 0.976 0.979 0.967
16 1.35e+03 9 3 0.866 0.838 0.893 0.895 0.889 0.904 0.893
17 440 6 3 0.758 0.485 0.542 0.682 0.5 0.655 0.61

Mean Accuracy and Rank

In [21]:
df_mean = all_data_df.mean(axis=0)[3:].to_frame()
df_mean.columns = ['Average Accuracy']

df_rank = all_data_df.iloc[:,3:].rank(method="min", axis=1, ascending=True).mean(axis=0).to_frame()
df_rank.columns = ['Average Rank (higher better)']

df = pd.concat([df_mean, df_rank], axis=1)
df.style.apply(highlight_max)
Out[21]:
Average Accuracy Average Rank (higher better)
Support Vector 0.774 3.5
Naive Bayes 0.699 2.39
Decision Tree 0.798 3.94
Random Forests 0.828 4.78
1-layer NN 0.721 3.22
2-layer NN 0.742 4.22
3-layer NN 0.752 4.28

Overall Recommendation

Based on these results, if you'd like to use a machine learning off-the-shelf to classify your numerical dataset, a good choice to use is a:

Random Forest, as it seems to work quite well on datasets of many different sizes and dimensionalities. In several datasets, 2- or 3-layer neural networks (NNs) offer some improvement to Random Forests, but for some datasets (particularly those with a small number of data points), NNs perform a lot worse, bringing down their overall accuracy. Perhaps NNs could be regularized to improve performance, but with off-the-shelf hyper-parameters, Random Forests are your best bet.

In a similar manner, SVMs achieved the top accuracy on only a few datasets, but had a very respectable average accuracy, in part because they did not fail terribly on any dataset. Compare this to for example Naive Bayes, which had the top performance on a couple of small datasets, but had the worst average performance overall in part because they had very poor performance on some datasets. Naive Bayes would not seem to be a good method to use generally. Of course, all of these considerations are based only on test accuracy. if you have other considerations besides test accuracy (e.g. interpretability), then some of the simpler methods may make sense. In particular, Decision Trees (though rarely a top performer) had quite good overall average accuracy and may be a good choice if interpretability is a key concern.

If this notebook was helpful in your research, please cite it using bibtex entry below:

@misc{abid2018atomic,
  title={Which Machine Learning Classifier Is the Best on 18 UCI Datasets?},
  author={Abid, Abubakar},
  year={2018},
  publisher={GitHub},
  howpublished={\url{https://github.com/abidlabs/AtomsOfDeepLearning}},
}