DEEP LEARNING OF PREPROCESSED HANDWRITTEN DIGITS

Ruthger Righart

Blogs, tutorials & videos: https://rrighart.github.io

Introduction

Deep Learning (DL) algorithms demonstrate excellent performance in the field of image classification, such as object detection, crop classification or lesion segmentation in medical images, to name a few. DL builds different levels of representations, from simple to more complex, by using functions in different layers [1].

Handwritten digits have been investigated extensively in the domain of Deep Learning (DL) [2]. The MNIST is a large database of handwritten digits that has been used frequently for this purpose. It represents a favourite testcase for examining algorithms in Data Science [3]. The MNIST is freely available at different websites, and one of the reasons that this database is very attractive is that several preprocessing steps have already been performed. That means that the data are ready for use when downloaded [2,4]. In the current blog, some additional preprocessing will be performed. The current result shows that it improves DL performance.

The current blog illustrates how to implement Deep Learning in a 5 fold cross validation (CV), using the H2O platform and R programming. CV permits to estimate performance on the training set while preventing the data from overfitting.

We will directly use the digits that were additionally preprocessed, but the interested reader is referred to the following blog to conduct these additional preprocessing steps [5].

Load the preprocessed data

We first need to load data. It would be best to download the original data from Kaggle or the site of Yann LeCun [2,4]. As a next step, use the read.csv function to load the train and test data. The current blog shows DL on preprocessed data.

There are 42000 data in the training set, 28000 in the testset.

train<-read.csv(paste(pa, "trainset.csv", sep=""))
test<-read.csv(paste(pa, "testset.csv", sep=""))
train[,1]<-NULL

The first column in the train data has labels. We set this as factor.

train[,1] = as.factor(as.character(train[,1]))
names(train)[1]<-c("label")
train[c(1:6), c(1:6)]

##   label X1 X2 X3 X4 X5
## 1     1  0  0  0  0  0
## 2     0  0  0  0  0  0
## 3     1  0  0  0  0  0
## 4     4  0  0  0  0  0
## 5     0  0  0  0  0  0
## 6     0  0  0  0  0  0

We are going to inspect if the occurrence of labels is balanced. For this purpose we use the table function.

table(train[,1])

## 
##    0    1    2    3    4    5    6    7    8    9 
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188

Let us randomly display some example digits from the train set, using the following code:

par(mfrow=c(2,5),
        oma = c(3,3,3,3) + 0.1,
        mar = c(0,0.1,0,0.1) + 0.1)
for(i in 51:60){
m = t(apply(matrix(unlist(train[i,-1]), nrow=24, byrow=TRUE), 2, rev))
image(m, col=grey.colors(255), axes=FALSE, asp=1)
Sys.sleep(2)
}

H2O

Deep learning can be performed using the H2O platform. For this we start a local H2O cluster.

When you use h2o.init, you may need to change max_mem_size, depending on your machine. More information about the H2O platform can be found at the H2O.ai webpage [6]

library(h2o)
h2o.stopLogging()

## Logging stopped

localH2O = h2o.init(max_mem_size = '6g', 
                    nthreads = -1)

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /tmp/RtmplC2Ql0/h2o_mark_started_from_r.out
##     /tmp/RtmplC2Ql0/h2o_mark_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: ... Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 504 milliseconds 
##     H2O cluster version:        3.10.0.8 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   5.33 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.3.2 (2016-10-31)

5-fold cross validation

At this point I need to mention that cross validation can be run in h2o.deeplearning itself, using the nfolds option [6]. The solution provided in this blog contains a bit more code but gives more flexibility.

In the following, I have chosen k = 5, a 80/20 split, but one could adapt this easily to for example k = 10. Using this set up, other ML algorithms can be easily implemented using the same coding framework.

Data structure

First, we need to create 5 randomly selected but equally sized folds from the train data.

For that purpose, we use set.seed, sothat the randomization can be reproduced next time we run the same script.

A vector randind is created that contains indices of the train data sampled randomly. Then, the vector is split in 5 different vectors, each having length 8400, resulting in object d1. These vectors are going to be used to build the 5 folds from the train data.

set.seed(123)
randind<-sample(seq(1:dim(train)[1]), dim(train[1]))
d1<-split(randind, ceiling(seq_along(randind)/(dim(train)[1]/5)))
length(d1[[1]])

## [1] 8400

After that, a data.frame x is created that determines the order of each set, sothat it can be selected belonging to train (80%) or validation set (20%) with each iteration in the loop. That means that with each iteration, we use 4 folds in the train, one fold in the testset.

x<-data.frame(c1=c(1,2,3,4,5), c2=c(2,3,4,5,1), c3=c(3,4,5,1,2), c4=c(4,5,1,2,3))
x

##   c1 c2 c3 c4
## 1  1  2  3  4
## 2  2  3  4  5
## 3  3  4  5  1
## 4  4  5  1  2
## 5  5  1  2  3

Distribution of labels for each set

We are going to create for every iteration in the loop a different combination of the 5 CV sets. To check the class distribution, we use the table function to count the number of digit classes for the four folds in each iteration (80%)

ctab<-NULL
for(i in 1:5){
h<-c(d1[[x[i,1]]],d1[[x[i,2]]],d1[[x[i,3]]],d1[[x[i,4]]])
ctab<-rbind(ctab, table(train[h,1]))
}

Validation of the distribution of categories, in each fold

To inspect if the distribution of the 5 fold sets is comparable to the overall distribution, we will divide the count of each digit in every fold with the total number.

xtab<-data.frame(ctab)
names(xtab)<-c("digit-0","digit-1", "digit-2", "digit-3", "digit-4", "digit-5", "digit-6", "digit-7", "digit-8", "digit-9")
vec<-as.numeric(table(train[,1]))
pctab<-t(apply(xtab, 1, function(x){x/vec}))*100

digit-0	digit-1	digit-2	digit-3	digit-4	digit-5	digit-6	digit-7	digit-8	digit-9
3267	3739	3355	3490	3238	3035	3266	3564	3279	3367
3370	3702	3325	3449	3254	3083	3298	3516	3234	3369
3297	3740	3328	3534	3232	2997	3368	3540	3233	3331
3279	3793	3385	3471	3272	3041	3291	3473	3249	3346
3315	3762	3315	3460	3292	3024	3325	3511	3257	3339

One would expect that the proportion of digits is very close to 80% for every fold set, since we used CV 80-20. And this is the case. The table showing percentage digits (columns) in each CV set (in rows) is displayed here:

digit-0	digit-1	digit-2	digit-3	digit-4	digit-5	digit-6	digit-7	digit-8	digit-9
79.07	79.82	80.32	80.21	79.52	79.97	78.95	80.98	80.70	80.40
81.56	79.04	79.60	79.27	79.91	81.24	79.72	79.89	79.60	80.44
79.79	79.85	79.67	81.22	79.37	78.97	81.41	80.44	79.57	79.54
79.36	80.98	81.04	79.77	80.35	80.13	79.55	78.91	79.97	79.89
80.23	80.32	79.36	79.52	80.84	79.68	80.37	79.78	80.16	79.73

Run the iterations in a loop

So this was a bit of preparation & validation work. We are going to use a loop to iterate through the 5 CV sets. Data are converted to the H2O data class for both trainset (in the loop) and testset (directly underneath). By the way, just to let you know, I have put off the progress bar, only to avoid a lot of output that we do not need at the moment.

h2o.no_progress()
dtest_h2o<-as.h2o(test)

Different parameters need to be set in Deep Learning such as using Drop out, which is a good way for dealing with overfit, and the number of hidden layers and neurons. In depth discussions on this can be found elsewhere [6].

h2o.no_progress()
for(i in 1:5)
  {
  h<-c(d1[[x[i,1]]],d1[[x[i,2]]],d1[[x[i,3]]],d1[[x[i,4]]]) 
  length(h)
  dtrain_h2o<-as.h2o(train[h,])
  label<-train[-h,1]
  dval_h2o<-as.h2o(train[-h,-1])
  model =
  h2o.deeplearning(x = 2:577,
                   y = 1,
                   training_frame = dtrain_h2o,
                   activation = "RectifierWithDropout",
                   input_dropout_ratio = 0.2,
                   hidden_dropout_ratios = c(0.5,0.5),
                   balance_classes = TRUE, 
                   hidden = c(100,100),
                   momentum_stable = 0.99,
                   nesterov_accelerated_gradient = T,
                   epochs = 15)
  h2o_y_dval <- h2o.predict(model, dval_h2o)
  df_y_dval <- as.data.frame(h2o_y_dval)
  df_y_dval <- data.frame(ImageId = seq(1,length(df_y_dval$predict)), predict = df_y_dval$predict)
  valtab<-table(df_y_dval$predict, label)
  valacc<-sum(diag(valtab)) / sum(valtab)
  assign(paste("accuracy", i, sep=""), valacc)
  assign(paste("valtab", i, sep=""), valtab)

  h2o_y_dtest <- h2o.predict(model, dtest_h2o)
  df_y_dtest <- as.data.frame(h2o_y_dtest)
  df_y_dtest <- data.frame(ImageId = seq(1,length(df_y_dtest$predict)), predict = df_y_dtest$predict)
  assign(paste("predict_test", i, sep=""), df_y_dtest)
  }

Cross validation results

We are almost there. We now have results for 5 holdout sets. The next table displays the performance after additional preprocessing.

CV1	CV2	CV3	CV4	CV5
0.952	0.954	0.953	0.956	0.954

Without additional preprocessing - using the data as provided at Kaggle - the accuracies are lower. The results are provided here.

CV1	CV2	CV3	CV4	CV5
0.943	0.947	0.946	0.949	0.946

Testset

For the testset, we also have 5 predictions. We can use a majority vote to determine which class occurs most frequently and is most likely the right category. For this purpose, we use the mode function [7]. After submitting the result at Kaggle we can have another impression of the performance (next to the CV).

Mode <- function(x, na.rm = FALSE) {
  if(na.rm){
    x = x[!is.na(x)]
  }
  ux <- unique(x)
  return(ux[which.max(tabulate(match(x, ux)))])
}

We now put all the predictions in a data.frame and use the apply function rowwise to calculate for each testcase the mode. Using the table function we can first of all inspect if the distribution looks faithful.

pred<-data.frame(CV1=as.numeric(as.character(predict_test1$predict)), CV2=as.numeric(as.character(predict_test2$predict)), CV3=as.numeric(as.character(predict_test3$predict)), CV4=as.numeric(as.character(predict_test4$predict)), CV5=as.numeric(as.character(predict_test5$predict)))
Majvote<-apply(pred[,c(1:5)],1, Mode)
table(Majvote)

## Majvote
##    0    1    2    3    4    5    6    7    8    9 
## 2745 3184 2852 2793 2827 2543 2782 2982 2577 2715

You may like to output the results and try it at Kaggle, in the “Digit Recognizer” competition.

The following code can be used to get a file with comma separated values (csv) that can be submitted at Kaggle [4]. You should give for pa the directory where you would like to save the file.

The resulting file can be uploaded at Kaggle to check the performance of your algorithm. The result I got was 0.947, quite in the range of the 5 fold CV results.

Kaggle = data.frame(ImageId = seq(1,length(Majvote)), Label = Majvote)
write.csv(Kaggle, file = paste(pa, "Submission-Kaggle-AfterPreproc.csv", sep=""), row.names=F)

About the results

The data show that performance in H2O Deep learning improved after additional preprocessing. From these results it is however not clear which preprocessing step (i.e., scaling, thresholding, centralizing, or framing) contributed mostly to this improvement. It suggests at least that preprocessing can facilitate Deep Learning.

It is comforting to see that both the CV and Kaggle testresults show a solid and quite similar improvement. The performance at Kaggle rose to 0.960 (from 0.947, an increase of 0.013). In the CV set, average performance rose to 0.956 (from 0.946, an increase of 0.010). This indicates that the results in Kaggle parallel those from the CV set.

It should be mentioned here that there are a number of factors that could still improve performance: 1). Adding other preprocessing steps that would remove noise and facilitate learning. 2). Tuning parameters. At this level standard settings were chosen and care was taken to have identical settings for unprocessed and preprocessed data in order to produce a fair comparison. 3). Different algorithms such as Tensor Flow may augment learning. 4). Data augmentation (extend the trainset by adding noise, applying transformations on existing data) and other ways of increasing the size of training data may improve learning as well.

Conclusions

The present blog showed how to implement 5 fold cross validation and Deep Learning in H2O. The script can be very easily transfered to other ML algorithms.

References

[1]. Deep Learning, Nature, 2015. Review paper by LeCun, Bengio, & Hinton.

[2]. MNIST data, http://yann.lecun.com/exdb/mnist/

[3]. http://www.deeplearningbook.org/

[4]. MNIST data, https://www.kaggle.com/c/digit-recognizer

[5]. A blog about preprocessing your own digits, https://rrighart.github.io/Digits/

[6]. H2O.ai and Deep learning https://github.com/h2oai/h2o-tutorials/tree/master/tutorials/deeplearning see also https://www.kaggle.com/kobakhit/digit-recognizer/digital-recognizer-in-r

[7]. Mode function, http://stackoverflow.com/questions/2547402/is-there-a-built-in-function-for-finding-the-mode