A GUIDE FOR TESTING YOUR OWN HANDWRITTEN DIGITS USING R
Ruthger Righart
Blogs, tutorials & videos: https://rrighart.github.io
E-Mail: rrighart@googlemail.com
INTRODUCTION
The handwritten digits set MNIST has been investigated extensively 1. Diverse machine learning (ML) methods have reported over 95% classification accuracy for this set.
In the current blog, I will give a decent guide through preprocessing steps that can be used for testing your own handwritten digits. The mentioned preprocessing is not exhaustive and thus different steps may be added.
Why another blog about this topic? The famous MNIST set is used in many research projects, whereas self made sets are not commonplace. In my experience, using your proper “kitchen” set is big fun, and an exciting way to learn about digit recognition and its associated problems in preprocessing. R code for preprocessing is lacking and for this reason I propose a pipeline that can be used to start-up your own preprocessing pipeline from scratch. It is not meant to be exhaustive and several preprocessing steps have not been mentioned here. The resulting digits are put in a format so they can be tested in an appropriate Machine Learning algorithm.
LOADING DATA
As a first step, I filled an A4 sheet with my handwritten digits and scanned it with my in-house scanner. The resulting JPG file can be read from my GitHub repository. The library “jpeg” can be installed using install.packages("jpeg")
.
library(jpeg)
myurl <- "https://raw.githubusercontent.com/RRighart/Digits/master/HandwrittenDigits.JPG"
z <- tempfile()
download.file(myurl,z,mode="wb")
img <- readJPEG(z)
file.remove(z)
## [1] TRUE
We use the function image to display the scanned A4 sheet:
par(mfrow=c(1,1),
oma = c(0.5,0.5,0.5,0.5) + 0.1,
mar = c(0,0,0,0) + 0.1)
image(t(apply(img[c(1:dim(img)[1]), c(1:dim(img)[2]), 1], 2, rev)), col=grey.colors(255), axes=F, asp=1)
mtext("Whole image of handwritten digits", cex=0.6, col="red")
RESIZING DATA
Math paper consisting of 5x5 mm squares was used, in order to facilitate handwriting. In the original file each line has 42 digits, each digit taking one 5x5 mm square. Like the MNIST data, it should be reframed to 28 pixels per digit. That means that the whole image should be resized to 42x28=1176 pixels. We will do this using the resize function from the EBImage package. Use install.packages("EBImage")
if it is not on your machine.
library(EBImage)
ximg<-img[c(1:dim(img)[1]), c(1:dim(img)[2]), 1]
nhsq=42
pix=28
nimg <- resize(ximg, h = nhsq*pix)
dim(nimg)
## [1] 1663 1176
A small but important detail: if we look carefully in the whole image of handwritten digits, we can see that the last rows are empty. This is caused by the misfortune that the A4 sheet did not entirely fill the glass plate of the scanner. So we need to remove them. It turns out that 56 rows were used, which results in 28*56=1568 rows:
nimg<-nimg[1:1568, ]
dim(nimg)
## [1] 1568 1176
SEGMENTING DATA
Segmenting digits by manually typing in row and column numbers is laborous. It would be nice to have a function that does all the work for us. Matsplitter2 does this. In addition, the resulting array format is slightly easier to handle.
matsplitter<-function(M, r, c) {
rg <- (row(M)-1)%/%r+1
cg <- (col(M)-1)%/%c+1
rci <- (rg-1)*max(cg) + cg
N <- prod(dim(M))/r/c
cv <- unlist(lapply(1:N, function(x) M[rci==x]))
dim(cv)<-c(r,c,N)
cv
}
Note that Matsplitter needs to know the size of the submatrices, which is 28 by 28:
nimg<-nimg[c(1:dim(nimg)[1]), ]
dat<-matsplitter(nimg, 28, 28)
class(dat)
## [1] "array"
dim(dat)
## [1] 28 28 2352
ADDING LABELS
As I prepared the digits myself, preparing the corresponding labels is not too much of a hassle, and it can be summarized by the vector called labels. As a check up, the simple table function shows, as expected, that each category has 224 digits.
labels=rep(c(NA, rep(seq.int(0, 9, by=1),4), NA), 56)
table(labels)
## labels
## 0 1 2 3 4 5 6 7 8 9
## 224 224 224 224 224 224 224 224 224 224
The value NA (missing value) indicates the “empty” images. These are the borders that did not contain any digit but were nevertheless included while scanning. These are removed by using !is.na.
ndat<-dat[,,which(!is.na(labels))]
nlabels<-labels[which(!is.na(labels))]
Let us display some example digits, with the labels in the header:
par(mfrow=c(2,5),
oma = c(3,3,3,3) + 0.1,
mar = c(0,0.1,0,0.1) + 0.1)
for(i in 1:10){
image(t(apply(ndat[, ,i], 2, rev)), col=grey.colors(255), axes=F, asp=1); mtext(nlabels[i], cex=0.8, col="red", side=3, line=-1)
}
CREATING NEGATIVE IMAGES
The scanned handwritten digits were of lower intensity and the background was of higher intensity. The MNIST set is actually the opposite. In order to get digits that better resemble those of the MNIST, we need to inverse image intensities, which we can do by pixelwise subtracting the original intensity values from the maximal values.
neg <- function(M,i){
apply(M, 3, max)[i]-M[,,i]
}
mmat<-array(0,dim=dim(ndat))
for(i in 1:dim(ndat)[3]){
mmat[,,i]<-neg(ndat,i)
}
Let’s have a look at the result:
AVERAGE AND STANDARD DEVIATION (SD) OF DIGIT INTENSITIES (AT THIS POINT…)
Displaying the average images (i.e., 10 digit categories, each 236 images) for each category gives us an impression of the amount of noise. Without preprocessing, these images are quite blurry.
par(mfrow=c(2,5),
oma = c(3,3,3,3) + 0.1,
mar = c(0,0.1,0,0.1) + 0.1)
for(i in 0:9){
tm<-apply(mmat[,,which(nlabels==i)], c(1,2), mean)
image(t(apply(tm, 2, rev)), col=grey.colors(255), axes=F, asp=1); mtext(i, cex=0.8, col="red", side=3, line=-1)
}
Another insightful measure of variation is the pixelwise standard deviation (SD). This would also suggest us if there is noise. For example, for category digit 7 we can see in the left upper corner some high intensity voxels that should not be there.
par(mfrow=c(2,5),
oma = c(3,3,3,3) + 0.1,
mar = c(0,0.1,0,0.1) + 0.1)
for(i in 0:9){
tm<-apply(mmat[,,which(nlabels==i)], c(1,2), sd)
image(t(apply(tm, 2, rev)), col=grey.colors(255), axes=F, asp=1); mtext(i, cex=0.8, col="red", side=3, line=-1)
}
HISTOGRAMS OF IMAGE INTENSITIES
To see how image intensities are distributed, it would be nice to get some insight by viewing the histograms. Here we take a look at the histograms of the mean intensity values. As expected, most values are in the lower tail, particularly for digit “1” and “7”, where most information is occupied by “black” background pixels.
par(mfrow=c(2,5),
oma = c(2,2,2,2) + 0.1,
mar = c(2,2,2,2) + 0.1)
for(i in 0:9){
tm<-apply(mmat[,,which(nlabels==i)], c(1,2), mean)
hist(tm, labels=FALSE, axes=TRUE, freq=FALSE, col="black", xlim=c(0,1), ylim=c(0,16), main=i)
}
SCALE IMAGES
As intensities differ much across all images, it would be good to scale the image intensities with values 0-1. For this purpose, we apply the function range01.
range01 <- function(M){(M-min(M))/(max(M)-min(M))}
scmat<-array(0,dim=dim(mmat))
for(i in 1:dim(mmat)[3]){
scmat[,,i]<-range01(mmat[,,i])
}
Let’s see what that looks like for a couple of digits:
On first glance, this looks quite similar. To check if the scaling worked, we can compute the simple summary statistics. The minimum and maximum values in each image matrix should be 0 and 1. Here I just display the first 10 values.
apply(scmat[,,c(1:10)], 3, min)
## [1] 0 0 0 0 0 0 0 0 0 0
apply(scmat[,,c(1:10)], 3, max)
## [1] 1 1 1 1 1 1 1 1 1 1
THRESHOLD IMAGES
A way to remove background noise is to use a threshold. One could try different values, and for example when using 0.2 most of the background is removed.
thresh <- function(M){ifelse(M<0.2, 0, M)}
thmat<-thresh(scmat)
The resulting images look as follows:
CENTRALIZE IMAGES
Many digits are actually a bit off center. It would be good to center the digits. We can do this using a boundary box.
bmat<-array(0,dim=dim(thmat))
for(i in 1:dim(thmat)[3]){
temp<-thmat[,,i]
w<-temp[apply(temp,1,mean)>0,apply(temp,2,mean)>0]
if(is.null(dim(w))) next
if(dim(w)[1]<4) next
if(dim(w)[2]<4) next
if(dim(w)[1]>26) next
if(dim(w)[2]>26) next
bim<-matrix(rep(0,28*28),nrow=28)
ly=floor(((dim(bim)[1]-dim(w)[1])/2)+0.5)
uy=ly+dim(w)[1]-1
lx=floor(((dim(bim)[2]-dim(w)[2])/2)+0.5)
ux=lx+dim(w)[2]-1
bim[c(ly:uy),c(lx:ux)]<-w
bmat[,,i]<-bim
}
Now we will display the result:
SELECT A FRAME OF 24X24 PIXELS
We currently have a frame of 28x28 pixels. But there is actually a lot of “empty” pixels around the digit. We could reduce the frame to 24x24 pixels, effectively stripping from each side 2 rows and 2 columns. This is very easy in the array format:
sfr<-bmat[c(3:26), c(3:26), ]
After cutting the borders, the images are as follows:
DISPLAY AVERAGE DIGITS
Now that we finished all preprocessing steps, it would be good to display again the average intensity for each digit, in order to check for irregularities, outliers or trends. The images look much cleaner than the average images that we had before preprocessing.
par(mfrow=c(2,5),
oma = c(3,3,3,3) + 0.1,
mar = c(0,0.1,0,0.1) + 0.1)
for(i in 0:9){
tm<-apply(sfr[,,which(nlabels==i)], c(1,2), mean)
image(t(apply(tm, 2, rev)), col=grey.colors(255), axes=F, asp=1); mtext(i, cex=0.8, col="red", side=3, line=-1)
}
BRING ARRAY TO MATRIX
Now we are going to bring the data to a data frame format that is commonly used for training and testing Machine Learning algorithms:
ownset<-aperm(sfr, c(3,2,1))
dim(ownset)<-c(dim(sfr)[3],576)
ownset<-data.frame(ownset)
CONCLUSIONS
Using several preprocessing steps in R we have seen that a nice clean set of digits can be obtained.
An important question is if the MNIST train data can be used to classify digits of our own set. As this post is an introduction about preprocessing of such digits in R, I will not go into detail about classification.
However, out of curiosity I run at this stage a Deep Learning (DL) algorithm with basic parameter settings. Similar preprocessing was used for the MNIST trainset. The DL performed at around 66% (random guessing would be 10% accuracy).
This is quite low when compared with performance on the MNIST testset, rocking 95%. A number of reasons may explain this:
1). The preprocessing is incomplete and was missing some essential steps, such as deskewing, controlling aspect ratio3.
2). My distinctive handwriting style and the quality of the image files (e.g., scanner noise) may also influence performance.
3). A new training set of my own handwritten digits would be needed.
Certainly, it can be a mix of these factors (and other unmentioned factors).
Please do not hesitate to tell me your ideas. Hope you enjoyed this journey in starting your own digit set in R!
GitHub page: https://rrighart.github.io
E-Mail: rrighart@googlemail.com
REFERENCES
- Train and testset of MNIST are available at http://yann.lecun.com/exdb/mnist/ and http://www.kaggle.com.
- Matsplitter function was obtained from StackOverflow, at http://stackoverflow.com/questions/24299171/function-to-split-a-matrix-into-sub-matrices-in-r.
- VenkateswaraRao et al. 2011. An efficient feature extraction and classification of handwritten digits using neural networks. International Journal of Computer Science, Engineering and Applications (IJCSEA), Vol. 1, No. 5. 47-56.