df<-data.frame(subject=as.factor(c("Pierre","Anne","Joyce","Peter","Alan","Camille")), age=as.numeric(c(20, 16, 19, 99, 23, 18)), sex=as.factor(c("M","F","F","M",NA,"F")), height=as.numeric(c(172, 181, 165, 168, 177, 178)), speed=as.numeric(c(11.20,3.00,11.50,10.35,10.98,13.05)))
Check the data.frame df by simply typing in:
df
## subject age sex height speed
## 1 Pierre 20 M 172 11.20
## 2 Anne 16 F 181 3.00
## 3 Joyce 19 F 165 11.50
## 4 Peter 99 M 168 10.35
## 5 Alan 23 <NA> 177 10.98
## 6 Camille 18 F 178 13.05
The data has 6 rows and 5 columns:
dim(df)
## [1] 6 5
Because the data are not that big you are able to view them entirely in your R studio. However, if you have very big data you may want to view only the first lines:
head(df, 4)
## subject age sex height speed
## 1 Pierre 20 M 172 11.20
## 2 Anne 16 F 181 3.00
## 3 Joyce 19 F 165 11.50
## 4 Peter 99 M 168 10.35
To inspect the structure of the data:
str(df)
## 'data.frame': 6 obs. of 5 variables:
## $ subject: Factor w/ 6 levels "Alan","Anne",..: 6 2 4 5 1 3
## $ age : num 20 16 19 99 23 18
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 NA 1
## $ height : num 172 181 165 168 177 178
## $ speed : num 11.2 3 11.5 10.3 11 ...
Some basic statistics can be given by the summary command:
summary(df)
## subject age sex height speed
## Alan :1 Min. :16.00 F :3 Min. :165.0 Min. : 3.00
## Anne :1 1st Qu.:18.25 M :2 1st Qu.:169.0 1st Qu.:10.51
## Camille:1 Median :19.50 NA's:1 Median :174.5 Median :11.09
## Joyce :1 Mean :32.50 Mean :173.5 Mean :10.01
## Peter :1 3rd Qu.:22.25 3rd Qu.:177.8 3rd Qu.:11.43
## Pierre :1 Max. :99.00 Max. :181.0 Max. :13.05
To examine if missing values (NA) are in the data:
is.na(df)
## subject age sex height speed
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE TRUE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
If you want the row and column index where NAs occur:
which(is.na(df), arr.ind=T)
## row col
## [1,] 5 3
We can use boxplots to see if there are impossible extreme values:
par(mfrow=c(1,3))
boxplot(df$age, main="Age (yrs)", cex.lab=2.0, cex.axis=2.0, cex.main=1.6, cex=2.0, col="yellow")
boxplot(df$height, main="Height (cm)", cex.lab=2.0, cex.axis=2.0, cex.main=1.6, cex=2.0, col="red")
boxplot(df$speed, main="Speed (ms)", cex.lab=2.0, cex.axis=2.0, cex.main=1.6, cex=2.0, col="orange")
Compute the mean before the removal of outliers:
mean(df$age)
## [1] 32.5
Check if there are cases that are older than 40 years:
df$age>40
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
Replace the case(s) older than 40 with a missing value (NA).
df$age[df$age>40]<-NA
Compute again the mean age, allowing to remove missing values (NAs):
mean(df$age, na.rm=TRUE)
## [1] 19.2
Mean age for male and female participants.
aggregate(age ~ sex, data=df, FUN=mean, na.rm=TRUE)
## sex age
## 1 F 17.66667
## 2 M 20.00000
Use a scatterplot to display the relation between age and speed:
par(mfrow=c(1,1))
plot(speed ~ age, data=df, col="blue", pch=1, cex=1.2)
Replace the outlier of 3.00 ms in the variable df$speed with a NA.
Answer There are multiple solutions possible:
df$speed[df$speed==3.00]<-NA
Another solution:
df[2,5]<-NA
To verify that the value was indeed replaced by a NA use the command is.na
:
is.na(df$speed)
## [1] FALSE TRUE FALSE FALSE FALSE FALSE
A dataset could erroneously have double records (duplicates). This is bad and should be removed.
Seeing the current dataset, what would be a way to discover duplicates in the variable df$subject?
Check your solution for the following data.frame that contains double data:
df<-data.frame(subject=as.factor(c("Pierre","Anne","Joyce","Peter","Alan","Camille", "Pierre")), age=as.numeric(c(20, 16, 19, 99, 23, 18, 20)), sex=as.factor(c("M","F","F","M",NA,"F", "M")), height=as.numeric(c(172, 181, 165, 168, 177, 178, 172)), speed=as.numeric(c(11.20,3.00,11.50,10.35,10.98,13.05, 11.20)))
Answer
A way to inspect for double records is to use the table
function.
table(df$subject)
##
## Alan Anne Camille Joyce Peter Pierre
## 1 1 1 1 1 2
But still simpler, using the summary
function would also display this:
summary(df)
## subject age sex height speed
## Alan :1 Min. :16.00 F :3 Min. :165.0 Min. : 3.00
## Anne :1 1st Qu.:18.50 M :3 1st Qu.:170.0 1st Qu.:10.66
## Camille:1 Median :20.00 NA's:1 Median :172.0 Median :11.20
## Joyce :1 Mean :30.71 Mean :173.3 Mean :10.18
## Peter :1 3rd Qu.:21.50 3rd Qu.:177.5 3rd Qu.:11.35
## Pierre :2 Max. :99.00 Max. :181.0 Max. :13.05