This script will show how to run exploratory analyses in Python. A YouTube video, a script in the R programming language, and additional materials can be found at my GitHub page.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = {'subject': ['Pierre','Anne','Joyce','Peter','Alan','Camille'], 'age': [20, 16, 19, 99, 23, 18], 'sex': ['M','F','F','M', np.nan, 'F'], 'height': [172, 181, 165, 168, 177, 178], 'speed': [11.20, 3.00, 11.50, 10.35, 10.98, 13.05]}
df = pd.DataFrame(data, columns = ['subject','age','sex','height','speed'])
df
df.shape
df.head(4)
df.dtypes
Some basic statistics can be given by describe. Use np.round(df.describe(), decimals=2) for rounding to two decimals.
df.describe()
To inspect if and where there are missing values (NaN) in the data. The command df.isnull() gives the same result.
pd.isnull(df)
If you want the row and column where NaNs occur. Please remind that Python indexing starts at zero, so the first row and first column are indexed "0".
np.where(pd.isnull(df))
We can use boxplots to see if there are impossible extreme values in the data.
Note that normally selecting the three variables in a boxplot, using the commands df[['age', 'height', 'speed']].boxplot() and plt.show(), would have been good solutions. But the problem here is that not the same scaling is used for the three measures. For this reason, one should make boxplots separately, shown side-by-side.
The add_subplot(x,y,z) command is used for the positioning of the figures, where the first number denotes that there is one row, the second number denotes that there are three columns, and the third number denotes the position. The command fig.tight_layout() is to regulate the spacing between figures.
fig = plt.figure()
fig.add_subplot(1,3,1)
df[['age']].boxplot(sym='.')
fig.add_subplot(1,3,2)
df[['height']].boxplot(sym='.')
fig.add_subplot(1,3,3)
df[['speed']].boxplot(sym='.')
fig.tight_layout()
plt.show()
Compute the mean before the removal of outliers:
df['age'].mean()
Check if there are cases that are older than 40 years:
df['age']>40
Replace the case(s) older than 40 with a missing value (NA).
df['age'][df['age']>40]=np.nan
df
df['age'].mean()
Mean age for male and female participants.
grpsex = df.groupby('sex')
grpsex['age'].mean()
Again, use np.round(grpsex['age'].mean(), decimals=2) for rounding to two decimals.
Use a scatterplot to display the relation between age and speed:
df.iloc[1,4]=np.nan; # command used to make missing value of 3.00 ms
plt.scatter(df['age'], df['speed'], facecolors='none', edgecolors='b')
plt.xlim(15.5,23.5)
plt.ylim(10.0, 13.5)
plt.xlabel('Age')
plt.ylabel('Speed')
plt.show()
Note that the above figure are the data after the two outliers were replaced by NaN (as discussed in the YouTube video). So for this reason there are only four datapoints left. Actually, up to now, we only replaced one outlier for Age with a NaN. See the Quiz question below for replacing the outlier in the variable speed with NaN.
Replace the outlier of 3.00 ms in the variable df$speed with a NaN.
Answer There are multiple solutions possible:
df.iloc[1,4]=3.00;
df['speed'][df['speed']==3.00]=np.nan
Another solution:
df.iloc[1,4]=np.nan
To verify that the value was indeed replaced by NaN use the command pd.isnull
pd.isnull(df['speed'])
A dataset could erroneously have double records (duplicates). This is bad and should be removed. Seeing the current dataset, what would be a way to discover duplicates in the variable df$subject? Check your solution for the following data.frame that contains double data:
data = {'subject': ['Pierre','Anne','Joyce','Peter','Alan','Camille', 'Pierre'], 'age': [20, 16, 19, 99, 23, 18, 20], 'sex': ['M','F','F','M', np.nan, 'F', 'M'], 'height': [172, 181, 165, 168, 177, 178, 172], 'speed': [11.20, 3.00, 11.50, 10.35, 10.98, 13.05, 11.20]}
df = pd.DataFrame(data, columns = ['subject','age','sex','height','speed'])
Answer
A way to inspect for double records is to use the value_counts() function. For this purpose, we need to transform the variable of interest to a Pandas Series.
dfsubject = pd.Series(df['subject'])
dfsubject.value_counts()
But still simpler, using the describe function would also show that Pierre has two records:
df['subject'].describe()
If you have any questions, please do not hesitate to contact me: rrighart@googlemail.com