- Introduction
- Jupyter notebook and loading packages
- Create a DataFrame manually [pd.DataFrame, head, tail]
- Selecting rows and columns [loc, ix]
- Data types [type, dtypes, astype]
- Basic properties [columns, index, values, shape, len]
- Load another DataFrame [pd.read_csv, value_counts]
- Append two DataFrames [append, pd.concat]
- Missing values [pd.isnull, count, dropna, describe]
- Sorting the DataFrame [sort_index, sort_values]
- Data selection by position [iloc]
- Conditional data selection [>,==,!=,&,^,|]
- Selection based on string [str.contains]
- Adding single rows [loc, pd.Series, append]
- Exploratory analyses using boxplots [describe, Matplotlib boxplot]
- Barplots [Matplotlib]
- Pie Charts [Matplotlib, Plotly]
- Scatterplots [Matplotlib]
- Replacing values [replace]
- Data categorization [pd.cut, pd.value_counts, Matplotlib histogram, groupby]
- Data conversions [loc, def]
- Merging data [pd.merge]
- More scatterplots [Matplotlib, Seaborn]
- Geographical mapping [np.where, basemap]
- Lineplots [sort_values, Matplotlib]
- Remove rows and columns [del, drop]
- Save DataFrame [to_csv]

Cities are interesting. Some people enjoy living in cities, others tend to avoid them and rather prefer the countryside. Whatever is your preference, one thing is true: Cities generate a huge amount of data that are good source for analyses.

The following is a brief tutorial to get quick insight into Python data analyses and visualization. It is meant to be an introduction for beginners^{1}. Much of this is already documented on the web, and sometimes with more detail. So why another guide?

Many guides treat every topic separately and use new datasets everytime a new data problem is presented. There are howvever advantages of using the same dataset for learning a large variety of analyses. One advantage is time-related, in that all attention can be focused on coding without losing "switching" time understanding new datasets. More importantly, working on a single dataset one experiences the evolution from preprocessing till visualization.

The goal of this tutorial is not to solve an important data science problem. For this, anyways, the presented data is too small. The goal is to use an intuitive dataset to learn Python, intuitive in the sense that it contains data that everyone can understand without prior domain knowledge. It should therefore be suitable for someone who is a starter in Data Science.

We start this tutorial creating data, loading data, data munging and in the end several visualizations. Do not hesitate to contact me if you have any questions or suggestions: rrighart@googlemail.com

^{2}.

The current tutorial treats a great variety of data types, various preprocessing steps (for ex.: selecting data, merging data, replacing data, finding missing values), data visualization (for ex.: boxplots, barplots, pie-charts, geographic mapping), from different packages (for ex. Matplotlib, Seaborn, Plotly).

In Python, several abbreviations are used, such as pd for Pandas, np for numpy etc. The abbreviations you can use for imported packages are arbitrary, but in this tutorial I used those that are used most commonly on platforms such as StackOverflow. The following packages are needed throughout the tutorial.

In [1]:

```
import os
import csv
import pandas as pd
import numpy as np
from datetime import datetime
```

The packages here below are needed for visualizations, such as Matplotlib, Seaborn, and Plotly:

In [2]:

```
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib import cm
import seaborn as sns
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
```

The following packages will be used for geographic mapping:

In [3]:

```
from mpl_toolkits.basemap import Basemap as Basemap
from matplotlib.colors import rgb2hex
from matplotlib.patches import Polygon
from geopy.geocoders import Nominatim
import math
```

There is no problem of continuity when any of the visualizations or geographic mapping are skipped.

We are going to use data from European cities, consisting of categorical and numerical data. The data were obtained from Wikipedia. *Temperatures* are daily means in Celsius. *Population* is urban population. *Altitude* is the highest level reported.

We are going to create a DataFrame manually, later we will see how to load a DataFrame. After typing the following code, your DataFrame df1 is created.

In [4]:

```
data = {'Place': ['Amsterdam', 'Barcelona', 'Paris', 'Geneva', 'Munich', 'Athens', 'Vienna'],
'Temp_Jan': [3.4, 11.8, 5.0, 1.5, 0.3, 9.9, 1.2],
'Temp_Jul': [17.6, 25.7, 20.6, 20.2, 19.4, 28.5, 19.1],
'Language': ['Dutch', 'Catalan and Spanish', 'French', 'French', 'German', 'Greek', 'German'],
'Altitude': [-2, 12, np.nan, 375, 520, 338, 542],
'MeasureDate':[np.nan, np.nan, '01-01-2013', '01-12-2015', '31-12-2015', np.nan, '01-01-2017'],
'Population': [1351587,4740000,10601122,198072,1450381, 3090508, 2600000]}
df1 = pd.DataFrame(data, columns = ['Place', 'Temp_Jan', 'Temp_Jul', 'Language', 'Altitude','MeasureDate', 'Population'])
```

In [5]:

```
df1.head(4)
```

Out[5]:

To select only one column or variable, use the following:

In [6]:

```
df1['Temp_Jan'].head(3)
```

Out[6]:

Another notation that is used and that gives the same result:

In [7]:

```
df1.Temp_Jan.head(3)
```

Out[7]:

If you desire selecting multiple columns, do the following:

In [8]:

```
df1.loc[:,['Temp_Jan','Temp_Jul']].head(3)
```

Out[8]:

In [9]:

```
df1.ix[0]
```

Out[9]:

To select the first column:

In [10]:

```
df1[[0]].head(3)
```

Out[10]:

To select multiple rows:

In [11]:

```
df1[0:2]
```

Out[11]:

Or use the following code:

In [12]:

```
df1.ix[[0,1]]
```

Out[12]:

To select multiple columns:

In [13]:

```
df1[[0,1]].head(3)
```

Out[13]:

To check the data type of df1.

In [14]:

```
type(df1)
```

Out[14]:

Or to check the variable types:

In [15]:

```
df1.dtypes
```

Out[15]:

Convert the variable *Language* into a categorical variable.

In [16]:

```
df1['Language'] = df1['Language'].astype('category')
```

After this we need to verify if we have gotten the right type.

In [17]:

```
df1['Language'].dtypes
```

Out[17]:

Now we change the column MeasureDate to a date variable

In [18]:

```
df1['MeasureDate'].dtypes
```

Out[18]:

In [19]:

```
df1['MeasureDate']=pd.to_datetime(df1['MeasureDate'])
```

In [20]:

```
df1['MeasureDate'].dtypes
```

Out[20]:

In [21]:

```
df1.dtypes
```

Out[21]:

The DataFrame can be dissected in its column- and rownames, and the values inside:

In [22]:

```
df1.columns
```

Out[22]:

In [23]:

```
df1.index
```

Out[23]:

In [24]:

```
df1.values
```

Out[24]:

And if we wanted to select the first column name, the following can be done:

In [25]:

```
df1.columns[0]
```

Out[25]:

The size of the DataFrame can be appreciated by the shape command:

In [26]:

```
df1.shape
```

Out[26]:

And the length of the DataFrame, or in other words, the number of rows, by the len command:

In [27]:

```
len(df1)
```

Out[27]:

*comma separated values* is expected.
An important remark: When loading a file, one needs to carefully check if the decimals are dots or commas (this can differ between countries and systems).^{3}

In [28]:

```
url = 'https://raw.githubusercontent.com/RRighart/City/master/df2.csv'
df2 = pd.read_csv(url, parse_dates=True, delimiter=",", decimal=",")
```

So this should give the following data of cities:

In [29]:

```
df2.head(3)
```

Out[29]:

*Language* and *MeasureDate* are not the right class. So it would be best to change this directly.

In [30]:

```
df2.dtypes
```

Out[30]:

Language should be of type "category".

In [31]:

```
df2['Language'] = df2['Language'].astype('category')
```

MeasureDate should be of type "datetime".

In [32]:

```
df2['MeasureDate']=pd.to_datetime(df2['MeasureDate'])
```

In [33]:

```
df2.dtypes
```

Out[33]:

In [34]:

```
df2.Language.value_counts()
```

Out[34]:

In [35]:

```
os.getcwd()
```

Out[35]:

In [36]:

```
df = df1.append(df2, ignore_index=True)
```

Displaying the whole DataFrame df now we can see that there are 23 rows.

In [37]:

```
df
```

Out[37]:

Another way this can be done is by the pd.concat function^{4}:

In [38]:

```
pd.concat([df1, df2], ignore_index=True).head(4)
```

Out[38]:

^{5}. First let's see how to detect where missing values occur.

The following command will indicate those cells with "True" where there is a missing value.

In [39]:

```
pd.isnull(df).head(3)
```

Out[39]:

*Altitude* and *MeasureDate*.
It is also possible to inspect specific variables, just to avoid the output of the whole datasheet.

In [40]:

```
pd.isnull(df['Altitude']).head(5)
```

Out[40]:

*number* of missing values per variable, using the is.null function.

In [41]:

```
df.isnull().sum()
```

Out[41]:

The inverse is also possible, that is counting the number of nonmissing values.

In [42]:

```
df.count()
```

Out[42]:

*any* missing value, using df.dropna(how='any'). If you put how='all' it will only drop the row(s) if all values are missing. Note that only writing it to df will change the DataFrame. So if df = df.dropna(how='any') were used, the DataFrame df would have changed.

In [43]:

```
df.dropna(how='any')
```

Out[43]:

Missing values are automatically omitted from summary statistics, such as describe().

In [44]:

```
df['Altitude'].describe()
```

Out[44]:

*Altitude* is larger than 500 m. It is adviced to refrain from using chain indices, such as df['Altitude'][df['Altitude']>350]=np.nan^{6}

In [45]:

```
df.loc[df['Altitude']>350, 'Altitude'] = np.nan
```

In [46]:

```
df['Altitude']
```

Out[46]:

In [47]:

```
df.sort_index(axis=1, ascending=True).head(3)
```

Out[47]:

*Place*, and the result is written to df.

In [48]:

```
df = df.sort_values(by='Place')
```

Displaying the first ten rows we can confirm that the sorting worked.

In [49]:

```
df.head(5)
```

Out[49]:

In [50]:

```
df.iloc[0,0]
```

Out[50]:

If you want to have multiple rows and columns, the following can be done:

In [51]:

```
df.iloc[:3,:2]
```

Out[51]:

Select rows within a given range, for example first till fourth column, and second till third row.

In [52]:

```
df.iloc[1:3,0:4]
```

Out[52]:

For selecting certain columns, but switching the position of the second and third

In [53]:

```
df.iloc[:,[1,3,2]].head(5)
```

Out[53]:

In [54]:

```
df[df.Temp_Jan > 9]
```

Out[54]:

Select cases where the average July temperature was equal to 28.5.

In [55]:

```
df[df.Temp_Jul == 28.5]
```

Out[55]:

Select cases if July temperature was unequal to 28.5.

In [56]:

```
df[df.Temp_Jul != 28.5].head(3)
```

Out[56]:

*multiple* conditions, for example cities where the January temperature is larger than 5 *and* the July temperature is smaller than 17 degrees Celsius.

In [57]:

```
df[(df.Temp_Jan > 5) & (df.Temp_Jul < 17)]
```

Out[57]:

*either* January temperature is larger than 5 *or* July temperature is smaller than 17. Note that Cork is not selected in this case, since it fulfills both conditions.

In [58]:

```
df[(df.Temp_Jan > 5) ^ (df.Temp_Jul < 17)]
```

Out[58]:

*and/or* July temperature is smaller than 17. That, means it also includes those cases that fulfill both cases.

In [59]:

```
df[(df.Temp_Jan > 5) | (df.Temp_Jul < 17)]
```

Out[59]:

*categorical* variable. So if we want to find back the "Paris" data, do the following:

In [60]:

```
df[df['Place'].str.contains('Paris')]
```

Out[60]:

And it also work if you are only using part of a string, for example "Stock"

In [61]:

```
df[df['Place'].str.contains('Stock')]
```

Out[61]:

Another example, searching for "Spanish" gives the following hits:

In [62]:

```
df[df['Language'].str.contains('Spanish')]
```

Out[62]:

Data can be added manually as well. Adding a row can be done in at least two ways:

In [63]:

```
df.loc[len(df)+1,:]=['Rome', 7.5, 24.1, 'Italian', 21, np.nan, 4353775]
```

As we will see the new entry "Rome" is added in the last row.

In [64]:

```
df.tail(3)
```

Out[64]:

A second way to do this is by using append. We first make newrow:

In [65]:

```
newrow = pd.Series(['Madrid', 6.3, 25.6, 'Spanish', 667, np.nan, 6240000], index= ['Place', 'Temp_Jan', 'Temp_Jul', 'Language', 'Altitude', 'MeasureDate', 'Population'])
```

In [66]:

```
type(newrow)
```

Out[66]:

In [67]:

```
newrow
```

Out[67]:

If we have created the row, we use append.

In [68]:

```
df = df.append(newrow, ignore_index=True)
```

In [69]:

```
df.tail(3)
```

Out[69]:

*50%* indicator.
It is also possible to use for ex. min(), mean(), median() and other stats separately. From these statistics it is directly clear that in July temperature is higher than January (not unexpected for European countries).

In [70]:

```
df['Temp_Jan'].describe()
```

Out[70]:

In [71]:

```
df['Temp_Jul'].describe()
```

Out[71]:

^{7}

In [72]:

```
sns.reset_orig()
```

*a* is number of rows, *b* is the number of columns, and *c* is the position that increments rowwise.

In [73]:

```
fig = plt.figure()
fig.add_subplot(1,2,1)
df[['Temp_Jan']].boxplot(sym='.')
fig.add_subplot(1,2,2)
df[['Temp_Jul']].boxplot(sym='.')
plt.show()
```

First, we are going to select the two columns and make a new temporary DataFrame called temp.

In [74]:

```
temp = df.iloc[:,[1,2]]
```

Second, we'd like to adapt our columnnames slightly, and the following code does that:

In [75]:

```
temp.columns = ['January','July']
```

In [76]:

```
temp.boxplot(sym='.')
plt.ylim(-10, 30)
plt.show()
```

In [77]:

```
fig = plt.figure()
df[['Altitude']].boxplot(sym='bo', widths = 0.4, patch_artist = True, vert=0)
flierprops = dict(markersize=9)
plt.show()
```

^{8}.
The data are ordered on January temperature, starting with the city with the lowest temperatures, which is Lulea, situated in the very north of Sweden.

In [78]:

```
plt.figure(figsize = (7, 10))
barwd = 0.2
r1=range(len(df))
r2=[y+ barwd for y in r1]
plt.barh(r1, df.sort_values(by='Temp_Jan', ascending=False).Temp_Jan, height = barwd, color = 'blue')
plt.barh(r2, df.sort_values(by='Temp_Jan', ascending=False).Temp_Jul, height = barwd, color = 'lightblue')
plt.yticks([s + barwd for s in range(len(df))], df.sort_values(by='Temp_Jan', ascending=False).Place,
rotation = 0)
lightblue_patch = mpatches.Patch(color='lightblue', label='July temperatures')
blue_patch = mpatches.Patch(color='blue', label='January temperatures')
plt.legend(handles=[lightblue_patch, blue_patch], bbox_to_anchor=(0., 1.02, 1., .102), loc=3, mode="expand", borderaxespad=0)
plt.show()
```

^{9}. The parameter cs is used for color scaling.
The resulting chart shows that Istanbul and Paris are clearly among the largest cities that are in our dataset.

In [79]:

```
cs = cm.Set1(np.arange(40)/40.)
mpl.rcParams['font.size'] = 6.0
plt.figure(figsize = (10, 10))
x = df.Population
plt.pie(x, labels = df.Place, labeldistance = 1.2, colors = cs)
plt.show()
```

In [80]:

```
mpl.rcParams['font.size'] = 6.0
plt.figure(figsize = (10, 10))
x = df.Population
plt.pie(x, colors = cs)
plt.legend(labels = df.Place, bbox_to_anchor=(1.1, 1.05))
plt.show()
```

^{10}.
You will then need to fill-in your *username* and *API* when you run Plotly code:

In [81]:

```
plotly.tools.set_credentials_file(username='rrighart', api_key='xxx;)
```

In [82]:

```
fig = {
'data': [{'labels': df.Place,
'values': df.Population,
'type': 'pie'}],
'layout': {'title': 'City populations as percentage of the total'}
}
py.iplot(fig)
```

Out[82]:

In [83]:

```
sns.reset_orig()
mpl.rcParams['font.size'] = 10.0
plt.figure(figsize = (8, 8))
plt.scatter(df.Temp_Jan, df.Temp_Jul, c = 'yellow', s=50)
plt.title('Relation between winter- and summertemperature')
plt.xlabel('Temperature in January')
plt.ylabel('Temperature in July')
plt.show()
```

We could display a third variable in this scatterplot. For example, it may be informative to make the dot diameters proportional to the population size. To do this, we could incorporate the variable df.Population into the scatterplot. Using the parameters facecolors and edgecolors, empty circles are used in order to display any overlapping data.

In these data, there seems a slight tendency that the relatively smaller places are in lower temperature areas. For the present purpose, the data were not very well sampled and therefore may be biased. Most important here is to see how such a visualization gives insight in three dimensions at once.

In [84]:

```
popsize = df.Population/5000
mpl.rcParams['font.size'] = 10.0
plt.figure(figsize = (8, 8))
plt.scatter(df.Temp_Jan, df.Temp_Jul, facecolors='none', edgecolors='red', linewidth=1.2, s=popsize)
plt.title('Relation between winter- and summertemperature')
plt.xlabel('Temperature in January')
plt.ylabel('Temperature in July')
for label, x, y in zip(df.Place, df.Temp_Jan, df.Temp_Jul):
plt.annotate(
label,
xy=(x,y), xytext=(-10,10), textcoords='offset points', ha='center', va='bottom',
)
plt.show()
```

In [85]:

```
df=df.replace(['Catalan and Spanish'], ['Catalan / Spanish'])
```

In [86]:

```
df[df['Language'].str.contains('Cat')]
```

Out[86]:

In [87]:

```
tempbins = [-10, -5, 0, 5, 10, 15, 20, 25, 30]
templabels = ['Very low', 'Low', 'Reasonaby Low', 'Average', 'Reasonably High', 'High', 'Very High', 'Extremely High']
```

The following will create the categorized variables:

In [88]:

```
df['Temp_Jan_cat'] = pd.cut(df['Temp_Jan'], tempbins, labels = templabels)
df['Temp_Jul_cat'] = pd.cut(df['Temp_Jul'], tempbins, labels = templabels)
```

In [89]:

```
df.head(5)
```

Out[89]:

We can then count the number of observations for each category:

In [90]:

```
pd.value_counts(df['Temp_Jan_cat'])
```

Out[90]:

In [91]:

```
pd.value_counts(df['Temp_Jul_cat'])
```

Out[91]:

As we see the pd.cut function created a categorical variable.

In [92]:

```
df.dtypes
```

Out[92]:

In [93]:

```
plt.hist(df['Temp_Jan'], alpha= 0.5, bins = tempbins, label='Temperature January')
plt.hist(df['Temp_Jul'], alpha= 0.5, bins = tempbins, label='Temperature July')
plt.ylim(0, 14)
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3, mode="expand", borderaxespad=0)
plt.show()
```

^{11}. First let us see the categories:

In [94]:

```
df.groupby(['Temp_Jan_cat']).groups.keys()
```

Out[94]:

We can then inspect the first entry of every category:

In [95]:

```
df.groupby(['Temp_Jan_cat']).first()
```

Out[95]:

And now let's compute mean temperature in July as a function of the binned temperature in January:

In [96]:

```
df.groupby(['Temp_Jan_cat']).Temp_Jul.mean()
```

Out[96]:

If you actually want to view every variable as a function of Temp_Jan_cat:

In [97]:

```
df.groupby(['Temp_Jan_cat']).mean()
```

Out[97]:

In [98]:

```
df.loc[:,"Tempdiff"]= df.loc[:,"Temp_Jan"]-df.loc[:,"Temp_Jul"]
```

Always make sure to inspect the result:

In [99]:

```
df[['Place','Temp_Jan', 'Temp_Jul', 'Tempdiff']].head(3)
```

Out[99]:

^{12}:
*T(°F) = T(°C) x 9/5 + 32*.
If you only use this code once, you would probably do the following code, and it will give the new temperatures in Fahrenheit.

In [100]:

```
(df.Temp_Jan*9/5+32).head(3)
```

Out[100]:

^{13}. Briefly, you use def to define the function name including arguments. Do not forget the colon (:). Next line, after an indentation you define the desired transformation.

In [101]:

```
def Tconv(x):
y=x*9/5+32
return y
```

In [102]:

```
Tconv(32)
```

Out[102]:

So now it is time to convert a whole column to Fahrenheit:

In [103]:

```
df['Temp_JanF'] = Tconv(df['Temp_Jan'])
df['Temp_JulF'] = Tconv(df['Temp_Jul'])
```

In [104]:

```
df.head(3)
```

Out[104]:

In [105]:

```
data = {'Place': ['Amsterdam', 'Barcelona', 'Paris', 'Geneva', 'Munich', 'Athens', 'Vienna', 'Stockholm', 'Oslo','Helsinki', 'Lulea', 'Edinburgh', 'Birmingham', 'Cork', 'Antwerp', 'Lille', 'Bordeaux', 'Malaga', 'Porto', 'Venice', 'Ajaccio', 'Rijeka', 'Istanbul', 'Rome'],
'Prec_Jan': [66.6,43.7,53.7,76.0,48.0,56.9,21.3,39,54.9,52,34,67.5,73.2,131.4,69.3,60.5,87.3,69.0,147.1,47.0,56.7,128.7,105.0,66.9],
}
df3 = pd.DataFrame(data, columns = ['Place', 'Prec_Jan'])
```

In [106]:

```
df3.head(3)
```

Out[106]:

In [107]:

```
df = pd.merge(df, df3, how='left', on='Place')
```

In [108]:

```
df.head(3)
```

Out[108]:

In [109]:

```
mpl.rcParams['font.size'] = 10.0
plt.figure(figsize = (8, 8))
plt.scatter(df.Temp_Jan, df.Prec_Jan, c = 'yellow', s=50)
plt.title('Relation between winter rain and temperature')
plt.xlabel('Temperature in January (Celsius)')
plt.ylabel('Precipitation in January (mm)')
plt.show()
```