Ruthger Righart
Email: rrighart@googlemail.com
Website: www.rrighart.com
Data is the new oil of the 21st century1. Many data come from the web2. Acquiring these webdata and structuring them is an essential skill that is called webscraping.
In this blog, I will briefly discuss how to webscrape data. However, it does not stop there. Analyses beyond webscraping are often needed. Some additional steps are shown, such as capturing hidden characters, merging different data, summary stats and visualization. Luckily, Python can be used for the whole analysis pipeline.
Healthdata are used for the current purpose, but it should be mentioned that many data can be webscraped in a similar way, sometimes with a few adaptations in the code.
Important note: Almost nothing is more dynamic than webdata. Please report me if code is not working, this may be due to changes in the downloaded webdata tables.
The current blog has been updated at 21-9-2018. A previous version was published June 2017 at https://rrighart.github.io/ . The code is based on Python 2.7. Some minor changes may be needed for Python 3.4 and later versions.
import warnings
warnings.filterwarnings("ignore")
Beautiful Soup is a Python package that is used for webscraping3. In this blog we are going to scrape tables from webpages.
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import csv
import re
import urllib2
from datetime import datetime
import os
import sys
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
First, you need to determine the directory where you save your data.
We are going to scrape data from wikipedia4. The data indicate rankings on different health indices, such as patient rights and information, accessibility (waiting time for treatment), outcomes, range and reach of services provided, prevention, and pharmaceuticals. The data are from the Euro Health Consumer index. In the following code, we read the data and use BeautifulSoup to convert the data in so-called bs4.BeautifulSoup data. Several data visualizations using Tableau can be found on my blog page5.
url = 'https://en.wikipedia.org/wiki/Healthcare_in_Europe'
r = requests.get(url)
HCE = BeautifulSoup(r.content)
type(HCE)
We could display the data using print(HCE.prettify()). We will not do that here because it will produce a great amount of text (or soup).
We first need to select the table that we'd like to scrape. As many webpages contain multiple tables, it would be good to read from the HTML the specific tablenames into a list, that we will call lst6:
htmlpage = urllib2.urlopen(url)
lst = []
for line in htmlpage:
line = line.rstrip()
if re.search('table class', line) :
lst.append(line)
This list lst has a length of 3:
len(lst)
Now let us display lst:
lst
We will scrape the first table, and therefore use index 0 in lst to capture the first table name. Now it is about time to read in the table using Beautiful Soup's find function. A simple option is to type in the table name. You simply select the name in lst, which in this case is "wikitable sortable":
table=HCE.find('table', {'class', 'wikitable sortable'})
Alternatively, there is a way to automate this step, by capturing the first data from the list, and then stripping off the unneeded characters.
x=lst[0]
extr=re.findall('"([^"]*)"', x)
table=HCE.find('table', {'class', str(extr).strip("'[]'")})
type(table)
Now it would be good to read in separately the header and row names, so we later can easily make a DataFrame.
headers= [header.text for header in table.find_all('th')]
headers
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
Now all elements, -- rows and headers --, are available to build the DataFrame, which we will call df1.
df1 = pd.DataFrame(rows, columns=headers)
df1.head(7)
This table still needs a good amount of preprocessing, on which we will return later.
Of course, other data sources can be scraped as well. So let us load data about health expenditure7. These are data per capita, which means that expenditure was corrected for the number of habitants in a country. Saving time, we will now put the script in one code block. This should lead to a DataFrame df2:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita'
r = requests.get(url)
HEE = BeautifulSoup(r.content)
htmlpage = urllib2.urlopen(url)
lst = []
for line in htmlpage:
line = line.rstrip()
if re.search('table class', line) :
lst.append(line)
x=lst[1]
print x
extr=re.findall('"([^"]*)"', x)
table=HEE.find('table', {'class', 'wikitable sortable'})
headers= [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
headers = [i.replace("\n", "") for i in headers]
df2 = pd.DataFrame(rows, columns=headers)
df2.head()
If we look at the dataframes, we see that there are still some issues that prohibit numeric computations. There are undesired characters ('\n'), undesired decimal format (comma should be removed), there are cells with non-numeric characters ('x') that should be NAN, and several columns should be numeric instead of objects. To resolve this, we will write a preprocessing function called preproc, which accepts as input any DataFrame (in our case df1 and df2). Please note that each webtable may need its unique collection of preprocessing steps.
def preproc(dat):
dat.dropna(axis=0, how='all', inplace=True)
dat.columns = dat.columns.str.replace("\n", "")
dat.replace(["\n"], [""], regex=True, inplace=True)
dat.replace([","], [""], regex=True, inplace=True)
dat.replace(r"\b[a-zA-Z]\b", np.nan, regex=True, inplace=True)
dat = dat.apply(pd.to_numeric, errors='ignore')
return(dat)
df1 = preproc(df1)
df2 = preproc(df2)
Apparently, after this preprocessing there are some NANs:
print(df1.isnull().sum().sum())
print(df2.isnull().sum().sum())
Now we display where the NANs occur. In fact, when we check the original table, we can see that Cyprus has values "x", which were in our preproc function changed to NANs ( https://en.wikipedia.org/wiki/Healthcare_in_Europe )
df1[df1.isnull().any(axis=1)]
Now we remove the NANs:
df1.dropna(axis=0, how='any', inplace=True)
At this point we inspect the data types:
df1.dtypes
df2.dtypes
The columnnames are a bit long, and so it would be good to use shorter names.
df1.columns = ['Country', 'Ranking', 'totalscore', 'Patientrights', 'Accessibility', 'Outcomes', 'Range', 'Prevention', 'Pharmaceuticals']
df2.columns = ['Country', 'y2013', 'y2014', 'y2015', 'y2016']
It should be clear from this example that webscraping can be important to quickly grasp data. Webscraping may be particularly useful when you need to automate data processing:
Webdata change regularly and need to be stored repeatedly.
A large number of data sources, for example tables, need to be loaded and merged.
Let us elaborate the last point a bit more. If the two tables that we just scraped need to be merged, this can be done in Python. For example, if we want to merge on the column "Country", we would use the following code (we use the .head() function to limit the output).
pd.merge(df1, df2, how='left', on='Country').head()
The health expenditure data give unexpected NaNs (missing values). Something must be wrong here.
So we need to inspect this in more detail. Normally using the set function to check for overlap between the countries in the two DataFrames should give a set. But now it is empty. This may suggest that the tables cannot merge because the Country names are not identical.
set(df1['Country']) & set(df2['Country'])
Let's inspect these tables more closely. Something must be wrong because there are several countries that are overlapping and there are no spelling errors. If we write the table to a csv file to check the raw text we may discover hidden symbols.
df1.to_csv('df1example.csv', sep=",")
df2.to_csv('df2example.csv', sep=",")
If you now open the csv file in a spreadsheet software you would discover that there are some hidden characters in the df1example.csv file.
Using the repr function we can reveal the hidden characters8.
repr(df1['Country'])
repr(df2['Country'])
Luckily, all unwanted hidden characters are identical across the countries. That is, the characters are "\xc2\xa0" and "\n". Using the replace function we could remove this part.
df1.head()
df1.replace(["\n"],[""], regex=True, inplace=True)
df1.replace(["\xc2\xa0"], [""], regex=True, inplace=True)
repr(df1['Country'])
df1.head()
Alright. To inspect if it worked, we could use repr function again and should now see that the undesired characters are removed. Using the set function again we can see now that there are quite some countries that overlap:
set(df1['Country']) & set(df2['Country'])
Now it is about time to merge the two tables. A few countries have NaNs but in these cases for the expenditure data there were no values. We make a new DataFrame df3 from the merged data. After that we drop any rows where there are NaNs.
df3 = pd.merge(df1,df2, how='left', on='Country')
df3.dropna(how='any', inplace=True)
df3.shape
We managed to bring the data together. The data are ready for analyses and visualization. We could visually inspect the relation between different variables using a scatterplot. The package adjustText takes care that the country labels do not overlap9.
reload(sys)
sys.setdefaultencoding('utf8')
For the visualization, we will compute the average across years.
df3['Avg'] = df3.loc[:, ['y2013', 'y2014', 'y2015', 'y2016'] ].mean(axis=1)
We tune the axes and prepare the minimum and maximum values on the X- and Y-axis:
xmin=df3.Patientrights.min()-10
xmax=df3.Patientrights.max()+10
ymin=df3.Outcomes.min()-10
ymax=df3.Outcomes.max()+10
import matplotlib as mpl
import matplotlib.pyplot as plt
from adjustText import adjust_text
def plot_df3(adjust=True):
mpl.rcParams['font.size'] = 12.0
plt.figure(figsize = (14, 14))
plt.scatter(df3.Patientrights, df3.Outcomes, facecolors='none', edgecolors='red', linewidth=1.2, s=1*df3.Avg)
texts = []
plt.title('Relation between different health parameters')
plt.xlabel('Patient rights')
plt.ylabel('Outcomes')
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)
for x, y, s in zip(df3['Patientrights'], df3['Outcomes'], df3['Country']):
texts.append(plt.text(x, y, s, size=12))
if adjust:
plt.title(str(adjust_text(texts, arrowprops=dict(arrowstyle="-", color='black', lw=0.5))
))
_ = plot_df3()
plt.show()
The visualization suggests a relation between patient rights and outcomes. The diameter of the circles displays expenditure. The higher expenditure countries have better patient rights and outcomes (mostly in the right top corner).
The current blog showed how webtables can be scraped by using Python. A certain amount of preprocessing is necessary before these data can be visualized. Note that the world of webscraping is dynamic. The used websites and tables may be updated and therefore may require regular updates of Python code.
Any questions or comments?
Please feel free to contact me :
Ruthger Righart
E: rrighart@googlemail.com
[1]. Data is the new oil. https://www.changethislimited.co.uk/2017/01/data-is-the-new-oil/
[2]. Data on internet. https://www.livescience.com/54094-how-big-is-the-internet.html
[3]. Beautiful Soup. https://www.crummy.com/software/BeautifulSoup/bs4/
[4]. Healthcare Europe. https://en.wikipedia.org/wiki/Healthcare_in_Europe
[5]. Visualizing European healthcare using Tableau. https://rrighart.github.io/HE-Tableau/
[6]. Scraping tables. https://stackoverflow.com/questions/17196018/extracting-table-contents-from-html-with-python-and-beautifulsoup
[7]. Health expenditure. https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita
[8]. Hidden characters. https://stackoverflow.com/questions/31341351/how-can-i-identify-invisible-characters-in-python-strings
[9]. Adjust text package. https://github.com/Phlya/adjustText/blob/master/examples/Examples.ipynb
(c) 2018 . https://www.rrighart.com | https://rrighart.github.io