Automating web analytics through Python

Ruthger Righart

Email: rrighart@googlemail.com

Website: https://rrighart.github.io

Python modules

In the current blog Python 2 was used. Please note that the code may be slightly different for Python 3. The following Python modules were used:

In [1]:
import pandas as pd
import numpy as np
from google2pandas import *
import matplotlib as mpl
import matplotlib.pyplot as plt
import sys
from geonamescache import GeonamesCache
from geonamescache.mappers import country
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from mpl_toolkits.basemap import Basemap
from scipy import stats
import matplotlib.patches as mpatches
import plotly
import plotly.plotly as py
from IPython.display import Image
from plotly import tools
from plotly.graph_objs import *

1. Introduction

Web analytics is a fascinating domain and important target of data science. Seeing where people come from (geographical information), what they do (behavioral analyses), how they visit (device: mobile, tablet or workstation), and when they visit your website (time-related info, frequency etc), are all different metrics of webtraffic that have potential business value. Google Analytics is one of the available open-source tools that is highly used and well-documented.

The current blog deals with the case how to implement web analytics in Python. I am enthusiastic about the options that are available inside Google Analytics. Google Analytics has a rich variety of metrics and dimensions available. It has a good visualization and an intuitive Graphic User Interface (GUI). However, in certain situations it makes sense to automate webanalytics and add advanced statistics and visualizations. In the current blog, I will show how to do that using Python.

As an example, I will present the traffic analyses of my own website, for one of my blogs (https://rrighart.github.io/Webscraping/). Note however that many of the implementation steps are quite similar for conventional (non-GitHub) websites. So please do stay here and do not despair if GitHub is not your cup of tea.

2. The end in mind

As the options are quite extensive, it is best to start with the end in mind. In other words, for what purpose do we use webtraffic analyses?1.

  1. What is the effect of a campaign on webtraffic? Concretely, publishing a link (i.e., my blog "Webscraping and beyond") on a reputated site for colleague developers, does this have a substantial impact on the number of visitors? To reach this goal, much like an experiment, I performed a single intervention of publishing a link at http://www.datatau.com .
  2. Where do the visitors come from, is this a large range of countries, from different continents, or is it rather limited to one or a few countries?
  3. What kind of devices do my visitors use? (mobile, desktop, or tablet).

To realise these goals, what follows are the analytical steps needed from data acquisition to visualization. If at this moment you are not able to run Google Analytics for your own website but want to nevertheless reproduce the data analyses in Python (starting below at section 8), I advice to load the DataFrames (df1, df2, df3, df3a, df3b) from my GitHub site using the following code (change the "df"-filename accordingly):

In [2]:
url = 'https://raw.githubusercontent.com/RRighart/GA/master/df1.csv'
df1 = pd.read_csv(url, parse_dates=True, delimiter=",", decimal=",")

3. Add a website to your Google Analytics account

You need to first subscribe to Google Analytics and add your website2:

  • Select Admin
  • In the dropdown menu Property, select Create new property. You need to give the name of your website. The URL has the following format for GitHub sites: https://yourname.github.io/projectname/ (in my case, it is for example https://rrighart.github.io/Webscraping/ ).
  • Do not forget to set the reporting timezone right. This is very important if you want to research the time of the day that your customers come visit your website.
  • After confirming the other steps, you'll receive a Universal Analytics (UA) tracking -ID, which has the following format, UA-xxxxxxxx-x, where the x are numbers.
In [3]:
Image("Fig1.png")
Out[3]:

4. Tracking-ID and code

If you need to find back the tracking-ID later, the code can be found at Tracking Info and then Tracking Code. Under the header Website Tracking, Google Analytics will give a script that needs to be pasted in your website. It is essential to set the code right to prevent for example half or double counting3.

5. Check the connection

Now you have the tracking code pasted in your website, Google Analytics is able to collect traffic data. The „official“ way to inspect if there is a connection in Google Analytics is to select Tracking Info, Tracking Code, and under status, push the button Send test traffic. This will open up your website.

However, a more real life way to do this is to visit your website yourself, using for example your mobile phone. In Google Analytics select Home, Real-time, and Overview. If you just visited your website of interest, you should see under pageviews that there is "Right now 1 active users on site" (of course this could be >1 if at the same moment there were other visitors). Additionally, you may want to check the geographical map and see if your place is highlighted. If you leave your website, the active users section should return to zero (or go one down). If this works, you are ready to start webtraffic analyses as soon as your first visitors drop in.

In [4]:
Image("Fig2.png")
Out[4]:

6. Query Explorer

So how to start webtraffic analyses? One option is to visualize traffic in Google Analytics itself. Another option is Query Explorer. Query Explorer is a GUI tool that gives a very quick impression of your data, combining different metrics and dimensions at once. It is also very helpful for preparing the Python code needed for data extraction (more about this later). Follow the next steps:

  • Log-in with your Google account at: https://ga-dev-tools.appspot.com/query-explorer/
  • Select under property the webpage that you want to check (in my case „Webscraping“).
  • Select view to choose between extracting all data, desktop or mobile.
  • Select ids: this is the „ga:“ code that corresponds with your property.
  • Fill-in a start-date. Here we select '2017-07-20' (this is one day before I started campaigning at www.datatau.com ).
  • Fill-in End-date: '2017-08-07'.
  • Metrics: select 'ga:sessions'.
  • Dimensions: select 'ga:date'.

Note that number of sessions is different from number of visitors. The difference is that the same visitors may return several times at the same website, resulting in a higher number of sessions. For the time being, leave all the other fields empty. When you hit the button Run Query this should return a spreadsheet with number of sessions, for each day in your time-window.

In [5]:
Image("Fig3.png")
Out[5]: