Scraping Hong Kong pollution data

Pollution is a major concern in Hong Kong. How bad is it? Compared to say 5 years ago? 10 years ago? It is surprisingly difficult to find simple long term charts, despite pollution data being seemingly readily available in a few places on the web.

Part 1 we collected some data from a HK Government website.

In this part we’ll calculate the AQI using US EPA standard

In the next post let’s try to explore the components of AQI to see where the pollution in HK comes from, if there is any seasonality in the data, if holidays have any impact and what times of the day are the worst.

Data downloaded from HK EPD, see Part 1. We use data for Causeway Bay in this post too.

API calculation reference

Using implementation in python-aqi

Remarks:

All Pollutant unit in μg/m3 except CO which is in 10μg/m3
N.A. = data not available
CO = Carbon Monoxide
FSP = Fine Suspended Particulates
NO2 = Nitrogen Dioxide
NOX = Nitrogen Oxides
O3 = Ozone
RSP = Respirable Suspended Particulates
SO2 = Sulphur Dioxide

import pandas as pd
import aqi
import os, sys, glob
from tqdm import tnrange, tqdm_notebook

# what to import
folder = 'cwb'
if os.path.isfile(os.path.join(os.getcwd(),folder,'aqi_hourly_full.csv')):
    data_ready = True
    df = pd.read_csv(os.path.join(os.getcwd(),folder,'aqi_hourly_full.csv'))
    df.drop(df.columns[[0]], axis=1, inplace=True)
else:
    data_ready = False
    dfs = (pd.read_csv(f,skiprows=11) for f in sorted(glob.glob(os.path.join(os.getcwd(),folder,'air_hourly*.csv'))))
    df = pd.concat(dfs, ignore_index=True)
    del df['NOX']
    del df['STATION']
if data_ready:
    df.DATE = pd.to_datetime(df.DATE, format="%Y-%m-%d")
else:
    df.DATE = pd.to_datetime(df.DATE, format="%d/%m/%Y")
df.iloc[:,2:] = df.iloc[:,2:].apply(pd.to_numeric, errors='coerce')
df.head()

	DATE	HOUR	CO	FSP	NO2	O3	RSP	SO2
0	1998-01-01	1	203.0	NaN	116.0	NaN	107.0	19.0
1	1998-01-01	2	157.0	NaN	119.0	NaN	99.0	15.0
2	1998-01-01	3	156.0	NaN	127.0	NaN	112.0	14.0
3	1998-01-01	4	179.0	NaN	114.0	NaN	103.0	17.0
4	1998-01-01	5	156.0	NaN	130.0	NaN	112.0	58.0

AQI with EPA calculation method requires these units: pm10 (µg/m³), o3_8h (ppm), co_8h (ppm), no2_1h (ppb), o3_1h (ppm), so2_1h (ppb), pm25 (µg/m³) Conversion factors

if not data_ready:
    # CONVERT CO FROM 10 µg/m3 to ppm
    # 1 ppm = 1.1642 mg m-3 -> 1 ppm = (1.1642 * 1000) µg/m3
    df.CO = df.CO / (1.1642 * 100)
    # CONVERT O3 from µg/m3 to ppm
    # 1 ppb = 1.9957 µg m-3 -> 1 ppm = 1000 ppb = (1.9957 * 1000) µg m-3
    df['O3'] = df.O3/(1000 * 1.9957)
    # CONVERT NO2 from µg/m3 to ppb
    # 1 ppb = 1.9125 µg m-3
    df['NO2'] = df.NO2 / 1.9125
    # CONVERT SO2 from µg/m3 to ppb
    # 1 ppb = 2.6609 µg m-3
    df['SO2'] = df.NO2 / 2.6609
    # CALCULATE 8h moving averages
    df['CO_8h'] = df.CO.rolling(window=8).mean()
    df['O3_8h'] = df.O3.rolling(window=8).mean()
    # drop CO
    del df['CO']

Let’s define two helper structures to match our data set with the constants from python-aqi

index = {'CO_8h': aqi.POLLUTANT_CO_8H,
        'FSP' : aqi.POLLUTANT_PM25,
        'NO2' : aqi.POLLUTANT_NO2_1H,
        'RSP' : aqi.POLLUTANT_PM10,
        'SO2' : aqi.POLLUTANT_SO2_1H,
        'O3' : aqi.POLLUTANT_O3_1H,
        'O3_8h' : aqi.POLLUTANT_O3_8H}

caps = {'CO_8h': 50.4,
        'FSP' : 500.4,
        'NO2' : 2049,
        'RSP' : 604,
        'SO2' : 1004,
        'O3' : 0.604,
        'O3_8h' : 0.374}

Now we can apply AQI calculations on our DataFrame

def aqi_atom(cc, cap, p):
    try:
        if pd.isnull(cc):
            return None
        else:
            return float(aqi.to_aqi([(p,min(cap,cc))]))
    except (ZeroDivisionError, IndexError) as e:
        return 0

aqidf = df.copy()
if not data_ready:
    for i in tnrange(2,df.shape[1]):
        p = index[df.columns[i]]
        cap = caps[df.columns[i]]
        aqidf.iloc[:,i] = df.iloc[:,i].apply(aqi_atom,args=(cap,p))
    aqi_series = aqidf.iloc[:,2:].max(numeric_only=True, axis = 1)
    aqi_series.name = "AQI"
    aqidf = aqidf.join(aqi_series)
dailyaqi = aqidf.iloc[:,[0,9]].groupby(['DATE']).median().copy().T.squeeze()
aqidf.head()

	DATE	HOUR	FSP	NO2	O3	RSP	SO2	CO_8h	O3_8h	AQI
0	1998-01-01	1	NaN	57.0	NaN	77.0	31.0	NaN	NaN	77.0
1	1998-01-01	2	NaN	60.0	NaN	73.0	33.0	NaN	NaN	73.0
2	1998-01-01	3	NaN	64.0	NaN	79.0	34.0	NaN	NaN	79.0
3	1998-01-01	4	NaN	56.0	NaN	75.0	31.0	NaN	NaN	75.0
4	1998-01-01	5	NaN	65.0	NaN	79.0	36.0	NaN	NaN	79.0

And we can save the resulting timeseries in nice CSV format for later use

if not data_ready:
    dailyaqi.to_csv(folder + '/aqi_daily.csv')
    aqidf.to_csv(folder + '/aqi_hourly_full.csv')