Open Data, or Opacity through Transparency
Published on June 20, 2018

In the recent years, it has been possible to see how governments and different public organisms have joined the Open Data: releasing part of their collected data using an open license to let third parties use it, no matter the purpose: most of these license have very permissive terms, allowing distribution and not restricting commercial use. The motivation to release the data can come from different sources: some of these data can be useful for different organizations or users, or just form a nice dataset that has no point on being jealously hidden (for instance, Open Data Madrid has a complete dataset of benches across the city, including coordinates and model). However, when a government or other public institution is involved, the motivation is usually the same: the data is representing something that belongs or heavily influences the general good, and its release in the form of Open Data is considered most of the times an exercise of transparency, allowing everyone to freely explore the data or third-parties to audit it.

This can be seen as utopian or even fallacious: even though most people nowadays own some kind of computing device, not everyone is a data analyst. Not everyone will be able to read, process and draw conclusions from a raw-data file like a CSV. For that reason, visualizations of the data like maps or other animations are usually created with informative purposes in mind. However, this is rarely motivated by any other thing than volunteering: students or people working with data usually do these works in their free time trying to improve their craft or share knowledge in an easier-to-digest format.

This is closely related to the main point of this post: contributors using Open Data are (most likely) not being paid to do so, and if they are, their work is probably going to be used in a commercial product rather than be shared with an open license. For people trying to practice and improve, they are most likely just looking for data to work on. If the data is malformed, uncomfortable to use, or obfuscated; the odds are the contributor is not going to be stubborn about it: they will ignore the dataset and search a new one.

This is the first point that articulates the article: Open Data should be taken special care of, allowing people to easily work with it. Fortunately, to support this claim, I have an entertaining case study - entertaining for you, because you are going to see me suffer.

The Story: Air Quality in Madrid

In my company in general, and in my unit in particular, we are constantly trying new techniques of data processing, prescription and prediction. To do so, we may generate proofs of concept, but as every work related with data science or analysis, we need actual data to do so. Sometimes we already have the data, but sometimes we may rely on open datasets like the ones in Kaggle, for example. However, a week ago we decided to search for a good Madrid Open Data dataset that aligned with what we needed. Open Data would allow us to experiment knowing that the resulting proof of concept can then be shared, as a blog post, in a meetup, etc. Also, and for further reference in the post, my language of choice for this task is Python 3.

Soon enough, we hit the jackpot: one of the datasets featured hourly levels of pollution in Madrid during the last 18 years. Reading the description, we can expect to find not only an hourly, 18 year long time-series, but several: it explicitly states that the information of several stations are collected. A related dataset features information about the stations to enrich the analysis, and if some location is provided, the possibilities are great. At a first glance in the website we can see that the file is an Excel's very own .xls. It makes my eyes roll every time, but pandas can deal with it so let's leave it for now.

There is a description of the file available, but I like to take a look myself at the data before reading the column description and such. There are 18 zip files available (one per year) containing 12 files each (per month). The extension of the files themselves is .txt, but it probably is some kind of tabular format. Let's open one random one: (notice that the ellipsis means that the lines are truncated, they are long)

$ head Abr_mo01.txt
2807900101380201040119.09V19.35V12.02V8.870V7.710V7.340V6.420V6.140V6.590...
280790010138020104028.680V12.98V14.84V9.830V7.160V8.720V14.39V18.28V25.84...
280790010138020104039.140V8.590V6.600V5.150V5.000V5.200V5.740V8.610V11.24...
2807900101380201040410.58V8.360V7.380V6.050V5.460V5.140V7.270V10.99V9.720...
280790010138020104054.300V3.160V2.410V1.860V1.480V1.540V3.000V4.650V7.440...
280790010138020104068.400V8.040V6.310V7.750V10.74V12.76V13.09V16.52V24.65...
2807900101380201040710.65V9.210V8.260V9.400V7.880V7.690V6.710V6.940V6.720...
280790010138020104088.480V8.810V8.370V7.280V6.880V5.710V5.510V5.390V5.560...
280790010138020104097.340V6.930V5.560V5.390V4.750V5.070V9.980V19.53V37.39...
2807900101380201041029.94V31.06V29.91V15.02V20.47V12.10V15.89V29.39V31.41...

Ok, so there are no headers to name the columns, but we can see that it is a dot-separated text containing… Wait, what is 09V19? Those are not regular numbers, is V the separator? That is not something I have seen before, but maybe the dot is actually a decimal point (which would make sense). But the first field looks like an ID of some sort that suddenly turns into a floating point value… It is time to open the data description.

In the PDF we find that it is actually a fixed-width text, which explain why all those V characters are vertically aligned. In Spanish, it says that:

  • The first 8 digits are the station ID,
  • the next 2 digits are the "parameters" (and invites to look at Annex 2),
  • the next 2 digits are the technique,
  • the next 2 digits are the analysis period (02 meaning hourly, and since we have downloaded hourly data this column will always be 02),
  • the date in YYMMDD format,
  • And then 5 digits of a level + 1 validation character per measurement, which happens 24 times.

So, contrary to what we could have expected, each row is a day and has 24 columns, one per hour. That is a… weird way to present the data (if someone needs that they probably know how to use a pivot table), and inconvenient for us. Funny enough, the daily data is even weirder following the same pattern: each row is a month and each column is a day, which means that not all rows have the same number of columns. We also have a technique and a "parameter" column, which turned out to be the gas that row is measuring. By taking a look at the annex, one of the columns seems redundant, since each gas is only measured using a single technique, so we can discard it. All those V characters around the text turned out to be "validation characters", that indicate the measurement is valid. It explicitly states that only V marked values are valid (so why include the rest of them at all?) so we will get rid of the rest.

All this process includes some functions that are far from common in Python and specifically in pandas: this was the first time I heard of pandas.read_fwf() to read fixed-width text files, and also my first time using pandas.melt() which basically "unpivots" columns, which I need to get hours as rows. After doing so, I needed to reconstruct and the date to generate correct timestamps and I decided to pivot the gases measured in columns (so that each column is a gas), which makes it easier to get sequential series by just selecting a column and not querying for rows. It also allows to have meaningful names in the columns, extracted from the abbreviations in the annex.

This worked on a single file, but I needed to reconstruct the whole dataset formed by 208 files. The names do not follow any convention at all, but have to be carefully selected when reading because some folders include XML or CSV files that can break the script if read by it. Two different parsing methods have to be implemented because the last years are actually comma-separated but the columns are divided differently (the station ID is separated in 3 different fields). Some fields are malformed and therefore a ValueError exception should be expected and ignored. When doing this, measurements of a gas that is not mentioned at all in the description (58) appear in certain years, so this also has to be dropped.

Once this is achieved, the resulting DataFrame is half a GB. For convenience, I decided to store it in an HDF5 file. This format is hierarchical, compressed and absurdly fast when accessing sequential rows (which is more than likely when working in time-series). I decided to split the data into stations, having each station their very own dataset inside the file, which allows the user to select all the sequential data of that station and us to discard all entries of inactivity and all the columns related to gases that the station does not measure (thus, are completely empty). This little bit of reflecting and savoir faire results in a single, hierarchical and convenient file of 70MB (7 times less than the CSV equivalent, 3 times less than the original data).

And we still have to check the stations Excel file. Once we open it, we get what we expect: a heavily formatted XLS file, which we basically need to copy-paste without format to a new sheet before reading. However, some underlying problems appear after a closer look: the coordinates are provided in degrees-minutes-seconds format instead of decimal (which makes it unusable as-is and requires parsing and conversion) and the IDs are different. Using a bit of imagination and detective skills is easy to infer that these IDs are only the last two digits of the other set of IDs (the rest of it are some constant codes for the region but here lies the real problem: Not even half of the stations referenced in the data have this extended information. This is because some of them have "ceased activity" and no coordinates are provided for them.

The Open Secret

This took me a whole day at work to clean and understand, a task which most people would have given up with. Even after all the hard work, there are some disappointing gotchas like the one about lack of information in the old stations.

However, it is not like the city council is not well aware of this, it is probably quite the contrary. In the last page of the aforementioned data description document, we can see a comprehensive explanation on how to actually open the files with Excel. Spoiler alert: it takes 16 steps to do so. This is probably a fair indicator that they do not use this data, at least they probably don't work with it the same way they expect you to. The current city council is probably not the (only) one to blame: all this inconvenience and accidental obfuscation probably have a reason to exist.

A short data exploration reveals some interesting details about the stations. We can check for null values during the whole period and see them side by side, to get a better perspective on the activity of each station. Using the wonderful library missingno we get a fast and easy to understand visualization of when the data is present in each station.

import pandas as pd
import numpy as np
import missingno as msno

# Read the file
with pd.HDFStore('../input/madrid.h5') as data:
    stations = {k[1:]: data[k] for k in data.keys() if k != '/master'}

# Prepare the full period for representation
missing = pd.DataFrame(
    index=pd.DatetimeIndex(start='2001-01-01', end='2018-05-01', freq='H')
)

# Collapse the data and store it in the DataFrame
for station, df in data.items():
    missing[station] = df.isnull().all(axis=1)

# Prepare the data for msno representation
missing = (missing.fillna(True)
                  .replace(True, np.nan)
                  .replace(False, 0))

msno.matrix(missing, freq='Y')

The resulting image represents data with grey blocks and missing data as gaps, where we can see an obvious pattern. It seems that in 2009 almost half of the stations ceased all activity and a good portion of them started. We can see that only 6 stations span through the whole period, but those were manually merged (their IDs were renamed and appear with separated IDs in the data, but the explanation states explicitly that are the same station).

missing-min.png

This is may seem familiar if you, dear reader, live in Madrid. In 2009 Madrid was breaking all contamination records and endured fierce criticism for its immobility about it: no regulation of the most polluting vehicles in the city center and a lack of green spaces across it were casting doubt on the city's sustainability. That same year, the EU recommendations went from a friendly recommendation to being compulsory, and most of the limits were not being fulfilled.

The city council's reaction was interesting, to say the least: instead of taking action against the pollution, their efforts focused on masking the results. In late 2009 most of the stations were moved to different locations or simply closed. One of the stations that were closed registered 74 µg/m3 of NO2 when the limit recommended by the EU is 42µg/m3. 18 out of 24 stations registered levels higher than this limit. The stations were moved to more propitious locations: stations that were located in crossings and roundabouts were moved to greener areas, which registered lower measurements when the levels are peaking. Even most of the surviving stations changed which substances were being measured, which is something that can be seen in the data. Are the particles smaller than 10µm (PM10) a problem?1 Well, then stop measuring them altogether.

pza-espana-min.png

The Lesson: Exercise Skepticism

In 2010, Madrid lowered its pollution levels a 26% from the previous year. It is possible to see that change in the data. The problem is that if someone does not pay enough attention, they can justify it on the 8% fuel consumption drop (which was a cause of the financial crisis, no official measures whatsoever) or the specially unstable weather that year (26% more rain than the average). Only a closer look to the data, the kind of closer look that probably no one using 16 steps in Excel to explore it would take, revealed the truth. And the truth is that the levels probably remained intact, just the way to measured them changed. Although a change is seen in the last year (retired measures are added back to the stations), this dataset is the proof of deceitful tactics that were used to trick the data and EU measurements.

And this is an important lesson to take into account: open data, just like open source, relies on contributors to audit and verify it. Its interaction and exploration is, however, easier to obfuscate: when dealing with great amounts of data, it is easy to hide flaws and irregularities by just making it inconvenient to work with the files. For that reason, it is important not to only demand open data in sensitive matters that affect the common good, the quality and accessibility of the data is as important. Open data should be straightforward, and not rely on almost-stenographic designs to hide flaws at plain sight.

Also, as a final note, notice that the modified data is available in Kaggle under the name Air Quality in Madrid (2001-2018) for further exploration, trying to win back all the possible contributors. Even though the change of trend is dishonest, the data is still interesting and can be useful for all kinds of time series analysis or prediction.

Footnotes:

1

It is rhetorical, but in case you are left wondering: yes, they are. They give people cancer.