UK General Election 2015 Special

_74880665_leadercompThose clever folks over at Nate Silver’s FiveThirtyEight have been analysing the polling figures for the forthcoming UK General Election. More than any in the UK’s electoral history, this is set to be the most unpredictable ever, with a first-past-the-post electoral system unfit for a multi-party democracy likely to throw up all kinds of unexpected outcomes in constituencies across the country. In an attempt to make sense of all the polling history, both at the national level and – more recently, thanks to Lord Ashcroft Polls – the constituency level, Silver’s brainiacs have offered what they believe is the best possible prediction for every seat in England, Scotland and Wales (Northern Ireland is a special case and, alas, does not have comparably rich polling data available).

I’m not a statistician, but I do love a good data set. I thought it would be fun to take the data which Chris Hanretty, Ben Lauderdale and Nick Vivyan have made available on their site and convert it into a classic star schema format. I did this largely for my own entertainment (I am an unashamed political junkie) but as much as anything else the exercise demonstrates the power of a well formed data model, which is why I thought I’d share it online.

I started off by defining my dimensions. Given the data available, the most obvious were:

  • Party (e.g. Conservative, Labour etc.)
  • Seat (e.g. Surrey East, Sheffield Hallam etc.)

Party is a flat dimension, with a short and long form description, as well as a simple flag to indicate whether the party is national (i.e. stands in England, Scotland and Wales) or not (as in the case of the SNP and Plaid Cymru who only contest seats in their native Scotland and Wales, respectively). Seat, however, offers a limited degree of hierarchy, since each seat falls under a particular region.

For my fact tables, I felt it would be a good idea to split out the concrete data of the 2010 election results (i.e. who won which seat) from the more uncertain data of the 2015 ‘poll of polls’ predictions. As such, I created two different facts:

  • Election 2010 Results
  • Election 2015 Predictions

The 2010 fact table is quite basic, as it offers only one piece of information: which party won which seat. It is also a good example of a ‘factless fact’, because it doesn’t have any measures. (With a bit of extra work I could have gone and found another data source to provide me with the vote share for each contested seat in 2010, but I was working with just a single data set in this instance.) However, not having any measures does not mean a fact table is without its uses, as this one offers the informative cross-section between seat and victor.

The 2015 fact table has multiple measures: the overall predicted vote share per party per seat, the upper and lower thresholds of that predicted vote share (with a 5% margin of error on either side of those thresholds) and the probability of a party winning a particular seat.

I chose to use MySQL for this exercise, largely because it’s open source and so anyone else reading this could easily take my work and reconstruct it for themselves, if they so desired. The first step was to create DDL scripts to generate each of my four entities.

CREATE TABLE dim_party
(party_id             INT AUTO_INCREMENT PRIMARY KEY NOT NULL,
 party_code           VARCHAR(5),
 party_name           VARCHAR(100),
 party_region         VARCHAR(100),
 is_national          BIT);

CREATE TABLE dim_seat
(seat_id              INT AUTO_INCREMENT PRIMARY KEY NOT NULL,
 seat_region          VARCHAR(255),
 seat_name            VARCHAR(255));

CREATE TABLE fct_2010_result
(f_2010_result_id     INT AUTO_INCREMENT PRIMARY KEY NOT NULL,
 seat_id              INT,
 party_id             INT);

CREATE TABLE fct_2015_prediction
(f_2015_prediction_id INT AUTO_INCREMENT PRIMARY KEY NOT NULL,
 seat_id              INT,
 party_id             INT,
 vote_share_pct       NUMERIC(18,2),
 lower_bound_vote_pct NUMERIC(18,2),
 upper_bound_vote_pct NUMERIC(18,2),
 probability_pct      NUMERIC(18,2));

In the case of my party and seat dimensions, I made use of the AUTO_INCREMENT feature to quickly generate surrogate keys. The only slight snag with this approach is that my incoming fact data sets wouldn’t know what the equivalent surrogate key was for each party and seat. To overcome this challenge I defined a couple of simple functions that carry out a lookup on the dimension table, taking in the natural key (how a party or seat is identified in the real world) and returning the surrogate key (how the dimension table identifies a party or seat).

CREATE FUNCTION func_getSeatID(p_seatName VARCHAR(255)) RETURNS INTEGER
DETERMINISTIC
BEGIN
 DECLARE v_seatID INTEGER;
 SELECT seat_id INTO v_seatID
 FROM   dim_seat
 WHERE  seat_name = p_seatName;
 RETURN(v_seatID);
END;

CREATE FUNCTION func_getPartyID(p_partyCode VARCHAR(255)) RETURNS INTEGER
DETERMINISTIC
BEGIN
 DECLARE v_partyID INTEGER;
 SELECT party_id INTO v_partyID
 FROM   dim_party
 WHERE  party_code = p_partyCode;
 RETURN(v_partyID);
END;

Finally, with a bit of Python wizardry, I converted the HTML data on Election Forecast into a set of INSERT scripts. There are more efficient ways of loading data into a database, but if you’re using MySQL with a small data set and you haven’t got access to more advanced ETL tools or database administration software, a set of INSERT scripts can often prove to be a data architect’s best friend. Since they’re much too large to cut and paste, you can download them here.

Having created my tables and populated them with the data, I defined my first analytical query to interrogate the data now at my disposal. To start off, I thought it would be interesting to aggregate the predicted vote share by party and region, with a total national vote share for each party. As I constructed a simple hierarchy in my seat dimension table, this proved to be a simple exercise: I can easily sum up the vote share by region (rather than by seat), divide it by the number of seats in that region and group the result by party. With the addition of a UNION ALL, it is possible to repeat the query at a higher level of aggregation and output a ‘total’ line at the end of the SQL report.

SELECT
seat_region AS "Region",
seat_count AS "Total Seats",
FORMAT(ROUND(SUM(CASE WHEN party_code = 'CON' THEN sum_share ELSE 0 END) / seat_count,2),2) AS "Conservative Vote Share",
FORMAT(ROUND(SUM(CASE WHEN party_code = 'LAB' THEN sum_share ELSE 0 END) / seat_count,2),2) AS "Labour Vote Share",
FORMAT(ROUND(SUM(CASE WHEN party_code = 'LIBDEM' THEN sum_share ELSE 0 END) / seat_count,2),2) AS "Liberal Democrat Vote Share",
FORMAT(ROUND(SUM(CASE WHEN party_code = 'UKIP' THEN sum_share ELSE 0 END) / seat_count,2),2) AS "UKIP Vote Share",
FORMAT(ROUND(SUM(CASE WHEN party_code = 'GREEN' THEN sum_share ELSE 0 END) / seat_count,2),2) AS "Green Vote Share",
FORMAT(ROUND(SUM(CASE WHEN party_code = 'SNP' THEN sum_share ELSE 0 END) / seat_count,2),2) AS "SNP Vote Share",
FORMAT(ROUND(SUM(CASE WHEN party_code = 'PLAID' THEN sum_share ELSE 0 END) / seat_count,2),2) AS "Plaid Cymru Vote Share",
FORMAT(ROUND(SUM(CASE WHEN party_code = 'OTHER' THEN sum_share ELSE 0 END) / seat_count,2),2) AS "Other Vote Share"
FROM
(SELECT
SUM(f_15.vote_share_pct) AS sum_share,
COUNT(DISTINCT d_st.seat_name) AS seat_count,
d_st.seat_region,
d_py.party_code
FROM fct_2015_prediction f_15
JOIN dim_party d_py ON d_py.party_id = f_15.party_id
JOIN dim_seat d_st ON d_st.seat_id = f_15.seat_id
GROUP BY d_st.seat_region, d_py.party_code) sum_share
GROUP BY seat_region
UNION ALL
SELECT
'TOTAL' AS "Region",
COUNT(DISTINCT seat_name) AS "Total Seats",
ROUND(SUM(CASE WHEN party_code = 'CON' THEN vote_share_pct ELSE 0 END) / COUNT(DISTINCT d_st.seat_name) ,2) AS "Conservative Vote Share",
ROUND(SUM(CASE WHEN party_code = 'LAB' THEN vote_share_pct ELSE 0 END) / COUNT(DISTINCT d_st.seat_name) ,2) AS "Labour Vote Share",
ROUND(SUM(CASE WHEN party_code = 'LIBDEM' THEN vote_share_pct ELSE 0 END) / COUNT(DISTINCT d_st.seat_name) ,2) AS "Liberal Democrat Vote Share",
ROUND(SUM(CASE WHEN party_code = 'UKIP' THEN vote_share_pct ELSE 0 END) / COUNT(DISTINCT d_st.seat_name) ,2) AS "UKIP Vote Share",
ROUND(SUM(CASE WHEN party_code = 'GREEN' THEN vote_share_pct ELSE 0 END) / COUNT(DISTINCT d_st.seat_name) ,2) AS "Green Vote Share",
ROUND(SUM(CASE WHEN party_code = 'SNP' THEN vote_share_pct ELSE 0 END) / COUNT(DISTINCT d_st.seat_name) ,2) AS "SNP Vote Share",
ROUND(SUM(CASE WHEN party_code = 'PLAID' THEN vote_share_pct ELSE 0 END) / COUNT(DISTINCT d_st.seat_name) ,2) AS "Plaid Cymru Vote Share",
ROUND(SUM(CASE WHEN party_code = 'OTHER' THEN vote_share_pct ELSE 0 END) / COUNT(DISTINCT d_st.seat_name) ,2) AS "Other Vote Share"
FROM fct_2015_prediction f_15
JOIN dim_party d_py ON d_py.party_id = f_15.party_id
JOIN dim_seat d_st ON d_st.seat_id = f_15.seat_id
ORDER BY 2 ASC;

And this is just the first of many possible queries you could make using this flexible data model. Have fun!

Thanks again to Chris HanrettyBen Lauderdale and Nick Vivyan for their generosity in sharing their fascinating data with the world.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: