I am currently involved with a very exciting machine learning project being run within the Bath Machine Learning Meetup group (I’m sure I’ll write a future post with more details on this project, but it is currently at a very early stage!). In collaboration with Bath: Hacked we are using this set of open data, which contains information about the occupancy of car parks in the Bath & North East Somerset (B&NES) area from the last two years or so.
While looking for something to listen to whilst making my daily bus commute to and from the university, I recently discovered the marvellous Partially Derivative podcast. In one of the early episodes (S1E2, to be precise) one topic of conversation particularly caught my attention: Paul Downey’s blog post “One CSV, thirty stories”. The concept is simple: take a single dataset, and produce a different visualization of the data each day for 30 days.
Seeing as we will be working with the car parking dataset a great deal in coming months, and inspired by Downey’s post, I thought I would try something similar. So, without further ado…
Day 00 (11/11/16): Sneak preview
I thought it would be a good idea to have a quick look at the dataset before starting out. I will, of course, be using R (to start with at least!).
I downloaded the dataset as a CSV file from the Bath: Hacked datastore (the link is above) - the file on the website is updated continuously with live data from sensors in the car parks. The version I downloaded (at 22:11 on 11/11/16) weighs in at a hefty 316MB, and contains a very large number of records:
To avoid converting my Pentium-core laptop into a puddle of molten plastic, for now I’ll only try to deal with about 10% of these records. Looking at the dataset’s documentation we see that “scripts are set up to query the B&NES car park database every 5 minutes, the data is then pushed to the Bath: Hacked data store to… append to a historical set”. Therefore it shouldn’t be a problem to read in only the first 150000 rows of the CSV file, since these rows should contain an equal spread of data from each of the distinct car parks.
Let’s have a quick first look at our data! Note: a few records are empty in certain fields (future investigation may be necessary…) - I’ll fill these fields with NAs for now to make it more obvious when data is missing.
Looking interesting… but I won’t look too closely for now. The fun starts tomorrow!
Day 01 (12/11/16): Car parks? What car parks?
A logical first step would be to work out where all the information we have is coming from!
A week or two ago I attended a learning night run by Bath: Hacked, which was an introduction to using the mapping tool Carto (formerly CartoDB) for visualizing geographic data. And notice that we have some location information…
So let’s see where these car parks are!
Day 02 (13/11/16): Maximum capacities
Making use of the small dataframe I created yesterday (which has one entry per car park), and for now sticking to R’s base graphics:
Day 03 (14/11/16): Capacities over time (Day 02 Reprise)
It crossed my mind earlier today that I had made a couple of assumptions about the data I’m working with - primarily, I assumed that for a given car park, each record would have the same values for certain columns. Before doing anything else, I thought I should check that this was actually the case.
The columns in question are: Location, Easting, Northing (which are all related, of course) and Capacities.
The first three columns only have one value per car park - great, our car parks aren’t moving!
However… it does seem that some capacities are not constant. Let’s try to see how they change over time.
The plot reveals some interesting information about the “changes” in capacities. For Newbridge P+R, the capacity increases over time, and remains level after each increase. It seems likely, then, that this is a genuine increase in capacity over time due to extensions to the car park.
For the other car parks, any changes are suspiciously short-lived and the capacities return to their previous values soon after any change. For this reason I think it is likely that these are mostly misrecorded values - although of course it is possible that there were short-term, temporary extensions/closures of the car parks.
The records for Avon Street CP also end abruptly and early in mid-summer of this year.
By pure chance, my visualization from yesterday (of maximum capacities) was using records from February 2016 - a time of relative stability, and when Newbridge P+R had reached its current maximum capacity - and so it is still, I would say, a valid representation of the data. But having had a much better look at the data today it is obvious I got a bit lucky!
Day 04 (15/11/16): Percentage occupancy by day (with strangely-ordered days…)
It’s a long title today. This is reflective of the long time I have spent trying to re-order the panels in today’s visualization, without success… more details shortly.
First, let’s get the data we’re interested in today.
Let’s have a quick look at what we’ve achieved:
Reassuringly, df5 contains 8064 entries:
6{10-min intervals per hour}*24{hours}*7{days}*8{car parks} = 8064
Now let’s create the plot!
Looking good - except for two things:
x-axis labels are not nicely formatted
Panels are not sensibly ordered
I have tried at some length to solve each of these issues - particularly the second, which I thought I could solve by reordering the underlying factor:
But plotting again faceting by Day1 instead of Day only reorders the labels, and does not reorder the plots themselves - i.e. Wednesday’s data is now labelled as Sunday. This is obviously worse than what I currently have (it’s plain wrong, rather than just “not pretty”) so I haven’t included it here.
I am going to try to stick to the plan of one visualization per day - but along with my already busy schedule and regular maths workload, I am obviously putting myself under a great deal of time pressure! At best I have a couple of hours to dedicate to this project on any given day, and this isn’t very long considering I am learning pretty much from scratch as I go along!
I’m hoping, then, that this won’t become a common refrain, but… I’ve run out of time today! I’ll revisit this visualization at some point soon if I figure out what I was doing wrong.
Day 05 (16/11/16): Occupancy per weekday, by car park
Before doing anything else, I’m going to remove the redundant columns from the dataframe of all records, and save the new smaller dataframe to a CSV file - which should then be quicker to load and easier work with.
Following another moment of mild panic this morning, I’m also going to check that the times being recorded are adjusted for GMT/BST each spring and autumn. If they are then I don’t need to worry - if not, I’ll have to do some time-correction work…
Suspiciously, there ARE records between 01:00 and 02:00 - and roughly 1/8 as many as for each other hour. The plot thickens when we remember that we have 8 car parks. Is one car park guilty?
Guilty as charged… I shall, for now, ignore any potential timeshift, and I will investigate this further tomorrow.
Right, on to today’s visualization!
I’m going to have a go at a somewhat similar plot to yesterday - but facetting by car park, rather than by day.
And regarding yesterday’s plot… good news! I am indebted to Paul Rougieux for pointing out that I was using y = df5$x in my ggplot code yesterday, where I ought to have been using y = x - changing this solves the problem! So here is yesterday’s visualization again, but ordered more sensibly.