As promised yesterday, I’m going to investigate the potential timeshift error in the data relating to one particular car park, caused by the changeover from GMT to BST.
Hmmm… let’s quickly try to see the effect of this error from one week to the next by comparing a day in a GMT week with the same weekday in a BST week. I’m going to use a couple of packages which are new to me - dplyr and lubridate - to manipulate the data.
Oh… those curves look quite similar. There certainly isn’t an obvious one-hour shift from one week to the next.
Let’s have another look at that vector from earlier of the number of records from each hour:
I was focusing so much on the suspicious 2nd element that I didn’t spot the also-somewhat-suspicious 3rd element - there are noticeably fewer records from 02:00 to 02:59 than from the other hours. This would be explained if Podium CP changes its clocks at 02:00, rather than 01:00... let’s see.
So none of the records taken between 02:00 and 03:00 were from Podium CP - confirming my theory.
There is, then, as I suspected, some timeshift error in the data from Podium CP.
However, the error is present for a grand total of 2 hours each year (one hour on a spring day, one on an autumn day), and records from these hours are between 01:00 and 03:00, when there is virtually no change in the occupancy of the car park. Therefore I think it is reasonable not to worry about it too much!
Day 07 (18/11/16): Status by weekday
I seem to have been converting the LastUpdate column to various date formats fairly regularly. Right now, I can’t think of any good reason why I need to leave it in character format, so I’ll save myself a step in future and re-write the column in my CSV.
On to today's work.
There’s one column I haven’t really looked at so far: the Status column, which describes the change in occupancy of the car park compared to the previous record. Let’s have a look:
There are a number of records without entries in the Status column - but remembering a discovery from a previous day, I have an idea of who the culprit might be…
That accounts for the 1333 records with no Status entry, then.
Onwards! We’ll remove the “test car park” records, and then for each car park and each weekday we’ll calculate the percentage of records that have each possible Status.
I’m aiming for a stacked-bar-type chart today, and it seems the most effective way to do this is to have the data in so-called ‘long’ format - so I’ll use another package I haven’t used much before, reshape2, to reformat the data:
Right, let’s plot!
Immediately we notice that our week has been extended with an eighth day - NAday.
This extra day seems to have been provided by Podium CP, which is by this point getting a reputation as a bit of a rebel… and as well as giving us NAday, it seems that Podium CP never fills or empties either.
Maybe every car that enters perfectly synchronizes with an exiting car. Maybe no-one ever enters or exits. We may never know.
Let’s remove the NAday records, just to tidy up the plot a little, and then re-plot.
This plot gives us some idea of the ‘turnaround’ of the different car parks. Looking, for example, at the two SouthGate car parks, we can see that SG General is rarely static, suggesting a near-constant flow of cars entering and exiting. In contrast, SG Rail is static more than half the time.
This corresponds to the role of the two car parks: SG General is in the town centre near the main shopping area, and so is used by a lot of people for relatively short periods of time. On the other hand SG Rail is primarily used by rail commuters, so during the day there isn’t much change in its occupancy - it fills and empties rapidly each morning and evening. This behaviour can actually be seen on the plot from Day 04, where SG Rail’s occupancy is much ‘squarer’ than the other car parks, particularly during the working week.
Day 08 (19/11/16): More dodgy data…
I just noticed a couple of things about my dataframe, df.
Firstly, because I’ve re-written it to CSV a couple of times and forgotten to turn off row.names, it’s picked up a couple of extra unnecessary columns.
Secondly, LastUpdate still isn’t in POSIXct form.
I’m going to go back to the original huge dataframe and sort out these issues for hopefully the last time.
That’s more like it. Let’s write that to a file.
Now let’s load it in again, just to make sure:
Oh. It seems that we lose the POSIXct date format when we write to and/or read from a CSV file. I suppose I will just have to convert it whenever I need to.
OK, on to today. The Christmas lights in Bath were switched on a couple of days ago in readiness for the Christmas market, which starts next week. Knowing that Bath gets particularly busy during this period, I thought I’d have a look at the mean occupancy of each car park per week, and see if there is some sort of increase in late November and early December - particularly in the park and ride (P+R) car parks, which the council advises visitors to use during the market period.
Let’s set up a dataframe and use dplyr and lubridate functions to calculate the mean occupancy per week, by car park.
Hang on. At some point, Newbridge CP was apparently, on average, well above 100% full over the course of a week.
Let’s see how this has happened.
Well, it’s immediately obvious where the problem is! That point at above 300% can only be explained by a broken exit sensor.
Tomorrow I’ll sort out this problem and carry on with my initial aim.
Day 09 (20/11/16): Mean occupancy per week
Continuing from yesterday’s work, I’ll add a couple of steps to the pipeline to cut out any weeks with a mean occupancy greater than 100%, and then to average by week.
Now let’s have another look at the maximum values per car park:
That looks better!
Let’s plot the data then. I’ll add a label with the week number for the week with the maximum mean percentage occupancy.
Other than Avon Street CP, which seems to fluctuate wildly over the course of the year, and Charlotte Street CP, which interestingly seems to drop slightly around November, the maximum values are - as I had expected - generally towards the end of the year.
SouthGate General CP actually has a noticeable peak in Week 51 - potentially due to last-minute Christmas shopping/Boxing Day sales.
And two of the P+Rs have their max average occupancy in Week 49 - approximately the first week of December, right in the middle of the market period (my prediction was spot on!).
Also, for all car parks there is a noticeable tail-off in the last couple of weeks of the year, presumably due to most people preferring to be at home over the Christmas period than out and about in the town.
Day 10 (21/11/16): Strange occupancies
Today I received not one, but two pieces of coursework for two separate modules of my uni course - and as such I am going to be under serious time pressure for the next couple of weeks. So I will do my best to keep up with these posts but bear with me if they become a little sloppier or less detailed.
I am going to start investigating some of the quirks of the dataset - to be honest, I probably should have done this earlier, before some of my other analyses, but I’ll see how serious any issues are and then I can always revisit and correct previous visualizations.
Let’s see some of the stranger records in the dataset - the records where a car park has zero occupancy (possible), >100% occupancy (possible if the car park is full and more cars are circulating waiting for spaces, but probably uncommon), and negative occupancy (not possible and definitely due to dodgy sensors!).
Of particular note are the times when Newbridge P+R contained 4 times as many cars as its maximum capacity, and when SouthGate General CP contained about -20 times its maximum capacity (i.e. about -14400 cars).