Perhaps the DateUploaded column can give us some help in working out why we have duplicate records.
Let’s do a similar test to the one from a couple of days ago, to see if the upload time is different for records with the same update time.
Ah-ha! An empty dataframe! So it looks like the duplicate records are caused by the same record being uploaded at multiple different times.
Let’s take the first upload only, and create a plot similar to yesterday’s.
Note that wherever the histogram displays a bar below the axis, this shows zero records in that bin, since we are using a log scale (and the value shown is therefore log10(0) = -Inf); and by similar logic, wherever the histogram shows 0 there is 1 record in that bin (for example, at the extreme right of the Avon Street CP plot).
We can see that for most of the car parks, there is a single record at any extreme values (e.g. that 1 record from Avon Street - the largest delay by a huge margin). However, there are multiple dodgy records at Podium CP and the SouthGate CPs.
Day 22 (03/12/16): Minute and second of upload
Let’s have a look at when records are uploaded to the online database.
I thought it was time to try out another new package! (Well, new to me…)
We can see that records are uploaded promptly every 5 minutes or so, as claimed by the documentation of the database; and that records tend to be uploaded ‘on the minute’.
Day 23 (04/12/16): Upload batch sizes and proportions
We saw yesterday that records are uploaded in batches roughly every 5 minutes. But how many records are usually uploaded in one of these batches? And which car parks, if any, “skip” updates?
So we can see that most batches are of size 8, as expected (these presumably contain one record per car park for each of the 8 car parks), but there are also many smaller batches - some as small as 4 records.
We can also see which car parks are contributiong to these smaller batches, and therefore work out which ones aren’t (i.e. the ones skipping updates).
Day 24 (05/12/16): Name == “test car park”
With a coursework deadline looming, I am struggling both for time and ideas - bear with me, the next few days may be a little rough…
I’ve been filtering out the records from “test car park” for most of the last month. I think it’s high time we had a look at them.
Admittedly not particularly interesting or informative, but probably the cleanest and most error-free data we've seen so far.
Day 25: Calculation of Percentage
There is one more thing I can check about the data: how is the Percentage column calculated? Given that it contains integer values, there must be some rounding involved - perhaps we can see whether values are rounded up or down, or to the nearest integer.
Although not entirely clear, the bulk of the observations lie to the right of zero. It seems, then, that the result of the (Occupancy/Capacity) calculation is rounded down to give the Percentage column.