30 Days, 30 Visualizations, 1 Dataset: Days 21-25

2 December 2016

Day 21 (02/12/16): Delay to first upload

Perhaps the DateUploaded column can give us some help in working out why we have duplicate records.

Let’s do a similar test to the one from a couple of days ago, to see if the upload time is different for records with the same update time.

df3 <- filter(df0, Name != "test car park") %>%
    group_by(Name, LastUpdate, DateUploaded) %>%
    filter(n() > 1)

df3
## Source: local data frame [0 x 12]
## Groups: Name, LastUpdate, DateUploaded [0]
## 
## # ... with 12 variables: ID <fctr>, LastUpdate <fctr>, Name <fctr>,
## #   Description <fctr>, Capacity <int>, Status <fctr>, Occupancy <int>,
## #   Percentage <int>, Easting <int>, Northing <int>, DateUploaded <fctr>,
## #   Location <fctr>

Ah-ha! An empty dataframe! So it looks like the duplicate records are caused by the same record being uploaded at multiple different times.

Let’s take the first upload only, and create a plot similar to yesterday’s.

df4 <- df0 %>%
    select(Name, LastUpdate, DateUploaded) %>%
    filter(Name != "test car park") %>%
    mutate(LastUpdate = as.POSIXct(LastUpdate, tz = "UTC",
                                   format = "%d/%m/%Y %I:%M:%S %p"),
           DateUploaded = as.POSIXct(DateUploaded, tz = "UTC",
                                     format = "%d/%m/%Y %I:%M:%S %p")) %>%
    group_by(Name, LastUpdate) %>%
    summarize(FirstUpload = min(DateUploaded)) %>%
    mutate(Delay = as.numeric(FirstUpload - LastUpdate))

p <- ggplot(df4, aes(x = Delay)) + geom_histogram(colour = "black") +
    facet_wrap(~ Name, nrow = 2, scales = "free") +
    ggtitle("Delay between update and first upload") +
    xlab("Seconds") + ylab("Number of records") +
    theme(plot.title = element_text(size = rel(1.5))) +
    scale_y_log10()

p

Note that wherever the histogram displays a bar below the axis, this shows zero records in that bin, since we are using a log scale (and the value shown is therefore log10(0) = -Inf); and by similar logic, wherever the histogram shows 0 there is 1 record in that bin (for example, at the extreme right of the Avon Street CP plot).

We can see that for most of the car parks, there is a single record at any extreme values (e.g. that 1 record from Avon Street - the largest delay by a huge margin). However, there are multiple dodgy records at Podium CP and the SouthGate CPs.


Day 22 (03/12/16): Minute and second of upload

Let’s have a look at when records are uploaded to the online database.

df3 <- select(df0, Name, DateUploaded) %>%
    mutate(DateUploaded = as.POSIXct(DateUploaded, tz = "UTC",
                                     format = "%d/%m/%Y %I:%M:%S %p")) %>%
    mutate(Minute = minute(DateUploaded), Second = second(DateUploaded))

p1 <- ggplot(df3, aes(x = Minute)) + geom_histogram(binwidth = 1)
p2 <- ggplot(df3, aes(x = Second)) + geom_histogram(binwidth = 1)

I thought it was time to try out another new package! (Well, new to me…)

library(grid)

grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2, heights = unit(c(0.5, 5),
                                                                "null"))))

grid.text("Minute and second of upload",
          vp = viewport(layout.pos.row = 1, layout.pos.col = 1:2),
          gp = gpar(fontsize = 22, fontface = 2))

print(p1, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))

We can see that records are uploaded promptly every 5 minutes or so, as claimed by the documentation of the database; and that records tend to be uploaded ‘on the minute’.


Day 23 (04/12/16): Upload batch sizes and proportions

We saw yesterday that records are uploaded in batches roughly every 5 minutes. But how many records are usually uploaded in one of these batches? And which car parks, if any, “skip” updates?

df2 <- select(df0, Name, DateUploaded) %>%
    filter(Name != "test car park") %>%
    group_by(DateUploaded) %>%
    mutate(batch_size = n())

p <- ggplot(df2, aes(x = batch_size))

p1 <- p + geom_bar() + xlab("Batch size") + ylab("Number of batches")
p2 <- p + geom_bar(aes(fill = Name), position = "fill") +
    xlab("Batch size") + ylab("Proportion of batches where present")

library(grid)
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2,
                                           heights = unit(c(0.5, 5),"null"),
                                           widths = unit(c(1, 2), "null"))))
grid.text("Upload batch sizes and proportions",
          vp = viewport(layout.pos.row = 1, layout.pos.col = 1:2),
          gp = gpar(fontsize = 25, fontface = 2))
print(p1, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(p2, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))

So we can see that most batches are of size 8, as expected (these presumably contain one record per car park for each of the 8 car parks), but there are also many smaller batches - some as small as 4 records.

We can also see which car parks are contributiong to these smaller batches, and therefore work out which ones aren’t (i.e. the ones skipping updates).


Day 24 (05/12/16): Name == “test car park”

With a coursework deadline looming, I am struggling both for time and ideas - bear with me, the next few days may be a little rough…

I’ve been filtering out the records from “test car park” for most of the last month. I think it’s high time we had a look at them.

df2 <- filter(df, Name == "test car park")

library(scales)

p <- ggplot(df2, aes(x = LastUpdate)) +
    geom_line(aes(y = Occupancy, colour = "Occupancy")) +
    geom_line(aes(y = Capacity, colour = "Capacity")) +
    ggtitle("Test car park records") + xlab("Time of update") + ylab("Total") +
    scale_x_datetime(labels = date_format("%d/%m/%y"))+
    scale_colour_manual(name = "", values = c("black", "red"))

p

Admittedly not particularly interesting or informative, but probably the cleanest and most error-free data we've seen so far.


Day 25: Calculation of Percentage

There is one more thing I can check about the data: how is the Percentage column calculated? Given that it contains integer values, there must be some rounding involved - perhaps we can see whether values are rounded up or down, or to the nearest integer.

df4 <- select(df0, LastUpdate, Capacity, Occupancy, Percentage) %>%
    mutate(newPercentage = (Occupancy / Capacity),
           Difference = (newPercentage - (Percentage/100)))

    
p <- ggplot(df4, aes(x = Difference)) +
    geom_histogram(colour = "black", bins = 30) +
    ggtitle("Difference between (Occupancy/Capacity) and Percentage")

p

Although not entirely clear, the bulk of the observations lie to the right of zero. It seems, then, that the result of the (Occupancy/Capacity) calculation is rounded down to give the Percentage column.


Back to menu