Data Utilities
data-tools.Rmd
There are numerous ‘data utilities’ contained within mqor that are useful for working with mqor-ready data. This page summarises these in no particular order.
Many of these functions are fairly lightweight and serve the
calculation of model quality objectives. For more complete tools for
more flexible handling of atmospheric composition data, you may consider
the openair
package. Functions like openair::timeAverage()
and
openair::rollingMean()
are more flexible and performant
than their mqor equivalents.
The 90% principle
mqo_percentile()
calculates the 90th percentile using
the methodology outlined in the model quality objectives working
document. Calculating the 90th percentile value of a variable
is calculated as outlined below, where
refers to the length of
.
While mqo_percentile()
defaults to a quantile of
0.9
, you can define any number between 0 and 1.
mqo_percentile(demo_longterm$obs, quantile = 0.9, na.rm = TRUE)
#> [1] 46.3
Temporal filtering
There are many filtering functions, all named
filter_*()
, which conveniently filter an
mqor-ready dataframe based on its “date” column.
filter_month()
, filter_wday()
and
filter_hour()
all operate on short-term data, and
filter_year()
operates on short- and long-term data.
demo_shortterm |> dplyr::pull(date)
#> [1] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [6] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [11] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [16] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [21] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [26] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [31] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [36] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [41] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [46] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [51] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [56] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [61] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [66] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [71] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
demo_shortterm |> filter_wday(1) |> dplyr::pull(date)
#> [1] "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05"
#> [6] "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05"
#> [11] "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05"
Combinations of these functions can be used to flexibly filter your data. For example, the below code will filter short-term data to just return morning hours, during the working week, during winter months.
your_short_term_data |>
filter_month(months = c(1, 2, 12)) |>
filter_wday(wdays = 2:6) |>
filter_hour(hours = 0:12)
Data Aggregations
There are many situations in which you may wish to aggregate
short-term hourly data to short-term daily or long-term annual data.
There are summarise_*()
functions to achieve both of these.
As well as aggregating data, they have extra arguments for the statistic
to use (mean, minimum, or maximum) as well as a minimum data coverage
threshold.
summarise_annual(demo_shortterm, statistic = "mean", min_coverage = 0.75)
#> # A tibble: 15 × 6
#> date site type param obs mod
#> <dbl> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2025 S1 fixed PM10 34.4 42.6
#> 2 2025 S2 fixed PM10 44.4 48.6
#> 3 2025 S3 fixed PM10 53.6 59.6
#> 4 2025 S4 fixed PM10 44.8 44.8
#> 5 2025 S5 fixed PM10 16.2 13.2
#> 6 2025 S6 fixed PM10 40.2 36.2
#> 7 2025 S7 fixed PM10 37.8 43.4
#> 8 2025 S8 fixed PM10 40.6 38.8
#> 9 2025 S9 fixed PM10 27.4 28.8
#> 10 2025 S10 fixed PM10 21.6 19.8
#> 11 2025 S11 fixed PM10 15 24.6
#> 12 2025 S12 fixed PM10 47.8 37.4
#> 13 2025 S13 fixed PM10 42.2 50.8
#> 14 2025 S14 fixed PM10 36.4 45.6
#> 15 2025 S15 fixed PM10 31 35
To calculate a maximum daily 8-hour rolling mean (e.g., for ozone)
you can combine summarise_daily()
with
mutate_rolling_mean()
, the latter of which replaces your
obs
and mod
columns with the rolling average
values.
your_short_term_data |>
mutate_rolling_mean(window_size = 8L, min_coverage = 0.75) |>
summarise_daily(statistic = "max")
Ensure Paired Modelled/Observed Data
Part of the methodology requires that each monitoring datapoint has
an equivalent modelled data point, and vice-versa. This is particularly
relevant for aggregating to a daily or annual mean, or calculating
rolling values. validate_mod_obs_pairs()
acts on
mqor-ready data, and sets either “obs” or “mod” to
NA
if either is already missing.
dat <- head(demo_longterm, n = 5)
dat$obs[5] <- NA
dat$mod[4] <- NA
dat
#> # A tibble: 5 × 6
#> site type param date obs mod
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 S1 fixed PM10 2025 34.4 42.6
#> 2 S2 fixed PM10 2025 44.4 48.6
#> 3 S3 fixed PM10 2025 53.6 59.6
#> 4 S4 fixed PM10 2025 44.8 NA
#> 5 S5 fixed PM10 2025 NA 13.2
validate_mod_obs_pairs(dat)
#> # A tibble: 5 × 6
#> site type param date obs mod
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 S1 fixed PM10 2025 34.4 42.6
#> 2 S2 fixed PM10 2025 44.4 48.6
#> 3 S3 fixed PM10 2025 53.6 59.6
#> 4 S4 fixed PM10 2025 NA NA
#> 5 S5 fixed PM10 2025 NA NA