Data Utilities • mqor

library(mqor)

There are numerous ‘data utilities’ contained within mqor that are useful for working with mqor-ready data. This page summarises these in no particular order.

Many of these functions are fairly lightweight and serve the calculation of model quality objectives. For more complete tools for more flexible handling of atmospheric composition data, you may consider the openair package. Functions like openair::timeAverage() and openair::rollingMean() are more flexible and performant than their mqor equivalents.

The 90% principle

mqo_percentile() calculates the 90th percentile using the methodology outlined in the model quality objectives working document. Calculating the 90th percentile value of a variable $X$ is calculated as outlined below, where $N_{s}$ refers to the length of $X$ .

$N_{90th} = floor(N_{s} \times 0.9)$

$D = N_{s} \times 0.9 - N_{90th}$

$X_{90th} = X(N_{90th}) + (X(N_{90th} + 1) - X(N_{90th})) \times D$

While mqo_percentile() defaults to a quantile of 0.9, you can define any number between 0 and 1.

mqo_percentile(demo_longterm$obs, quantile = 0.9, na.rm = TRUE)
#> [1] 46.3

Temporal filtering

There are many filtering functions, all named filter_*(), which conveniently filter an mqor-ready dataframe based on its “date” column. filter_month(), filter_wday() and filter_hour() all operate on short-term data, and filter_year() operates on short- and long-term data.

demo_shortterm |> dplyr::pull(date)
#>  [1] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#>  [6] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [11] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [16] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [21] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [26] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [31] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [36] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [41] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [46] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [51] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [56] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [61] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [66] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [71] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"

demo_shortterm |> filter_wday(1) |> dplyr::pull(date)
#>  [1] "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05"
#>  [6] "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05"
#> [11] "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05"

Combinations of these functions can be used to flexibly filter your data. For example, the below code will filter short-term data to just return morning hours, during the working week, during winter months.

your_short_term_data |>
  filter_month(months = c(1, 2, 12)) |>
  filter_wday(wdays = 2:6) |>
  filter_hour(hours = 0:12)

Data Aggregations

There are many situations in which you may wish to aggregate short-term hourly data to short-term daily or long-term annual data. There are summarise_*() functions to achieve both of these. As well as aggregating data, they have extra arguments for the statistic to use (mean, minimum, or maximum) as well as a minimum data coverage threshold.

summarise_annual(demo_shortterm, statistic = "mean", min_coverage = 0.75)
#> # A tibble: 15 × 6
#>     date site  type  param   obs   mod
#>    <dbl> <chr> <chr> <chr> <dbl> <dbl>
#>  1  2025 S1    fixed PM10   34.4  42.6
#>  2  2025 S2    fixed PM10   44.4  48.6
#>  3  2025 S3    fixed PM10   53.6  59.6
#>  4  2025 S4    fixed PM10   44.8  44.8
#>  5  2025 S5    fixed PM10   16.2  13.2
#>  6  2025 S6    fixed PM10   40.2  36.2
#>  7  2025 S7    fixed PM10   37.8  43.4
#>  8  2025 S8    fixed PM10   40.6  38.8
#>  9  2025 S9    fixed PM10   27.4  28.8
#> 10  2025 S10   fixed PM10   21.6  19.8
#> 11  2025 S11   fixed PM10   15    24.6
#> 12  2025 S12   fixed PM10   47.8  37.4
#> 13  2025 S13   fixed PM10   42.2  50.8
#> 14  2025 S14   fixed PM10   36.4  45.6
#> 15  2025 S15   fixed PM10   31    35

To calculate a maximum daily 8-hour rolling mean (e.g., for ozone) you can combine summarise_daily() with mutate_rolling_mean(), the latter of which replaces your obs and mod columns with the rolling average values.

your_short_term_data |>
  mutate_rolling_mean(window_size = 8L, min_coverage = 0.75) |>
  summarise_daily(statistic = "max")

Ensure Paired Modelled/Observed Data

Part of the methodology requires that each monitoring datapoint has an equivalent modelled data point, and vice-versa. This is particularly relevant for aggregating to a daily or annual mean, or calculating rolling values. validate_mod_obs_pairs() acts on mqor-ready data, and sets either “obs” or “mod” to NA if either is already missing.

dat <- head(demo_longterm, n = 5)

dat$obs[5] <- NA
dat$mod[4] <- NA

dat
#> # A tibble: 5 × 6
#>   site  type  param  date   obs   mod
#>   <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 S1    fixed PM10   2025  34.4  42.6
#> 2 S2    fixed PM10   2025  44.4  48.6
#> 3 S3    fixed PM10   2025  53.6  59.6
#> 4 S4    fixed PM10   2025  44.8  NA  
#> 5 S5    fixed PM10   2025  NA    13.2

validate_mod_obs_pairs(dat)
#> # A tibble: 5 × 6
#>   site  type  param  date   obs   mod
#>   <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 S1    fixed PM10   2025  34.4  42.6
#> 2 S2    fixed PM10   2025  44.4  48.6
#> 3 S3    fixed PM10   2025  53.6  59.6
#> 4 S4    fixed PM10   2025  NA    NA  
#> 5 S5    fixed PM10   2025  NA    NA