Skip to contents

There are numerous ‘data utilities’ contained within mqor that are useful for working with mqor-ready data. This page summarises these in no particular order.

Many of these functions are fairly lightweight and serve the calculation of model quality objectives. For more complete tools for more flexible handling of atmospheric composition data, you may consider the openair package. Functions like openair::timeAverage() and openair::rollingMean() are more flexible and performant than their mqor equivalents.

The 90% principle

mqo_percentile() calculates the 90th percentile using the methodology outlined in the model quality objectives working document. Calculating the 90th percentile value of a variable XX is calculated as outlined below, where NsN_{s} refers to the length of XX.

N90th=floor(Ns×0.9) N_{90th} = floor(N_{s} \times 0.9)

D=Ns×0.9N90th D = N_{s} \times 0.9 - N_{90th}

X90th=X(N90th)+(X(N90th+1)X(N90th))×D X_{90th} = X(N_{90th}) + (X(N_{90th} + 1) - X(N_{90th})) \times D

While mqo_percentile() defaults to a quantile of 0.9, you can define any number between 0 and 1.

mqo_percentile(demo_longterm$obs, quantile = 0.9, na.rm = TRUE)
#> [1] 46.3

Temporal filtering

There are many filtering functions, all named filter_*(), which conveniently filter an mqor-ready dataframe based on its “date” column. filter_month(), filter_wday() and filter_hour() all operate on short-term data, and filter_year() operates on short- and long-term data.

demo_shortterm |> dplyr::pull(date)
#>  [1] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#>  [6] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [11] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [16] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [21] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [26] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [31] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [36] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [41] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [46] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [51] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [56] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [61] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [66] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"
#> [71] "2025-01-01" "2025-01-02" "2025-01-03" "2025-01-04" "2025-01-05"

demo_shortterm |> filter_wday(1) |> dplyr::pull(date)
#>  [1] "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05"
#>  [6] "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05"
#> [11] "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05" "2025-01-05"

Combinations of these functions can be used to flexibly filter your data. For example, the below code will filter short-term data to just return morning hours, during the working week, during winter months.

your_short_term_data |>
  filter_month(months = c(1, 2, 12)) |>
  filter_wday(wdays = 2:6) |>
  filter_hour(hours = 0:12)

Data Aggregations

There are many situations in which you may wish to aggregate short-term hourly data to short-term daily or long-term annual data. There are summarise_*() functions to achieve both of these. As well as aggregating data, they have extra arguments for the statistic to use (mean, minimum, or maximum).

summarise_annual(demo_shortterm, statistic = "mean")
#> # A tibble: 15 × 6
#>     date site  type  param   obs   mod
#>    <dbl> <chr> <chr> <chr> <dbl> <dbl>
#>  1  2025 S1    fixed PM10   34.4  42.6
#>  2  2025 S2    fixed PM10   44.4  48.6
#>  3  2025 S3    fixed PM10   53.6  59.6
#>  4  2025 S4    fixed PM10   44.8  44.8
#>  5  2025 S5    fixed PM10   16.2  13.2
#>  6  2025 S6    fixed PM10   40.2  36.2
#>  7  2025 S7    fixed PM10   37.8  43.4
#>  8  2025 S8    fixed PM10   40.6  38.8
#>  9  2025 S9    fixed PM10   27.4  28.8
#> 10  2025 S10   fixed PM10   21.6  19.8
#> 11  2025 S11   fixed PM10   15    24.6
#> 12  2025 S12   fixed PM10   47.8  37.4
#> 13  2025 S13   fixed PM10   42.2  50.8
#> 14  2025 S14   fixed PM10   36.4  45.6
#> 15  2025 S15   fixed PM10   31    35

To calculate a maximum daily 8-hour rolling mean (e.g., for ozone) you can combine summarise_daily() with mutate_rolling_mean(), the latter of which replaces your obs and mod columns with the rolling average values. This function contains the min_coverage argument which ensures that each window has sufficient data.

your_short_term_data |>
  mutate_rolling_mean(window_size = 8L, min_coverage = 0.75) |>
  summarise_daily(statistic = "max")

Ensure Paired Modelled/Observed Data

Part of the methodology requires that each monitoring datapoint has an equivalent modelled data point, and vice-versa. This is particularly relevant for aggregating to a daily or annual mean, or calculating rolling values. validate_mod_obs_pairs() acts on mqor-ready data, and sets either “obs” or “mod” to NA if either is already missing.

dat <- head(demo_longterm, n = 5)

dat$obs[5] <- NA
dat$mod[4] <- NA

dat
#> # A tibble: 5 × 6
#>   site  type  param  date   obs   mod
#>   <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 S1    fixed PM10   2025  34.4  42.6
#> 2 S2    fixed PM10   2025  44.4  48.6
#> 3 S3    fixed PM10   2025  53.6  59.6
#> 4 S4    fixed PM10   2025  44.8  NA  
#> 5 S5    fixed PM10   2025  NA    13.2

validate_mod_obs_pairs(dat)
#> # A tibble: 5 × 6
#>   site  type  param  date   obs   mod
#>   <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 S1    fixed PM10   2025  34.4  42.6
#> 2 S2    fixed PM10   2025  44.4  48.6
#> 3 S3    fixed PM10   2025  53.6  59.6
#> 4 S4    fixed PM10   2025  NA    NA  
#> 5 S5    fixed PM10   2025  NA    NA

Ensure Data Capture Thresholds are met

The validate_coverage() function takes short-term data, a resolution (either hourly or daily), and a minimum data coverage percentage. Depending on mode, this function will either remove site-pollutant-year combinations with too low data capture, or warn if they exist. Regardless, a data_coverage column will be appended to the data.

This function warns with the demo dataset as it only has 5 observations - way under the default 75% data capture for daily data (5 observations out of 365 days is ~1.4%).

validate_coverage(demo_shortterm, resolution = "daily", mode = "warn")
#> Warning: The following years, sites, and pollutants have insufficient data capture:
#> S1-PM10-2025, S10-PM10-2025, S11-PM10-2025, S12-PM10-2025, S13-PM10-2025,
#> S14-PM10-2025, S15-PM10-2025, S2-PM10-2025, S3-PM10-2025, S4-PM10-2025,
#> S5-PM10-2025, S6-PM10-2025, S7-PM10-2025, S8-PM10-2025, and S9-PM10-2025.
#> # A tibble: 5,475 × 7
#>    site  type  param date                  obs   mod data_coverage
#>    <chr> <chr> <chr> <dttm>              <dbl> <dbl>         <dbl>
#>  1 S1    fixed PM10  2025-01-01 00:00:00    24    48        0.0137
#>  2 S1    fixed PM10  2025-01-02 00:00:00    35    47        0.0137
#>  3 S1    fixed PM10  2025-01-03 00:00:00    44    39        0.0137
#>  4 S1    fixed PM10  2025-01-04 00:00:00    38    37        0.0137
#>  5 S1    fixed PM10  2025-01-05 00:00:00    31    42        0.0137
#>  6 S1    fixed PM10  2025-01-06 00:00:00    NA    NA        0.0137
#>  7 S1    fixed PM10  2025-01-07 00:00:00    NA    NA        0.0137
#>  8 S1    fixed PM10  2025-01-08 00:00:00    NA    NA        0.0137
#>  9 S1    fixed PM10  2025-01-09 00:00:00    NA    NA        0.0137
#> 10 S1    fixed PM10  2025-01-10 00:00:00    NA    NA        0.0137
#> # ℹ 5,465 more rows