# Modelling approaches for monitoring data

Most statistical analyses are concerned with inference (the methods we use to infer something about some characteristic of a population of values) based on the limited information contained in a sample drawn from that population.

Here we briefly explain different modelling philosophies used when analysing data in water quality monitoring programs. The distinction between model-based and probability-based analyses is outlined and followed by a discussion on Bayesian and frequentist methods. Then we explain Bayesian inference concepts and how they link to hierarchical models and expert opinion.

## Model-based versus probability-based analysis

When it comes to spatial monitoring design, 2 popular statistical philosophies for choosing a sampling strategy are:

Other statistically valid methods for choosing sample sites that exist in broader environmental or ecological contexts may be relevant for sampling aquatic resources, including geometric approaches (e.g. Muller 2000) and hybrid approaches (e.g. Cressie et al. 2009, Brus & De Gruijter 2012).

With regard to analysis of water quality data, choosing an appropriate approach will primarily be determined by the objectives of the study. The adopted sampling strategy can also help guide analysis options.

If a model-based design is used to select sample sites, then the ensuing data can only be analysed using a model-based approach.

If probability-based designs are used for site selection, then the data can be analysed using readily available survey sample methods or a statistical model.

### Probability-based analysis

Probability-based designs enable inference about an attribute of the population, such as a mean, total, variance, proportion or distribution function to extend from the observed sample to the population.

For example, repeated random subsampling from a grab sample yields data that could be used to determine average levels of a water quality indicator of interest for that sample. Sampling at sites on a stream network will yield data that can be used to determine the length of the network that is of a particular condition.

The central feature is that representativeness of the population can follow naturally from probability-based sampling if designed carefully. Inferences more often follow directly from simple statistics.

### Model-based analysis

Model-based analyses rely heavily on the form of the model, its ability to capture the potential complexity in the system of interest and its generality for making valid, reliable and precise predictions and inferences.

Model-based inference can be very general and precise from a limited number of sample observations.

Reliability of the inference, on the downside, depends on:

Discussion about the contrasts between the 2 approaches is well documented in the literature. Refer to Sarndal (1978) and Hanson et al. (1983) for general discussion. De Gruijter & Ter Braak (1990) and Brus & De Gruijter (1993, 1997) focused on the differences in spatial inference in the context of soil sampling, and Theobald et al. (2007) outlined the differences in relation to environmental monitoring of natural resources generally.

Here we focus on model-based analyses because they are appropriate for either probability-based or model-based study designs.

## Bayesian versus frequentist approaches

Bayesian methods have taken a prominent role in modelling complex environmental processes in recent times (Kang & Cressie 2013, Parslow et al. 2013, Berrocal et al. 2014, Clifford et al. 2014, Pagendam et al. 2014, Zammit-Mangion et al. 2014). Their flexible nature, inherent in the hierarchical set up, provides an attractive framework for modelling that can accommodate information sources at different spatial and temporal scales. The incorporation of expert opinion when other quantitative information is lacking can assist with prediction and assessment that may have otherwise been abandoned.

Compared with frequentist approaches, Bayesian methods offer flexibility for modelling, estimation and assessment but at the expense of the additional computational capability required for implementation.

Frequentist approaches are philosophically distinct from Bayesian methods in many — sometimes subtle — ways. This is why one approach may be chosen over another when defining a framework for modelling.

As Casella (2008) pointed out, frequentists house an orthodox view to statistical inference whereby ‘sampling is regarded as infinite and decisions are sharp’. This infers that the data arise from a repeatable sample where the underlying parameters are constant and remain fixed.

Bayesians consider unknown quantities as random variables with assigned probability distributions that can be updated when new information becomes available. Under this statistical philosophy, the data are considered fixed (Gelman et al. 2004). The notion of confidence or credible intervals can be considered a subtle interpretation despite having quite different interpretations.

The interpretation of a frequentist 95% confidence interval, for example, is one that results from repeating an experiment an infinite number of times, and from this, estimating the quantity of interest (e.g. mean), resulting in 95% of the estimated means lying within the confidence interval.

A 95% Bayesian credible interval can be interpreted as having a 95% probability that the mean lies within the credible interval, given the data.

We can perform an analysis on a very simple dataset to contrast the 2 approaches, which has been presented many times as the ‘line example’ in the literature.

The line example consists of 5 pairs of points, {x,y} = {(1,1), (2,3), (3,3), (4,3), (5,5)} that roughly lie along a straight line, as illustrated in Figure 1. Figure 1 Points from the line example with a least squares estimate showing the line of best fit through the points

The model formulation in this simple case assumes:

Y ~ N(α + β(x),1/τ2).

This can be easily fit in a frequentist framework using the method of least squares but we also performed a Bayesian analysis and contrasted the results in Table 5 . The parameter estimates produced for both analyses are very similar for this example but not exactly the same. The confidence interval for τ, which is the square root of the precision parameter (square root of the reciprocal of the residual variance) is not provided for the least squares estimate because it is not routinely provided as part of the model output.

Table 5 Summaries from a Bayesian analysis and least squares analysis performed for the line example
Parameter α β τ
Bayesian Estimate 3.001 0.800 1.894
Standard deviation 0.55 0.38 1.53
95% credible interval [2.00,4.07] [0.09,1.53] [0.14,5.89]
Least squares Estimate 3.000 0.800 1.875
Standard deviation 0.33 0.23
95% credible interval [1.95,4.05] [0.07,1.53]

Within the context of the Water Quality Management Framework, while Bayesian and frequentist approaches to the analysis and the incorporation of updates or existing knowledge may differ, the overarching focus needs to be on the quality of the inferences and the level of support for the underlying assumptions.

Ask yourself, ‘What inferences do the analyses enable us to make that are relevant to the water quality objectives?’

## Bayesian inference, hierarchical modelling and expert opinion

We outline the general concepts of Bayesian inference because it has taken a prominent role in the analysis of environmental and ecological data (Kuhnert et al. 2005, Martin et al. 2005, Griffiths et al. 2007, Fox 2010, Pagendam et al. 2014). You can read more comprehensive references on the topic (Robert 2001, Gelman et al. 2004).

Let θ represent a vector of unobservable quantities, y the observed data and x a vector of explanatory variables. If we consider the joint distribution of θ and y such that:

p(ϑ,y) = p(ϑ)p(y|ϑ)

where p(ϑ,y) represents the joint distribution of θ and y, p(ϑ) represents prior probability distributions and our initial belief for θ, and p(y|ϑ) represents the sampling distribution (also known as ‘data distribution’ or likelihood that characterises the data). If the data, y, is taken as known or fixed (conditioned on), the posterior distribution p(ϑ|y) is obtained through Bayes theorem and is expressed in Equation 1.