This stage stipulates the type of change to look for in the indicators. It’s important to know what management actions will be taken if an effect is detected at Step 5 of the framework (refer to Chapter 10 of Downes et al. 2002). The management goals will provide much of this context.
For example, if a slightly to moderately disturbed ecosystem is being managed to conserve the population of a recreationally important fish, and the threat is a persistent contaminant that could reduce fish fecundity, then the decision criteria for the indicators need to be set at values that are smaller than those that begin to affect the reproduction of the fish. This will allow sufficient lead time for managers to act before the population of fish is affected.
This process requires a significant level of scientific judgement. The actual values used as decision criteria will depend on a number of modifying factors, such as chemical speciation of toxicants, or knowledge of the sensitivity of the fish species from laboratory testing or field studies.
Defining an acceptable level of change
When setting levels of protection for high conservation or ecological value ecosystems or slightly to moderately disturbed ecosystems, we recommend a criterion of ‘no change beyond natural variability’ for ecosystem receptors.
Operationally, you will still need to:
- stipulate how much change can be expected under ‘natural’ conditions (natural variation constitutes an acceptable level of change in the ecosystem)
- decide on an effect size
- ensure that sampling is intensive enough (to detect effects larger than the acceptable natural changes in the chosen indicators, and avoid Type II errors).
The determination of the acceptable level of change may have both scientific and social or cultural elements.
If you are new to environmental assessment, defining an acceptable level of change may seem weak, especially when some stakeholders insist there must be no change in the indicator.
You cannot set a criterion of ‘no change’ because change is an essential component of all functioning ecosystems. Proof of no change would require you to prove a null hypothesis, which is operationally impossible.
It is possible to state some level of change in an indicator below which it is not important to reject the null hypothesis of no change. This requires stakeholders to be explicit about what level of change in the indicator is regarded as harmless or acceptable. In formal terms, this process involves specifying an effect size in Stage 2.
This stage translates the identified changes into quantifiable effect sizes.
Indicator values vary naturally in space and time, and estimates of their true values can only be made via samples. Some observed changes in the indicators are likely to be ecologically trivial, or important components of ecosystem function.
If the water quality monitoring objective is to detect non-trivial changes or trends in the indicators soon enough to allow management to act, then the monitoring program needs to be sensitive enough to detect modest rather than large changes in those indicators.
Selecting widely used default monitoring protocols — out of habit or convenience — is not sufficient. The selected monitoring programs must have the sensitivity required to assess achievement or otherwise of the management goal.
You need to identify the maximum amount of change in the indicator that is tolerable before you can conclude that there has been important or unacceptable change that requires a management response. This level of ecologically important change is sometimes called the critical effect size, or ‘effect size’ for short.
Effect size comprises its form and magnitude (Toft & Shea 1983, Mapstone 1995).
The form of an effect is the statistical measure (e.g. mean, variance) that is expected to differ between reference and disturbed conditions, and the pattern of differences or trends that needs to be detected (Stewart-Oaten et al. 1986, Green 1989, Underwood 1991, 1992).
The magnitude of an effect is the size of those differences or trends that would be considered ecologically important to detect.
Setting an effect size for ecosystem receptors can be difficult because there is often little information about the relationships between contaminants and biological indicators in field conditions, especially in Australia and New Zealand.
We make some suggestions about how to proceed although this is not an exhaustive list. Other strategies may arise as you gain more experience in planning monitoring programs.
If an indicator has intrinsic socioeconomic value (e.g. a commercially or recreationally important species), then effect sizes can be set to ensure sustainable use of that indicator. However, many biological indicators have been selected because they are:
- more sensitive than commercial species, or
- thought to be ecologically important rather than of economic value.
For example, seagrass is not used directly by humans in Australia and New Zealand but is an important indicator because of the habitat it provides for many other species.
Existing research on similar stressors, preferably in comparable regions, can provide information about the relationship between the indicator and size of the potential effect, especially if existing effects can be found on a gradient from mild to extreme.
For example, a variety of wastewater treatment plants may be present in a river basin with differing degrees of sewage treatment. Pilot data relating indicator levels and type of treatment could be used in stakeholder consultations to correlate stakeholders’ expectations of acceptable sewage treatment with change in the indicator. In some cases, statistical or simulation models can use these data to estimate how much an indicator might change under different scenarios (Paul et al. 2016).
For many indicators in Australian and New Zealand ecosystems, however, such background data are unlikely to be available. This will inevitably involve some judgement by people planning the monitoring program, and an arbitrary but conservative effect size will need to be specified (e.g. Humphrey et al. 1995). This should be done explicitly, at the beginning of the program (as a feedback between Steps 2 and 3 of the Water Quality Monitoring Framework).
Sometimes the desired effect size requires impractically large sample sizes, and the trade-offs between error rates needs to be discussed iteratively with stakeholders (refer to Stage 3). Any change to the effect size later in the program must be openly and explicitly negotiated and fully justified on scientific grounds (Step 9 of the Water Quality Monitoring Framework).
Recognising that some procedures have implicit effect sizes
For some of the procedures in the Water Quality Guidelines, the effect size and error rates are implicit in the methods, making them less amenable to the procedure of using scalable decision criteria that we describe here.
For example, in the AUSRIVAS procedure for rivers, notions of effect size and error rates are inherent in the way the summary indexes are compared with the reference conditions.
This stage assesses the risk of making a Type I error (false alarm) and a Type II error (false sense of security) in the light of the consequences or costs of making either of those errors.
Having stipulated an effect size, the stakeholders then need to minimise the risk of 2 potential statistical outcomes — the Type I and Type II errors. The challenge is to ensure that sufficient data are collected to detect the change stipulated in the effect size while, on the other hand, not expending too many resources on sample sizes that will detect ecologically trivial changes in the indicator. Inevitably, resources are scarce so all monitoring programs will need to balance these 2 errors.
The first potential error is to declare that an effect has occurred (effect size has been exceeded) when there has actually been no ecologically important change.
Conventionally, the Type I error rate, α, has been set at 0.05 or smaller (Toft & Shea 1983, Fairweather 1991, Mapstone 1995). Some have argued for conservative or default values for α and the Type II error rate, β, but ideally these rates should be negotiated rather than accepted uncritically (Greenland et al. 2016, Wasserstein & Lazar 2016). The most important part of this negotiation is to ensure that the balance between these 2 types of errors is acceptable to your involved stakeholders, which could occur at Steps 2, 4 or 5 of the Water Quality Management Framework.
To this end, the Water Quality Guidelines recommend Mapstone’s (1995) proposal that the ratioof Type I and Type II errors is negotiated when refining the design of the monitoring program. This process requires iteration between stakeholders, but should be transparent, accountable and, above all, take place before the final monitoring or assessment program is implemented (Mapstone 1995).
Choosing error rates α and β involves 4 steps:
- Establish the relative importance or cost of the consequences of each type of error.
- Set the ratio of the critical Type I and Type II errors relative to their costs. If there is insufficient information to estimate the costs of the errors, then Mapstone (1995) suggested they should be weighted equally.
- Negotiate with the stakeholders the desired values of α and β with reference to the ratio established in the previous step.
- Design a sampling program to meet the desirable Type II error rate, β (established in the previous step), given the effect size specified earlier. This allows the sample size and details of the design to be finalised.
Mapstone (1995) detailed 2 alternative decision procedures that can be followed once data have been collected and analysed.
Ideally, these negotiations should consider a number of potential indicators simultaneously. In the process of balancing Type I and Type II errors, some indicators will inevitably prove much more costly than others if the 2 error rates are to be kept low. In such cases, stakeholders are faced with a choice:
- replace the costly indicators with others that will detect the stipulated effect size more cheaply, or
- increase the sizes of both errors while maintaining the ratio between the errors (if the costly indicators must be included in the program for some reason).
The only way to reduce the sizes of these errors is to increase the sampling intensity. Maintaining the ratio between the errors ensures that Type I errors are not minimised at the expense of increasing Type II errors. The monitoring program should not lose power to detect an important change at the expense of being conservative about the probability of incorrectly declaring that an important change has occurred.
The condition of an ecosystem may influence decisions about monitoring study design. We define 3 categories of ecosystem condition in the Water Quality Guidelines:
For high conservation or ecological value systems, the study design will need to set effect sizes and the ratio between α and β so as to be as precautionary as possible. Often there are scant baseline data for ecosystem receptors in such systems. The study design should maximise opportunities to improve baseline knowledge so that natural variation is sufficiently well characterised to allow effect sizes to be set (Mapstone 1995).
Humphrey et al. (1999) criticised aspects of the environmental impact assessment (EIA) process in Australia, saying that too often developments proceeded without adequate baseline data being gathered to detect and assess potential disturbances.
We strongly recommend that parties adopt a precautionary approach and respond wisely and in a timely manner to data gathered for ‘early detection’ indicators.
Slightly to moderately disturbed systems should be treated like high conservation or ecological value systems, acknowledging that there may be negotiated deviations from default guideline values (DGVs) prescribed for high conservation or ecological value systems. Nevertheless, any decisions on effect size should be based on sound ecological principles of sustainability rather than arbitrary relaxation of guideline values determined for high conservation or ecological value systems, or because of resource constraints.
For highly disturbed systems, our philosophy is that, at worst, water quality is maintained so that it can support the values identified by stakeholders (Step 2 of the Water Quality Management Framework). Ideally, the longer-term aim is towards improved water quality, in which case design considerations for remediation become relevant.
Your objective may be to provide evidence of environmental change related to water quality, or to identify the source of contaminants or stressors, such as when:
Providing evidence of environmental change using field-based lines of evidence (e.g. biodiversity, in situ toxicity testing, some biomarkers) will primarily rely on designs such as the multiple control before–after control–impact paired-differences (M)BACI(P) design class , which captures temporal and spatial variation but usually at smaller spatial scales.
More flexible and appropriate statistical tools are available to analyse data from these designs, and you should seek professional statistical advice during planning to ensure that sample sizes and other design elements are sufficient to permit the planned analysis.
Temporal baselines before the change are frequently insufficient for robust applications of these designs so we outline some strategies to mitigate lack of data issues. Seeking professional statistical advice is strongly encouraged when planning such studies.
Study designs that rely on patterns or trends over time (including those in the (M)BACI(P) design class) need to carefully consider the temporal scope of the monitoring program and any natural cyclic behaviour (e.g. daily or seasonal cycles) in the ecosystem receptor.
Distinguishing ‘signal’ from ‘noise’ in time series data can require intensive sampling programs, and these demands need to be weighed carefully when selecting indicators at Step 3 of the Water Quality Management Framework (refer to temporal analysis approaches).
Indicators based on laboratory and mesocosm data, such as lines of evidence within toxicity and biomarkers, will mostly be served by experimental designs prescribed by relevant protocols or procedures (Hook et al. 2014, Kroon et al. 2017).
Assessing change where there are no baseline data is commonly required when investigating an unexpected event, such as a fish kill or an accidental spill.
Where there is little baseline information, incorporation of liberal spatial controls — if available — greatly improves your ability to infer water-quality related change in ecosystems.
Whether or not spatial controls are available, it is crucial that additional lines of evidence are used and combined in a weight-of-evidence approach. The process is based on risk assessment principles derived from logic tables (e.g. Chapman 1990) or epidemiological precepts (Beyers 1998) adapted to environmental applications, with examples described by Suter & Cormier (2011) and Linkov et al. (2009) amongst others.
If ecosystem receptors with poor baseline data are used, then your monitoring team should enhance its set of monitoring techniques with additional lines of evidence, such as chemical monitoring, toxicological and other experimental data.
Reporting on environmental conditions across a broad spatial scale is often associated with State of the Environment reporting and similar ‘audits’, but broadscale assessments are also used for:
Traditionally, ecosystem receptors used in broadscale monitoring programs have been multivariate responses, consisting of the abundances or presence and absences of plants (including algae), macroinvertebrates (e.g. AUSRIVAS) or fish at sampling locations across the landscape.
Similar information is generated by ‘new generation’ genomic techniques using eukaryotic, prokaryotic and functional markers of ecosystem receptors, which use similar statistical techniques for their analysis (Chariton et al. 2010, Bourlat et al. 2013, Saxena et al. 2015).
For larger organisms, patterns of presence or absence of one or a small number of species across a large number of sites can be used in broadscale assessments. The problem with such surveys is that the target organisms can only be detected imperfectly, but modern broadscale survey designs can incorporate elements that estimate detectability. Gwinn et al. (2016) reviewed the opportunities for such designs for fish.
Vertebrate ecologists have benefitted from rapid improvements in sampling technology and their reduced costs (e.g. camera traps, automated audio recordings). These innovations have sampling designs aimed at estimating occupancy of sites across the landscape for a wide variety of taxa (Mackenzie & Royle 2005).
Occupancy incorporates information on the detectability of the organisms derived from several repeat surveys over the short term (e.g. several consecutive nights of camera trap data) (MacKenzie et al. 2006). Occupancy estimation and modelling have been applied to aquatic organisms, such as amphibians (Price et al. 2015) and snails (Tyre et al. 2003), while new genomic techniques will likely make multispecies occupancy modelling at broad scales more feasible (Noon et al. 2012).
We provide advice on spatial aspects of study design.
For auditing applications and broadscale remediation studies, trends over time may be the focus of the study. Take care to ensure both spatial and temporal components of the design match the aims of the study.
Ecosystem recovery, including rehabilitation, remediation and restoration, presents a special challenge to implementing decision criteria because the monitoring program will ultimately seek to show that the values of the indicators are similar to those defined by the reference or control conditions, or are tracking closely to modelled trajectories and final ecosystem states, which may not reflect a priori natural reference conditions.
Long timescales (10 years or more)
For recovery over very long timescales, the study design will probably be framed around estimating trends or rates over time, often against modelled temporal trajectories or those based on a conceptual model. In the Water Quality Guidelines, we provide information about:
Shorter timescales (less than 10 years)
For shorter time frames, recovery is likely to be assessed in terms of returning to reference or control conditions, or an agreed ecosystem condition where modelling or other assessment indicates return to original condition is not possible. Under conventional statistical paradigms, you will be trying to prove the null hypothesis (no ecologically important change from reference or agreed ecosystem conditions), which is operationally impossible.
Bioequivalence tests used in medicine and toxicology (Chow & Liu 1992) provide a framework for conventional statistical procedures, and have been applied to environmental restoration for field surveys (McDonald & Erickson 1998) and in laboratory contexts (Erickson & McDonald 1995).
Tests of bioequivalence recast the question so that the undesirable outcome — that the remediated site differs substantially from the reference, control or other agreed condition (sites are not bioequivalent) — becomes the null hypothesis, and evidence is sought to reject this in favour of the alternative, that the remediated site is similar to the desired ecosystem state (sites are bioequivalent).
Formally, the hypotheses are framed in terms of the ratio of the values of the indicator in the remediated site and the (agreed) reference condition. If recovery has been achieved, then the ratio should be sufficiently close to 1, and there should be strong evidence against the null hypothesis, which is then rejected in favour of the alternative after conducting the appropriate statistical analysis.
Under bioequivalence testing:
- a Type I error results in incorrectly deciding that the sites are bioequivalent when they still differ by an important amount (inadequate recovery, false sense of security)
- a Type II error results in deciding that the sites still differ when in fact they are similar (adequate recovery, false alarm).
Stakeholders still must negotiate an effect size with this technique. You need to stipulate how different sites can be before they are declared ‘non-bioequivalent’. In formal terms, stipulate a critical value of the ratio of the indicator between the sites.
Subsequent contributions note the potential utility of these tests, and also describe some of the difficulties in their implementation, for estimating differences between means or quantiles (McBride 1999, Fox 2001, Parkhurst 2001, Ngatia et al. 2010) and trends (slopes) (Dixon & Pechmann 2005, 2008, Camp et al. 2008).
Issues about power were introduced by Ryan (2013), and McBride et al. (2014) provided a lucid treatment oriented towards water quality applications. Fox (2001) and Fox et al. (2007) extended ideas about power, design and equivalence testing to situations where traditional statistical techniques do not apply or where the distributions of the variables are uncertain.
As with environmental change related to water quality, site-specific studies will be best served by the (M)BACI(P) design class, unless pre-remediation baselines are inadequate, in which case spatial and temporal aspects of the design will need to play a bigger role, with inferences further strengthened from other lines of evidence.
Remediation may sometimes need long time frames (decadal or longer) so the design will need to pay particular attention to sampling frequency and pattern to detect the trends most relevant to managing the remediation.
Beyers DW 1998, Causal inference in environmental impact studies, Journal of the North American Benthological Society 17(3): 367–373.
Bourlat SJ, Borja A, Gilbert J, Taylor MI, Davies N, Weisberg SB, et al. 2013, Genomics in marine monitoring: new opportunities for assessing marine health status, Marine Pollution Bulletin 74: 19–31.
Camp RJ, Seavy NE, Gorresen PM & Reynolds MH 2008, A statistical test to show negligible trend: comment, Ecology 89: 1469–1472.
Chapman PM 1990, The sediment quality triad approach to determining pollution-induced degradation, Science of the Total Environment 97–98: 815–825.
Chariton AA, Court LN, Hartley DM, Colloff MJ & Hardy CM 2010, Ecological assessment of estuarine sediments by pyrosequencing eukaryotic ribosomal DNA, Frontiers in Ecology and the Environment 8: 233–238.
Chow S-C & Liu J-P 1992, Design and Analysis of Bioavailability and Bioequivalence Studies, Marcel Dekker, New York.
Dixon PM & Pechmann JH 2008, A statistical test to show negligible trend: reply, Ecology 89: 1473.
Dixon PM & Pechmann JHK 2005, A statistical test to show negligible trend, Ecology 86: 1751–1756.
Downes BJ, Barmuta LA, Fairweather PG, Faith DP, Keough MJ, Lake PS, et al. 2002, Monitoring Ecological Impacts: Concepts and practice in flowing waters, Cambridge University Press, Cambridge.
Erickson WP & McDonald LL 1995, Tests for bioequivalence of control media and test media in studies of toxicity, Environmental Toxicology & Chemistry 14: 1247–1256.
Fairweather PG 1991, Statistical power and design requirements for environmental monitoring, Australian Journal of Marine and Freshwater Research 42: 555–567.
Fox DR 2001, Environmental power analysis — a new perspective, Environmetrics 12: 437–449.
Fox DR, Ben-Haim Y, Hayes KR, McCarthy MA, Wintle B & Dunstan P 2007, An info-gap approach to power and sample size calculations, Environmetrics 18: 189–203.
Green RH 1989, Power analysis and practical strategies for environmental monitoring, Environmental Research 50: 195–205.
Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN et al. 2016, Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations, European Journal of Epidemiology 31(4): 337–350.
Gwinn DC, Beesley LS, Close P, Gawne B & Davies PM 2016, Imperfect detection and the determination of environmental flows for fish: challenges, implications and solutions, Freshwater Biology 61: 172–180.
Hook SE, Gallagher EP & Batley GE 2014, The role of biomarkers in the assessment of aquatic ecosystem health, Integrated Environmental Assessment and Management 10(3): 327–341.
Humphrey CL, Faith DP & Dostine PL 1995, Baseline requirements for assessment of mining impact using biological monitoring, Australian Journal of Ecology 20(1): 150–166.
Humphrey C, Thurtell L, Pidgeon RWJ & van Dam R, Finlayson M 1999, A model for assessing the health of Kakadu's streams, Australian Biologist 12: 33–42.
Kroon F, Streten C & Harries S 2017, A protocol for identifying suitable biomarkers to assess fish health: a systematic review, PloS one 12, e0174762.
Linkov I, Loney D, Cormier S, Satterstrom FK & Bridges T 2009, Weight-of-evidence evaluation in environmental assessment: review of qualitative and quantitative approaches, Science of the Total Environment 407: 5199–5205.
MacKenzie DI, Nichols JD, Royle JA, Pollock KH, Bailey LL & Hines JE 2006, Occupancy Estimation and Modeling: Inferring patterns and dynamics of species occurrence, 1st Edition, Elsevier, Burlington.
Mackenzie DI & Royle JA 2005, Designing occupancy studies: general advice and allocating survey effort, Journal of Applied Ecology 42: 1105–1114.
Mapstone BD 1995, Scalable decision rules for environmental impact studies: effect size, Type I, and Type II errors, Ecological Applications 401.
McBride G, Cole RG, Westbrooke I & Jowett I 2014, Assessing environmentally significant effects: a better strength-of-evidence than a single P value? Environmental Monitoring and Assessment 186: 2729–2740.
McBride GB 1999, Applications: equivalence tests can enhance environmental science and management, Australian & New Zealand Journal of Statistics 41(1): 19–29.
McDonald LL & Erickson WP 1998, Testing for bioequivalence in field studies: has a disturbed site been adequately reclaimed? in: Fletcher DJ & Manly BFJ (eds), Statistics in Ecology and Environmental Monitoring 2, University of Otago Press, Dunedin, pp. 183–197.
Ngatia M, Gonzalez D, Julian SS & Conner A 2010, Equivalence versus classical statistical tests in water quality assessments, Journal of Environmental Monitoring 12: 172–177.
Noon BR, Bailey LL, Sisk TD & McKelvey KS 2012, Efficient species-level monitoring at the landscape scale, Conservation Biology 26: 432–441.
Parkhurst DF 2001, Statistical significance tests: equivalence and reverse tests should reduce misinterpretation, Bioscience 51: 1051–1057.
Paul WL, Rokahr PA, Webb JM, Rees GN & Clune TS 2016, Causal modelling applied to the risk assessment of a wastewater discharge, Environmental Monitoring and Assessment 188: 131.
Price SJ, Muncy BL, Bonner SJ, Drayer AN & Barton CD 2015, Effects of mountaintop removal mining and valley filling on the occupancy and abundance of stream salamanders, Journal of Applied Ecology 53: 459–468.
Ryan TP Jr 2013, Sample Size Determination and Power, John Wiley & Sons, Hoboken.
Sahu SK & Smith TMF 2006, A Bayesian method of sample size determination with practical applications, Journal of the Royal Statistical Society: Series A (Statistics in Society) 169: 235–253.
Saxena G, Marzinelli EM, Naing NN, He Z, Liang Y, Tom L et al. 2015, Ecogenomics reveals metals and land-use pressures on microbial communities in the waterways of a megacity, Environmental Science & Technology 49: 1462–1471.
Stewart-Oaten A, Murdoch WW & Parker KR 1986, Environmental impact assessment: “pseudoreplication” in time? Ecology 67: 929–940.
Suter GW & Cormier SM 2011, Why and how to combine evidence in environmental assessments: weighing evidence and building cases, Science of the Total Environment 409: 1406–1417.
Toft CA & Shea PJ 1983, Detecting community-wide patterns: estimating power strengthens statistical inference, American Naturalist 122: 618–625.
Tyre AJ, Tenhumberg B, Field SA, Niejalke D, Parris K & Possingham HP 2003, Improving precision and reducing bias in biological surveys: estimating false-negative error rates, Ecological Applications 13: 1790–1801.
Underwood AJ 1991, Beyond BACI: experimental designs for detecting human environmental impacts on temporal variation in natural populations, Australian Journal of Marine and Freshwater Research 42(5): 569–587.
Underwood AJ 1992, Beyond BACI: the detection of environmental impact on populations in the real, but variable, world, Journal of Experimental Marine Biology and Ecology 161(2), 145–178.
Wasserstein RL & Lazar NA 2016, The ASA’s statement on p-values: context, process, and purpose, The American Statistician 70: 129–133.