Correlation between variables

Before reading this page, you should read:

In water quality monitoring studies, we are often interested in quantifying the relationship between multiple water quality indicators and other environmental drivers.

You can conveniently explore statistical relationships between pairs of water quality variables, or between water quality and environmental variables, using standard statistical techniques:

correlation analysis
multivariate and high-dimensional data analysis
regression analysis.

Correlation analysis

A useful adjunct to the scatterplot matrix presented in Figure 2 of Data preparation is a summary of the correlations between pairs of variables.

The Pearson correlation coefficient is a numerical measure of the degree of linearity between 2 variables. Given 2 variables X and Y (where Y is notionally the dependent variable and X the independent variable), the sample correlation coefficient (r) is computed using Equation 1.

r equals sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis x Subscript i minus x overbar right-parenthesis left-parenthesis y Subscript i minus y overbar right-parenthesis Over StartRoot sigma-summation Underscript i equals 1 Overscript n Endscripts left-parenthesis x Subscript i minus x overbar right-parenthesis squared sigma-summation Underscript i equals 1 Overscript n left-parenthesis y Subscript i minus y overbar right-parenthesis squared EndRoot — **Equation 1**

The correlation coefficient is constrained to lie in the interval –1 ≤ r ≤ +1. Broad interpretations of r are provided in Table 1.

**Table 1 Broad interpretations of Pearson’s correlation coefficient**
Value of \|r\|	Interpretation
0.7 to 1.0	Strong linear association
0.5 to 0.7	Moderate linear association
0.3 to 0.5	Weak linear association
0 to 0.3	Little or no linear association

(A related and important statistic used to assess the adequacy of a fitted regression model is the coefficient of determination, R². In simple terms, R² is the proportion of total variation in some dependent variable that can be explained or accounted for by a regression model.)

A correlation analysis is generally done before a more comprehensive regression analysis, which models relationships between the variables and makes inferences about the underlying parameter values and the expected response.

Multivariate and high-dimensional data

Water quality monitoring programs often generate high-dimensional data that pose considerable challenges.

The data can be the collection of multiple water quality variables at the same site or of the same water quality variable from different sites, or a combination of both. Responses to the multiple water quality data can also be multivariate if the impact study is of interest.

Statistical analogues of many of the univariate techniques are available for high-dimensional data.

Many classic techniques were developed to deal with specific features of high-dimensional data under the statistical umbrella of multivariate analysis with strong distributional assumptions on the data (often under the assumption of normality). These techniques are underpinned by advanced statistical concepts and linear algebra.

Most statistical software have multivariate statistical analysis capabilities. Before applying these more advanced statistical tools, you should:

visually explore the data
attempt to identify relationships between pairs and groups of variables
determine the nature of these relationships, which may indicate the need for suitable data transformations.

Many multivariate software provide these visualisations, such as draftsman plots and scatterplot matrixes.

Distance matrixes have a crucial role in many multivariate methods and are often used as a way to characterise similarity (or dissimilarity) across many variables in both the analysis and visualisation.

The choice of similarity or dissimilarity can be important and should be selected according to the differences of concern (e.g. Faith et al. 1987, Clarke 1993).

Visualisation of data is necessarily limited to 3 dimensions, although often as a 2-dimensional projection. To overcome the curse of dimensionality, a number of statistical procedures have been devised in an attempt to reduce the number of dimensions with minimal loss of information contained in the original set of variables.

The methods discussed here are available in published literature (e.g. Davis 2002, Manly 2004). For easy-to-read descriptions from the aspects of ecology and biology, Legendre & Legendre (2012) and Quinn & Keough (2003) have very useful descriptions of multivariate techniques.

Figure 1 summarises the broad types of multivariate methods and provides examples of common techniques within those types and some examples of typical analyses for water quality data. The distinction is made between descriptive and inferential methods. Regression analysis is included, and while it can be useful for high-dimensional data, it is not usually considered a multivariate method. Many alternative representations and typologies of multivariate methods are possible, and we encourage you to consult some of the published literature.

**Figure 1 Broad typology and summary of multivariate methods**

Multivariate techniques are often based on geometric intuition. Although often not explicitly stated, their performances are affected by the underlying distribution of the data and they work best for normal data. Transformations of non-normal data are sometimes used to produce normal data but the results need to be interpreted carefully due to the transformation. Instead of data transformation, nonparametric methods, such as non-metric multidimensional scaling, ANOSIM and PERMANOVA, have been developed and often appeal to simulation or randomisation-based solutions that are possible with modern computers.

One technique that is potentially useful in water quality studies is principal component analysis (PCA) (Pearson 1901). PCA constructs linear combinations of the original variables such that the resulting combinations account for the maximal amount of variation in the original data using considerably fewer constructed variables.

The drawback of PCA is that the constructed variables (‘components’) generally have no real or physical meaning and have no recognisable units of measurement. As an exploratory analysis technique and dimension-reduction device, PCA has a potentially valuable role to play. You should consult any of the texts available on multivariate statistics to learn more about PCA.

If the data are multivariate Gaussian, then PCA has the beneficial property that the distribution is centred on the principal component axes and is therefore ‘self-consistent’ (Hastie & Stuetzle 1989).

Take care when the dimension of data is much larger than the number of observations because the results may not be consistent (Johnstone & Lu 2009). Several regularisations and algorithms were proposed to address this issue, such as lasso type (Joliffe et al. 2003), penalised method (Witten et al. 2009) and semiparametric method (Han & Liu 2012).

Multidimensional scaling (MDS) (Torgerson 1952) is a statistical technique that attempts to reveal relationships or patterns among a large number of variables by reconstructing ‘similarities’ between pairs of these variables in a ‘reduced’ space (typically 2 or 3 dimensions). MDS is usually introduced by analogy with our example situation.

Example of performing multidimensional scaling

Given a map showing the locations of major cities, it is a straightforward task to construct a table of distances between the cities. The reverse task is considerably more difficult. Given a table of intercity distances, how is the map constructed? MDS attempts to resolve that problem.

In the natural environment, the difficulty is compounded because our input data (‘distance’ matrix) are subject to considerable uncertainty. Furthermore, the number of dimensions required to adequately represent the observed similarities or distances is not always evident.

Biologists and ecologists have found MDS analysis particularly useful in teasing out complex space–time interactions among biological variables.

Like PCA, MDS has some limitations, including the presentation of results in an abstract space, uncertainty surrounding the appropriate dimension of the reduced space, and a multitude of similarity measures on which to base the MDS. The calculations underpinning MDS are highly complex and are iterative. A terminal solution is generally found only after a number of alternatives have been explored.

The lack of agreement between the similarities or distances in the final MDS representation and the original input data are measured by the so-called ‘stress statistic’. The derivation of fitted distances also depends on the type of MDS that is performed.

Many MDS procedures exist but the main dichotomy differentiates between metric MDS and nonmetric MDS, depending on the measurement level of the data. Metric MDS assumes the input data are quantitative (measured on an interval or ratio scale). Nonmetric MDS assumes the input data are qualitative (measured on a nominal or ordinal scale).

There are different views about the extent to which MDS can be used as an inferential tool. Computationally intensive methods are available that enable you to conduct formal tests of significance, although the strength of MDS for water quality (including biological) studies is more likely to reside in its ability to let the analyst discern patterns and trends visually. Refer to Borg & Groenen (2005) for comprehensive discussion and applications.

Cluster analysis or clustering (Driver & Kroeber 1932) is a statistical tool that groups a set of observations in such a way that the observations in the same cluster are more similar to each other than to those in other clusters. The first step in cluster analysis is to decide the characteristics or features that are used to group the observations. Then you need to decide on the procedure to form the clusters of observations.

Many different approaches exist to form clusters, including:

hierarchical clustering methods, which recursively aggregate or divide according to some metric of similarity or dissimilarity among the observations
k-means or fuzzy k-means clustering, which seeks to find the k clusters with the greatest separation in multivariate feature space
other model-based clustering approaches, which seek to define clusters so that observations in the same cluster most likely belong to the same distribution.

There is no simple guidance to select a clustering method. It often makes sense to try multiple methods. If clusters persist across different methods, then it is more likely that those clusters are unique in some way.

Analysis of similarities (ANOSIM), a modification and generalisation of the Mantel test (Mantel 1967) for 2 groups with real values of distance between observations, is a distribution-free method that directly tests whether there is a significant difference (either in location or dispersion) between 2 or more sampled groups by using the ranks of distance.

The test statistic established is the difference between the mean between-groups and within-groups rank dissimilarities, which is similar to a correlation coefficient.

However, distances are not independent of each other. Changing the position of one observation will change its distances to all the other observations. So you cannot simply use the correlation to determine its statistical significance. Instead, the significance of the statistic is assessed by permuting the group vector to obtain its empirical distribution under the null hypothesis model and comparing the calculated value of the test statistic with the empirical distribution.

Like other distance-based methods, ANOSIM is affected by the similarity index used. For ecological application, discussions can be found in Clark et al. (2006a, 2006b).

Sometimes it is necessary to discriminate variables between 2 groups and identify the variables that explain the similarity or dissimilarity.

Similarity percentage analysis (SIMPER) gives the percentage of the contribution of individual variables to that similarity and the cumulative percentage of contribution so that you know how many variables explain a given level of similarity (Clarke 1993). The analysis of differences between groups by SIMPER can be sensitive to species that are highly variable (Warton et al. 2012).

Multivariate analysis of variance (MANOVA), a multivariate version of the univariate analysis of variance (ANOVA), tests the null hypothesis of no difference in the multivariate centroids among the groups. A number of statistical tests have been designed, such as Wilks’ lambda, the Lawley–Hotelling trace, the Pillai–Bartlett trace and Roy’s largest root. You can obtain the theory and application from a wide range of textbooks. Unfortunately, the validity of MANOVA relies on strong assumptions, including normality of errors, linearity among all dependent variables and homogeneity of variance–covariance matrixes among groups, and also suffers from outliers in the data.

Permutational multivariate analysis of variance (PERMANOVA) is designed to derive a more robust test when the underlying assumptions of MANOVA become invalid, by using permutation to make it distribution-free (Anderson 2001). PERMANOVA works with any distance measure that is appropriate to the data. The permutation idea can be used to include most of the options you would expect from modern ANOVA or MANOVA implementation. Anderson & Walsh (2013) demonstrated its robustness against different assumptions that underpin MANOVA in comparison with ANOSIM and Mantel tests.

It is also of interest to link multivariate biotic patterns to environmental variables (Clarke & Ainsworth 1993). Clarke et al. (2008) tested if the biotic patterns had significant correlation with its environment. A Bray–Curtis dissimilarity matrix among biotic variables is calculated once but all possible subsets of environmental variables are examined using Euclidean distance to examine the extent to which the particular subset of environmental variable has captured the sampled biotic pattern by correlating the matching entries of the 2 resemblance matrixes. As there is almost always a positive rank correlation, there is a strong selection bias if the null hypothesis of zero correlation is carried out, and a permutation needs to be implemented to rectify this selection bias.

There are many more multivariate techniques across the statistical, ecological and biological sciences than presented here. For instance, ordination methods are particularly popular in quantitative ecology.

More model-based approaches are now available for common multivariate methods, such as model-based approaches to clustering and ordination, or species distribution modelling approaches (e.g. Warton et al. 2015, Baeten et al. 2014, Hui et al. 2015). These are increasingly popular because they enable more rigorous model selection and checking, strengthen the basis for statistical inference, assist interpretation and offer the ability for extensions to address particular features in the data or problem.

Additional approaches centred more on statistical or machine learning, such as classification and regression tree (CART) analysis (e.g. Breiman et al. 1984) or random forests for classification (e.g. Breiman 2001), are also popular.

You can explore the rich suite of potential multivariate techniques by consulting many of the standard texts previously mentioned (e.g. Legendre & Legendre 2012).

Regression analysis

Elsewhere we discussed regression methods of a single water quality variable measured at one site over time or at multiple sites for a particular point in time. These methods still apply when identifying and quantifying the relationship between multiple water quality indicators, or between water quality indicators and environmental drivers.

An example is a model developed by Kuhnert et al. (2012) for total suspended solids concentration that includes multiple terms for the effect of flow history on concentration, a term that captures sediment trapping and spatial sources of flow, and a term for catchment vegetation cover.

If, through exploratory data analyses, plots of the dependent variable indicate that it may follow a particular statistical distribution, then parametric regression methods, such as generalised linear modelling (McCullagh & Nelder 1989) and generalised additive modelling (Hastie & Tibshirani 1990, Wood 2006) may be appropriate. Otherwise, nonparametric regression methods are a popular approach when such indications or assumptions are not valid.

Nonparametric regression methods, such as classification and regression trees (CARTs) (Breiman et al. 1984), have played a prominent role in the environmental space due to their ability to model complex relationships and interactions from a potentially large set of predictor variables and to present the findings in a graphical form.

As the name suggests, a tree-based model can be constructed on data to inform a classification (classification tree), a continuous response or count data (regression tree). The methodology comprises 3 components, the first of which aims to partition the data into homogeneous (or like) groups until a large tree is produced. Each partitioning is formed by splitting the data using a variable and split location that is identified as important by a split criterion.

In classification problems, the most popular criterion used is the gini index. For regression, the mean square error is adopted. Breiman et al. (1984) provided a discussion around these different measures and how they are used to partition the data. In addition to identifying a primary split to partition the data, surrogate splits are also identified.

The algorithm used to partition the data and form a tree with many splits is considered a ‘greedy’ approach because once a split is performed, it is never revisited. While this approach is computationally efficient and fast, it can lead to models that underperform or are considered unstable.

As highlighted by Breiman (2001), tree-based models, while visually interpretable, can produce slightly different trees and predictions if a resample of the data is taken and a new model is fitted. This by-product of the tree-based algorithms resulted in a body of research that investigated ensemble approaches, whereby bootstrap samples of the data were used to develop a large number of trees. Predictions for each tree model could then be averaged to result in more accurate and stable predictions (Breiman 2001).

Bagging (Breiman 1996), boosting (Breiman 1998, Freund & Schapire 1996) and random forests (Breiman 2001) are some of the primary ensemble approaches that have resulted in better predictive models, particularly in situations where a problem has many inputs that do not predict well individually.

With random forests, trees are developed on bootstrap samples of the data with a random feature selection. It is this ‘randomness’ of samples and features taken that leads to an increase in the performance of the method over other ensemble methods and tree-based models.

While the interpretability of the model is somewhat lost through random forests due to the averaging across multiple models, partial dependence plots can be explored to examine each variable’s marginal contribution. These plots can be likened to the smooth marginal plots from a generalised additive model (GAM) but are less smooth due to how the trees are constructed. Like decision trees, a variable importance ranking can be produced to examine important variables.

Extensions to decision trees for multivariate and higher dimensional data are multivariate regression trees (MRT) (De'ath 2002, Larsen & Speckman 2004). Kuhnert et al. (2012) developed a methodology that builds on the classification tree framework of Breiman et al. (1984) by considering a rearrangement of the data from a matrix to a vector.

References

Anderson MJ 2001, A new method for non-parametric multivariate analysis of variance, Austral Ecology 26(1): 32–46.

Anderson MJ & Walsh CI 2013, PERMANOVA, ANOSIM and the Mantel test in the face of heterogeneous: What null hypothesis are you testing? Ecological Monographs 83(4): 557–574.

Baeten L, Warton DI, Van Calster H, De Frenne P, Verstraeten G, Bonte D, Bernhardt-Römermann M, Cornelis J, Decocq G, Eriksson O, Hédl R, Heinken T, Hermy M, Hommel P, Kirby K, Naaf T, Petřík P, Walther GR, Wulf M & Verheyen K 2014, A model-based approach to studying changes in compositional heterogeneity, Methods in Ecology and Evolution 5: 156–164.

Borg L & Groenen PJF 2005, Modern Multidimensional Scaling: Theory and applications, Springer, New York.

Breiman L 1996, Bagging predictors, Machine Learning 24: 123–140.

Breiman L 1998, Arcing classifiers (with discussion and a rejoinder by the author), The Annals of Statistics 26(3): 801–824.

Breiman L 2001, Random forests, Machine Learning 45: 5–32.

Breiman L, Friedman J, Stone CJ & Olshen RA 1984, Classification and Regression Trees, Chapman and Hall, USA.

Clarke KR, Chapman MG, Somerfield PJ & Needham HR 2006a, Dispersion-based weighting of species counts in assemblage analyses, Marine Ecology Progress Series 320: 11–27

Clarke KR, Somerfield PJ & Chapman MG 2006b, On resemblance measures for ecological studies, including taxonomic dissimilarities and a zero-adjusted Bray–Curtis coefficient for denuded assemblages, Journal of Experimental Marine Biology and Ecology 330: 55–80.

Clarke KR 1993, Non-parametric multivariate analyses of changes in community structure, Australian Journal of Ecology (Austral Ecology) 18: 117–143.

Clarke KR & Ainsworth M 1993, A method of linking multivariate community structure to environmental variables, Marine Ecology Progress Series 92(3): 205–219.

Clarke KR, Somerfield PJ & Gorley RN 2008, Testing of null hypotheses in exploratory community analyses: Similarity profiles and biota-environment linage, Journal of Experimental Marine Biology and Ecology 366: 56–69.

Davis JC 2002, Statistics and Data Analysis in Geology, 3rd Edition, John Wiley and Sons, New York.

De'ath G 2002, Multivariate regression trees: a new technique for modeling species-environment relationships, Ecology 83: 1105–1117.

Driver HE & Kroeber AL 1932, Quantitative expression of cultural relationships, The University of California Publications in American Archaeology and Ethnology 31: 211–256.

Faith DP, Minchin RM & Belbin L 1987, Compositional dissimilarity as a robust measure of ecological distance, Plant Ecology 69: 57–68

Freund Y & Schapire RE 1996, Experiments with a new boosting algorithm, in: Saitta L & Kaufmann, M (eds), Machine Learning: Proceedings of the Thirteenth International Conference, Bari, Italy.

Han F & Liu H 2012, Transelliptical component analysis, Advances in Neural Information Processing Systems 25: 171–179.

Hastie TJ & Stuetzle W 1989, Principal curves, Journal of the American Statistical Association 84(406): 502–516.

Hastie TJ & Tibshirani RJ 1990, Generalized Additive Models, Chapman and Hall/CRC, London.

Hui FKC, Taskinen S, Pledger S, Foster SD & Warton DI 2015, Model-based approaches to unconstrained ordination, Methods in Ecology and Evolution 6: 399–411.

Johnstone IM & Lu AY 2009, On consistency and sparsity for principal components analysis in high dimensions, Journal of the American Statistical Association 104(486): 682–693.

Joliffe NT, Trendafilov NT & Uddin M 2003, A modified principal component technique based on the lasso, Journal of Computational and Graphical Statistics 12(3): 531–547.

Kuhnert PM, Henderson B, Lewis SE, Bainbridge ZT, Wilkinson S & Brodie JE 2012, Quantifying total suspended sediment export from the Burdekin River catchment using the loads regression estimator tool, Water Resources Research 48(4): 1–18.

Larsen DR & Speckman PL 2004, Multivariate regression trees for analysis of abundance data, Biometrics 60: 543–549.

Legendre P & Legendre L 2012, Numerical Ecology, 3rd Edition, Elsevier, Amsterdam.

Manly BFJ 2004, Multivariate Statistical Methods: A primer, 3rd Edition, Chapman and Hall/CRC.

Mantel N 1967, The detection of disease clustering and a generalized regression approach, Cancer Research 27(2): 209–220.

McCullagh P & Nelder JA 1989, Generalized Linear Models, 2nd Edition, Chapman and Hall/CRC.

Pearson K 1901, LIII. On lines and planes of closest fit to systems of points in space, Philosophical Magazine 2(11): 559–572.

Quinn GP & Keough MJ 2003, Experimental Design and Data Analysis for Biologists, Cambridge University Press, Cambridge.

Torgerson WS 1952, Multidimensional scaling: I. Theory and method, Psychometrika 17: 401–419.

Warton DI, Wright TW & Wang Y 2012, Distance-based multivariate analyses confound location and dispersion effect, Methods in Ecology and Evolution 3: 89–101.

Warton DI, Foster SD, De’ath G, Stoklosa J & Dunstan PK 2015, Model-based thinking in ecology, Plant Ecology 216(5): 669–682.

Witten DM, Tibshirani R & Hastie TJ 2009, A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics 10(3): 515–534.

Wood S 2017, Generalized Additive Models: An introduction with R, Chapman and Hall/CRC, Florida.

Correlation between water quality variables

Correlation analysis

Multivariate and high-dimensional data

Regression analysis

References