Thursday, April 30, 2015

Function for maximising rank correlation between environmental and compositional differences

As a follow-on from the post on scaling environmental variables before calculating distances, there is this function ('bioenv') in the vegan package which can be used to find the set of variables for which (scaled) environmental distances are the most rank-correlated with compositional dissimilarity. This could be useful for some analyses- i.e. if you want to know what environmental changes are the most important for turnover in the system.

Friday, April 10, 2015

Scaling environmental variables

Multivariate analysis in community ecology often involves some comparison of samples (e.g. field plots) based on some measure of species composition, function or habitat structure as well as environmental variables (like rainfall, soil nutrient levels or depth etc).

So a basic data processing step is to find pairwise distances between sample plots based on the environmental variables, for example, as input for a partial mantel test. For simple environmental variables, generally Euclidean distance is appropriate and may be calculated on all variables at once or a subset. To be tidy with this, it is useful to standardise the variables first, to avoid those on larger scales or with larger variances having disproportionate influence.

Here is an example for nicely behaved normally distributed dummy environmental variables. Scaling these variables (that may be on different scales or measured in different units etc) makes them more comparable and gives them even weight in the distances. The scatterplot at the bottom shows that distances calculated on the unscaled versus scaled variables are not equivalent.



> variables <- data.frame(var1 = rnorm(100, mean=500, sd=10), var2 = rnorm(100, 13, 5), var3 = rnorm(100, 30, 5))
> variables
        var1      var2     var3
1   511.0022  6.714630 35.99468
2   497.8158 17.111932 23.84066
3   486.5962  8.608682 35.21800
4   496.9038 10.772592 27.12847
5   503.3185  8.448805 28.20107
6   504.2738 10.866616 31.44947
7   493.8976 18.157606 26.38141
8   492.1818 19.294316 26.23843
9   489.9431  7.349317 29.21912
10  478.5376 15.526236 21.32091

. . .

> par(mfrow=c(2,3)); hist(variables$var1); hist(variables$var2); hist(variables$var3)
> distances_raw <- dist(variables)
> variables <- as.data.frame(scale(variables))
> variables
             var1         var2          var3
1    0.9818746685 -1.288906408  1.3751342411
2   -0.2532038824  0.865837810 -1.2146544102
3   -1.3040573556 -0.896381771  1.2096385673
4   -0.3386237022 -0.447931619 -0.5140854569
5    0.2621974712 -0.929514889 -0.2855340385
6    0.3516740650 -0.428445964  0.4066368357
7   -0.6201900599  1.082543939 -0.6732696538
8   -0.7808971996  1.318116576 -0.7037355547
9   -0.9905750496 -1.157373534 -0.0686077652
10  -2.0588471840  0.537216980 -1.7515649547

. . .

> hist(variables$var1); hist(variables$var2); hist(variables$var3)
> distances_scaled <- dist(variables)

Second row shows the same variables re-scaled



> plot(distances_raw ~ distances_scaled, cex=0.7, col="red", pch=20, main="Euclidean distances from raw versus scaled variables")