Cluster and Indicator Species｜Japan Sea National Fisheries Research Institute, FRA

Top page
Analyses on Ecological Assemblages
Cluster and Indicator Species

Analyses on Ecological Assemblages

Cluster and Indicator Species

Here, we learn methods to construct clusters of assemblages according to their similarity and to recognize thier indicator species.

Dissimilarity of assemblges and hierarchical cluster

Load packages by require(vegan) and require(cluster).

Read a assemblage data matrix from a csv file. An argument header=TRUE sets species names as column names, and row.names=1 sets site names as row names.

dat <- read.csv("spcdat.csv", header = TRUE, row.names=1)

To evaluate differencies between assemblages, a distance matrix is calculated by a function vegdist() in the package vegan. Morisita-Horn dissimilarity can be obtained by an argument method="horn". If you would like to get the Bray-Curtis dissimilarity, set method="bray".

dis.mh <- vegdist(dat, method="horn")
dis.bc <- vegdist(dat, method="bray")

Using the distance matrix, a function hclust() in the pacage cluster produces a hierarchical cluster. There are several methods to make clusters. The following example show one of the popular methods: the Ward method method="ward.D2".

resmh.ward <- hclust(dis.mh, method="ward.D2")
plot(resmh.ward)

If we would like to use the group-average method, another popular method, set method="average". In the example above, the site Ako1 is an outlier but this is included in a group by a superior grouping power of the Ward method.

After inspecting this dendrogram, we may think that the whole sits should be grouped into three groups. Then, use a function cutree() and set an argument k=3.

resmh.ward.g <- cutree(resmh.ward, k=3)
resmh.ward.g

##    Ako1    Ako2    Ako3  Nisio1  Nisio2  Nisio3 Nakatu1 Nakatu2 Nakatu3 
##       1       2       2       1       1       2       3       1       1 
##  Banzu1  Banzu2  Banzu3    Uto1    Uto2    Uto3 
##       2       2       2       3       3       1

We can decide the number of groups more objctivly by using a silhouette value.

asil <- numeric(nrow(dat))
for (i in 2:(nrow(dat)-1)){
  sil <- silhouette(cutree(resmh.ward, k=i), dis.mh)
  asil[i] <- mean(sil[,3])
}
asil

##  [1] 0.00000000 0.27436092 0.37874725 0.41602592 0.43325536 0.41073129
##  [7] 0.34779031 0.32437756 0.28972742 0.23109473 0.24977641 0.22718512
## [13] 0.09599773 0.06826518 0.00000000

Larger silhouette value indicates better grouping numbers. In this example, five groups gives the maximum silhouette 0.4332554. Let's see the 5 grouping and its silhouette plot.

resmh.ward.g <- cutree(resmh.ward, k=which.max(asil))
resmh.ward.g

##    Ako1    Ako2    Ako3  Nisio1  Nisio2  Nisio3 Nakatu1 Nakatu2 Nakatu3 
##       1       2       2       3       3       2       4       5       5 
##  Banzu1  Banzu2  Banzu3    Uto1    Uto2    Uto3 
##       2       2       2       4       4       3

sil <- silhouette(resmh.ward.g, dis.mh)
rownames(sil) <- row.names(dat)
plot(sil)

Non-hierarchical cluster: K-medoids

Depending on characteristics of data and/or purpose of study, non-hierarchical clustur is a better option to group assemblages.

K-means method is one of the best known methods for hierarchical clusters, that has an assumption of Euclidean distance. But most of the Ecologically useful dissimilarities, including Bray-Curtis and Morisita-Horn, do not fit the Euclidean distance. The following example shows K-medoids method, by using a function pam(), that is known to be more robust to violation of the Euclidean distance.

asilmk <- numeric(nrow(dat))
for (i in 2:(nrow(dat)-1)){
  asilmk[i] <- pam(dis.mh, k=i, diss=TRUE)$silinfo$avg.width
}
asilmk

##  [1] 0.00000000 0.21662533 0.32562256 0.36650728 0.42029671 0.38719369
##  [7] 0.38006387 0.34541372 0.28972742 0.23109473 0.21786278 0.08558420
## [13] 0.09599773 0.06826518 0.00000000

Again, five groups gives the maximum silhouette 0.4202967. Let's see the 5 grouping and its silhouette plot.

resmh.pam <- pam(dis.mh, k=which.max(asilmk), diss=TRUE)
summary(resmh.pam)

## Medoids:
##      ID            
## [1,] "1"  "Ako1"   
## [2,] "10" "Banzu1" 
## [3,] "15" "Uto3"   
## [4,] "9"  "Nakatu3"
## [5,] "13" "Uto1"   
## Clustering vector:
##    Ako1    Ako2    Ako3  Nisio1  Nisio2  Nisio3 Nakatu1 Nakatu2 Nakatu3 
##       1       2       2       3       4       2       5       4       4 
##  Banzu1  Banzu2  Banzu3    Uto1    Uto2    Uto3 
##       2       2       2       5       5       3 
## Objective function:
##     build      swap 
## 0.2251717 0.2217902 
## 
## Numerical information per cluster:
##      size  max_diss    av_diss  diameter separation
## [1,]    1 0.0000000 0.00000000 0.0000000  0.7825552
## [2,]    6 0.4932491 0.21581136 0.7984151  0.5436638
## [3,]    2 0.6921108 0.34605540 0.6921108  0.6004724
## [4,]    3 0.5783309 0.36657508 0.9096986  0.5536461
## [5,]    3 0.1268453 0.08004966 0.1640254  0.5436638
## 
## Isolated clusters:
##  L-clusters: character(0)
##  L*-clusters: [1] 5
## 
## Silhouette plot information:
##         cluster neighbor   sil_width
## Ako1          1        5  0.00000000
## Banzu1        2        5  0.67233690
## Banzu3        2        5  0.62128995
## Ako2          2        5  0.56283193
## Banzu2        2        5  0.52444944
## Nisio3        2        4  0.39476346
## Ako3          2        3  0.35069105
## Nisio1        3        4  0.14466868
## Uto3          3        5  0.07093218
## Nakatu3       4        3  0.33196280
## Nakatu2       4        5  0.20581221
## Nisio2        4        3 -0.04255231
## Uto1          5        3  0.86179224
## Uto2          5        1  0.81415322
## Nakatu1       5        2  0.79131885
## Average silhouette width per cluster:
## [1] 0.0000000 0.5210605 0.1078004 0.1650742 0.8224214
## Average silhouette width of total data set:
## [1] 0.4202967
## 
## Available components:
## [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
## [6] "clusinfo"   "silinfo"    "diss"       "call"

plot(resmh.pam)

Indicator species of clusters: IndVal method

According to the clusters of the assemblages, we would like to know characteristic species of each cluster. We can select indicator species by IndVal method.

Here, we use the result of K-medoids clusters. First, load labdsv package require(labdsv) to use a function indval().

iva <- indval(dat, resmh.pam$clustering, numitr=5000)
gr <- iva$maxcls[iva$pval<=0.05]
iv <- iva$indcls[iva$pval<=0.05]
pv <- iva$pval[iva$pval<=0.05]
fr <- apply(dat>0,2,sum)[iva$pval<=0.05]
fdig <- data.frame(group=gr, indval=iv, pvalue=pv, freq=fr)
(fdig <- fdig[order(fdig$group, -fdig$indval),])

##                        group    indval pvalue freq
## Musculista_senhousia       1 0.9926342 0.0292    9
## Nectoneanthes_latipoda     1 0.9795918 0.0360    2
## Aonides_oxycephala         1 0.9354839 0.0370    6
## Spio_sp                    3 0.8571429 0.0158    5
## Synidotea_sp               4 1.0000000 0.0040    3
## Mactra_veneriformis        5 0.8463950 0.0026    9

Now, we have a list of indicator species that shows significantly (p<0.05) large value of indval. Group 2 does not have any indicator species.

References

Borcard D, Gille F, Legendre P (2011) Numerical Ecology with R. Springer
Legendre P, Legendre L (2012) Numerical Ecology. Elsevier

Back to "Analyses on Ecological Assemblages by R"

JSNFRI, Japan Fisheries Research and Education Agency

Analyses on Ecological Assemblages