Shrinkage improves estimation of microbial associations under different normalization methods

Shrinkage improves estimation of microbial associations under

di↵erent normalization metho ds

Michelle Badri

, Zachary D. Kurtz

, Richard Bonneau

1,3,4,⇤

, and Christian L. M

uller

5,6,7,⇤

Departme nt of Biology, New York Universi ty, New York, 10012 NY, USA

Lodo Therapeutics, New York, 10016 NY, USA

Center for Computational Biology, Flatiron Institute, Simon s Foundation, New York, 10010 NY, USA

Computer Science Department, Courant Institute, New York, 10012 NY, USA

Center for Computational Mathematics, Flatiron Ins ti tute , S i mon s Foundation, New York, 10010 NY , US A

Institute of Computational Biology, Helmholtz Zentrum M¨unchen, Neuherberg, Germany

Departme nt of Statistics, Ludwig-Maximilians-Universit¨at M¨unchen, Mun i ch , Ger ma ny

⇤

correspond en c e to: [email protected], cmueller@ﬂatironinstitute.org

ABSTRACT

Consistent estimation of associations in m ic r obi a l genomic

survey count data is fundamental to microbiome research.

Technical limitations, including compositionality, low sample

sizes, and technical variabi l i ty, obstruct standard app l i ca ti on

of association measures and require data normalization

prior to estimating associations. Here , we investigate

the interplay between data normali z a ti on and microbial

association estimation by a comprehensive analysis of

statistical consistency. Leveraging th e large sample size

of the American Gut Project (AGP), we asse ss the

consistency of the two prominent linear association

estimator s, correl a ti on and proportional i ty, u n de r di↵erent

sample scenarios and data normalization schemes, including

RNA-seq analysis work ﬂows and log-ratio transformations.

We show that shrinkage estimation, a standard technique

in high-dimensional statistics, can univer sa ll y improve the

quality of association estimates for microbiome data. We

ﬁnd that large-scale association pattern s in the AGP

data can be grouped into ﬁve normalization-dependent

classes. Using microbial association network construction

and clustering as examples of exploratory data analysis,

we show that variance-stabilizing an d log-ratio approaches

provide for the most consistent estimation of taxonomic and

structural coherence. Taken together, the ﬁndings from our

reproduc i bl e analysis workﬂow have important implications

for microbiome studies in multiple stages of analysis,

partic ul arly when only small sample sizes are available.

INTRODUCTION

Recent advances in microbial amplicon and metagenomic

sequencing as well as large-scale data collection efforts

provide samples across different microbial habitats that are

amenable to quantitative analysis. Following the organization

of sequence data into Operational Taxonomic Units (OTUs)

or Amplicon Sequence Variants (ASVs), via pipelines such

as qiime (1), mothur (2), or dada2 (3), the resulting

count data are then available in tabular format for

statistical analysis. Downstream analysis tasks include

assessing community diversity (4), differential abundance

analysis, associating bacterial compositions to system-speciﬁc

ecological and biomedical covariates, and learning microbe-

microbe associations.

However, technical artifacts inherent in microbial

abundance data preclude the application of such analysis

tasks directly on the measured counts. The data typically

comprise a high proportion of zeros and carry only relative

information about species abundance. The total number of

read counts for any given observation is limited by the total

amount of sequencing, quality of DNA preparations, and

other technical factors and does not represent the community

abundance or total species abundance in the sample or

ecosystem. For example, unequal amplicon library sizes

can bias sequencing reads to taxa from the larger sample,

regardless of true abundance proﬁles. Although some recent

studies have used controlled communities, spike-in controls

and other innovations to obtain total community size (5–8), in

the majority of experimental designs, the community size is

unknown, and, thus, our data is best thought of as containing

relative or compositional information (each OTU fraction

of total counts, total community size unknown) (9, 10).

Additionally, technical variation due to sequencing such as

differences in ampliﬁcation biases and batch effects due to

multiple sequencing runs can hamper proper quantiﬁcation of

microbial compositions (11).

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

To ameliorate these biases, general data normalization

methods have been proposed to correct for sampling bias,

library size, and technical variability, including workﬂows

from RNA-seq pre-processing and compositional data

analysis (12–15). Dedicated normalization and modeling

strategies are also available for speciﬁc analysis tasks, most

prominently, for differential abundance testing (16–19).

Here, we examine data normalization schemes in

connection with a fundamental multivariate statistical

estimation task: inferring pairwise linear associations from

microbial count data. Two common strategies that have been

adopted for microbial relative abundance data are Pearson

or rank-based correlations after data normalization (20, 21)

and proportionality (22, 23) as association measure for

compositional data. Consistent correlation or proportionality

estimation is of paramount importance for a host of

downstream analysis tasks, including state-of-the-art diversity

estimation that takes the connectivity of the community into

account (4), direct microbial association network inference

(12), discriminant analysis, and microbial community

clustering.

While previous work (20) has assessed the precision

of correlation detection strategies on synthetic microbial

sequencing count data, we took a different approach and

investigated the behavior of linear association estimation

on the largest-to-date citizen-science sample collection, the

American Gut Project (AGP) (24). The large available sample

size n>9000 allows us, for the ﬁrst time, to critically

measure the asymptotic consistency of combinations of data

normalization and association estimation techniques. More

speciﬁcally, given the lack of “gold standard” microbial

associations in gut microbial communities, we evaluated

different estimation strategies on subsamples of the AGP data

of increasing size. We asked the question whether and how

association patterns inferred from small but realistic sample

sizes of tens to a few hundreds of samples resemble those

inferred using the entire data set.

Using a comprehensive set of evaluation criteria and

summary statistics, we ﬁrst show that, independent of

any speciﬁc data normalization scheme, standard linear

association measures are unreliable in the small sample

regime. We propose the concept of shrinkage estimation (25)

as an effective strategy for consistent association estimation

in the small sample regime and quantify the effects of sample

size on data normalization and association estimates on several

down-stream analysis tasks, including microbial association

(or relevance) network inference and clustering. Figure 1

shows the proposed analysis framework used in this study.

Our analysis revealed that all normalization-dependent

association estimates in the AGP data can be broadly grouped

into ﬁve categories and that variance-stabilization and log-

ratio approaches provide the most consistent estimation in

terms of taxonomic and community structure coherence. Our

ﬁndings, available in a fully reproducible statistical analysis

R workﬂow at Synapse ID: syn21654780, have important

implications for microbiome studies in multiple stages of

analysis, most prominently in the presence of small sample

sizes. In particular, we believe that our developed shrinkage

estimation framework will improve the consistency of future

microbiome data analysis studies at almost no additional

computational cost.

METHODS

To examine the interplay of data normalization and association

estimation methods, we ﬁrst describe the four essential

ingredients of our analysis: the processed AGP 16S rRNA

data set, the comprehensive list of data normalization methods,

statistical estimation of linear associations, and downstream

statistical evaluation and analysis tools.

American Gut Project sample collection

We obtained Operational Taxonomic Units (OTU) count tables

and mapping ﬁles for unrareﬁed AGP samples (24) from the

project website ftp://ftp.microbio.me/AmericanGut/ag-2017-

12-04/. We ﬁltered the dataset to contain only fecal samples

whose sequencing depths fall above the 10th percentile and

removed taxa that were present in fewer than 30% of all

samples. This resulted in a data matrix comprised of p =531

taxa and n =9631 samples.

To investigate the sample size dependence of data

normalization and association estimation on this dataset,

we generated collections of random subsamples of varying

sample sizes, ranging from 25n  9000. Large-sample

reference association estimates were calculated using a subset

of samples at n =9000. To simulate reference data under

null correlation or proportionality, we randomly shufﬂed OTU

count data across samples prior to normalization.

Normalization methods

All normalization methods require as input OTU counts,

collected over n samples and stored in a matrix W 2

n⇥p

. Each row is a p-dimensional vector w

(j)

,. . . ,w

(j)

], where j =1,...,n is the sample index,

(j)

is the read count of OTU i in sample j, and N

is the

set of natural numbers {0,1,2,...}. Let the total OTU count

for sample j be m

(j)

i=1

(j)

. Several methods require

the application of a log transformation, thus requiring non-

negative input data. For consistency, we include a pseudocount

of 1 to all OTU input data if zero counts are not explicitly

handled by the respective normalization scheme. We consider

the following data normalization or transformation schemes.

Total Sum Scaling A standard approach for normalizing

count data is to divide individual counts by the total OTU

counts in a sample, thus scaling the count vector such that the

total sum is ﬁxed to 1. This normalization is known as total

sum scaling (tss) or total sum normalization. It reads

tss(w

(j)

,...,

(j)

The resulting sample space of the data is thus the (p

1)dimensional simplex.

Cumulative Sum Scaling The tss approach may place

unwanted inﬂuence on taxa that are highly sampled due to

sequencing biases by over-representing it in the scaling factor

(j)

(11). To reduce the inﬂuence of these highly abundant

taxa for sparse data, cumulative sum scaling (css) has been

proposed in (15) and implemented in the metagenomeSeq

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

Figure 1. Framework for examining the effects of normalization methods on linear association estimation with increasing sample size. Comparative summary

statistics of the resulting association matrices include histogram-based analysis, distance-based matrix comparison, hierarchical clustering, and association

network analysis.

R package. Rather than normalizing by the total sum, css

selects a scaling factor that is a ﬁxed quantile of OTUs counts.

Formally,

css(w

(j)

N,...,

(j)

0

where the scaling factor for sample j is m

(j)

i|w

q

(j)

. The quantity q

(j)

is the sum of read

counts up to and including the l

quantile. N is a pre-

speciﬁed constant, e.g., N , 1000, chosen such that the

resulting data vector resembles the units of the original

counts. The sample space of css-transformed data is that of

non-negative real numbers R

0

Let

l be the index of the q

(j)

, the lth quantile for sample j,

¯q

=med

(j)

the median lth quantile across all samples,

and let µ

(j)

be the mean lth quantile. css requires the median

absolute deviation of sample quantiles to be empirically stable

via the quantity 

=med

|µ

(j)

 ¯q

|. A common choice is

to set

=min{

l+1



0.1

:1 l<n} (15). The scaling

factor is then deﬁned by summing all the counts up to the

smallest value of l that is stable, on average, across all samples

that is greater than or equal to the median. The default choice

for the median is the lth quantile.

Common Sum Scaling Common sum scaling (com), as

introduced in (11), is an alternative to rarefying OTU counts.

Counts are scaled to the minimum depth of each sample via

com(w

(j)

(min)

(j)

,...,

(j)

(min)

(j)

where m

(min)

=inf{m

(1)

(2)

,...,m

(n)

}. The operator b·c

rounds real proportions to the nearest integer.

Relative Log Expression The relative log expression (rle),

introduced for gene expression data and available in the

DESeq/edgeR package (14). The rle method is deﬁned

as follows. Let g(x)=(

i=1

)

1/m

be the geometric

mean of an m-dimensional vector x, and let w

(1)

,...,w

(n)

be the vector of counts of OTU i over n

samples (a transposed column vector of count matrix W ). The

numeric scaling factor for sample j is

(j)

=med

(j)

g(w

)

,...,

(j)

g(w

)

¯s

(j)

g(s

(·)

)

where med[x] denotes the median of vector x and s

(·)

(1)

,...,s

(n)

is a collection of the sample scaling factors.

Let the global scaling factor ˆs

j=1

¯s

(j)

be the arithmetic

mean of all normalized scaling factors. The rle is then deﬁned

rle(w

(j)

ˆs

(j)

¯s

(j)

,..., ˆs

(j)

¯s

(j)

0

In summary, the rle estimates a median library from the

geometric mean over all samples. The median ratio of each

sample to the median library is then taken as the scale factor.

Inverse Hyperbolic Sine A standard variance-reducing

transformations, often applied to ﬂow and mass cytometry

data (26, 27), is the inverse hyperbolic sine function, deﬁned

asinh(w

(j)

)=log(w

(j)

)

+1),

applied element-wise over the sample vector. The resulting

data matrix is then mean-centered prior to association

estimation.

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

Wrench The Wrench procedure, introduced in (16), estimates

compositional correction factors in the presence of zero-

inﬂation. Wrench is deﬁned as

wren(w

(j)

⌘

1

,...,

(j)

⌘

1

The quantity ⌘

1

represents a compositional scale factor

where ⌘

1

(j)

++i

. Here, y

is the proportion of

each feature i in sample j, and

++i

is the average proportion

of each feature i across all samples. The weight e

(j)

estimated using the ”W

” scheme, the default choice in the

Wrench R package (see also (16) for further details). While

Wrench is capable of incorporating information about sample

grouping, e.g., for differential abundance testing, we consider

all samples to be in a single group.

Variance Stabilizing Transform The goal of variance

stabilizing transformations (vst) is to factor out the

dependence of the variance on the mean (over-

dispersion) (14). Consider the mean-dispersion relation

v(µ)=

n1

j=1

✓

(j)

ˆs

(j)

 ˆµ

(j)

◆

. Here, the “size factors” are

ˆs

=med

✓

(j)

g(w

(j)

)

◆

and ˆµ

(j)

j=1

(j)

ˆs

is the sample

mean of counts to size factor ratios of sample j. The vst is

then the integral quantity deﬁned as

vst(w

(j)

dµ

v(µ)

The function v(µ) is approximated by a spline function

and evaluated for each count value in the column. The vst

normalization is available in the DESeq package where the

numerical ﬁtting is achieved using local regression on the

graph (ˆµ

(j)

,v(µ)). A smooth function v(µ) is estimated using

an estimate of raw variance: ˆv(µ) = v(ˆµ

(j)

)z

where z

(j)

j=1

ˆs

The local regression is parameterized such that

large counts are scaled to be asymptotically equal to the

logarithm base 2 of normalized counts. When we examined

the per-OTU standard deviation (taken across all (p =531)

OTUs) plotted against the rank of the average OTU count it

can be seen that vst produces similar counts to both clr and a

logarithm base 2 transform (Supplementary Fig 1).

Centered Log-Ratio transformation Log-ratio

transformations, introduced in (9), transform positive

compositional data from the simplex to Euclidean space

(9, 12). The centered log-ratio (clr) transform is deﬁned as

clr(w

(j)

log

(j)

g(w

(j)

)

,...,log

(j)

g(w

(j)

)

where the ratio is taken with respect to the geometric mean of

the composition. The resulting data lies in a p1 hyperplane

of p-dimensional Euclidean space.

Estimation of linear associations

Following a transformation of count data under some function

f : N

, we consider several estimation methods for

linear associations among the p OTUs.

Covariance and Correlation estimation The standard way

of estimating linear associations is the empirical (sample)

covariances in the sample space X

which forms the basis

for many downstream multivariate data analysis techniques,

including principal component analysis (PCA), discriminant

analysis, metric learning, and network inference.

Formally, column-centering the transformed data results in

a n⇥p data matrix X = f(W )(I



) where I

is the p-

dimensional identity matrix and 1

is unit (all-ones) matrix.

In matrix notation, the sample covariance matrix (cov) is

S =

X 

n1

, where  indicates element-wise multiplication

of two equal size matrices.

The estimate

S is a symmetric p⇥p matrix with the sample

variances along the diagonal and can be normalized to obtain

a matrix containing Pearson correlation coefﬁcients. Let

D =

diag[

S] be a diagonal matrix with the p post-transformed OTU

variances on the diagonal and zero elsewhere. The Pearson

correlation matrix is then

R =



. The matrix

is a symmetric p⇥p matrix where each entry ˆr

R[i,k]

corresponds to the Pearson correlation between OTU i and

OTU k under the data transformation.

The magnitude and sign of the values in

R are

often interpreted as the association strength and direction,

respectively. The sample correlation/ covariance matrices are,

however, inadmissible in the p n setting, i.e., when fewer

samples than OTUs are available. For example, type I errors

may be grossly inﬂated, since the parameters under estimation

are underdetermined. Standard operations for solving systems

of linear equations such as PCA are then ill-posed.

Shrinkage estimation To alleviate this shortcoming of the

standard sample covariance estimator, regularized covariance

and correlation estimation have been developed high-

dimensional statistics. By imposing structural assumptions

about the underlying population covariance, estimators with

stronger statistical properties can be derived in the p>n

setting. One prominent structural assumption is sparsity, i.e.,

only a few strong pairwise correlations are assumed to be

present in the dataset. An effective data-driven approach to

realizing structural sparsity is shrinkage estimation. While

several estimators have been proposed in the literature (28–

30), we focus here on Sch

afer-Strimmer shrinkage estimation

(31), as implemented in the R package corpcor. The principle

idea of shrinkage estimation is to shrink small sample

correlation toward zero where the shrinkage intensities can

be efﬁciently estimated from data (28). More speciﬁcally, we

estimate individual entries s

⇤

of the shrinkage covariance S

⇤

and r

⇤

in the shrinkage correlation R

⇤

as follows. For all

off-diagonal elements in S

⇤

, we compute

⇤

=ˆr

⇤

ˆs

8i6=j,

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

where the shrunk correlation estimates are ˆr

⇤

=(1



⇤

)ˆr

The variances (var) s

⇤

are shrunk in a separate procedure

toward the median v =med[ˆs

,..., ˆs

] via s

⇤



⇤

v +(1



⇤

)ˆs

The shrinkage intensities



⇤

and



⇤

are determined

empirically by estimating the variance within the sample

covariance matrix (see Supplementary Methods).

Proportionality estimation Covariance and correlation

estimation on compositional data has long been criticized due

to the necessary presence of negative bias, scale dependence,

and subcompositional incoherence in the estimates (9, 32).

Association measures based on the concept of proportionality

have thus been put forward as an alternative to correlation

(22). Here, we consider the symmetric proportionality ⇢

(23, 33) which by default operates on clr transformed data

clr

, (W ). The measure is deﬁned as:

⇢

⇣

clr

⌘

=1

var(X

clr

X

clr

)

var(X

clr

)+var(X

clr

)

where X

clr

and X

clr

are the columns of the matrix

corresponding to OTU i and k, respectively. ⇢

a proportionality measure because differences of clr-

transformed components are equivalent to log-ratios of

compositions. As shown in (23, 33), the measure can be

equivalently expressed in terms of covariances and variances

as follows:

⇢

⇣

clr

⌘

2⇥cov

⇣

clr

⌘

var



clr



+var



clr



⌘

2ˆs

ˆs

+ˆs

where ˆs

are elements of the covariance estimates

S =(X

clr

)

clr



n1

on clr-transformed data. This

formulation also clariﬁes the link between the standard

correlation matrix on clr-transformed data and ⇢

: the former

uses the geometric mean of ˆs

and ˆs

in the denominator

whereas the latter uses the arithmetic mean. Since ⇢

completely determined by sample covariances and variances,

we expect the measure to have the same drawbacks in the

small sample setting as the sample covariance estimators.

We thus propose the novel shrinkage-based proportionality

estimator ⇢

⇤

(rhoshrink) as:

⇢

⇤

⇣

clr

⌘

⇤

where s

⇤

are elements of the Sch

afer-Strimmer shrinkage

covariance estimator S

⇤

Comparing Association Patterns

To quantify similarities of the estimated association patterns

across different data normalization method, association

measures, and sample sizes, we considered three different

distance measures. These distances are then used for

comparative low-dimensional embeddings of the different

estimates as well as for measuring convergence of the

estimators with sample size.

Frobenius Distance Given a pair of p⇥p-dimensional

association matrices

R and

, the Frobenius distance

measures the sum of squared differences between the

corresponding entries and is deﬁned as

⇣

⌘

⇣



⌘







Spectral Distance Given a square, symmetric matrix A, let

A=U ⌃U

be its singular value decomposition where ⌃ is

a diagonal matrix with singular values along the diagonal

entries, i.e., ⌃

=

. Let 

max

(A) be the largest singular

value of A. The spectral distance is

⇣

⌘



max

(

R

)⌘







The spectral distance is invariant to unitary transformations of

the association matrices.

Correlation Matrix Distance The Correlation Matrix

Distance (CMD) (34) measures the orthogonality of two

correlation matrices and is deﬁned as

cmd

⇣

⌘

=1

⇣

⌘



2[0,1],

where the trace operator tr(A)=

is the sum of the

diagonal entries of a square matrix.

Downstream analysis

We considered two downstream exploratory data analysis

tasks that require the estimation of microbial associations as

input: (i) taxa clustering and (ii) microbial association network

construction and community analysis.

Clustering Unsupervised clustering of microbial taxa can

help identify microbial sub-communities that may jointly

affect host phenotype or reveal experimental and batch

effects (24). We considered two popular clustering techniques:

spectral and hierarchical clustering.

Spectral clustering requires the construction of an “afﬁnity

matrix” from estimated associations. Here, we transformed

associations into dissimilarity scores A

=1

1

and

constructed a k-nearest neighbor graph (k =2) to obtain a

sparse and symmetric afﬁnity matrix A. Identiﬁcation of

taxa clusters is based on k-means clustering of the ﬁrst m

components of the eigendecomposition of of the normalized

Laplacian L =D

(DA)D

, where D is the diagonal

degree matrix with entries containing the row or column sums

of A (35). We chose the target cluster size to be number of

connected components of the associated afﬁnity graph (35). To

assess the taxonomic consistency of a particular clustering, we

evaluated the homogeneity of each cluster with respect to the

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

taxonomic families of the underlying OTUs. As a quantitative

measure, we computed the ratio of the effective family number

(exponential of the Shannon entropy of family counts) to the

total number of families detected per cluster.

For hierarchical clustering, we converted association

matrices to dissimilarity measures using A

1R

Clustering was then performed using Ward’s method from the

hclust package in R. Circular dendrograms were cut using

the cuttree method where k = 10 was chosen to represent the

number of class annotations.

Relevance networks and community analysis Relevance

networks (36) are a popular way of visualizing and

analyzing the overall structure of the microbial ecosystem.

Relevance network construction ranks all pairwise correlation

or proportionality values between OTUs by absolute value,

selects a certain percentage of highest ranked pairs, and

visualizes the resulting set of pairs as edges between OTUs

in an association network.

Multiple studies have found a higher prevalence of positive

associations between taxonomically related taxa in human

gut datasets (12, 37–40). We thus use taxonomic coherence,

measured by assortativity (41), as independent summary

statistic for relevance networks. When categorical variables

are available for each node, the assortativity coefﬁcient of a

network takes values in [1,1] and measures the tendency of

adjacent nodes to belong to the same category. In the context

of microbial networks, the nodes are OTUs and the associated

categories are their inferred taxonomic rank at the Genus level.

We also examined the presence of community structure

in the relevance networks using the concept of modularity

(42). Similar to clustering analysis, modularity analysis of

a network enables the partitioning of nodes into tightly

connected sub-communities. Modularity was computed using

the fast-greedy algorithm, described in (43) and implemented

in the igraph package in R (44).

RESULTS

Our comprehensive computation and analysis workﬂow

produced several key results which are summarized below.

We highlighted statistical properties of association estimation,

followed by a comparison of downstream analysis results.

Shrinkage universally improves consistency of association

estimation We ﬁrst analyzed the inﬂuence of shrinkage on the

estimation of associations under different data normalization

and sample sizes. We show the convergence properties of

association estimation, as measured by Frobenius distance

with respect to the large sample limit, with increasing

sample size in Figure 2a. Shrinkage universally improves

estimates in the low sample regime compared to its sample

estimation counterparts. Even when the sample size n exceeds

the number of OTUs p, most shrinkage estimates remain

more consistent with their respective large-sample. This

behavior is also reﬂected in the distribution of association

estimates at low (n=50) and large (n =9000) sample sizes,

as highlighted in Figure 3 for the proportionality measures

⇢

(rhoprop) and ⇢

⇤

(rhoshrink). In the small sample limit,

rhoprop produces too extreme estimates compared to the large

sample limit (third row in Figure 3). The distributions of

rhoshrink estimates in the small and large sample limit were

considerably more consistent. This phenomenon was observed

for all combinations of data normalization and association

estimation (Supplementary Figure 4a, Supplementary Figure

5a). As expected, the inﬂuence of shrinkage vanished in

the large sample limit, as reﬂected in decreasing shrinkage

intensities with increasing sample size (Supplementary Figure

2).

Figure 2. Frobenius distance between estimates of association. a) Frobenius distance between sub-samples of different sizes. Dashed lines represent the

mean distance between normalized matrices after Pearson correlation. The solid lines represent the mean distance between normalized matrices where

correlation/proportionality estimation with shrinkage was performed. The dot-dash line represents RHO, a proportionality metric. Vertical lines represent standard

deviation from the mean for each corresponding method. b) Multidimensional scaling representation of Frobenius distance between correlation structures of

varying size estimated from different normalization methods. The most opaque points represent the mean of 5 subsamples of the same size. (color scheme as in

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

Normalization methods induce distinct association patterns

We next analyzed the similarity among the different

association estimates with increasing sample size using

multi-dimensional scaling (MDS). Figure 2b shows a 2D

MDS embedding of all shrinkage association estimates

using the Frobenius distance. We identiﬁed ﬁve distinct

classes. Association estimates following a variance-

reducing/stabilizing transformations (clr, vst, asinh, rhoprop,

rhoshrink) form a distinct linear trace in the embedding,

ordered along sample size. Correlation estimates on raw count

data form another distinct group. Correlation estimates after

css and wren normalization form two distinct traces in the

embedding. Finally, correlation estimates following the com,

rle, and tss normalization form the ﬁfth class of association

patterns. For small sample sizes, association patterns are

similar independent of the normalization methods. As each of

the ﬁve classes form a distinct linear trace in the embedding,

we used the distances between estimates of different sample

sizes to evaluate the rate at which normalization methods

arrived at stable patterns of association. In agreement with

Figure 2a, we observed that wren, vst, and com arrived at

consistent association estimates with the fewest samples,

followed closely by rle and tss normalization methods (Figure

2b, Supplementary Figure 3). The observed grouping pattern

and convergence behavior is largely invariant to the distance

measure used (Supplementary Figure 4b and Supplementary

Figure 5b for spectral distance and CMD, respectively).

Association estimates are positively skewed We next analyzed

the shapes of empirical distributions of shrinkage association

estimates for all normalization schemes in three different

sample scenarios: Small sample regime (n = 50), randomly

shufﬂed data in the small sample regime (n = 50), and

large sample regime (n=9000). Figure 3 shows clr, tss,

rhoprop and rhoshrink distributions across these scenarios (see

Supplementary Figure 3 for all others). We observed that

variance-reducing/stabilizing transformations (clr, vst, asinh,

rhoprop, rhoshrink) induce association distribution that are

close to normal with moderate positive skewness. Estimates

without shrinkage are considerably wider in the low sample

regime (as exempliﬁed for standard proportionality rhoprop

vs. rhoshrink in Figure 3). All other normalization schemes

induce asymmetric correlation distributions with large positive

skewness, resembling correlation distributions on raw count

data (Supplementary Figure 3). Positive skewness also persists

for association estimates on shufﬂed data. Although the shapes

of shrinkage association distributions are visually similar in

the small and large sample limit, we universally observed an

increase in skewness and kurtosis with larger sample sizes

independent of the normalization scheme.

Clustering methods are sensitive to normalization and

shrinkage estimation We next focused on analyzing the

inﬂuence of normalization and association estimation on

downstream data analysis tasks. We ﬁrst considered clustering

of OTUs using a large sample limit of n = 9000 samples

from the AGP dataset. For spectral clustering, we asked

the question whether and how normalization and shrinkage

inﬂuences (i) the standard selection of the number of cluster

and (ii) the taxa composition of the resulting clusters.

One common strategy for model selection in spectral

clustering is the “spectral gap” criterion. The number of

selected clusters is considerably larger (k  11) for the

variance-reducing/stabilizing transformations (clr, vst, asinh,

rhoprop, rhoshrink) than for other normalization methods

(k  8)(Supplementary Figure 6). Despite the large sample

mean = 0.003

variance = 0.001

skewness = 1.784

kurtosis = 4.401

mean = 0.003

variance = 0.026

skewness = 0.408

kurtosis = 0.608

mean = 0.001

variance = 0.003

skewness = 0.396

kurtosis = 0.525

mean = 4.6e-04

variance = 0.004

skewness = 0.329

kurtosis = 0.277

mean = 8.95e-04

variance = 4.9e-05

skewness = 1.891

kurtosis = 4.944

mean = -1.30e-04

variance = 0.016

skewness = 0.179

kurtosis = 0.073

mean = -3.5e-05

variance = 1.2e-05

skewness = 0.178

kurtosis = 0.018

mean = -4.1e-05

variance = 1.4e-05

skewness = 0.169

kurtosis = -0.117

mean = 0.015

variance = 0.004

skewness = 3.587

kurtosis = 24.375

mean = 0.003

variance = 0.01

skewness = 1.435

kurtosis = 5.802

mean = 0.003

variance = 0.01

skewness = 1.435

kurtosis = 5.801

mean = 0.001

variance = 0.012

skewness = 1.278

kurtosis = 4.974

Figure 3. Frequencies of association after transformation and shrinkage. Density plots of association value frequency of matrices after application of different

normalization methods at different sample sizes. Mean, variance, skewness and kurtosis are shown for each distribution.

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

size, the spectral gap of rhoprop - and rhoshrink-based

spectral clustering resulted is still different, resulting in k =

11 and k = 12 clusters, respectively. The different number of

clusters also contributed to marked differences in terms of

the homogeneity of taxa compositions, as shown in Figure

4. Variance-reducing/stabilizing transformations produced

taxonomically more homogeneous groups at the Family level.

rhoshrink-based clustering produced the highest mean cluster

purity, indicating strong agreement between estimated OTU

associations and taxonomic identity (as shown for Family

level in Figure 4 and Supplementary Figure 5). rhoprop-

and rhoshrink-based clustering formed very similar but not

identical clusters in terms of composition and cluster purity. A

larger number of taxa of Family Ruminococcaceae and Class

Bacteriodia cluster together in clr-based clustering compared

to tss-based clustering. Taxa clusters derived from css, rle, and

wren normalization resulted in no distinct taxonomic grouping

(see Supplementary Figure 5).

Hierarchical clustering largely conﬁrmed the previous

observations. For ease of comparison, we set the number of

clusters to k =10 for inference workﬂows. Figure 5 shows

the dendrograms for clr-, tss-, rhoprop-, and rhoshrink-based

clustering. While some distinct and homogeneous clusters can

be found in the tss case, the majority of taxa has been grouped

into a single cluster comprising many families and classes

of taxonomically unrelated bacteria. However, taxonomic

grouping is well represented by hierarchical clustering of

rho- and rhoshrink-based estimates (Figure 5). Similarly, vst

and asinh have recovered large groups of the most prevalent

family annotation: Ruminococcaceae, Lachnospiraceae, and

Bacteroidaceae (see Supplementary Figure 8).

Normalization induces relevance networks with different

community structures We next considered the downstream

statistical task of learning microbial relevance networks from

AGP data. We estimated associations in the large sample

limit n= 9000 and selected the top 2000 associations for

network construction in every data normalization/association

estimation workﬂow. Figure 6 shows network visualizations

for clr-, tss-, rhoprop-, and rhoshrink-based relevance

networks (Supplementary Figure 9 for the other instances).

We identiﬁed sub-communities of highly connected taxa

using modularity maximization. The number of identiﬁed

modules ranged between 20 (using Wrench) and 38 (using vst

normalization). Relevance networks derived from variance-

reducing/stabilizing transformations (clr, vst, asinh, rhoprop,

and rhoshrink) were partitioned into 35-38 modules and

achieved a maximum modularity score of ⇡0.8 (compared

to modularity scores of < 0.6 for all other networks).

Visual inspection of these networks revealed that members

of the Bacteroidetes phylum (represented by square nodes

in Figure 6) formed tightly connected modules with few

edges connecting to other phyla. Firmicutes (represented

by circular nodes) in networks were divided into a

higher number of modules comprising distinct families,

including Lachnospiraceae (represented by orange circles)

and Ruminococcaceae (teal circles, Figure 6, Supplementary

Figure 9). This striking modularity is less pronounced in the

tss-based relevance networks (Figure 6b).

Similar to the clustering analysis, we next evaluated

the taxonomic coherence of the different networks. Using

assortativity on the genus level as a quantitative measure,

we found that relevance networks derived from variance-

reducing/stabilizing transformations showed the highest

overall assortativity in the large sample limit (⇡ 0.35).

We next asked the question whether high-level network

properties such as assortativity and modularity were

Figure 4. Horizontal stacked bar plot of OTU groups resulting from spectral clustering. a-d) The stacked bar plots represent the composition of OTUs in each

cluster at the Family level. Clusters are vertically ordered from highest percentage of the most abundant Family: Ruminococcaceae. Horizontally the order

represents the highest percentage of Family in each cluster. Numbers on the left axis represent the number of OTUs in each cluster. Values stated next to method

name represent cluster purity.

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

Figure 5. Circular dendrograms of hierarchical clustering patterns amongst taxa Family and Class. Each point surrounding the circular dendrogram represents

one of the 531 OTUs in our data set. The color represents Family annotation and shape represents Class annotation. Each dendrogram has been cut hierarchically

into 10 trees (representing the 10 Orders to which each Family maps). The grey and black shading is used to highlight different clusters which are numbered.

consistent independent of the sample size used to estimate

the association networks. We thus repeated the previous

analysis for different sample sizes, ranging from n =

25 to n = 9000. Figure 7 shows the estimated network

assortativity and maximum modularity score estimates

vs. sample size. We found that for relevance networks

derived from variance-reducing/stabilizing transformations,

both assortativity and modularity monotonically increase with

sample size. Both estimates stabilize around samples sizes

n⇡p. For the remaining relevance networks, assortativity

estimates monotonically increase with sample size while

modularity tends to decrease with sample size. In summary,

this analysis implies that estimates of high-level network

summary statistics such as assortativity and modularity are

inconsistent compared to their large-sample limit.

We next examined the edge overlap from correlation-

based relevance networks (clr and tss transformations) and

proportionality metrics (rhoprop and rhoshrink). We found a

common core of 1086 edges between 349 OTUs that were

present in all relevance networks. This consensus network

also contained several tightly connected network modules

with highly assortative inter-family associations (Figure 8a).

Overall, we found that clr-, rhoprop, and rhoshrink-based

networks shared the majority of common edges with rhoprop-

and rhoshrink-based edge set differing only by a single edge.

The tss-based relevance network comprised 779 unique edges

not shared by any of the other networks (Figure 8b).

Additionally, we found that clr-, vst-, and asinh-based

correlation networks also shared a large common consensus

core (Supplement Figure 10a). Similarly, com-, rle-, and tss-

based networks showed a large edge set overlap (Supplement

Figure 10b). These observations conﬁrmed the distinct

groupings observed in the MDS embedding of Frobenius

distances (Figure 2b).

DISCUSSION

Data normalization and inference of taxon-taxon associations

from microbial genomic survey count data are two of the most

basic statistical analysis tasks in modern microbiome research.

To help the practitioner of microbial data analysis make

informed choices about the different available normalization

and association inference schemes, we have taken a closer

look at the impact of data normalization on association

estimation and several downstream exploratory data analysis

tasks. Rather than asking what is the best method available

for different analysis steps, we have leveraged the large

available sample size in the American Gut Project (AGP)

data set and assessed the consistency of two ubiquitous linear

association estimators for microbiome data, correlation and

proportionality, under a wide range of realistic sample size

scenarios, data normalization schemes, and downstream data

analysis tasks.

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

clr: 35

s: 3

rhoprop:

rhoshrink: 37

Figure 6. Relevance network visualization displaying modularity. a-d) For networks on the left of each panel every node represents an OTU labeled with module

annotation as predicted by the Fast-Greedy modularity algorithm. The networks on the right represent the corresponding taxonomic annotation of the OTU at

the family level. Values stated next to method name represent the number of modules in the network. Layout using the force-directed Fruchterman-Reingold

algorithm was conserved for both networks in each panel.

Our analysis revealed several important observations that

have direct implications for best practice in microbiome

data analysis workﬂows. Firstly, we have conﬁrmed that

correlation and proportionality estimates are inconsistent

in the low sample regime n<p when compared to large

sample counterparts, both in terms of general large-scale

association patterns (Figure 2) and downstream network

summary statistics, including assortativity and modularity

(Figure 7). While this phenomenon has been long appreciated

in the statistical literature, we have established that shrinkage

estimation, a popular statistical regularization scheme used in

ﬁnance (28) and genomics research (31), can also improve

association estimation for microbiome data, independent

of the employed normalization method. Leveraging the

close mathematical relationship between variance-covariance

estimation and the concept of proportionality, we have

also introduced a novel shrinkage proportionality estimator,

rhoshrink, that is easy to compute and may prove useful in

other scientiﬁc areas where compositional data are available.

On the AGP data, we have been able to categorize ten

data normalization/association estimation workﬂows into ﬁve

coherent groups that show strong agreement across all sample

size scenarios (Figure 2b). Most prominently, we have found

that variance-reducing/stabilizing transformations lead to a

high agreement of correlation or proportionality estimates.

This was also conﬁrmed in the downstream microbial

relevance network comparison where clr-based correlation

networks and proportionality association networks showed

high agreement among the inferred edge sets (Figure 8).

This implies that, in the presence of large samples sizes

and large number of OTUs, differences between correlation

and proportionality estimates are less pronounced than

previously expected. An important observation on the AGP

dataset was that the empirical distributions of association

estimates were universally right-skewed even in the randomly

shufﬂed data scenario. This implies that no matter the

data normalization/association inference workﬂow, one will

observe a higher prevalence of positive associations. This

phenomenon has been previously described in the context

of microbial association inference across many different

microbial habitats (12, 45). While it is tempting to interpret

these results as ecological features of the underlying microbial

community in terms of higher prevalence of commensual

rather than competitive microbial interactions, the positive

skewness may also be due to technical limitations in the

data generation process and statistical shortcomings. More

speciﬁcally, truncation to zero effects for low sequencing

read counts likely obstruct unbiased estimation of negative

correlations, and in turn, proportionality. A possible remedy

for this data-induced artifact is the application of more

advanced semi-parametric correlation estimators that infer

latent correlations under data truncation assumptions (21,

46). A detailed investigation of semi-parametric and other

estimators may provide a promising avenue for future

research.

Despite the universal presence of positive skewness

in association estimates for the AGP data, we have

observed that variance-reducing/stabilizing transformations

could reduce positive skewness in shrunk association

estimates (Figure 3). Moreover, our results on microbial

association network construction and clustering as typical

downstream exploratory data analysis examples also revealed

that variance-reducing/stabilizing approaches provided the

most consistent estimation in terms of taxonomic and

structural coherence, as measured by taxonomic cluster purity

in spectral and hierarchical clustering (Figures 4 and 5) and

network assortativity (Figures 6 and 7). Taken together, we can

recommend any variance-reducing/stabilizing transformations

followed by shrinkage estimation for association inference.

However, transformations such as asinh and clr may

be preferred since they are faster to compute than vst,

while providing similar statistical properties. The resulting

shrinkage correlation estimates can then also serve as input for

more involved direct microbial network inference workﬂows

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

Figure 7. Community analysis of relevance network structure with increasing sample size. a) Assortativity coefﬁcient across sample size of genus annotation. b)

Maximum modularity score across sample size at 2000 edges. For all plots lines represent mean and grey ribbon represent standard deviation from the mean.

that account for transitive correlations, adjust for additional

covariates, or model latent effects (12, 37, 47, 48).

For relevance network estimation, consensus network

construction, as put forward here for the AGP data (Figure

8), is a straightforward strategy to relax the inﬂuence of data

normalization. For our AGP consensus network, we found

that more than half of the top 2000 edges in the tss-, clr-,

rhoprop-, and rhoshrink-based relevance networks were in full

agreement, connecting a subset of 346 taxa. The inferred AGP

consensus network comprised a majority of positive edges

and showed high assortativity at the genus level (0.39) and

a maximum modularity of 0.8.

Assortativity increased in the consensus network compared

to individual relevance networks. Notably, many taxa in the

consensus network were frequently identiﬁed as key targets

for microbiome therapeutics, such as prebiotic treatment

and fecal microbiota transplants, including Akkermansia

muciniphila, Prevotella copri, Ruminococcus bromii, and

Faecalibacterium prausnitzii (49, 50).

Our computational data analysis workﬂow, available on

GitHub and as synapse project (see Data Availability), is

fully reproducible, provides all novel shrinkage estimators

introduced here, and allows easy extension and comparison

to additional data normalization, estimation, and downstream

analysis tasks. For instance, future work could include the

integration of more advanced zero-replacement strategies

(51, 52), application of popular data normalization schemes

from single-cell data analysis (53) or the application of other

correlation (21, 46) or proportionality estimators, including

those available in the propr package (23). Here, rather than

using universal thresholding for sparsifying associations, more

advanced selection strategies that control false discovery rates

(as available in the propr package (23)) may improve the

consistency of the microbial association inference workﬂows.

Figure 8. A) Edges in common between relevance networks. This consensus network contains 1086 edges between 346 taxa. Node color represents Family

annotation and node shape represents phylum. Network layout was generated using the force-directed Fruchterman-Reingold algorithm. B) Venn Diagram of

edges in common from the top 2000 edges between four highlighted normalization methods.

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

Going forward, we believe that large-scale reproducible

computational analysis workﬂows that focus on sample-

size dependent consistency of statistical estimates are of

paramount importance for deriving stable testable hypotheses

about the complex interplay between host phenotype and the

microbiome from large-scale microbial genomic survey data.

DATA AV AILABILITY

The code and data used is available as a github repository at

https://github.com/MichelleBadri/NormCorr

manuscript and

Synapse project syn21654780. Data used for this study was

accessed from ftp://ftp.microbio.me/AmericanGut/ag-2017-

12-04/. The latest complete American Gut Project dataset can

be accessed on Qiita using study ID 10317 (24).

REFERENCES

1. Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman,

F. D., Costello, E. K., Fierer, N., Pea, A. G., Goodrich, J. K., Gordon,

J. I., Huttley, G. A., Kelley, S. T., Knights, D., Koenig, J. E., Ley,

R. E., Lozupone, C. A., McDonald, D., Muegge, B. D., Pirrung, M.,

Reeder, J., Sevinsky, J. R., Turnbaugh, P. J., Walters, W. A., Widmann,

J., Yatsunenko, T., Zaneveld, J., and Knight, R. (May, 2010) QIIME

allows analysis of high-throughput community sequencing data. Nature

Methods, 7(5), 335–336.

2. Schloss, P. D., Westcott, S. L., Ryabin, T., Hall, J. R., Hartmann,

M., Hollister, E. B., Lesniewski, R. A., Oakley, B. B., Parks, D. H.,

Robinson, C. J., Sahl, J. W., Stres, B., Thallinger, G. G., Van Horn, D. J.,

and Weber, C. F. (December, 2009) Introducing mothur: open-source,

platform-independent, community-supported software for describing

and comparing microbial communities. Applied and Environmental

Microbiology, 75(23), 7537–7541.

3. Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A.

J. A., and Holmes, S. P. DADA2: High-resolution sample inference from

Illumina amplicon data. 13(7), 581–583.

4. Willis, A. D. and Martin, B. D. (2018) DivNet: Estimating diversity in

networked communities. bioRxiv,.

5. Bucci, V., Tzen, B., Li, N., Simmons, M., Tanoue, T., Bogart, E., Deng,

L., Yeliseyev, V., Delaney, M. L., Liu, Q., Olle, B., Stein, R. R., Honda,

K., Bry, L., and Gerber, G. K. (2016) MDSINE: Microbial Dynamical

Systems INference Engine for microbiome time-series analyses. Genome

Biology, 17(1), 121.

6. Zwiener, I., Frisch, B., and Binder, H. (January, 2014) Transforming

RNA-Seq Data to Improve the Performance of Prognostic Gene

Signatures. PLOS ONE, 9(1), e85150.

7. Mark Welch, J. L., Hasegawa, Y., McNulty, N. P., Gordon, J. I., and

Borisy, G. G. (October, 2017) Spatial organization of a model 15-member

human gut microbiota established in gnotobiotic mice. Proceedings of the

National Academy of Sciences, 114(43), E9105.

8. Stmmler, F., Glsner, J., Hiergeist, A., Holler, E., Weber, D., Oefner, P. J.,

Gessner, A., and Spang, R. (June, 2016) Adjusting microbiome proﬁles

for differences in microbial load by spike-in bacteria. Microbiome, 4.

9. Aitchison, J. (1986) The Statistical Analysis of Compositional Data,

Chapman & Hall, Ltd., London, UK, UK.

10. Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., and Egozcue,

J. J. (2017) Microbiome Datasets Are Compositional: And This Is Not

Optional. Frontiers in Microbiology, 8.

11. McMurdie, P. J. and Holmes, S. (April, 2014) Waste Not, Want Not:

Why Rarefying Microbiome Data Is Inadmissible. PLOS Computational

Biology, 10(4), e1003531.

12. Kurtz, Z. D., Muller, C. L., Miraldi, E. R., Littman, D. R., Blaser, M. J.,

and Bonneau, R. A. (2015) Sparse and compositionally robust inference

of microbial ecological networks. PLoS Comput Biol, 11.

13. Love, M. I., Huber, W., and Anders, S. (2014) Moderated estimation of

fold change and dispersion for RNA-seq data with DESeq2. Genome

Biology, 15, 550.

14. Anders, S. and Huber, W. (2010) Differential expression analysis for

sequence count data. Genome Biology, 11, R106.

15. Paulson, J. N., Stine, O. C., Bravo, H. C., and Pop, M. (December,

2013) Robust methods for differential abundance analysis in marker gene

surveys. Nature methods, 10(12), 1200–1202.

16. Kumar, M. S., Slud, E. V., Okrah, K., Hicks, S. C., Hannenhalli, S., and

Corrada Bravo, H. (2018) Analysis and correction of compositional bias

in sparse sequencing count data. BMC genomics, 19(1), 799.

17. Weiss, S., Xu, Z. Z., Peddada, S., Amir, A., Bittinger, K., Gonzalez,

A., Lozupone, C., Zaneveld, J. R., Vzquez-Baeza, Y., Birmingham, A.,

Hyde, E. R., and Knight, R. (March, 2017) Normalization and microbial

differential abundance strategies depend upon data characteristics.

Microbiome, 5, 27.

18. Martin, B. D., Witten, D., and Willis, A. D. (2019) Modeling microbial

abundances and dysbiosis with beta-binomial regression. pp. 1–27.

19. Brill, B., Amir, A., and Heller, R. Testing for differential abundance in

compositional counts data, with application to microbiome studies.

20. Weiss, S., Treuren, W. V., Lozupone, C., Faust, K., Friedman, J., Deng,

Y., Xia, L. C., Xu, Z. Z., Ursell, L., Alm, E. J., Birmingham, A., Cram,

J. A., Fuhrman, J. A., Raes, J., Sun, F., Zhou, J., and Knight, R. (July,

2016) Correlation detection strategies in microbial data sets vary widely

in sensitivity and precision. The ISME Journal, 10(7), 1669–1681.

21. Yoon, G., Gaynanova, I., and Mller, C. L. Microbial Networks in SPRING

- Semi-parametric Rank-Based Correlation and Partial Correlation

Estimation for Quantitative Microbiome Data. 10.

22. Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat, S., and Bhler,

J. Proportionality: A Valid Alternative to Correlation for Relative Data.

11(3), e1004075.

23. Quinn, T. P., Richardson, M. F., Lovell, D., and Crowley, T. M. propr:

An R-package for Identifying Proportionally Abundant Features Using

Compositional Data Analysis. 7(1), 1–9.

24. McDonald, D., Hyde, E., Debelius, J. W., Morton, J. T., Gonzalez, A.,

Ackermann, G., Aksenov, A. A., Behsaz, B., Brennan, C., Chen, Y.,

Goldasich, L. D., Dorrestein, P. C., Dunn, R. R., Fahimipour, A. K.,

Gaffney, J., Gilbert, J. A., Gogul, G., Green, J. L., Hugenholtz, P.,

Humphrey, G., Huttenhower, C., Jackson, M. A., Janssen, S., Jeste, D. V.,

Jiang, L., Kelley, S. T., Knights, D., Kosciolek, T., Ladau, J., Leach, J.,

Marotz, C., Meleshko, D., Melnik, A. V., Metcalf, J. L., Mohimani, H.,

Montassier, E., Navas-Molina, J., Nguyen, T. T., Peddada, S., Pevzner,

P., Pollard, K. S., Rahnavard, G., Robbins-Pianka, A., Sangwan, N.,

Shorenstein, J., Smarr, L., Song, S. J., Spector, T., Swafford, A. D.,

Thackray, V. G., Thompson, L. R., Tripathi, A., Vzquez-Baeza, Y.,

Vrbanac, A., Wischmeyer, P., Wolfe, E., Zhu, Q., Consortium, T. A. G.,

and Knight, R. (June, 2018) American Gut: an Open Platform for Citizen

Science Microbiome Research. mSystems, 3(3), e00031–18.

25. Stein, C. M. (1981) Estimation of the mean of a multivariate normal

distribution. Ann. Statist., 9(6), 1135–1151.

26. Holmes, S. and Huber, W. (2019) Modern statistics for modern biology,

Cambridge University Press, Cambridge, United Kingdom New York,

NY.

27. Burbidge, J. B., Magee, L., and Robb, A. L. Alternative Transformations

to Handle Extreme Values of the Dependent Variable. 83(401), 123–127.

28. Ledoit, O. and Wolf, M. (2004) Honey, I Shrunk the Sample Covariance

Matrix. The Journal of Portfolio Management, 30(4), 110–119.

29. Rothman, A. J., Bickel, P. J., Levina, E., and Zhu, J. (2008) Sparse

permutation invariant covariance estimation. Electronic Journal of

Statistics, 2(January), 494–515.

30. Bien, J. and Tibshirani, R. J. (dec, 2011) Sparse estimation of a covariance

matrix.. Biometrika, 98(4), 807–820.

31. Schfer, J. and Strimmer, K. (2005) A shrinkage approach to large-scale

covariance matrix estimation and implications for functional genomics.

Statistical Applications in Genetics and Molecular Biology, 4, Article32.

32. Pearson, K. (1897) Mathematical contributions to the theory of

evolution.On a form of spurious correlation which may arise when indices

are used in the measurement of organs. Proceedings of the Royal Society

of London,.

33. Erb, I. and Notredame, C. How should we measure proportionality on

relative gene expression data?. 135(1), 21–36.

34. Herdin, M., Czink, N.,

Ozcelik, H., and Bonek, E. (2005) Correlation

matrix distance, a meaningful measure for evaluation of non-stationary

MIMO channels. In IEEE Vehicular Technology Conference.

35. Ng, A. Y., Jordan, M. I., and Weiss, Y. (2001) On Spectral Clustering:

Analysis and an Algorithm. In Proceedings of the 14th International

Conference on Neural Information Processing Systems: Natural and

Synthetic Cambridge, MA, USA: MIT Press NIPS’01 pp. 849–856.

36. Butte, A. J., Tamayo, P., Slonim, D., Golub, T. R., and Kohane, I. S.

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

(2000) Discovering functional relationships between RNA expression and

chemotherapeutic susceptibility using relevance networks. Proceedings of

the National Academy of Sciences,.

37. Claussen, J. C., Skieceviien, J., Wang, J., Rausch, P., Karlsen, T. H.,

Lieb, W., Baines, J. F., Franke, A., and Htt, M.-T. (June, 2017) Boolean

analysis reveals systematic interactions among low-abundance species

in the human gut microbiome. PLOS Computational Biology, 13(6),

e1005361.

38. Friedman, J. and Alm, E. J. (September, 2012) Inferring Correlation

Networks from Genomic Survey Data. PLOS Computational Biology,

8(9), e1002687.

39. Faust, K., Sathirapongsasuti, J. F., Izard, J., Segata, N., Gevers, D.,

Raes, J., and Huttenhower, C. (July, 2012) Microbial Co-occurrence

Relationships in the Human Microbiome. PLoS Computational Biology,

8(7).

40. Lozupone, C., Faust, K., Raes, J., Faith, J. J., Frank, D. N., Zaneveld, J.,

Gordon, J. I., and Knight, R. (October, 2012) Identifying genomic and

metabolic features that can underlie early successional and opportunistic

lifestyles of human gut symbionts. Genome Research, 22(10), 1974–

1984.

41. Newman, M. E. J. (February, 2003) Mixing patterns in networks. Physical

Review E, 67(2) arXiv: cond-mat/0209450.

42. Newman, M. E. (2006) Modularity and community structure in networks.

Proceedings of the National Academy of Sciences of the United States of

America, 103(23), 8577–8582.

43. Clauset, A., Newman, M. E. J., and Moore, C. (December, 2004) Finding

community structure in very large networks. Physical Review E, 70(6)

arXiv: cond-mat/0408187.

44. Csardi, G. and Nepusz, T. (2006) The igraph software package for

complex network research. InterJournal, Complex Systems, 1695.

45. Faust, K., Lima-Mendez, G., Lerat, J.-S., Sathirapongsasuti, J. F., Knight,

R., Huttenhower, C., Lenaerts, T., and Raes, J. (2015) Cross-biome

comparison of microbial association networks. Frontiers in Microbiology,

6(October), 1–13.

46. Yoon, G., Carroll, R. J., and Gaynanova, I. (2018) Sparse semiparametric

canonical correlation analysis for data of mixed types. pp. 1–33.

47. Biswas, S., Mcdonald, M., Lundberg, D. S., Dangl, J. L., and Jojic, V.

(2016) Learning microbial interaction networks from metagenomic count

data. Journal of Computational Biology, 23(6), 526–535.

48. Kurtz, Z. D., Bonneau, R., and M

uller, C. L. (2019) Disentangling

microbial associations from hidden environmental and technical factors

via latent graphical models. bioRxiv, p. 2019.12.21.885889.

50. Scher, J. U., Sczesnak, A., Longman, R. S., Segata, N., Ubeda, C.,

Bielski, C., Rostron, T., Cerundolo, V., Pamer, E. G., Abramson, S. B.,

Huttenhower, C., and Littman, D. R. Expansion of intestinal Prevotella

copri correlates with enhanced susceptibility to arthritis. 2.

49. Scott, K. P., Jean-Michel, A., Midtvedt, T., and Hemert, S. v.

Manipulating the gut microbiota to maintain health and treat disease.

26(1), 25877.

51. Palarea-Albaladejo, J. and Martin-Fernandez, J. A. (2015)

zCompositionsR package for multivariate imputation of left-censored

data under a compositional approach. Chemometrics and Intelligent

Laboratory Systems, 143, 85–96.

52. Cao, Y., Zhang, A., and Li, H. (2020) Multisample estimation of bacterial

composition matrices in metagenomics data. Biometrika, 107(1), 75–92.

53. Vallejos, C. A., Risso, D., Scialdone, A., Dudoit, S., and Marioni, J. C.

(2017) Normalizing single-cell RNA sequencing data : challenges and

opportunities. 14(6).

54. Strimmer, K. Comments on: Augmenting the bootstrap to analyze high

dimensional genomic data. 17(1), 25–27.

Supplementary Figures

d e

g h

Figure 9. Supplement 1: Standard deviation per-OTU (taken across all 531

OTUs) plotted against the rank of the average count.

a b

Figure 10. Supplement 2: Lambda values for shrinkage estimation. a)

Lambda values selected for shrinkage estimation of correlation (



⇤

). b)

Lambda of correlation (



⇤

) and Lambda of variance shrinkage (



⇤

) used

jointly to estimate shrinkage of covariance for rhoshrink )

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

mean = 0.0015

variance = 7e-04

skewness = 2.03

kurtosis = 5.407

mean = 0.002

variance = 7e-04

skewness = 1.857

kurtosis = 4.507

mean = 0.0102

variance = 0.0014

skewness = 1.566

kurtosis = 2.813

mean = 4e-04

variance = 6e-04

skewness = 2.107

kurtosis = 5.858

mean = 0.0058

variance = 0.001

skewness = 1.682

kurtosis = 3.46

mean = 0.0022

variance = 0.0039

skewness = 0.273

kurtosis = 0.217

mean = 7e-04

variance = 0.0041

skewness = 0.339

kurtosis = 0.276

mean = 0.000313

variance = 2e-05

skewness = 2.064

kurtosis = 5.732

mean = -2e-05

variance = 1.1e-05

skewness = 1.967

kurtosis = 5.654

mean = -1e-05

variance = 1.1e-05

skewness = 1.966

kurtosis = 5.644

mean = -1.4e-05

variance = 1e-05

skewness = 1.972

kurtosis = 5.666

mean = 2.4e-05

variance = 1.1e-05

skewness = 1.98

kurtosis = 5.708

mean = -3.1e-05

variance = 1.4e-05

skewness = 0.086

kurtosis = -0.135

mean = -3.9e-05

variance = 1.4e-05

skewness = 0.201

kurtosis = -0.101

mean = 0.0103

variance = 0.0033

skewness = 3.999

kurtosis = 30.16

mean = 0.0201

variance = 0.0037

skewness = 3.957

kurtosis = 26.996

mean = 0.0606

variance = 0.0063

skewness = 2.416

kurtosis = 10.315

b c

mean = 0.0035

variance = 0.0019

skewness = 5.559

kurtosis = 52.581

mean = 0.0319

variance = 0.0042

skewness = 3.446

kurtosis = 20.847

mean = 0.0118

variance = 0.0114

skewness = 1.206

kurtosis = 4.547

mean = 0.0019

variance = 0.0118

skewness = 1.253

kurtosis = 4.838

Figure 11. Supplement 3: Frequencies of correlation after transformation.

Density plots of correlation frequency of matrices after application of different

normalization methods. Mean, variance, skewness, and kurtosis are shown for

each distribution.

Figure 12. Supplement 4: Spectral distance between estimates of association

a) Spectral distance between sub-samples of different sizes. Lines represent

mean and error lines represent standard deviation from the mean. Lines

represent normalized matrices where correlation/proportionality estimation

with shrinkage was performed. b) Multidimensional scaling representation

of Spectral distance between correlation structures of varying size estimated

from different normalization methods. The most opaque points represent the

mean of 5 subsamples of the same size. (color scheme as in A)

Figure 13. Supplement 5: CMD distance between estimates of association

a) CMD distance between sub-samples of different sizes. Lines represent

mean and error lines represent standard deviation from the mean. Dashed

lines represent normalized matrices after Pearson correlation. The solid lines

represent normalized matrices where correlation/proportionality estimation

with shrinkage was performed. The dot-dash line represents rho, a

proportionality metric. b) Multidimensional scaling representation of CMD

distance between correlation structures of varying size estimated from

different normalization methods. The most opaque points represent the mean

of 5 subsamples of the same size. (color scheme as in A)

Figure 14. Supplement 6: Numbers of clusters as selected by the number of

connected components. Vertical lines represent the ﬁrst non-zero eigenvalue

which is selected as the number of clusters.

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

d e

Figure 15. Supplement 7: Horizontal stacked bar plot of OTU groups

resulting from spectral clustering. The stacked bar plots represent the

composition of OTUs in each cluster at the Family level. Clusters are

vertically ordered from highest percentage of the most abundant Family:

Ruminococcaceae. Horizontally the order represents the highest percentage

of Family in each cluster. Numbers on the left axis represent the number of

OTUs in each cluster. Values stated next to method name represent cluster

purity.

d e

Figure 16. Supplement 8: Circular dendrograms of hierarchical clustering

patterns amongst taxa Family and Class. Each point surrounding the circular

dendrogram represents one of the 531 OTUs in our data set. The color

represents Family annotation and shape represents Class annotation. Each

dendrogram has been cut hierarchically into 10 trees (representing the 10

Orders to which each Family maps). The grey and black shading is used to

highlight different clusters which are numbered.

vst

raw

: 33

en:

css:

com:

Figure 17. Supplement 9: Relevance network visualization displaying

modularity. For networks on the left of each panel every node represents

an OTU labelled with module annotation as predicted by the Fast-Greedy

modularity algorithm. The networks on the right represent the corresponding

phylogenetic annotation of the OTU at the family level. Values stated next to

method name represent the number of modules in the network. Layout using

the force-directed Fruchterman-Reingold algorithm was conserved for both

networks in each panel for comparison

Figure 18. Supplement 10: Edges in common between relevance networks.

Venn Diagram of edges in common from the top 2000 edges between a) Log-

based normalization methods clr, vst, asinh and rhoshrink b) rle, com, and

tss.

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint

Supplementary Methods

SHRINKAGE ESTIMATION



⇤

=min

i6=j

Var



ˆr



i6=j

⇣

ˆr

⌘



⇤

=min

i=1

Var(ˆs

)

i=1

(ˆs

v)

The detailed computation of unbiased estimation of

variance (

Var) for



⇤

and



⇤

can be found in Scha

fer and

Strimmer (31, 54).

.CC-BY-NC 4.0 International licenseunder a

certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available

The copyright holder for this preprint (which was notthis version posted April 4, 2020. ; https://doi.org/10.1101/406264doi: bioRxiv preprint