Mikrobiomo 16S Analysis Example Report

Content

Ⅰ. General Vision
Ⅱ. Workflow
1 Preparation and Sequencing
2 Data Analysis and Procedures
Ⅲ. Results
1 Sequencing Data Processing
2 OTU analysis and species annotation
     2.1 OTU identification and annotation
     2.2 Species Composition Analysis
3 Alpha Diversity Analysis
     3.1 Alpha Diversity Indexes
     3.2 Species Diversity Curves
     3.3 Flor and Venn diagram
4 Process Data
Ⅳ. Methods
Ⅴ. References
Ⅵ. Appendix

I. General Vision

The 16S ribosomal RNA (rRNA) sequence is composed of nine hypervariable regions interspersed with conserved regions.The bacterial 16S gene contains nine hypervariable regions (V1-V9) ranging from about 30 to 100 base pairs in length that are involved in the secondary structure of the small ribosomal subunit.The degree of conservation varies widely between hypervariable regions, with more conserved regions correlated with higher-ranking taxonomy and less-conserved regions with lower ranks, such as genus and species.For taxonomic classification, it is sufficient to sequence individual hypervariable regions rather than the entire gene.Furthermore, the 16S gene contains highly conserved sequences between hypervariable regions, allowing the design of universal primers.With the development of high-throughput sequencing platforms, sequence variation in the 16S gene is widely used to characterize diverse microbial communities.[1,2,3].

Based on end-of-pair algorithms, amplicon sequencing is performed on the Illumina platform.

II. Workflow

1 Preparation and Sequencing

From raw DNA samples to final data, each step, including sample testing, PCR, library preparation and sequencing, influences the quality of the data, and the Data quality directly impacts the results of the analysis.To maintain data reliability, quality control (QC) is performed at each step of the procedure.The workflow is as follows:

2 Data Analysis and Procedures

For accuracy of sequencing data analysis, the raw data would be merged and filtered for clean data.The effective data is used to perform OTU group and species annotations for the respective sequence of each OTU.Therefore, relative species distribution, evenness, and abundance can be analyzed with alpha diversity, beta diversity, Venn, or Flower Graph et al.Furthermore, OTU selection, taxonomic assignment construction of phylogenetic trees through post statistical analysis explain community construction differences between samples or between groups via PCoA, PCA and NMDS.Statistical methods such as T-test, MetaStat, LEfSe, Anosim, and MRPP could test the importance of community composition and structure differences between groups.Meanwhile, the analysis result combines CCA/RDA to explore the main environmental factors.The data analysis workflow is shown below:

Notes: If the number of samples is less than 3, Beta diversity, community significance test, the build difference between groups, and environmental association analysis cannot continue.Significance test of the difference in community construction between groups and environmental association the analysis cannot proceed without grouping, or when the repetition is less than 3.Environmental Association. The analysis needs the data of the environmental factor offered by the clients.

III. Results

1 Sequencing Data Processing

Amplicon was sequencing on the Illumina paired-end platform to generate a 250 bp paired-end raw read (RAW PE), and then assembled and pretreated to obtain clean tags. Chimeric sequences in the clean tags were detected and removed to obtain finally the effective labels. The data output has been shown in table 1.

Table 1. QC Stadistics

Notes: Raw PE means PE readings; Raw tags means tags merged from PE reads; Clean tags means tags after filtering; Effective Labels means labels after filtering the chimera; Base means the base number of Effective labels; AvgLen means average length of effective labels; Q20 and Q30 mean the base percentage whose quality is greater than 20 and 30; GC (%) means GC content in effective tags; Effective (%) means the percentage of effective tags in Raw PE.

2 OTU analysis and species annotation

To analyze species diversity in each sample, all effective tags were grouped by 97% DNA sequence similarity in OTUs (Operating Taxonomic Units)

2.1 OTU identification and annotation

2.1.1 Statistic analysis of annotation

During the construction of the OTUs, basic information was collected from different samples, such as effective tag data, low-frequency tag data, and tag annotation data. The statistical data set is shown below in figure 2.1.1 (To see the full size image, click on here).

Sample_Tags-OTUs_dis.png

Figure 2.1.1 Statistical analysis of the labels and number of OTUs of each sample

Notes (from left to right): The Y1 axis titled "Number of Labels" means the number of labels; "Total tags" (red bars) means the number of effective tags; "Taxon tags" (blue bars) means the number of tags annotated; "Unclassified tags" (green bars) means the number of unranked tags; "Unique labels" (orange bars) means the number of labels with a frequency of 1 and only occurs in one sample. The Y2 axis titled "Numbers of OTUs" means the number of OTUs shown as "OTUs" (purple bars) in the image above to identify the number of OTUs in different samples.

2.1.2 Interactive view of species annotation

The heat map displays an interactive view of the species composition and abundance between different samples on the live web page Click here . An example image shows the following:

Figure 2.1.2 An example heatmap of the OTU table, showing the taxonomy mapping for each OTU.

Notes: Counts are colored according to the percentage contribution of each OTU to the total OTU count in a sample (blue: contributes low percentage of UTO to the sample; red: contributes high percentage of UTO). Keeping the filter value unchanged , and click the "Sample ID" button, then a graph will be generated as the example figure above. Click here for details..

2.1.3 Taxonomy Tree:

Specific species (showing the top 10 genera in high relative abundance by default) were selected to make the taxonomic tree[4] by independent R&D software. The taxonomy tree in a single sample shown in figure 2.1.3-1(To see the full size image, please click).

Figure 2.1.3-1 Taxonomic Tree in Single Sample.

Notes: The different colors represent different taxonomic ranges. The size of the circles represents the relative abundance of species. The first number below the taxonomic name represents the percentage in the entire taxon, while the second number represents the percentage in the selected taxon.

2.1.4 KRONA Display

KRONA [5] visually displays the result of the species annotation parsing .Circles from inside to outside represent different taxonomic ranges, and sector area means the respective proportion of different OTU annotation results.More details please click here. An example image is shown below:

Figure 2.1.4 KRONA Display

2.2 Species Composition Analysis

2.2.1 Taxons relative abundance

The top 10 taxa in the different taxonomic ranges were selected to form the relative abundance distribution histogram. The distribution in phylum was shown as follows (To see the full size image, please click here):

Figure 2.2.1 Relative abundance of phylum species

Notes: Plotted by "Relative Abundance" on the Y-axis and "Sample Names" on the X-axis. "Other" represents a total relative abundance of all other phyla in addition to the top 10 phyla.

2.2.2 The Evolutionary Tree in Gender.

The top 100 genera were selected and the evolutionary tree drawn using the aligned representative sequences. The relative abundance of each genus is also shown across the genus in Figure 2.2.2. (To view the full image please do click):

Genus 100 Tree

Figure 2.2.2 The Evolutionary Tree in Gender.

Notes: The different colors of the branches represent different cutting edges. The relative abundance of each genus in each group was shown outside the circle and the different colors represent different groups.

3 Alpha Diversity Analysis

Alpha diversity is widely used to assess microbial diversity within the community[6], including species accumulation boxplots, species diversity curves, and statistical analysis indices.

3.1 Alpha Diversity Indexes

Generally speaking, OTUs generated with 97% sequence identity are considered homologous in species. The statistical indices of alpha diversity when the clustering threshold is 97% are summarized below: (Number of reads chosen for normalization: Cutoff = 88996. The meaning of each alpha diversity index is listed in Method-Data Analysis- 3 Alpha diversity analysis).

Table 3.1 Alpha Diversity Indices Statistics

3.2 Species Diversity Curve

Rarefaction curves and rank abundance curves are widely used to indicate the biodiversity of samples. The rarefaction curve is created by randomly selecting a certain amount of sequencing data from the samples and then counting the number of species they represent. If the curve is steep, many species remain to be discovered. If the curve becomes flatter, a credible number of samples have been taken, meaning that only rare species remain to be sampled.

The rank-of-abundance curve is used to show the relative abundance of species. It can also be used to visualize species richness and uniformity. It overcomes the deficiencies of the biodiversity indices that cannot present the role that the variables played in their evaluation[7].

Species Diversity Curve (To see the full size image please click):

      Figure 3.2-1 Rarefaction Curves                        Figure 3.2-2 Curvas de Abundancia de Rango

Notes: For rarefaction curves, each curve represents a sample and can be colored and shaped by each sample name provided in the mapping file. The number of sequences is on the x-axis and the number of observed OTUs is on the y-axis. For the rank-abundance curves, each curve represents a single sample, plotted by the relative abundance of OTUs on the y-axis and the rank of OTU abundance on the X-axis, which can be colored and shaped by each sample name provided in the mapping file.

4 Process Data

16S rDNA amplicon sequencing (18S rDNA, ITS) is widely used for comparison of microbial communities between samples from various natural or endozoic environments. such as soil, water, host gut, etc. To achieve these goals, several important outcomes needed to be very concerned.

First, the results of the OTU group and species annotation are summarized in result/02.OTUanalysis/. Labels are pooled at 97% identity, all represented labels for each OTU are listed in result/02.OTUanalysis/OTUs.fasta. These OTUs are then noted and collected in result/02.OTUanalysis/OTUs.tax_assignments.txt. Species abundance is displayed in two important directories: Absolute/ (containing the absolute composition of species in different taxonomic ranges), Evenabs/ (containing the absolute composition of species after normalization), and Relative/ (containing the abundance of each sample after normalization, which are mainly summarized for subsequent analysis of alpha diversity and beta diversity). For example, the Relative/ directory contains the relative abundance of species from each sample in different taxonomic ranges (kingdom, phylum, class, order, family, genus, species). From these results, we can visualize the species composition of various samples and focus on some species in question or very different species between samples (or groups) to correlate with our certain research objectives.

The distribution of dominant species among the samples is visualized in the directory result/02.OTUanalysis/top10 (with bar graph and profile table in p, c, o, f, g) so that we can locate the notable predominant species in a more accurate way. quickly and conveniently, and then proceed to abundance analysis and difference tests.

Results on sample complexity are mainly included in the directory result/03.AlphaDiversity/ with six different alpha diversity indices (Observed Species, Asset Coverage, Chao1, ACE, Shannon, Simpson).

Regarding the comparison of differences in microbial communities between samples, the results are shown in the directory result/04.BetaDiversity. First, the Unifrac distance between pairwise samples is displayed as a heatmap to measure and view the degree of dissimilarity, the result is represented in result/04.BetaDiversity/beta_div_heatmap. Dissimilarity is then calculated with gradient analysis and displayed with rank plots (PCA, PCoA, etc.), samples with similar microbial community structure tend to be collected, and vice versa. UPGMA could then group the samples based on the acquired distance matrix and display them in result/04.BetaDiversity/Tree/. From these results, we can determine the differences in complexity between samples and explain the differences between samples (or groups) by combining with specific underlying biological problems. For example, we can explain the UPGMA sample cluster results by considering high-abundance taxa to achieve the underlying driving factors.

When there are more than two groups, more advanced analysis could be performed.

For species differences, we can use Metastats to obtain the importance of all species between groups and select obvious different species between groups in various taxonomic ranges (p, c, o, f, g, s) for further analysis, or choose LefSE analysis to discover statistically significant biomarkers between groups.

Anosim and MRPP analysis could be used to determine whether community structure differs significantly between groups, or to compare between-group and within-group differences.

NMDS analysis could be selected as a complementary method with unexpected results through PCA and PCoA, because it is based on a non-linear model (PCA and PCoA are based on a linear model), and it can offer a better explanation of the non-linear structure in ecological data sets.

For an exploratory analysis, if multiple environmental factors are involved, we might select CCA or RDA analysis to extract environmental gradients from ecological data sets, and find environmental drivers that influence the development of certain microbial communities.

Ⅳ. Methods

Preparation of Sequencing

1 Extraction of DNA from the genome

The DNA of the total genome of the samples was extracted by the CTAB/SDS method. DNA concentration and purity were monitored on 1% agarose gels. According to the concentration, the DNA was diluted to 1 ng/μL using sterile water.

2 Amplicon Generation

16S rRNA/18SrRNA/ITS genes from different regions (16SV4/16SV3/16SV3-V4/16SV4-V5, 18S V4/18S V9, ITS1/ITS2, Arc V4) were amplified with a specific primer used (for example, 16S V4: 515F-806R, 18S V4: 528F-706R, 18S V9: 1380F-1510R, et al) with the barcode. All PCR reactions were performed with Phusion® High Fidelity PCR Master Mix (New England Biolabs).

3 Quantification and qualification of PCR products

Mix the same volume of 1X loading buffer (containing SYB green) with PCR products and run 2% agarose gel electrophoresis for detection. Samples with bright main band between 400-450 bp were chosen for further experiments.

4 Mixing and Purification of PCR Products

The PCR products were mixed in equidensity proportions. The PCR product mixture was then purified with the Qiagen gel extraction kit (Qiagen, Germany).

Libraries generated with the NEBNext® UltraTM DNA Library Preparation Kit for Illumina and quantified via Qubit and Q-PCR would be analyzed by the Illumina platform.

Analysis of the Information

1 Sequencing Data Processing

Paired reads were assigned to samples based on their unique barcode and truncated by cutting the barcode and primer sequence. The matched reads were merged using FLASH (V1.2.7) [8], a very fast and accurate parsing tool, which was designed to merge matched-end reads when at least some of the reads overlap to the readout generated from the opposite end of the same DNA fragment, and the splicing sequences were referred to as raw tags. Quality filtering on raw tags was performed under specific filtering conditions to obtain high quality clean tags [9] according to Qiime (V1.7.0) [10] quality control process. The tags were compared to the reference database (Gold database) using the UCHIME algorithm (UCHIME Algorithm)[11] to detect chimera sequences AND then chimera sequences were removed [12]. So finally the Effective Tags were obtained.


2 OTU Cluster and Species Annotation

Sequence analysis was performed using the Uparse software (Uparse v7.0.1001)[13] using all effective tags. Sequences with ≥97% similarity were assigned to the same OTUs. The representative sequence of each OTU was examined for further annotation. For each representative sequence, the Mothur software was performed against the SSUrRNA database of SILVA database[14] for the annotation of species in each taxonomic range (Threshold: 0.8~1).[15] (kingdom, phylum, class, order, family, genus, species). To obtain the phylogenetic relationship of all representative OTU sequences, MUSCLE [16](Version 3.8.31) can quickly compare multiple sequences. OTU abundance information was normalized using a sequence number standard corresponding to the sample with the fewest number of sequences. Subsequent analyzes of alpha diversity and beta diversity were all performed based on this normalized data output.


3 Alpha Diversity

Alpha diversity is applied in the analysis of the complexity of species diversity for a sample through 6 indices, including Observed Species, Chao1, Shannon, Simpson, ACE, Good Coverage. All these indices in our samples were calculated with QIIME (Version 1.7.0) and displayed with R software (Version 2.15.3).

Alpha Diversity Indexes

Community Wealth Indices:

Chao - The Chao1 estimator

ACE - The ACE estimator

Community Diversity Indices:

Shannon - The Shannon Index

Simpson - The Simpson Index

The Sequencing Depth Index:

Coverage - The Good’s Coverage

The Phylogenetic Diversity Index:

PD_whole_tree - PD_whole_tree Index


4 Beta Diversity

Beta diversity analysis was used to assess sample differences in species complexity. Weighted and unweighted unifrac beta diversity was calculated using QIIME software (Version 1.7.0). Cluster analysis was preceded by principal component analysis (PCA), which was applied to reduce the dimension of the original variables using the FactoMineR package and the ggplot2 package in R software (Version 2.15.3). Principal coordinate analysis (PCoA) was performed to obtain principal coordinates and visualize complex multidimensional data. A weighted or unweighted unifrac distance matrix between the samples obtained above was transformed into a new set of orthogonal axes, whereby the maximum variation factor is shown by the first principal coordinate, the second maximum by the second principal coordinate, and so on. PCoA analysis was shown using the WGCNA package, stat packages, and ggplot2 package in R software (version 2.15.3). Unweighted Pairwise Group Method with Arithmetic Means (UPGMA) Clustering was performed as a type of hierarchical clustering method to interpret the distance matrix using the average link and was performed by QIIME software (Version 1.7.0).

LEfSe analysis was performed with LEfSe software. Metastats were calculated using R software. The P value was calculated using the permutation test method, while the q value was calculated using the Benjamini and Hochberg False Discovery Rate method[17] . Anosim, MRPP and Adonis were performed with R software (vegan package: anosim function, mrpp function and adonis function). AMOVA was calculated by mothur using the amova function. T_test and drawing were performed by R software.

Ⅴ. References

[1] Caporaso, J. Gregory, et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences 108.Supplement 1 (2011): 4516-4522.
[2] Youssef, Noha, et al. Comparison of species richness estimates obtained using nearly complete fragments and simulated pyrosequencing-generated fragments in 16S rRNA gene-based environmental surveys. Applied and environmental microbiology 75.16 (2009): 5227-5236.
[3] Hess, Matthias, et al. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331.6016 (2011): 463-467.
[4] DeSantis, T. Z., et al. NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic acids research 34.suppl 2 (2006): W394-W399.
[5] Ondov, Brian D., Nicholas H. Bergman, and Adam M. Phillippy. Interactive metagenomic visualization in a Web browser. BMC bioinformatics 12.1 (2011): 385.
[6] Li, Bing, et al. Characterization of tetracycline resistant bacterial community in saline activated sludge using batch stress incubation with high-throughput sequencing analysis. Water research 47.13 (2013): 4207-4216.
[7] Lundberg, Derek S., et al. Practical innovations for high-throughput amplicon sequencing.Nature methods 10.10 (2013): 999-1002.
[8] Magoč, Tanja, and Steven L. Salzberg. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27.21 (2011): 2957-2963.
[9] Bokulich, Nicholas A., et al. Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nature methods 10.1 (2013): 57-59.
[10] Caporaso, J. Gregory, et al. QIIME allows analysis of high-throughput community sequencing data. Nature methods 7.5 (2010): 335-336.
[11] Edgar, Robert C., et al. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27.16 (2011): 2194-2200.
[12] Haas, Brian J., et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome research 21.3 (2011): 494-504.
[13] Edgar, Robert C. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nature methods 10.10 (2013): 996-998.
[14] Wang, Qiong, et al. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and environmental microbiology 73.16 (2007): 5261-5267.
[15] Quast C, Pruesse E, et al.The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl. Acids Res. (2013) : D590-D596.
[16] MUSCLE: multiple sequence alignment with high accuracy and high throughputEdgar, 2004
[17] White, James Robert, Niranjan Nagarajan, and Mihai Pop. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS computational biology 5.4 (2009): e1000352.

Ⅵ. Appendix

Methods: Report_Methods_MIKROBIOMO.pdf.

Notes: It is highly recommended to scan this report in the Firefox box, as various browsers, such as IE, may disable the display of tables. More vectograms and statistical data are provided in the delivery directory.