MetaGenomics — Shotgun Metagenomics & 16S rRNA Analysis Engine

01 — Modules

Six Analysis Engines, One Script

MetaGenomics bundles six published statistical methods into a single pure-Python pipeline. Each module is implemented from first principles — no external bioinformatics frameworks required.

Module 1

Taxonomic Profiling

Rarefaction (subsampling to uniform depth), relative abundance normalization, taxonomic roll-up from OTU to phylum level, and stacked composition barplots.

rarefaction CLR transform barplots 500-taxon DB

Module 2

Alpha Diversity

Shannon entropy, Simpson index, Chao1 richness estimator, Pielou's evenness, Faith's phylogenetic diversity, and rarefaction curves for species discovery assessment.

Shannon H′ Chao1 Pielou J Simpson D

Module 3

Beta Diversity & Ordination

Bray-Curtis dissimilarity, Aitchison distance (CLR + Euclidean), Jaccard presence/absence, PCoA eigendecomposition, PERMANOVA, and ANOSIM for group separation testing.

Bray-Curtis PCoA PERMANOVA Enterotypes (DMM)

Module 4

Differential Abundance

LEfSe (Kruskal-Wallis → Wilcoxon → LDA effect size), ALDEx2 (Monte Carlo Dirichlet + CLR + Wilcoxon), DESeq2-inspired (negative binomial), and ANCOM-BC bias-corrected analysis.

LEfSe ALDEx2 ANCOM-BC BH FDR

Module 5

Functional Profiling

COG category mapping (25 functional categories), KEGG pathway hypergeometric enrichment, metabolic guild classification, and antibiotic resistance gene (ARG) detection across 20 gene classes.

COG KEGG ARG metabolic guilds

Module 6

Co-occurrence Networks

SparCC-inspired correlation estimation for compositional data, node hub scoring, edge filtering by correlation threshold, and microbial interaction inference.

SparCC Spearman ρ hub taxa interaction

02 — Diversity Metrics

Mathematically Rigorous, Compositional-aware

MetaGenomics implements published statistical methods that account for the compositional nature of microbiome data — the mathematical reason many naive analyses produce spurious results.

Alpha — Shannon H′

−Σ pᵢ log(pᵢ)

Information entropy; range [0, log(S)]

Alpha — Chao1

S + f₁² / 2f₂

Richness estimator accounting for unseen species

Beta — Bray-Curtis

Σ|aᵢ−bᵢ| / Σ(aᵢ+bᵢ)

Abundance-weighted dissimilarity; range [0,1]

Beta — Aitchison

‖CLR(a)−CLR(b)‖₂

CLR + Euclidean; composition-aware distance

Differential — CLR

log(xᵢ) − log GM(x)

Centered log-ratio; projects proportions to ℝⁿ

Network — SparCC

ρ(Spearman)[CLR]

Spearman on CLR values; removes compositional bias

    Why CLR transform? Microbiome data is compositional — reads are proportions constrained to sum to a constant. Standard Pearson correlations on raw proportions create spurious negatives (if one taxon rises, others must fall mathematically). The centered log-ratio (CLR) projects to unconstrained Euclidean space where standard statistics are valid. This is the mathematical foundation of both ALDEx2 and Aitchison distance.
  

03 — Biomarker Discovery

LEfSe & ALDEx2 for Differential Abundance

LEfSe (Segata et al. 2011, 6000+ citations) is the field-standard for microbiome biomarker discovery. ALDEx2 provides a fully Bayesian treatment of compositional uncertainty via Monte Carlo Dirichlet sampling.

LEfSe

Kruskal-Wallis → LDA

Biomarker discovery

ALDEx2

MC Dirichlet + CLR + Wilcoxon

Effect size + FDR

ANCOM-BC

Bias-corrected ANCOM

Compositionality-aware

DESeq2

Neg. binomial + size factors

Metagenomic count data

IBD

Inflammatory bowel disease (Crohn's / UC) vs. healthy

↓ F. prausnitzii, E. rectale, A. muciniphila
↑ R. gnavus, E. coli, B. fragilis

CRC

Colorectal cancer vs. healthy

↓ F. prausnitzii, Coprococcus, Bifidobacterium
↑ B. fragilis, E. coli, R. gnavus

Obesity

Obesity / metabolic syndrome vs. lean

↓ A. muciniphila, F. prausnitzii, B. uniforms
↑ M. smithii, Lachnospiraceae NK4A136

T2D

Type 2 diabetes vs. healthy

↓ A. muciniphila, Lactobacillus, Coprococcus
↑ B. fragilis, E. coli

04 — Pipeline

End-to-End in One Command

The full pipeline runs from OTU table → rarefaction → alpha/beta diversity → biomarker discovery → network inference → interactive dashboard in a single Python script.

Step 01

Data Input

Provide an OTU/ASV count table (taxa × samples) or generate a synthetic cohort for any of the four built-in conditions (IBD, CRC, obesity, T2D).

# From OTU table
python metagenomics.py --otu otu_table.csv --condition IBD

# Synthetic demo
python metagenomics.py IBD

Built-in taxon DB: 500 gut microbiome species with known disease associations

Step 02

Rarefaction + Alpha Diversity

Subsample all samples to minimum sequencing depth for fair comparison. Compute Shannon, Simpson, Chao1, Pielou's evenness, and dominance for each sample.

# Rarefaction: uniform depth sampling without replacement
depth = min(sample_totals)
# Alpha: Shannon, Simpson, Chao1, Pielou J, dominance

Outputs: otu_rarefied.csv, alpha_diversity.csv

Step 03

Beta Diversity + PERMANOVA

Compute pairwise Bray-Curtis / Aitchison / Jaccard distance matrix. Run PCoA ordination and PERMANOVA to test whether group membership explains community composition differences.

# Bray-Curtis: |a-b| / (a+b)
# PCoA: double-center D², eigendecompose
# PERMANOVA: pseudo-F via group permutation (999x)

Expected: PERMANOVA p < 0.05 for IBD vs. control cohorts

Step 04

LEfSe Biomarker Discovery

Kruskal-Wallis test → pairwise Wilcoxon consistency check → LDA score estimation. Report taxa with |LDA| > 2.0 as condition-specific biomarkers. Also run ALDEx2 with 128 MC Dirichlet samples for FDR-controlled results.

# LEfSe: KW p<0.05 → pairwise Wilcoxon → LDA
# ALDEx2: 128 MC Dirichlet + CLR + Wilcoxon + BH FDR

Outputs: lefse_biomarkers.csv, cooccurrence_network.csv

Step 05

Visualization + Network

Generate a 6-panel interactive Plotly dashboard: taxonomic barplot, alpha diversity boxplots, PCoA ordination, LEfSe biomarker histogram, SparCC co-occurrence network, and summary table. ARG profile and functional COG summary also exported.

metagenomics_output/
├── metagenomics.html        # 6-panel interactive dashboard
├── alpha_diversity.csv
├── lefse_biomarkers.csv
├── cooccurrence_network.csv
├── arg_profile.csv
└── summary.json

Compositionality warning

Never compute Pearson correlations on raw relative abundances — the constant-sum constraint guarantees spurious negatives. Always CLR-transform first, then compute correlations in CLR-space (as SparCC does).

Rarefaction is a necessary correction

Unequal sequencing depth confounds all diversity metrics. Rarefaction (subsampling without replacement) is the field-standard correction for this bias. Alternatively, use CLR or CSS normalization for diversity-free comparisons.

05 — Live Demo

Interactive Dashboard

Synthetic IBD cohort (15 cases, 15 controls, 80 taxa) generated by the built-in simulator. Shows all six analysis modules producing results in a single run.

Synthetic demo parameters: 30 samples (15 IBD / 15 healthy), 80 taxa, 47,216 rarefaction depth, Bray-Curtis + PERMANOVA (999 permutations), LEfSe (LDA > 2.0), SparCC network (|ρ| > 0.5). IBD-associated taxa: Faecalibacterium prausnitzii (depleted), Ruminococcus gnavus (enriched), Escherichia coli (enriched).

06 — The Skill

Full SKILL.md Content

The complete executable skill file used by AI agents. Reproduces the full analysis pipeline from data generation to interactive dashboard.

MetaGenomics: Shotgun Metagenomics & 16S rRNA Analysis Engine

Trigger

Use this skill when the user wants to:

Profile the taxonomic composition of a microbial community from 16S rRNA or WGS data
Compute alpha diversity (Shannon, Simpson, Chao1, Faith's PD) and beta diversity
Perform differential abundance analysis between sample groups (case vs. control)
Identify biomarker taxa associated with a condition (LEfSe-like analysis)
Annotate metagenomic reads with functional categories (COG, KEGG pathways)
Detect antibiotic resistance genes (ARGs) in a metagenomic sample
Analyze enterotypes and gut microbiome community structure

Quick Start

pip install numpy scipy pandas scikit-learn plotly matplotlib requests --break-system-packages -q

python metagenomics.py IBD
# Open metagenomics_output/metagenomics.html

Demo Conditions

Condition	Description	Key Biomarkers
IBD	Inflammatory bowel disease	F. prausnitzii ↓, R. gnavus ↑
CRC	Colorectal cancer	Coprococcus ↓, B. fragilis ↑
obesity	Metabolic syndrome	A. muciniphila ↓, M. smithii ↑
T2D	Type 2 diabetes	Lactobacillus ↓, E. coli ↑

Output Files

File	Description
metagenomics.html	6-panel interactive Plotly dashboard
otu_rarefied.csv	Rarefied OTU count table (taxa × samples)
alpha_diversity.csv	Shannon, Simpson, Chao1, evenness per sample
lefse_biomarkers.csv	Differentially abundant taxa + LDA scores
cooccurrence_network.csv	SparCC pairwise microbial correlations
arg_profile.csv	Antibiotic resistance gene abundances
summary.json	Machine-readable analysis summary

Dependencies

numpy>=1.24, scipy>=1.10, pandas>=1.5, scikit-learn>=1.3, plotly>=5.15

Python 3.9+. CPU only. No QIIME2, mothur, HUMAnN3, or R required.

Scientific References

Anderson, M.J. (2001). PERMANOVA. Austral Ecology
Segata, N. et al. (2011). LEfSe. Genome Biology
Fernandes, A.D. et al. (2014). ALDEx2. BMC Bioinformatics
Friedman, J. & Alm, E.J. (2012). SparCC. PLoS Comput Biol
Gloor, G.B. et al. (2017). Microbiome datasets are compositional. Front. Microbiol

07 — Reproduce

Clone and Run

Full reproducibility in two commands. The skill handles everything from data generation to the interactive dashboard.

# Clone the repository
git clone https://github.com/junior1p/MetaGenomics.git
cd MetaGenomics

# Install dependencies
pip install numpy scipy pandas scikit-learn plotly matplotlib requests \
    --break-system-packages -q

# Run IBD demo (15 cases + 15 controls, 80 taxa)
python metagenomics.py IBD

# Open the interactive 6-panel dashboard
open metagenomics_output/metagenomics.html

# Try other conditions
python metagenomics.py CRC        # colorectal cancer
python metagenomics.py obesity    # metabolic syndrome
python metagenomics.py T2D        # type 2 diabetes

Max

Shotgun metagenomics analysis, differential abundance, and microbial ecology

GitHub Repository Web Page