Executable Agent Skill

Shotgun Metagenomics & 16S rRNA Analysis

Pure NumPy/SciPy/scikit-learn microbial community analysis — rarefaction, alpha/beta diversity, LEfSe biomarker discovery, ALDEx2 differential abundance, SparCC co-occurrence networks, and ARG detection. No QIIME2, no mothur, no HUMAnN3.

6
Analysis Modules
100%
Pure Python
30+
Diversity Metrics
20
ARG Classes
01 — Modules

Six Analysis Engines, One Script

MetaGenomics bundles six published statistical methods into a single pure-Python pipeline. Each module is implemented from first principles — no external bioinformatics frameworks required.

Module 1

Taxonomic Profiling

Rarefaction (subsampling to uniform depth), relative abundance normalization, taxonomic roll-up from OTU to phylum level, and stacked composition barplots.

rarefaction CLR transform barplots 500-taxon DB
Module 2

Alpha Diversity

Shannon entropy, Simpson index, Chao1 richness estimator, Pielou's evenness, Faith's phylogenetic diversity, and rarefaction curves for species discovery assessment.

Shannon H′ Chao1 Pielou J Simpson D
Module 3

Beta Diversity & Ordination

Bray-Curtis dissimilarity, Aitchison distance (CLR + Euclidean), Jaccard presence/absence, PCoA eigendecomposition, PERMANOVA, and ANOSIM for group separation testing.

Bray-Curtis PCoA PERMANOVA Enterotypes (DMM)
Module 4

Differential Abundance

LEfSe (Kruskal-Wallis → Wilcoxon → LDA effect size), ALDEx2 (Monte Carlo Dirichlet + CLR + Wilcoxon), DESeq2-inspired (negative binomial), and ANCOM-BC bias-corrected analysis.

LEfSe ALDEx2 ANCOM-BC BH FDR
Module 5

Functional Profiling

COG category mapping (25 functional categories), KEGG pathway hypergeometric enrichment, metabolic guild classification, and antibiotic resistance gene (ARG) detection across 20 gene classes.

COG KEGG ARG metabolic guilds
Module 6

Co-occurrence Networks

SparCC-inspired correlation estimation for compositional data, node hub scoring, edge filtering by correlation threshold, and microbial interaction inference.

SparCC Spearman ρ hub taxa interaction
02 — Diversity Metrics

Mathematically Rigorous, Compositional-aware

MetaGenomics implements published statistical methods that account for the compositional nature of microbiome data — the mathematical reason many naive analyses produce spurious results.

Alpha — Shannon H′
−Σ pᵢ log(pᵢ)
Information entropy; range [0, log(S)]
Alpha — Chao1
S + f₁² / 2f₂
Richness estimator accounting for unseen species
Beta — Bray-Curtis
Σ|aᵢ−bᵢ| / Σ(aᵢ+bᵢ)
Abundance-weighted dissimilarity; range [0,1]
Beta — Aitchison
‖CLR(a)−CLR(b)‖₂
CLR + Euclidean; composition-aware distance
Differential — CLR
log(xᵢ) − log GM(x)
Centered log-ratio; projects proportions to ℝⁿ
Network — SparCC
ρ(Spearman)[CLR]
Spearman on CLR values; removes compositional bias
Why CLR transform? Microbiome data is compositional — reads are proportions constrained to sum to a constant. Standard Pearson correlations on raw proportions create spurious negatives (if one taxon rises, others must fall mathematically). The centered log-ratio (CLR) projects to unconstrained Euclidean space where standard statistics are valid. This is the mathematical foundation of both ALDEx2 and Aitchison distance.
03 — Biomarker Discovery

LEfSe & ALDEx2 for Differential Abundance

LEfSe (Segata et al. 2011, 6000+ citations) is the field-standard for microbiome biomarker discovery. ALDEx2 provides a fully Bayesian treatment of compositional uncertainty via Monte Carlo Dirichlet sampling.

LEfSe
Kruskal-Wallis → LDA
Biomarker discovery
ALDEx2
MC Dirichlet + CLR + Wilcoxon
Effect size + FDR
ANCOM-BC
Bias-corrected ANCOM
Compositionality-aware
DESeq2
Neg. binomial + size factors
Metagenomic count data
IBD
Inflammatory bowel disease (Crohn's / UC) vs. healthy
↓ F. prausnitzii, E. rectale, A. muciniphila
↑ R. gnavus, E. coli, B. fragilis
CRC
Colorectal cancer vs. healthy
↓ F. prausnitzii, Coprococcus, Bifidobacterium
↑ B. fragilis, E. coli, R. gnavus
Obesity
Obesity / metabolic syndrome vs. lean
↓ A. muciniphila, F. prausnitzii, B. uniforms
↑ M. smithii, Lachnospiraceae NK4A136
T2D
Type 2 diabetes vs. healthy
↓ A. muciniphila, Lactobacillus, Coprococcus
↑ B. fragilis, E. coli
04 — Pipeline

End-to-End in One Command

The full pipeline runs from OTU table → rarefaction → alpha/beta diversity → biomarker discovery → network inference → interactive dashboard in a single Python script.

1
Step 01
Data Input
Provide an OTU/ASV count table (taxa × samples) or generate a synthetic cohort for any of the four built-in conditions (IBD, CRC, obesity, T2D).
# From OTU table
python metagenomics.py --otu otu_table.csv --condition IBD

# Synthetic demo
python metagenomics.py IBD
Built-in taxon DB: 500 gut microbiome species with known disease associations
2
Step 02
Rarefaction + Alpha Diversity
Subsample all samples to minimum sequencing depth for fair comparison. Compute Shannon, Simpson, Chao1, Pielou's evenness, and dominance for each sample.
# Rarefaction: uniform depth sampling without replacement
depth = min(sample_totals)
# Alpha: Shannon, Simpson, Chao1, Pielou J, dominance
Outputs: otu_rarefied.csv, alpha_diversity.csv
3
Step 03
Beta Diversity + PERMANOVA
Compute pairwise Bray-Curtis / Aitchison / Jaccard distance matrix. Run PCoA ordination and PERMANOVA to test whether group membership explains community composition differences.
# Bray-Curtis: |a-b| / (a+b)
# PCoA: double-center D², eigendecompose
# PERMANOVA: pseudo-F via group permutation (999x)
Expected: PERMANOVA p < 0.05 for IBD vs. control cohorts
4
Step 04
LEfSe Biomarker Discovery
Kruskal-Wallis test → pairwise Wilcoxon consistency check → LDA score estimation. Report taxa with |LDA| > 2.0 as condition-specific biomarkers. Also run ALDEx2 with 128 MC Dirichlet samples for FDR-controlled results.
# LEfSe: KW p<0.05 → pairwise Wilcoxon → LDA
# ALDEx2: 128 MC Dirichlet + CLR + Wilcoxon + BH FDR
Outputs: lefse_biomarkers.csv, cooccurrence_network.csv
5
Step 05
Visualization + Network
Generate a 6-panel interactive Plotly dashboard: taxonomic barplot, alpha diversity boxplots, PCoA ordination, LEfSe biomarker histogram, SparCC co-occurrence network, and summary table. ARG profile and functional COG summary also exported.
metagenomics_output/
├── metagenomics.html        # 6-panel interactive dashboard
├── alpha_diversity.csv
├── lefse_biomarkers.csv
├── cooccurrence_network.csv
├── arg_profile.csv
└── summary.json
Compositionality warning
Never compute Pearson correlations on raw relative abundances — the constant-sum constraint guarantees spurious negatives. Always CLR-transform first, then compute correlations in CLR-space (as SparCC does).
Rarefaction is a necessary correction
Unequal sequencing depth confounds all diversity metrics. Rarefaction (subsampling without replacement) is the field-standard correction for this bias. Alternatively, use CLR or CSS normalization for diversity-free comparisons.
05 — Live Demo

Interactive Dashboard

Synthetic IBD cohort (15 cases, 15 controls, 80 taxa) generated by the built-in simulator. Shows all six analysis modules producing results in a single run.

Synthetic demo parameters: 30 samples (15 IBD / 15 healthy), 80 taxa, 47,216 rarefaction depth, Bray-Curtis + PERMANOVA (999 permutations), LEfSe (LDA > 2.0), SparCC network (|ρ| > 0.5). IBD-associated taxa: Faecalibacterium prausnitzii (depleted), Ruminococcus gnavus (enriched), Escherichia coli (enriched).

Full SKILL.md Content

The complete executable skill file used by AI agents. Reproduces the full analysis pipeline from data generation to interactive dashboard.

MetaGenomics: Shotgun Metagenomics & 16S rRNA Analysis Engine

Trigger

Use this skill when the user wants to:

  • Profile the taxonomic composition of a microbial community from 16S rRNA or WGS data
  • Compute alpha diversity (Shannon, Simpson, Chao1, Faith's PD) and beta diversity
  • Perform differential abundance analysis between sample groups (case vs. control)
  • Identify biomarker taxa associated with a condition (LEfSe-like analysis)
  • Annotate metagenomic reads with functional categories (COG, KEGG pathways)
  • Detect antibiotic resistance genes (ARGs) in a metagenomic sample
  • Analyze enterotypes and gut microbiome community structure

Quick Start

pip install numpy scipy pandas scikit-learn plotly matplotlib requests --break-system-packages -q

python metagenomics.py IBD
# Open metagenomics_output/metagenomics.html

Demo Conditions

ConditionDescriptionKey Biomarkers
IBDInflammatory bowel diseaseF. prausnitzii ↓, R. gnavus ↑
CRCColorectal cancerCoprococcus ↓, B. fragilis ↑
obesityMetabolic syndromeA. muciniphila ↓, M. smithii ↑
T2DType 2 diabetesLactobacillus ↓, E. coli ↑

Output Files

FileDescription
metagenomics.html6-panel interactive Plotly dashboard
otu_rarefied.csvRarefied OTU count table (taxa × samples)
alpha_diversity.csvShannon, Simpson, Chao1, evenness per sample
lefse_biomarkers.csvDifferentially abundant taxa + LDA scores
cooccurrence_network.csvSparCC pairwise microbial correlations
arg_profile.csvAntibiotic resistance gene abundances
summary.jsonMachine-readable analysis summary

Dependencies

numpy>=1.24, scipy>=1.10, pandas>=1.5, scikit-learn>=1.3, plotly>=5.15

Python 3.9+. CPU only. No QIIME2, mothur, HUMAnN3, or R required.

Scientific References

  • Anderson, M.J. (2001). PERMANOVA. Austral Ecology
  • Segata, N. et al. (2011). LEfSe. Genome Biology
  • Fernandes, A.D. et al. (2014). ALDEx2. BMC Bioinformatics
  • Friedman, J. & Alm, E.J. (2012). SparCC. PLoS Comput Biol
  • Gloor, G.B. et al. (2017). Microbiome datasets are compositional. Front. Microbiol
07 — Reproduce

Clone and Run

Full reproducibility in two commands. The skill handles everything from data generation to the interactive dashboard.

# Clone the repository
git clone https://github.com/junior1p/MetaGenomics.git
cd MetaGenomics

# Install dependencies
pip install numpy scipy pandas scikit-learn plotly matplotlib requests \
    --break-system-packages -q

# Run IBD demo (15 cases + 15 controls, 80 taxa)
python metagenomics.py IBD

# Open the interactive 6-panel dashboard
open metagenomics_output/metagenomics.html

# Try other conditions
python metagenomics.py CRC        # colorectal cancer
python metagenomics.py obesity    # metabolic syndrome
python metagenomics.py T2D        # type 2 diabetes
M

Max

Shotgun metagenomics analysis, differential abundance, and microbial ecology