Pure NumPy/SciPy/scikit-learn microbial community analysis — rarefaction, alpha/beta diversity, LEfSe biomarker discovery, ALDEx2 differential abundance, SparCC co-occurrence networks, and ARG detection. No QIIME2, no mothur, no HUMAnN3.
MetaGenomics bundles six published statistical methods into a single pure-Python pipeline. Each module is implemented from first principles — no external bioinformatics frameworks required.
Rarefaction (subsampling to uniform depth), relative abundance normalization, taxonomic roll-up from OTU to phylum level, and stacked composition barplots.
Shannon entropy, Simpson index, Chao1 richness estimator, Pielou's evenness, Faith's phylogenetic diversity, and rarefaction curves for species discovery assessment.
Bray-Curtis dissimilarity, Aitchison distance (CLR + Euclidean), Jaccard presence/absence, PCoA eigendecomposition, PERMANOVA, and ANOSIM for group separation testing.
LEfSe (Kruskal-Wallis → Wilcoxon → LDA effect size), ALDEx2 (Monte Carlo Dirichlet + CLR + Wilcoxon), DESeq2-inspired (negative binomial), and ANCOM-BC bias-corrected analysis.
COG category mapping (25 functional categories), KEGG pathway hypergeometric enrichment, metabolic guild classification, and antibiotic resistance gene (ARG) detection across 20 gene classes.
SparCC-inspired correlation estimation for compositional data, node hub scoring, edge filtering by correlation threshold, and microbial interaction inference.
MetaGenomics implements published statistical methods that account for the compositional nature of microbiome data — the mathematical reason many naive analyses produce spurious results.
LEfSe (Segata et al. 2011, 6000+ citations) is the field-standard for microbiome biomarker discovery. ALDEx2 provides a fully Bayesian treatment of compositional uncertainty via Monte Carlo Dirichlet sampling.
The full pipeline runs from OTU table → rarefaction → alpha/beta diversity → biomarker discovery → network inference → interactive dashboard in a single Python script.
# From OTU table
python metagenomics.py --otu otu_table.csv --condition IBD
# Synthetic demo
python metagenomics.py IBD
# Rarefaction: uniform depth sampling without replacement
depth = min(sample_totals)
# Alpha: Shannon, Simpson, Chao1, Pielou J, dominance
# Bray-Curtis: |a-b| / (a+b)
# PCoA: double-center D², eigendecompose
# PERMANOVA: pseudo-F via group permutation (999x)
# LEfSe: KW p<0.05 → pairwise Wilcoxon → LDA
# ALDEx2: 128 MC Dirichlet + CLR + Wilcoxon + BH FDR
metagenomics_output/
├── metagenomics.html # 6-panel interactive dashboard
├── alpha_diversity.csv
├── lefse_biomarkers.csv
├── cooccurrence_network.csv
├── arg_profile.csv
└── summary.json
Synthetic IBD cohort (15 cases, 15 controls, 80 taxa) generated by the built-in simulator. Shows all six analysis modules producing results in a single run.
The complete executable skill file used by AI agents. Reproduces the full analysis pipeline from data generation to interactive dashboard.
Use this skill when the user wants to:
pip install numpy scipy pandas scikit-learn plotly matplotlib requests --break-system-packages -q
python metagenomics.py IBD
# Open metagenomics_output/metagenomics.html
| Condition | Description | Key Biomarkers |
|---|---|---|
| IBD | Inflammatory bowel disease | F. prausnitzii ↓, R. gnavus ↑ |
| CRC | Colorectal cancer | Coprococcus ↓, B. fragilis ↑ |
| obesity | Metabolic syndrome | A. muciniphila ↓, M. smithii ↑ |
| T2D | Type 2 diabetes | Lactobacillus ↓, E. coli ↑ |
| File | Description |
|---|---|
| metagenomics.html | 6-panel interactive Plotly dashboard |
| otu_rarefied.csv | Rarefied OTU count table (taxa × samples) |
| alpha_diversity.csv | Shannon, Simpson, Chao1, evenness per sample |
| lefse_biomarkers.csv | Differentially abundant taxa + LDA scores |
| cooccurrence_network.csv | SparCC pairwise microbial correlations |
| arg_profile.csv | Antibiotic resistance gene abundances |
| summary.json | Machine-readable analysis summary |
numpy>=1.24, scipy>=1.10, pandas>=1.5, scikit-learn>=1.3, plotly>=5.15
Python 3.9+. CPU only. No QIIME2, mothur, HUMAnN3, or R required.
Full reproducibility in two commands. The skill handles everything from data generation to the interactive dashboard.
# Clone the repository
git clone https://github.com/junior1p/MetaGenomics.git
cd MetaGenomics
# Install dependencies
pip install numpy scipy pandas scikit-learn plotly matplotlib requests \
--break-system-packages -q
# Run IBD demo (15 cases + 15 controls, 80 taxa)
python metagenomics.py IBD
# Open the interactive 6-panel dashboard
open metagenomics_output/metagenomics.html
# Try other conditions
python metagenomics.py CRC # colorectal cancer
python metagenomics.py obesity # metabolic syndrome
python metagenomics.py T2D # type 2 diabetes