[Forge] Immune repertoire pipeline - TCR/BCR FASTQ to MiXCR clones to epitope-link artifact open

← Forge
MiXCR clonotyping + diversity metrics + IEDB epitope-matching into a clonotype-to-epitope linkage artifact.
Spec File

Effort: thorough

Goal

Build an immune-receptor repertoire pipeline for the immunology
vertical: ingest TCR/BCR sequencing FASTQ (or a precomputed AIRR
TSV), call clonotypes with MiXCR, compute repertoire diversity
(Shannon, Gini, Hill), match clonotypes against IEDB epitopes via the
new iedb_epitopes tool, and emit a clonotype-to-epitope linkage
artifact a debate can cite. Closes the immunology-vertical's biggest
data gap: SciDEX has no way to argue from actual repertoire data today.

Why this matters

Immunology hypotheses ("expanded autoreactive TCRs drive RA flares",
"hospital-acquired SARS-CoV-2 strains evade convalescent BCR
responses") need real repertoire evidence to be debate-grade. Without
a pipeline, the immunology Theorist has nothing to ground claims on
beyond text reviews. This pipeline absorbs publicly available AIRR
datasets (10X Genomics, ImmuneSpace, AIRR-DB) and turns them into
SciDEX artifacts the persona pack can argue from.

Acceptance Criteria

☐ New module scidex/forge/immune_repertoire.py (≤700 LoC):
- ingest(source) — accepts FASTQ paths, an AIRR-format TSV,
or a 10X cellranger-vdj output dir.
- call_clonotypes_mixcr(fastqs) — invokes MiXCR via subprocess;
returns AIRR-format clones table.
- diversity_metrics(clones) — computes Shannon entropy, Gini,
Hill numbers (q=1, q=2), Chao1 estimator.
- link_to_epitopes(clones) — calls
tools.iedb_epitopes per clonotype CDR3; returns matches
with sequence-similarity score (Levenshtein ≤ 2 = match,
≤ 4 = candidate).
- pipeline(source, chain='TRB') — composes; commits artifact
under data/scidex-artifacts/immune_repertoire/<run_id>/
with the clones table, diversity JSON, and linkage CSV.
☐ Migration repertoire_run(run_id PRIMARY KEY, source_kind,
source_spec_json, chain TEXT CHECK IN ('TRA','TRB','IGH','IGK','IGL'),
n_clonotypes, shannon, gini, n_epitope_matches, mixcr_version,
pipeline_version, started_at, finished_at, artifact_id)
.
tools.py registers immune_repertoire_pipeline(source,
chain) with @log_tool_call.
/artifacts/<id> renders a clonotype-frequency rank plot
(Pareto), a diversity-metric panel, and a clonotype-to-epitope
table linking out to the matched IEDB record.
☐ Immunology persona pack
(q-vert-vertical-personas-pack) consumes a
repertoire_block when a debate's hypothesis names a disease
with a recent run.
☐ Acceptance: small public AIRR dataset (e.g., a 10X demo) runs
end-to-end in <20 min, produces ≥1 epitope match, artifact
registered.
☐ Tests: tests/test_immune_repertoire.py — synthetic AIRR table
→ diversity metrics in expected ranges, mock IEDB linkage
returns expected matches.

Approach

  • MiXCR has a free academic license; ship install instructions in
  • docs/setup/mixcr.md. Subprocess wrapper handles installed +
    missing cases gracefully.
  • Diversity formulas implemented once in pure NumPy.
  • Levenshtein matching uses python-Levenshtein (lightweight).
  • Cache IEDB lookups by CDR3 hash to avoid repeat calls.
  • Persona injection mirrors the prior pattern.
  • Dependencies

    • MiXCR (subprocess); iedb_epitopes from
    q-vert-vertical-evidence-providers.
    • q-vert-vertical-personas-pack — immunology-expert consumer.
    • data/scidex-artifacts/ submodule.

    Work Log

    Sibling Tasks in Quest (Forge) ↗