Quest: C

Quest: Competitive Biotools — Compete, Learn, Co-Adapt with Biomni + K-Dense

Layer: Cross-cutting Priority: P94 Status: active

Tracked competitors

Biomni (Stanford / Phylo, Apache 2.0, $13.5M seed, 7,000+ labs) — agent

runs end-to-end biomedical analyses; 150 tools, Biomni-R0 RL model,
Biomni-Eval1 benchmark. Profile: docs/bio_competitive/biomni_profile.md.

K-Dense (Biostate AI / K-Dense AI, Accel + Dario Amodei, 29.2% BixBench)

— hierarchical dual-loop planner+executor; 133 open-source Agent Skills,
250+ databases, 500K+ Python packages. Profile:
docs/bio_competitive/k_dense_profile.md.

Amass Tech — enterprise scientific-intelligence SaaS (40M+155M+235M

corpus, GEMA citation-backed Q&A). Profile:
docs/bio_competitive/amass_profile.md.

Amazon Bio Discovery (AWS, launched April 2026) — enterprise agentic AI

for drug development. Tracked inside amass_profile.md.

Alpha1 Science (late-2025 launch) — biomedical-specific Rigor Check

agent: 2 independent AI evaluators, 8 methodological-rigor dimensions
grounded in NIH / MDAR / ARRIVE 2.0 / CONSORT / EQUATOR, every rating
carries an evidence citation from the paper's own text. Profile:
docs/bio_competitive/alpha1_science_profile.md.

OpenAI PRISM (launched 2026-01-27) — free LaTeX-native AI workspace for

scientists, powered by GPT-5.2; Paper Review feature added April 2026 but
with no biomedical-guideline basis. Adjacent product category — tracked
for positioning, not absorption. Profile:
docs/bio_competitive/openai_prism_profile.md.

Vision

Biomni (Stanford / Phylo, Apache 2.0, $13.5M seed, 7,000+ labs) and K-Dense
(Biostate AI / K-Dense AI, backed by Accel and Dario Amodei, 29.2% BixBench vs
GPT-4's 22.9%) own the mindshare for "agent runs the biomedical analysis
end-to-end." Both are strong, well-funded, and moving fast — Biomni with 150
tools / 59 databases / a Biomni-R0 RL reasoning model / a 433-instance
Biomni-Eval1 benchmark, K-Dense with 133 open-source Agent Skills / 250+
databases / 500K+ packages / a hierarchical dual-loop planner+executor. SciDEX
cannot ignore either of them, and cannot blindly copy either of them.

Our differentiation is not "agent runs the analysis." It is **world model +
debate + market + resource awareness**: agents generate / debate / score /
price hypotheses against a living knowledge graph, with every contribution
credited back through the token economy. Biomni and K-Dense run the analysis;
SciDEX runs the analysis and ingests its result as a hypothesis-anchored,
debated, market-priced contribution to the world model. This quest invests in
running sophisticated analyses to Biomni parity — while keeping every analysis
wrapped in our epistemic / market / credit layer — and treats Biomni and
K-Dense as both competitors (whose mindshare we have to answer) and upstream
tooling (whose open-source skills we can absorb).

The bet is that the epistemic layer is the durable moat. Anyone can fund more
tools; fewer can build a self-auditing market for scientific claims.

Non-goals

Building a generic bioinformatics IDE. We are not trying to replace Biomni

Lab or K-Dense Analyst for users who just want to run a Scanpy pipeline.

Registering every tool under the sun. The tool-growth freeze still applies;

K-Dense skills adoption (WS3) is a one-time structured absorption, not an
invitation to broad tool sprawl.

Re-implementing Biomni/K-Dense infrastructure internals (datalake mirror,

PDF reports, Gradio UI). We reuse what's useful through API calls or the
skills registry — we do not fork.

"Beating" Biomni on Biomni-Eval1 or K-Dense on BixBench in the abstract.

Benchmarks are a signal, not the product. We aim to be competitive, not
supreme, on pure analysis execution.

Principles

Every sophisticated analysis feeds the epistemic stack. Running a

survival analysis or a scRNA pipeline is not the deliverable. The
deliverable is a hypothesis-anchored artifact that enters the knowledge
graph, triggers a debate, moves a market price, and credits the sponsoring
agent. An analysis that doesn't close that loop is a stub.

Call upstream when upstream is better. When Biomni or K-Dense has a

well-tested recipe for a subtask (e.g. CRISPR primer design, ligand-receptor
inference), we call their API / invoke their skill rather than rebuild.
The SciDEX-unique value is the debate + market wrap, not the subroutine.

No stubs. Each showcase analysis in WS2 must produce publication-grade

artifacts ≥50KB, cite real datasets (SEA-AD / ABC Atlas / CZ Cellxgene /
etc. per quest_real_data_pipeline_spec.md), and be consumable by the
debate pipeline without synthetic fallbacks.

Absorb best features, keep differentiation. Adopt K-Dense's skills

repo pattern, Biomni's GPU-as-a-tool, K-Dense's dual-loop plan+validate
— but keep WM / debate / markets / resource tracking as the frame. Do not
strip our differentiators to look more like Biomni.

Attribution is non-negotiable. Every skill call, every Biomni API

invocation, every ported analysis cites its upstream origin in the
artifact metadata and the Atlas wiki page. No laundering of others' work.

Workstreams

WS1: Competitive intelligence driver

A recurring agent scans the Biomni and K-Dense surface area and feeds an
internal digest. Sources: github.com/snap-stanford/Biomni and github.com/K-Dense-AI/claude-scientific-skills (commits, releases, issues,
wiki changes), blog posts (biomni.stanford.edu, k-dense.ai), published papers
(bioRxiv 2025.05.30.656746, arXiv 2508.07043 and their citation graph), and
public social signal (HN, Twitter/X, LinkedIn where accessible). Aggregates
into a weekly markdown report at docs/bio_competitive/weekly/YYYY-MM-DD.md
with: new tools/skills released, new papers citing them, new funding /
customers / benchmark claims, deltas vs our capability map.

Delivers: task-id-pending_biotools_competitive_intel_spec.md (recurring,
weekly).

WS2: Analysis parity — Biomni's 15 showcase use cases

Port each of Biomni's 15 showcased use cases into SciDEX as a
hypothesis-anchored showcase analysis. The 15: spatial transcriptomics,
binder design, biomarker panel design, clinical trial landscaping, survival
analysis, scRNA-seq processing & annotation, cell-cell communication, novel
Cas13 primer design, proteomics differential expression, gene regulatory
network inference, gene co-expression networks, microbiome analysis,
polygenic risk scores, variant annotation, fine-mapping. For each, the
delivered artifact must carry:

(i) A hypothesis or knowledge gap from hypotheses or knowledge_gaps
that motivated running the analysis. If none exists, one must be generated
and debated before the analysis runs.
(ii) Artifacts ≥50KB — code, data outputs, figures, write-up — stored
under artifacts/ with a wiki entry cross-linking the dataset, the
hypothesis, and the upstream Biomni/K-Dense recipe we adapted.
(iii) Debate trace — a debate_sessions row where at least Theorist and
Skeptic weigh in on the analysis conclusion, with quality_score ≥ 0.6.
(iv) Market price update — a price_history row on the sponsoring
hypothesis with event_source pointing at the analysis artifact.

Delivers: task-id-pending_biomni_analysis_parity_spec.md (quest-coordinator;
spawns 5 parallel sub-agents, 3 analyses each).

WS3: K-Dense skills adoption

K-Dense's claude-scientific-skills repo (Apache 2.0, 133 skills) is directly
compatible with Claude-based agents — which is us. Run npx skills add K-Dense-AI/scientific-agent-skills once inside the Forge
toolchain; wire each imported skill into our Forge tool registry as a
first-class tool so that skill invocations flow through the existing @log_tool_call instrumentation, get priced through the resource intelligence
scorer, and credit the invoking agent through agent_contributions. Prefer
adoption over re-implementation: if K-Dense already wraps BioPython / pysam /
Scanpy / RDKit / DeepChem / ESM / OpenMM, we use their wrapper rather than
adding another entry to tools.py. A recurring sub-task checks for skills
repo updates and re-syncs.

Delivers: task-id-pending_kdense_skills_adoption_spec.md (one-shot install
+ registry wire-up, plus monthly refresh task).

WS4: Sandboxed GPU execution

Biomni's GPU-as-a-tool lets its agents fine-tune Borzoi / scGPT / ESM2 /
UniRef / ADMET models inside a sandbox. To port the 15 analyses honestly we
need at least one working end-to-end fine-tune. This workstream pilots one model — scGPT preferred because it feeds directly into WS2's
scRNA-seq analyses — inside a bwrap sandbox with a GPU launcher that:

Reserves the GPU via resource_tracker.
Launches the fine-tune inside scripts/sandbox/run_gpu.sh with network

allow-list limited to model-weight CDNs and the dataset registry.

Caps wall-time and VRAM; kills + cleans on overrun.
Captures training logs, final weights, and validation metrics as artifacts.
Credits the sponsoring agent and debits the pool via the resource

allocation system (quest_economics_spec.md).

Success is one end-to-end scGPT fine-tune on an SEA-AD subset, artifacts
landed, debate triggered on the fine-tune's utility. Scope does not extend
to multi-model support in this quest.

Delivers: task-id-pending_gpu_sandbox_pilot_spec.md (one-shot).

WS5: Epistemic layer wraps

For every analysis produced by WS2 (and going forward, every new analysis of
comparable scope), auto-trigger: (a) a multi-agent debate seeded with the
analysis conclusion; (b) a price update on the hypothesis the analysis
informs; (c) a resource-cost ledger entry debited from the sponsoring
agent's wallet via cost_ledger; (d) a follow-up gap if the analysis
exposed a new knowledge gap. This is how our differentiation from Biomni /
K-Dense becomes systemic rather than per-analysis.

Delivers: task-id-pending_analysis_debate_wrapper_spec.md
(recurring, every-6h).

Success criteria

☐ WS1: Weekly competitive intel report landing every Monday; ≥90% of

Biomni/K-Dense commits in the trailing week surfaced within 7 days;
report file size ≥8KB; cited in at least one Senate decision within
30 days.

☐ WS2: 15/15 Biomni showcase analyses ported, each with

hypothesis + artifacts ≥50KB + debate quality_score ≥ 0.6 + price
update. Zero synthetic-data fallbacks.

☐ WS3: 133 K-Dense skills ingested into Forge registry; ≥30 skills

invoked by an agent in the first 60 days; logged through
@log_tool_call; monthly refresh runs without manual intervention.

☐ WS4: One scGPT fine-tune run end-to-end inside sandbox, artifacts

stored, resource cost reconciled against resource_allocations.

☐ WS5: 100% of analyses ≥50KB in the last 30 days have an associated

debate + price update + cost ledger entry. No orphaned analyses.

☐ Benchmark check-in: within 6 months, SciDEX scores a published

number on BixBench comparable to K-Dense Analyst (within 5 pts).
This is an informational check, not a pass/fail gate.

☐ Debate quality metric on wrapped analyses: mean quality_score on

WS2-generated debates ≥ 0.65, 20% higher than the current all-analysis
baseline (measurable via backfill_debate_quality.py).

Quality requirements

Reference quest_quality_standards_spec.md and

quest_real_data_pipeline_spec.md. No stubs: no empty notebooks, no
<50KB artifacts, no 0-edge analyses, no "generic" debates.

Parallel agents mandatory for batches ≥10 items. WS2 in particular runs as

5 parallel sub-agents covering 3 analyses each.

All wrapped analyses must cite real datasets (SEA-AD / ABC Atlas /

Cellxgene / ClinicalTrials.gov / OpenTargets / etc.) per
quest_real_data_pipeline_spec.md. No simulated inputs.

All skills / Biomni API calls logged through @log_tool_call with

upstream attribution in the artifact metadata.

Every new quest commit uses [Cross-cutting] or layer-specific prefix

with the task ID.

Parallel agent execution

WS2 is explicitly parallel. 5 agents × 3 analyses = 15 showcase

analyses. Agents run concurrently, each responsible for a disjoint
3-analysis slice, coordinated by the WS2 quest-coordinator task. Sub-agent
outputs merge through Orchestra sync push onto the coordinator's branch;
the coordinator runs integration tests + debate-wrap checks before
promoting.

WS1 is single-agent recurring (weekly cadence, small batch, no parallelism

needed).

WS3 is single-agent for the initial install, parallel only if the

post-install registry wire-up touches ≥10 skills per batch (which it will
— expect 133 skills split across 3–5 agents for the first pass).

WS4 is single-agent (one pilot model, no parallelism useful).
WS5 is single-agent recurring (every-6h, wraps whatever analyses landed

since last run).

Risks & mitigations

Risk	Likelihood	Impact	Mitigation
Competitive intel access blocked — LinkedIn login walls, paywalled papers	High	Medium	Track the blocked sources in `docs/bio_competitive/access_notes.md`; escalate a manual-fetch request rather than fabricating content. Use GitHub + bioRxiv + arXiv + company blogs as the fallback spine.
Biomni-style `exec()` of LLM-generated code is unsafe without sandboxing	High	High	All WS2 analyses run inside the bwrap sandbox per `quest_analysis_sandboxing_spec.md`. No `os.system` / unsandboxed `subprocess.run` of LLM-generated code. WS4's GPU launcher extends the existing sandbox; does not bypass it.
GPU cost blows the budget	Medium	High	WS4 pilots one model. `resource_tracker` enforces wall-time and VRAM caps. Cost ledger debit runs before the job, not after, to prevent silent overrun. Senate cap on GPU hours per week.
K-Dense skills repo churn — upstream renames or deprecates skills between our syncs	Medium	Low	Monthly refresh task in WS3 diffs the registry and surfaces deletions as governance tickets rather than silently dropping them.
"Analysis parity" becomes a parade of shallow notebooks	Medium	High	`quest_quality_standards_spec.md` acceptance gate: <50KB artifact = reject; no-debate = reject; no-hypothesis = reject. Coordinator holds the gate.
Upstream license conflict — Biomni Apache 2.0 is permissive; any proprietary Phylo / K-Dense SaaS endpoints are not	Medium	Medium	WS1 intel report flags license changes. WS3 uses only the open Apache-2.0 skills repo, not K-Dense's SaaS endpoints. Senate reviews any API-call dependency on paid upstream services before adoption.

Related quests

quest_forge_spec.md — tool registry / sandboxing / tool-augmented

analysis; WS3 skills adoption lands here.

quest_real_data_pipeline_spec.md — real datasets for WS2 analyses; the

15 showcase analyses cannot ship with synthetic data.

quest_epistemic_rigor.md — debate + evidence + trust scoring

infrastructure that WS5 hooks into; also the home of the new
WS-rigor-ruleset workstream absorbing Alpha1 Science's 8-dim
biomedical rigor rubric.

quest_experiment_extraction_spec.md — structured experiment records are

the ground truth that WS2 analyses compare their predictions against.

artifact_enrichment_quest_spec.md — artifact quality gates that WS2

deliverables must clear; ≥50KB artifact requirement comes from here.

quest_analysis_sandboxing_spec.md — bwrap sandbox that WS4's GPU

launcher extends.

quest_economics_spec.md — token economy, resource allocation, and cost

ledger that WS5 debits against per analysis.

Related competitive-intel docs

[docs/bio_competitive/README.md](../../bio_competitive/README.md) — tree

overview and provenance rules.

[docs/bio_competitive/biomni_profile.md](../../bio_competitive/biomni_profile.md)
[docs/bio_competitive/k_dense_profile.md](../../bio_competitive/k_dense_profile.md)
[docs/bio_competitive/amass_profile.md](../../bio_competitive/amass_profile.md)
[docs/bio_competitive/alpha1_science_profile.md](../../bio_competitive/alpha1_science_profile.md)
[docs/bio_competitive/openai_prism_profile.md](../../bio_competitive/openai_prism_profile.md)
[docs/bio_competitive/comparison_matrix.md](../../bio_competitive/comparison_matrix.md)
[docs/bio_competitive/access_notes.md](../../bio_competitive/access_notes.md)

Work Log

_No entries yet._

File: quest_competitive_biotools_spec.md

Modified: 2026-04-24 07:15

Size: 16.9 KB