Hackathon Submission Medical Oncology

MolForge: The Scientific Method as a Workflow

How we trained an LLM to navigate a resource-constrained laboratory, optimize oncology drug candidates, and survive scientific "sunk-cost" traps.

Adhitya Vardhan

OpenEnv Hackathon 2026 • 12 min read

Watch the MolForge technical explainer (3:24)

Introduction

In traditional drug discovery tasks, LLMs are often asked to "generate a molecule" in a single shot. But science doesn't happen in a vacuum. It happens in the loop—through trial, error, and verification.

MolForge is a reinforcement learning environment that simulates a medical oncology discovery lab. It forces the model to navigate real-world constraints: limited budget, molecular toxicity, and synthesis complexity.

Core Philosophy

"The model is a trainable research agent inside a controlled scientific environment, not an oracle. It is judged by chemistry and biomedical verifiers, corrected by specialist feedback, and scored by a reward system that explains exactly where the path to a discovery failed."

🧪

The Scientific Verifier Stack

MolForge doesn't just predict outcomes; it utilizes multiple simulation layers to ground the model's decisions in chemical and biological reality.

🧬

RDKit

"Keeping molecules physically possible"

RDKit acts as the fundamental chemistry ruleset. It checks for molecular valency, ensures every edit is chemically plausible, and calculates core descriptors like Lipophilicity and TPSA.

Visit RDKit.org →

💊

TDC Oracles

"Predicting biomedical fate"

Utilizing the Therapeutics Data Commons, MolForge predicts real-world ADMET properties, toxicity risks, and synthesizability scores (SA_Score) for every candidate.

Explore TDCommons.ai →

🎯

Heuristic Docking

"Simulating receptor-drug fit"

A fast, physics-inspired simulation that updates potency in milliseconds based on structural pocket matching and receptor complementarity.

The 3 Rules of Potency Simulation

1. Pocket Matching

Structural fit of the fragment (e.g., azaindole) into the KRAS G12C target pocket.

2. Lipophilic Match

Targeting the ideal LogP of 3.0 for optimal binding without repulsive clashes.

3. Polarity Match

Optimizing TPSA toward the ideal 85.0 to avoid polar clashes in hydrophobic pockets.

The POMDP Architecture

MolForge is built as a Partially Observable Markov Decision Process (POMDP). This means the agent never sees the "hidden truth" of the receptor. It only sees what its budget allows it to assay.

THE SCIENTIFIC FEEDBACK LOOP: VERIFIER-FIRST DESIGN

The Hidden State

• Ground Truth Potency: The exact hidden binding energy of the KRAS G12C pocket.
• Sunk-Cost Traps: Starting scaffolds that look promising but have hidden liabilities.
• Target Mutation: Late-stage shifts in the pocket (Level 2) that punish blind optimization.

The Visible Evidence

• RDKit & TDC Signals: Noisy, verifier-backed readings of Lipophilicity (LogP) and TPSA.
• Heuristic Docking: Fast simulations of pocket matching and receptor fit.
• Governance Vetoes: Objections from the Safety Specialist or Process Chemist.

The Molecular Search Space

We don't ask the model to memorize molecules. We ask it to navigate a combinatorial space of 256 fragments across three starting scenarios.

Warhead

4 Options

Hinge

4 Options

Solvent Tail

4 Options

Back Pocket

4 Options

Benchmark Scenarios

Scenario	Story	Budget	Difficulty
Level 0: Easy	Near-viable scaffold needs safety repair and evidence.	3600	Low
Level 1: Medium	Potency, toxicity, and synthesis must be balanced.	4300	Moderate
Level 2: Hard	Sunk-cost trap: starting series has hidden liability.	5200	Critical

Reward Design: Beyond Scalar Scores

Training for scientific rigor requires more than a "Good/Bad" signal. We use a decomposed reward system that mixes coarse shaping with sparse terminal bonuses.

Coarse Shaping

Edit feedback avoids exact hidden deltas, forcing the model to rely on empirical assays.

Evidence Multipliers

Submissions without current potency, toxicity, and synthesis support receive massive penalties.

Budget Efficiency

Small credits for valid evidence-backed submissions that use less than the allocated budget.

"Curriculum mode is the RL warm-up engine—providing the breadcrumbs needed for the model to discover the submission bonus."

Training Results

Difficulty	Before (SFT)	After (RL)	Improvement
Level 0: Easy	0.1167	0.1295	+10.9%
Level 1: Medium	0.1167	0.1278	+9.5%
Level 2: Hard	0.0800	0.0866	+8.3%

RL Training Progression

Governance Action History

Final Takeaway

MolForge proves that scientific AI should not be built as a single-shot generator. By grounding the LLM in a closed-loop scientific environment, we can train models that respect budget, coordinate with specialists, and base their discoveries on verifiable evidence.

Explore Code Run Notebook

Space Deployment Model Card