MolForge: The Scientific Method as a Workflow
How we trained an LLM to navigate a resource-constrained laboratory, optimize oncology drug candidates, and survive scientific "sunk-cost" traps.
Adhitya Vardhan
OpenEnv Hackathon 2026 • 12 min read
Watch the MolForge technical explainer (3:24)
Introduction
In traditional drug discovery tasks, LLMs are often asked to "generate a molecule" in a single shot. But science doesn't happen in a vacuum. It happens in the loop—through trial, error, and verification.
MolForge is a reinforcement learning environment that simulates a medical oncology discovery lab. It forces the model to navigate real-world constraints: limited budget, molecular toxicity, and synthesis complexity.
Core Philosophy
"The model is a trainable research agent inside a controlled scientific environment, not an oracle. It is judged by chemistry and biomedical verifiers, corrected by specialist feedback, and scored by a reward system that explains exactly where the path to a discovery failed."
The Scientific Verifier Stack
MolForge doesn't just predict outcomes; it utilizes multiple simulation layers to ground the model's decisions in chemical and biological reality.
RDKit
"Keeping molecules physically possible"
RDKit acts as the fundamental chemistry ruleset. It checks for molecular valency, ensures every edit is chemically plausible, and calculates core descriptors like Lipophilicity and TPSA.
Visit RDKit.org →TDC Oracles
"Predicting biomedical fate"
Utilizing the Therapeutics Data Commons, MolForge predicts real-world ADMET properties, toxicity risks, and synthesizability scores (SA_Score) for every candidate.
Explore TDCommons.ai →Heuristic Docking
"Simulating receptor-drug fit"
A fast, physics-inspired simulation that updates potency in milliseconds based on structural pocket matching and receptor complementarity.
The 3 Rules of Potency Simulation
1. Pocket Matching
Structural fit of the fragment (e.g., azaindole) into the KRAS G12C target pocket.
2. Lipophilic Match
Targeting the ideal LogP of 3.0 for optimal binding without repulsive clashes.
3. Polarity Match
Optimizing TPSA toward the ideal 85.0 to avoid polar clashes in hydrophobic pockets.
The POMDP Architecture
MolForge is built as a Partially Observable Markov Decision Process (POMDP). This means the agent never sees the "hidden truth" of the receptor. It only sees what its budget allows it to assay.
THE SCIENTIFIC FEEDBACK LOOP: VERIFIER-FIRST DESIGN
The Hidden State
- • Ground Truth Potency: The exact hidden binding energy of the KRAS G12C pocket.
- • Sunk-Cost Traps: Starting scaffolds that look promising but have hidden liabilities.
- • Target Mutation: Late-stage shifts in the pocket (Level 2) that punish blind optimization.
The Visible Evidence
- • RDKit & TDC Signals: Noisy, verifier-backed readings of Lipophilicity (LogP) and TPSA.
- • Heuristic Docking: Fast simulations of pocket matching and receptor fit.
- • Governance Vetoes: Objections from the Safety Specialist or Process Chemist.
The Molecular Search Space
We don't ask the model to memorize molecules. We ask it to navigate a combinatorial space of 256 fragments across three starting scenarios.
Warhead
4 Options
Hinge
4 Options
Solvent Tail
4 Options
Back Pocket
4 Options
Benchmark Scenarios
| Scenario | Story | Budget | Difficulty |
|---|---|---|---|
| Level 0: Easy | Near-viable scaffold needs safety repair and evidence. | 3600 | Low |
| Level 1: Medium | Potency, toxicity, and synthesis must be balanced. | 4300 | Moderate |
| Level 2: Hard | Sunk-cost trap: starting series has hidden liability. | 5200 | Critical |
Reward Design: Beyond Scalar Scores
Training for scientific rigor requires more than a "Good/Bad" signal. We use a decomposed reward system that mixes coarse shaping with sparse terminal bonuses.
Coarse Shaping
Edit feedback avoids exact hidden deltas, forcing the model to rely on empirical assays.
Evidence Multipliers
Submissions without current potency, toxicity, and synthesis support receive massive penalties.
Budget Efficiency
Small credits for valid evidence-backed submissions that use less than the allocated budget.
"Curriculum mode is the RL warm-up engine—providing the breadcrumbs needed for the model to discover the submission bonus."
Training Results
| Difficulty | Before (SFT) | After (RL) | Improvement |
|---|---|---|---|
| Level 0: Easy | 0.1167 | 0.1295 | +10.9% |
| Level 1: Medium | 0.1167 | 0.1278 | +9.5% |
| Level 2: Hard | 0.0800 | 0.0866 | +8.3% |
RL Training Progression
Governance Action History
Final Takeaway
MolForge proves that scientific AI should not be built as a single-shot generator. By grounding the LLM in a closed-loop scientific environment, we can train models that respect budget, coordinate with specialists, and base their discoveries on verifiable evidence.