Benchmarks you can run, evidence you can trust
The repository ships runnable benchmarks and research previews, each with an explicit evidence boundary — and an audit layer that records artifacts, QA checks, and metric provenance so claims stay honest.
Validation snapshot
The repository includes runnable benchmarks and research previews with different evidence boundaries. Each row states what the benchmark actually demonstrates and where its evidence stops — these boundaries are deliberate and verification-first.
| Benchmark / Preview | What it shows | Evidence boundary |
|---|---|---|
| Information-loss-guided subcatchment partition | QGIS-to-Agentic SWMM preprocessing using entropy and fuzzy-similarity concepts from Zhang & Valeo's Journal of Hydrology paper | GIS preprocessing concept, not a calibrated SWMM performance claim |
| Raw GeoPackage-to-INP benchmark | Public TUFLOW GeoPackage layers converted into SWMM-ready artifacts, QA, and audit | Structured raw GIS path, not arbitrary CAD/GIS recognition |
| Prepared-input SWMM benchmark | External 40-subcatchment Tecnopolo model execution, plotting, and direct swmm5 comparison. v0.7.1 re-verification: model.out SHA256 unchanged across the v0.7.0 → v0.7.1 minor revision, with an 11-word natural-language prompt now sufficient to drive the full run-audit-plot chain end-to-end. |
Prepared INP validation path |
| Cross-session memory autonomously activated | On a real production natural-language run, the LLM planner consulted prior session history without any user instruction to do so — see the v0.7.1 cross-session memory evidence | Memory layer fires correctly and shapes planner decisions on a real natural-language run; staleness weighting and negative-precedent handling are next-milestone scope |
| Prior Monte Carlo uncertainty smoke | Tecnopolo HORTON parameter perturbation and hydrograph envelope preview | Prior uncertainty smoke, not calibration |
| Optional INP-derived raw adapter benchmark | Raw-like inputs extracted from a public SWMM fixture and rebuilt through the modular path | Adapter handoff check, not greenfield watershed generation |
Audit and research memory
The audit layer consolidates artifacts, QA checks, and metric provenance into an Obsidian-compatible experiment note. This example catches a recorded peak-flow value that does not match the value re-parsed from the SWMM report source section.
The downstream modelling-memory layer can summarize audited run histories into recurring failure patterns, assumptions, missing evidence, QA issues, lessons learned, and controlled proposals for updating existing skills or creating new skills. Because skills drive the workflow, these proposals stay coupled to the current Agentic SWMM framework and still require human review and benchmark verification before acceptance.