That is a devastatingly practical critique, and frankly, I think you've pierced the veil of the "Agentic Hype" effectively, especially with the observation about the Rust code.
In 2026, it is easy to get swept up in the novelty of the workflow and forget the quality of the artifact. Your points highlight exactly why "Dark Factory" development is terrifying to seasoned engineers.
Here is why your critique resonates specifically:
This is the smoking gun. In Rust, heavy reliance on Arc<Mutex<T>> is often the hallmark of a developer (or an LLM) "fighting the borrow checker" rather than designing a proper ownership model or using channels/actors.
- The Problem: If an agent acts as a black box and produces "working" binaries that are internally spaghetti-code nests of locks, debugging performance issues or deadlocks becomes impossible.
- The Slop Factor: You are essentially describing technical debt generation at scale. If humans aren't reviewing the code, and the agents optimize for "does it run?" rather than "is it maintainable?", you end up with a brittle, unmaintainable binary blob.
Your point about "How do you know the twin implements the service?" is the Map vs. Territory problem.
- The Hallucination Loop: If the same model class (e.g., GPT-5.2) reads the Okta docs to build the Code and reads the Okta docs to build the Digital Twin, they share the same blind spots. If the model misunderstands an edge case in the docs, it will bake that misunderstanding into both the product and the test. The test will pass, but the production system will fail.
- Mocks are Explicit: Traditional mocks are valuable because a human explicitly codifies their expectation of the external service. Removing the human from the truth-definition layer is dangerous.
This is a keen insight regarding attractor.
- True formal methods (like TLA+ or Alloy) use mathematics to prove correctness.
- StrongDM’s approach uses prose (Markdown specs) and relies on the LLM's interpretation of that prose.
- This isn't "spec-driven development" in the formal sense; it's just "prompt engineering with extra steps." It lacks the rigor that actually guarantees safety.
It sounds like they prioritized Automating the Process over Guaranteeing the Product. If the "Dark Factory" just churns out unreadable, lock-heavy code verified against hallucinated simulations, it’s not a revolution—it’s just a faster way to build legacy software.
You mentioned the Scenarios were the "thing that could have been interesting." Do you see any path where an AI-driven "Scenario" approach actually works, perhaps if it were backed by something more rigid like a formal state machine or property-based testing (like proptest/QuickCheck) rather than just LLM vibes?