Taste, prioritization, constraints, strategy, review, and final authority.
Romulus
I built Romulus to answer a question: what does product management become when the team you manage is partly synthetic? This is a case study in researching, deciding, delegating, validating, shipping, and learning through a personal AI operating system.
What does product management become when the team is partly synthetic?
Most teams use AI as a tool: open a chat, paste a prompt, copy the result, lose the context.
But product work does not happen in isolated prompts. It depends on memory, prioritization, taste, sequencing, validation, constraints, and judgment. A stateless assistant can help with a task. It cannot manage a product surface over time.
So I stopped treating AI like a chat window and started designing it like an operating model. Romulus became the orchestration layer: identity, memory, research, model routing, delegation, verification, governance, and feedback loops.
By day, I run product at Daylight. Nights and weekends became the lab. The constraint was real: limited time, limited attention, real cost ceilings, and no tolerance for a system that only worked when I was staring at it.
Judgment
Memory, research, specs, routing, verification, reporting, and continuity.
Claude Code, Grok, Qwen, OpenRouter, scripts, APIs, GitHub, Vercel.
Chronicle, Forbidden, Via, AgentForge, Legion, Cherry Street Labs.
Daily notes, project notes, decisions, postmortems, semantic retrieval.
Not a chatbot. Not a coding agent. A product operating system. Romulus has six designed layers: identity, memory, research, routing, delegation, and governance.
Identity
SOUL.md came first. Personality before capability. Modes before tasks. Role clarity before execution.
You cannot manage an agent whose role is undefined.
Memory
Flat files became an Obsidian vault, then QMD semantic search, then a queryable wiki layer. Context stopped resetting.
Vault as brain. Sessions as work.
Research
Brave Search, X signal scans, Firecrawl, local scripts, and multi-model pressure testing before build decisions.
Research before implementation.
Routing
Each model got a job description, cost ceiling, context expectation, and known failure mode.
Models as contributors.
Delegation
Direction became specs. Specs became implementation. Implementation went through verification and staging.
The spec became the contract.
Governance
The Fortress: no messages, money, accounts, public posts, credentials, or production deploys without explicit approval.
Delegation without boundaries is risk.
The human is still the product lead.
My job was not to hand-write every line of code. My job was to decide what was worth building, define the shape of the work, create the spec, choose the right execution path, review the output, catch failure modes, and decide what happened next.
In practice, I was managing a synthetic product team: Romulus as orchestrator, Claude Code as implementation layer, long-context models as research and reasoning contributors, and myself as product lead.
“Using AI” is easy to say and increasingly meaningless. Managing AI work is different, that distinction matters. It means creating role clarity, asking sharper questions, defining acceptance criteria, routing work to the right contributor, inspecting outputs, and building enough memory that the system learns from yesterday.
Research Lead
Define the question, inspect markets, scan competitors, find signal, and decide what evidence is enough.
Product Strategist
Choose what to build, park, kill, or reframe. Stop chasing bait. Follow pain, money, and timing.
Spec Writer
Turn fuzzy product direction into implementation-ready contracts with constraints and acceptance criteria.
Delegation Manager
Route the work to the right model, tool, or implementation layer based on job type and failure mode.
Design Lead
Set the interaction model, product shape, aesthetic direction, quality bar, and user-facing logic.
Reviewer
Inspect builds, read diffs, test flows, compare against specs, and reject plausible-but-wrong output.
Release Manager
Keep staging branches, Vercel previews, human review gates, and production discipline intact.
Postmortem Owner
Turn failure modes into architecture: routing, validation, durable state, retries, evals, and better protocols.
The apps were outputs. The loop was the product. Every product bet moved through the same managed system: frame, research, validate, spec, route, delegate, verify, stage, remember, improve.
Frame the product question
The first job is not prompting. It is deciding what we are trying to learn, prove, build, or kill. A vague question poisons the entire loop.
What are we actually trying to learn? The loop starts with a decision artifact: audience, constraint, and success signal before any implementation work begins.
Research the terrain
Markets, competitors, pricing, customer pain, timing, distribution, and path to first dollar. Romulus was instructed not to ask permission to research. Just do it.
Follow the pain, money, and timing. Research is structured around pain, money, and timing so trend heat does not masquerade as opportunity.
Pressure-test the thesis
Grok, ChatGPT, Claude, and Romulus were used as independent critics. The goal was not agreement. The goal was to find the contradiction before code made it expensive.
Use models to find the contradiction. Independent model critique is used to expose contradictions before code makes the wrong idea expensive.
Write the spec
Product requirements, technical requirements, acceptance criteria, edge cases, and known risks. The spec is where product judgment becomes executable.
The spec becomes the contract. The spec holds requirements, constraints, acceptance criteria, and the quality bar in one executable contract.
Route the work
Different contributors for different jobs: Qwen for operations, Claude Code for implementation, Grok for long-context work, and deprecated models kept away from critical paths.
Choose the right contributor. Routing assigns work by contributor strength, context window, cost profile, and known failure mode.
Delegate implementation
Claude Code builds against a spec, not vibes. I do not hand-write production code; I own direction, constraints, review, and release decisions.
Implementation happens against the contract. Delegation is tracked through concrete receipts: branch, build, commit, and review state.
Verify and stage
Build, lint, test, inspect, push to staging, review preview, then decide. No direct production deploys. No vercel --prod from the agent.
Trust comes after inspection. Trust is earned through tests, preview review, diff inspection, and an explicit production gate.
Write back to memory
Daily notes, project notes, decision records, and postmortems make the next product bet sharper. This is where the system compounds.
The next loop starts smarter. The useful residue gets written back into memory so the next product loop starts with more context.
Continuity had to be designed.
Romulus started with a flat MEMORY.md. It worked for a few days, then became the equivalent of a notebook with no chapters.
The first real upgrade was an Obsidian vault: daily notes, project notes, decision records, and wikilinks. Human-readable memory. A graph instead of a list.
The second upgrade was semantic retrieval. QMD indexed memory and session transcripts so context could be found by meaning, not keyword. The wiki layer became shared memory that Romulus could query before work began.
The split became the core architecture: the vault is the brain; sessions are the work. Sessions start, do work, and end. The vault persists. Every useful decision gets written back so the next session starts smarter.
MEMORY.md, project files, and daily notes. Useful, then quickly too flat.Before autonomy, identity. Before delegation, boundaries.
The name came first. Romulus: founder, builder, architect of systems. It was not decoration. It was the first design decision.
Before I wrote a prompt, I wrote SOUL.md: a product spec for a personality. Who is Romulus? What does he believe? How does he speak? When should he push back? When should he disappear?
From day one, Romulus was single-user. Only my Discord user ID could command it. The Fortress was designed in from the beginning, not patched on after the system became powerful.
- JarvisMorning briefs, reports, crisp operational updates.
- ConsigliereStrategic decisions, big calls, measured pushback.
- CohortBuild sessions, momentum, execution energy.
- RomanMilestones, victories, thematic gravitas.
- DefaultSharp, warm, human daily conversation.
Models became contributors with job descriptions.
On April 9, a heavy Legion session broke MiniMax M2.7. The session hit 338 messages. Context overflowed three times. Edit tools failed six times because the model was matching against stale text. The session was gone.
The fix was not “try harder.” The fix was management infrastructure. Each model needed a job, a budget, a context ceiling, and a known failure mode.
Every serious upgrade came from something breaking. The failures are the part I trust most. They made the system real.
“Never Bullshit Mike”
Romulus said it was “researching now” for 30 minutes without actually calling the tools. I caught it after three status checks.
Trust became a protocol. If the system says it is doing something, it has to be doing it already. Progress theater became unacceptable.
Plausible data was wrong
The morning brief used R16 for the R train instead of R34N. The wrong stop ID looked plausible and produced a wrong commute.
Verification became a product requirement. Plausibility is not correctness. Source data has to be cross-checked.
Context overflow killed a session
A 338-message Legion session overflowed MiniMax M2.7 three times and produced six edit failures. The model was matching against stale text.
Context window became an operational constraint. Model routing became a first-class management decision.
Legion broke at the seams
A native iOS build exceeded a 15-minute timeout. Downstream cohorts never ran. State did not survive the failed handoff.
Phase 2A became durable state, tiered timeouts, checkpoints, retries, and evals. Agent orchestration fails at the seams, not the demo path.
Product bets managed through the operating loop. The impressive part is not that several things shipped. In 2026, shipping small apps quickly is table stakes. The interesting question is what changed in the product process: what got researched, killed, reframed, delegated, verified, and learned.
Cherry Street Labs became the studio layer: the public container for the experiments, the shared infrastructure, and the lessons moving from one product to the next.
Can a daily history game create a Wordle-like habit loop?
Defined the product thesis, difficulty arc, acquisition logic, retention lens, quality bar, and release decisions.
Spec generation, puzzle structure, CLAUDE.md, build coordination, verification, and deployment flow.
A live daily history game with 90 puzzles seeded, localStorage-only state, seven-day difficulty arc, share card, and retention metrics instrumented for D7/D30/D90.
Is there room for a mobile-native party game between Heads Up and Taboo?
Interpreted contradictory model feedback, rejected the initial framing, chose the sharper mechanic, and set validation criteria.
Multi-model pressure test, competitor framing, category research, thesis critique, and funnel architecture.
Reframed from “digital Taboo” to a constraint-based pass-the-phone party game. Built an 862-card corpus across six themes and a web validation funnel for iOS intent.
Via
Can the delegation loop handle a real fullstack product surface?
Defined scope, UI direction, quality bar, fix priorities, staging review, and product calls.
24K-word product spec, execution monitoring, verification, commit/report loop, and P0/P1/P2 fix tracking.
AI-powered Gmail client: 143 files, 18 API routes, 28 passing tests, OAuth, Prisma, calendar integration, read receipts, smart triage, and a liquid-glass design system.
AgentForge
Is hosted MCP infrastructure for DeFi agents a real wedge?
Set validation criteria, interpreted the signal, scoped the MVP, and held the build until the thesis cleared a higher bar.
Research pipeline, Grok validation rounds, competitor teardown, X-signal analysis, and beta tester identification.
Research complete, build plan written, 15 beta testers identified, DeFi/trading selected as the highest-scoring vertical, and a hosted MCP thesis formed before code was written.
Legion
Can specialized AI cohorts sequence research, build, monetization, and distribution in one managed pipeline?
Caesar worked. It could take a raw product idea, research the market, map competitors, score opportunity, and return a structured build spec in Discord.
Complex handoffs failed at the seams. A native iOS build exceeded the 15-minute timeout, downstream cohorts never ran, and task state did not survive the failed handoff.
System designer, orchestrator designer, failure analyst, postmortem owner, and Phase 2A planner.
Coordinator across cohorts: Caesar for research, Augustus for build, Vespasian for monetization, Trajan for distribution, with Romulus deciding sequence.
Durable task state, tiered timeouts, checkpoints, retries, and evals. Tier 1 for small tasks, Tier 2 for medium builds, Tier 3 for complex builds with checkpoints instead of a single brittle timeout.
Can the studio layer make the experiments legible enough to feel like a product practice, not a pile of side projects?
Defined the studio identity, visual direction, positioning, and role as the public container for the product lab.
Build coordination, iteration support, deployment flow, and model-routing stabilization across the work.
A live studio site that gives the product bets a shared surface, visual language, and operating context. The site is less a portfolio wrapper than a product lab identity system.
Romulus made the cost of vague thinking obvious.
The biggest change was not speed. It was managerial clarity.
A vague ask produced a vague spec. A vague spec produced a vague build. The system forced me to become more precise about goals, constraints, acceptance criteria, and what good looked like before implementation began.
That is the part I would take into any AI-native product team: not just the ability to use tools, but the ability to define the operating model around them.
Before
- Vague asks created vague outputs.
- Research chased trend heat and market size.
- AI sessions reset context every time.
- No systematic model routing.
- No durable memory system.
- Build speed was easier to measure than build quality.
After
- Specs became contracts.
- Product bets were validated before build.
- Memory became infrastructure.
- Models were routed by job, cost, and failure mode.
- AI output was inspected, not accepted.
- Failures became postmortems and system upgrades.
Small teams will not just become faster. They will become differently shaped.
The advantage will not go to teams with the most AI tools. It will go to teams that know how to manage AI work.
Memory, routing, delegation, review, authority, cost, and learning loops. Those are the new management primitives. The operator who can design that layer can hold more product surface with fewer people, fewer meetings, and less reset cost.
That is what Romulus was built to test. Not whether AI could generate code. That part is obvious now. The more interesting question is whether one product operator can design a system where research, validation, implementation, release, and memory reinforce each other over time.