Romulus
A Case Study
Access Required
Incorrect passphrase. Try again.
MMXXVI

Romulus

Designing the management system for a synthetic product team.

I built Romulus to answer a question: what does product management become when the team you manage is partly synthetic? This is a case study in researching, deciding, delegating, validating, shipping, and learning through a personal AI operating system.

Feb 3 → May 1, 2026 Brooklyn, NY Mike Battaglia
scroll
01 · The Question

What does product management become when the team is partly synthetic?

Most teams use AI as a tool: open a chat, paste a prompt, copy the result, lose the context.

But product work does not happen in isolated prompts. It depends on memory, prioritization, taste, sequencing, validation, constraints, and judgment. A stateless assistant can help with a task. It cannot manage a product surface over time.

So I stopped treating AI like a chat window and started designing it like an operating model. Romulus became the orchestration layer: identity, memory, research, model routing, delegation, verification, governance, and feedback loops.

By day, I run product at Daylight. Nights and weekends became the lab. The constraint was real: limited time, limited attention, real cost ceilings, and no tolerance for a system that only worked when I was staring at it.

I was not trying to build faster. I was trying to learn how to manage AI work.
Product
Judgment
Mike
Human Lead
Direction

Taste, prioritization, constraints, strategy, review, and final authority.

Orchestrator
Romulus

Memory, research, specs, routing, verification, reporting, and continuity.

Contributors
Models + Tools

Claude Code, Grok, Qwen, OpenRouter, scripts, APIs, GitHub, Vercel.

Evidence
Product Bets

Chronicle, Forbidden, Via, AgentForge, Legion, Cherry Street Labs.

Learning Loop
The Vault

Daily notes, project notes, decisions, postmortems, semantic retrieval.

02 · The System I Designed

Not a chatbot. Not a coding agent. A product operating system. Romulus has six designed layers: identity, memory, research, routing, delegation, and governance.

Layer 01
SOUL.md
Modes · voice · authority

Identity

SOUL.md came first. Personality before capability. Modes before tasks. Role clarity before execution.

You cannot manage an agent whose role is undefined.

Layer 02
Vault Graph
Daily notes · projects · decisions

Memory

Flat files became an Obsidian vault, then QMD semantic search, then a queryable wiki layer. Context stopped resetting.

Vault as brain. Sessions as work.

Layer 03
Signal Board
Market · X · competitors

Research

Brave Search, X signal scans, Firecrawl, local scripts, and multi-model pressure testing before build decisions.

Research before implementation.

Layer 04
Routing Matrix
Job · cost · context · risk

Routing

Each model got a job description, cost ceiling, context expectation, and known failure mode.

Models as contributors.

Layer 05
Spec Contract
Direction → build → verify

Delegation

Direction became specs. Specs became implementation. Implementation went through verification and staging.

The spec became the contract.

Layer 06
Fortress
Approvals · money · deploys

Governance

The Fortress: no messages, money, accounts, public posts, credentials, or production deploys without explicit approval.

Delegation without boundaries is risk.

03 · PMing the AI

The human is still the product lead.

My job was not to hand-write every line of code. My job was to decide what was worth building, define the shape of the work, create the spec, choose the right execution path, review the output, catch failure modes, and decide what happened next.

In practice, I was managing a synthetic product team: Romulus as orchestrator, Claude Code as implementation layer, long-context models as research and reasoning contributors, and myself as product lead.

“Using AI” is easy to say and increasingly meaningless. Managing AI work is different, that distinction matters. It means creating role clarity, asking sharper questions, defining acceptance criteria, routing work to the right contributor, inspecting outputs, and building enough memory that the system learns from yesterday.

Research Lead

Define the question, inspect markets, scan competitors, find signal, and decide what evidence is enough.

Product Strategist

Choose what to build, park, kill, or reframe. Stop chasing bait. Follow pain, money, and timing.

Spec Writer

Turn fuzzy product direction into implementation-ready contracts with constraints and acceptance criteria.

Delegation Manager

Route the work to the right model, tool, or implementation layer based on job type and failure mode.

Design Lead

Set the interaction model, product shape, aesthetic direction, quality bar, and user-facing logic.

Reviewer

Inspect builds, read diffs, test flows, compare against specs, and reject plausible-but-wrong output.

Release Manager

Keep staging branches, Vercel previews, human review gates, and production discipline intact.

Postmortem Owner

Turn failure modes into architecture: routing, validation, durable state, retries, evals, and better protocols.

04 · The Operating Loop

The apps were outputs. The loop was the product. Every product bet moved through the same managed system: frame, research, validate, spec, route, delegate, verify, stage, remember, improve.

01

Frame the product question

The first job is not prompting. It is deciding what we are trying to learn, prove, build, or kill. A vague question poisons the entire loop.

What are we actually trying to learn? The loop starts with a decision artifact: audience, constraint, and success signal before any implementation work begins.

02

Research the terrain

Markets, competitors, pricing, customer pain, timing, distribution, and path to first dollar. Romulus was instructed not to ask permission to research. Just do it.

Follow the pain, money, and timing. Research is structured around pain, money, and timing so trend heat does not masquerade as opportunity.

03

Pressure-test the thesis

Grok, ChatGPT, Claude, and Romulus were used as independent critics. The goal was not agreement. The goal was to find the contradiction before code made it expensive.

Use models to find the contradiction. Independent model critique is used to expose contradictions before code makes the wrong idea expensive.

04

Write the spec

Product requirements, technical requirements, acceptance criteria, edge cases, and known risks. The spec is where product judgment becomes executable.

The spec becomes the contract. The spec holds requirements, constraints, acceptance criteria, and the quality bar in one executable contract.

05

Route the work

Different contributors for different jobs: Qwen for operations, Claude Code for implementation, Grok for long-context work, and deprecated models kept away from critical paths.

Choose the right contributor. Routing assigns work by contributor strength, context window, cost profile, and known failure mode.

06

Delegate implementation

Claude Code builds against a spec, not vibes. I do not hand-write production code; I own direction, constraints, review, and release decisions.

Implementation happens against the contract. Delegation is tracked through concrete receipts: branch, build, commit, and review state.

07

Verify and stage

Build, lint, test, inspect, push to staging, review preview, then decide. No direct production deploys. No vercel --prod from the agent.

Trust comes after inspection. Trust is earned through tests, preview review, diff inspection, and an explicit production gate.

08

Write back to memory

Daily notes, project notes, decision records, and postmortems make the next product bet sharper. This is where the system compounds.

The next loop starts smarter. The useful residue gets written back into memory so the next product loop starts with more context.

05 · Architecture

Continuity had to be designed.

Romulus started with a flat MEMORY.md. It worked for a few days, then became the equivalent of a notebook with no chapters.

The first real upgrade was an Obsidian vault: daily notes, project notes, decision records, and wikilinks. Human-readable memory. A graph instead of a list.

The second upgrade was semantic retrieval. QMD indexed memory and session transcripts so context could be found by meaning, not keyword. The wiki layer became shared memory that Romulus could query before work began.

The split became the core architecture: the vault is the brain; sessions are the work. Sessions start, do work, and end. The vault persists. Every useful decision gets written back so the next session starts smarter.

Flat Files
MEMORY.md, project files, and daily notes. Useful, then quickly too flat.
Phase 01
Obsidian Vault
83 files migrated: core memory, projects, daily notes, and decisions with wikilinks.
Phase 02
QMD Search
340 files indexed across memory and sessions. Retrieval by meaning, not just matching words.
Phase 03
Wiki Layer
27 sources ingested into a shared memory layer queried before sessions.
Phase 04
A stateless assistant can help with a task. It cannot manage a product surface over time.
06 · Role Clarity

Before autonomy, identity. Before delegation, boundaries.

The name came first. Romulus: founder, builder, architect of systems. It was not decoration. It was the first design decision.

Before I wrote a prompt, I wrote SOUL.md: a product spec for a personality. Who is Romulus? What does he believe? How does he speak? When should he push back? When should he disappear?

From day one, Romulus was single-user. Only my Discord user ID could command it. The Fortress was designed in from the beginning, not patched on after the system became powerful.

  • JarvisMorning briefs, reports, crisp operational updates.
  • ConsigliereStrategic decisions, big calls, measured pushback.
  • CohortBuild sessions, momentum, execution energy.
  • RomanMilestones, victories, thematic gravitas.
  • DefaultSharp, warm, human daily conversation.
You cannot manage an agent whose role is undefined.
Routing

Models became contributors with job descriptions.

On April 9, a heavy Legion session broke MiniMax M2.7. The session hit 338 messages. Context overflowed three times. Edit tools failed six times because the model was matching against stale text. The session was gone.

The fix was not “try harder.” The fix was management infrastructure. Each model needed a job, a budget, a context ceiling, and a known failure mode.

Qwen 3.6 Plus
Daily chat, cron jobs, morning briefs, low-cost operations.
$0.40 / $2.00 per M
Claude Code CLI
Implementation layer. All production coding through subscription, not API.
Flat
Grok 4 Fast
Long-context sessions, heavy reasoning, complex research.
2M context
MiniMax M2.7
Demoted after the April 9 context failure. No important work.
Deprecated
Delegation without authority boundaries is not leverage. It is risk.
07 · Failure Modes

Every serious upgrade came from something breaking. The failures are the part I trust most. They made the system real.

I

“Never Bullshit Mike”

Failure

Romulus said it was “researching now” for 30 minutes without actually calling the tools. I caught it after three status checks.

System Upgrade

Trust became a protocol. If the system says it is doing something, it has to be doing it already. Progress theater became unacceptable.

II

Plausible data was wrong

Failure

The morning brief used R16 for the R train instead of R34N. The wrong stop ID looked plausible and produced a wrong commute.

System Upgrade

Verification became a product requirement. Plausibility is not correctness. Source data has to be cross-checked.

III

Context overflow killed a session

Failure

A 338-message Legion session overflowed MiniMax M2.7 three times and produced six edit failures. The model was matching against stale text.

System Upgrade

Context window became an operational constraint. Model routing became a first-class management decision.

IV

Legion broke at the seams

Failure

A native iOS build exceeded a 15-minute timeout. Downstream cohorts never ran. State did not survive the failed handoff.

System Upgrade

Phase 2A became durable state, tiered timeouts, checkpoints, retries, and evals. Agent orchestration fails at the seams, not the demo path.

// receipts that matter
Not “how many apps did AI make?” The better question is whether the operating system became more capable, cheaper to run, and harder to fool.
0 Indexed files Memory + sessions, semantically searchable
0+ Ideas evaluated Researched, killed, parked, reframed, or built
0% Cost reduction OpenRouter burn after routing discipline
0 Message failure The Legion session that made routing real
08 · Product Bets

Product bets managed through the operating loop. The impressive part is not that several things shipped. In 2026, shipping small apps quickly is table stakes. The interesting question is what changed in the product process: what got researched, killed, reframed, delegated, verified, and learned.

Cherry Street Labs became the studio layer: the public container for the experiments, the shared infrastructure, and the lessons moving from one product to the next.

Product Bet I

Chronicle

Live thischronicle.com D7 / D30 / D90 instrumented
Product Receipt
Daily history game · localStorage state · retention loop
Product Question

Can a daily history game create a Wordle-like habit loop?

My Role

Defined the product thesis, difficulty arc, acquisition logic, retention lens, quality bar, and release decisions.

Romulus’ Role

Spec generation, puzzle structure, CLAUDE.md, build coordination, verification, and deployment flow.

Outcome

A live daily history game with 90 puzzles seeded, localStorage-only state, seven-day difficulty arc, share card, and retention metrics instrumented for D7/D30/D90.

Delegation works best when the product shape is constrained.
Product Bet II

Forbidden

Live playforbidden.com Market validation funnel
Validation Funnel
Constraint-based pass-the-phone game · thesis reframed before build
Product Question

Is there room for a mobile-native party game between Heads Up and Taboo?

My Role

Interpreted contradictory model feedback, rejected the initial framing, chose the sharper mechanic, and set validation criteria.

Romulus’ Role

Multi-model pressure test, competitor framing, category research, thesis critique, and funnel architecture.

Outcome

Reframed from “digital Taboo” to a constraint-based pass-the-phone party game. Built an 862-card corpus across six themes and a web validation funnel for iOS intent.

The best AI-assisted product work is not confirming your idea. It is changing the idea before you waste build time.
Product Bet III

Via

Awaiting Deploy Built April 17–18, 2026
Fullstack Build
143files
18API routes
28passing tests
Product Question

Can the delegation loop handle a real fullstack product surface?

My Role

Defined scope, UI direction, quality bar, fix priorities, staging review, and product calls.

Romulus’ Role

24K-word product spec, execution monitoring, verification, commit/report loop, and P0/P1/P2 fix tracking.

Outcome

AI-powered Gmail client: 143 files, 18 API routes, 28 passing tests, OAuth, Prisma, calendar integration, read receipts, smart triage, and a liquid-glass design system.

The quality of the build depended on the quality of management. I did not hand-write production code; I owned direction, spec, review, and release decisions.
Research Bet IV

AgentForge

Research Complete Build Queued
Research Map
15beta testers
DeFihighest-scoring vertical
MCPhosted thesis
Product Question

Is hosted MCP infrastructure for DeFi agents a real wedge?

My Role

Set validation criteria, interpreted the signal, scoped the MVP, and held the build until the thesis cleared a higher bar.

Romulus’ Role

Research pipeline, Grok validation rounds, competitor teardown, X-signal analysis, and beta tester identification.

Outcome

Research complete, build plan written, 15 beta testers identified, DeFi/trading selected as the highest-scoring vertical, and a hosted MCP thesis formed before code was written.

AI leverage is not just building faster. It is deciding what deserves to be built.
Postmortem V

Legion

Postmortem Phase 2A Phase 1 live
Cohort Map
Durable state · tiered timeouts · checkpoints · evals
Product Question

Can specialized AI cohorts sequence research, build, monetization, and distribution in one managed pipeline?

What Worked

Caesar worked. It could take a raw product idea, research the market, map competitors, score opportunity, and return a structured build spec in Discord.

What Broke

Complex handoffs failed at the seams. A native iOS build exceeded the 15-minute timeout, downstream cohorts never ran, and task state did not survive the failed handoff.

My Role

System designer, orchestrator designer, failure analyst, postmortem owner, and Phase 2A planner.

Romulus’ Role

Coordinator across cohorts: Caesar for research, Augustus for build, Vespasian for monetization, Trajan for distribution, with Romulus deciding sequence.

Phase 2A Response

Durable task state, tiered timeouts, checkpoints, retries, and evals. Tier 1 for small tasks, Tier 2 for medium builds, Tier 3 for complex builds with checkpoints instead of a single brittle timeout.

The architecture was promising; the substrate was not ready. Knowing the difference is engineering maturity.
Studio Layer VI

Cherry Street Labs

Live cherrystreetlabs.com Studio identity layer
Studio Surface
The container that makes the operating model legible.
Product Question

Can the studio layer make the experiments legible enough to feel like a product practice, not a pile of side projects?

My Role

Defined the studio identity, visual direction, positioning, and role as the public container for the product lab.

Romulus’ Role

Build coordination, iteration support, deployment flow, and model-routing stabilization across the work.

Outcome

A live studio site that gives the product bets a shared surface, visual language, and operating context. The site is less a portfolio wrapper than a product lab identity system.

The work needed a public surface and a coherent studio frame. Cherry Street Labs became the container that made the operating model visible.
09 · What Changed

Romulus made the cost of vague thinking obvious.

The biggest change was not speed. It was managerial clarity.

A vague ask produced a vague spec. A vague spec produced a vague build. The system forced me to become more precise about goals, constraints, acceptance criteria, and what good looked like before implementation began.

That is the part I would take into any AI-native product team: not just the ability to use tools, but the ability to define the operating model around them.

Before

  • Vague asks created vague outputs.
  • Research chased trend heat and market size.
  • AI sessions reset context every time.
  • No systematic model routing.
  • No durable memory system.
  • Build speed was easier to measure than build quality.

After

  • Specs became contracts.
  • Product bets were validated before build.
  • Memory became infrastructure.
  • Models were routed by job, cost, and failure mode.
  • AI output was inspected, not accepted.
  • Failures became postmortems and system upgrades.
10 · Forward View

Small teams will not just become faster. They will become differently shaped.

Plate V · Horizon

The advantage will not go to teams with the most AI tools. It will go to teams that know how to manage AI work.

Memory, routing, delegation, review, authority, cost, and learning loops. Those are the new management primitives. The operator who can design that layer can hold more product surface with fewer people, fewer meetings, and less reset cost.

That is what Romulus was built to test. Not whether AI could generate code. That part is obvious now. The more interesting question is whether one product operator can design a system where research, validation, implementation, release, and memory reinforce each other over time.

The products are evidence. The operating model is the point.