lullabeast

Autonomous dev pipeline · runs on OpenClaw

Bring an idea. Leave with a working MVP.

Describe what you want to build in plain English. A team of agents (planner, executor, reviewer) implements it phase by phase against a real git repo, with deterministic checks between every step and an escalation path back to you when they get stuck.

Open source In beta The agents' code runs on a machine you trust

Why it exists

Cheap models fail in predictable ways.

Open-weight models like Qwen, Kimi, and MiniMax cost a fraction of frontier prices. They also fail in known, repeatable ways. Lullabeast is built around those failures instead of pretending they don't happen.

Frontier models need less of this. Cheap models need all of it. That's the whole product.

  • How it's handledEvery deletion has to be announced up front. If a file disappears that wasn't, the gate flags it and the phase re-runs.

  • How it's handledContext is cleared between tasks and each one stays atomic, so focus stays narrow against a clear definition of done. The reviewer then checks those principles held across the whole phase.

  • How it's handledA phase can't close until the gate confirms every required test exists, is structured correctly, and passes.

  • How it's handledEach agent gets three real attempts before anything is flagged to you, and timeouts catch an agent that never starts or hangs mid-run, so one dropped turn never sinks the run.

How it works

From a conversation to a committed branch.

You shape the idea in a chat with prd-creator, refining a structured PRD until it's right. The roadmap-converter turns that into a roadmap of ordered phases, and from there every phase runs the same short loop, where nothing advances on an agent's own say-so.

Setup · once per project

IdeaWhat you want to build, in plain English.
PRDprd-creator turns it into a structured spec.
Roadmaproadmap-converter builds it from ordered phases.

Per phase · repeats until the roadmap is done

Planner Plans the work Reads the phase spec and writes the plan, generating the phase's tests up front (test-driven). emits → plan + tests
gate
Executor Writes the code Writes the code to make those tests pass, then commits to the git branch. emits → commit
gate
Reviewer Checks it behaves Verifies the result behaves. Passes it forward, or sends it back to the executor to fix. emits → verdict
gate
Phase lands next phase

Every gate between those steps is plain Python, not a model. It reads git and the filesystem for hard evidence (a real commit, passing tests, a clean diff) and refuses to advance without it. That's the part that makes cheap models safe to trust with your repo.

Needs your attention

When a gate can't be satisfied (repeated failures, a blown budget, an ambiguous spec), the run stops and asks you instead of guessing. The escalation agent can read and explain, but it has no shell, no edits, no browser.

Completion report optional

When the whole roadmap is finished, the pipeline writes up what shipped across every phase and suggests where to take it next, so you land with a clear place to pick up.

The dashboard

A dashboard, not a command line.

The CLI is only for install and launch; everything real happens in the browser, built for everyone, not just people who live in a terminal.

127.0.0.1:18790 · Pipeline Monitor
The Lullabeast Pipeline Monitor mid-run: the live planner, executor, and reviewer loop with agent attempts, roadmap progress, per-phase cost and tokens, and a real-time activity feed.
  • Chat an idea into a PRDthen generate the roadmap. No files to hand-write.
  • Launch and watch livethe planner, executor, reviewer loop, phase by phase.
  • See what it spendscost and tokens, per phase and per agent, as it runs.
  • Answer escalations inlinewhen a phase needs you, resolve it right there.

Living proof

MultiLife: built twice to compare.

I gave it one brief: a multi-team Conway's Game of Life where colored teams contest territory, with live analytics and a verdict when the board stabilizes. Then I ran the same PRD and the same eleven phases two ways, on local Qwen and on cloud open-weight models, to see what the price of "cheap" actually buys.

Local RTX 4090 · 48GB (modded)
PlannerQwen3.6-27B-MTP Q8_0
ExecutorQwen3.6-27B-MTP Q8_0
ReviewerQwen3.6-27B Q8_0
Runtime3h 27m
Phases11 / 11 ✓ · no retries
Tokens21.9M
Tokens · by agent
Planner 15% Executor 54% Reviewer 31%

Total cost

--

no API spend; local compute (power) wasn't metered

Cloud hosted open models · measured
Plannerglm-5.2
Executorkimi-k2.7-code
Reviewerkimi-k2.7-code
Runtime2h 4m
Phases11 / 11 ✓ · 2 retries
Tokens27.5M
Tokens · by agent
Planner 14% Executor 61% Reviewer 26%

Total cost

$6.90

real API bill: $1.21 planner / $3.86 executor / $1.84 reviewer

same PRD, same eleven phases, both runs measured end to end. No estimates anywhere on this page.

Both are real and clickable, go see for yourself

What else it builds

More than a one-off.

MultiLife is one of several. The same pipeline has also built GridBeast (a mini spreadsheet tool), a 2048 with correct merge semantics, a live regex tester, and SVG Pictionary, a multi-screen game with persistent state and live simultaneous LLM calls. Every one ships with the exact PRD and phased roadmap that drove it.

What it won't do

The honest limits.

The whole point is to be straight about where this is and isn't ready. Here's what to expect before you install anything.

It's in beta

I've worked on this for months and made significant improvements in output, but it's still far from perfect. I'm releasing it now to get real-world feedback and find where it breaks. Bug reports and contributions are welcome!

Coding agents are as risky as their access

The pipeline executes agent-written code directly on your host. Because of that I run it in a VM, and I'd suggest doing the same.

It won't deploy for you

You get a working build committed to a real git branch, not a hosted, production-ready service. Lullabeast takes on the initial development; enhancements and deployment are separate for now.

Hard projects/phases escalate

It shines on small, focused webapps. If the scoped tasks are something too large or intricate, it stops and asks for help rather than accept an output that fails to meet expectations.

It's a single-user local tool

One access token, loopback only. No accounts, roles, or multi-user, by design for now.

Get started

Install OpenClaw, then Lullabeast.

Lullabeast is the pipeline and dashboard. The agents themselves run inside OpenClaw, which you install and run separately, so that's step one. I won't pretend setup is frictionless: getting OpenClaw configured is the real work, and the installer takes out as much of the rest as it can. If you hit a wall, SETUP.md covers the silent-failure modes.

  1. 1
    Install and start OpenClaw.

    The agents' runtime. Have the gateway listening on its default port; a "connection refused" means it isn't up yet.

    curl -s http://localhost:18789/v1/models
  2. 2
    Install Lullabeast.

    Clone, then run the interactive installer. It registers the agents with OpenClaw and generates your dashboard token. Safe to re-run.

    git clone https://github.com/bigbraingoldfish/lullabeast.git autodev-ui cd autodev-ui ./install.sh
  3. 3
    Run the dashboard.

    From the repo root; the module form is required. It prints your tokenized access URL at startup.

    source .env python -m ui.server