Task Manager Lifecycle¶
This page is the canonical contributor map for how a task moves through the
task manager in libs/tasks/.
If you are debugging why a task is not moving, keep three layers separate:
- Task spec: markdown under
tasks/workstreams/*/tasks/*.md - Runtime truth: SQLite rows in the shared task-manager control plane
- UI grouping: derived buckets such as
ReadyorNeeds attention
Most confusion comes from mixing those layers.
Core rule¶
The heartbeat reads markdown once at the top of a cycle with
sync_markdown_task_specs_to_sqlite(). After that, runtime surfaces should read
from SQLite only.
That means:
- task markdown defines spec-owned fields such as title, owner, dependencies, milestone, and worksheet content
- SQLite defines runtime-owned fields such as queue state, runs, PR state, delivery state, and operator classification
- UI buckets are projections, not authoritative state
Who owns transitions¶
| Component | Main entrypoints | What it is allowed to move |
|---|---|---|
| Spec sync | sync_markdown_task_specs_to_sqlite() |
Markdown spec into SQLite task-spec rows |
| Queueing | kickoff_task(), queue_task() |
planned -> queued |
| Heartbeat | run_heartbeat_cycle(), build_manager_state() |
Periodic refresh, reconciliation, classification, dispatch pickup |
| Dispatcher | run_dispatch() |
queued/dispatching -> running, worker follow-up loops, stale-gate detection on worker exit |
| Run finalizer | finish_dispatch_run() |
running -> exited/failed/interrupted/closeout_* |
| Delivery controller | _refresh_task_runtime_view() and PR repair helpers |
PR discovery, CI/conflict state, review routing |
| Stage/stuck rules | tick.py hooks (on_entry, ongoing, stuck rules) |
Nudges, retries, stall breaking — one recovery engine, runs in the fast cycle |
| Reconciler | reconcile_live_runs(), reconcile_merged_tasks(), reconcile_terminal_runs_to_sqlite_runtime() |
Dead-worker cleanup, merged-task completion, terminal drift cleanup |
| Incident detector | process_stale_runs() in incident.py |
Emits durable incident row, sets driver_action=incident_emitted, delivers to tasks tmux agent |
| Tasks agent | tasks tmux session (operator or manager-spawned) |
Judgment-based repair: nudge, restart, escalate to user |
Stage model¶
The canonical stage enum lives in libs/tasks/tasks/manager/stages.py.
There are 12 stages:
flowchart TD
planned["planned"]
queued["queued"]
launching["launching"]
running["running"]
closeout["closeout"]
pr_testing["pr_testing"]
pr_red["pr_red"]
pr_conflicts["pr_conflicts"]
pr_green["pr_green"]
pr_merged["pr_merged"]
done["done"]
cancelled["cancelled"]
planned --> queued
queued --> launching
launching --> running
running --> closeout
closeout --> pr_testing
pr_testing --> pr_red
pr_testing --> pr_conflicts
pr_testing --> pr_green
pr_red --> pr_green
pr_conflicts --> pr_green
pr_green --> pr_merged
pr_merged --> done
planned --> cancelled
queued --> cancelled
launching --> cancelled
running --> cancelled
Stage derivation¶
Stages are derived every tick from observable state — they are never stored
as mutable fields. derive_stage_from_state() in stages.py maps the following
inputs to the canonical stage:
| Input | Description |
|---|---|
is_cancelled |
frontmatter status is a cancelled value |
is_completed |
frontmatter status is a completed value |
queued_at |
non-empty queue row timestamp |
run_state |
latest run row state column |
dispatch_state |
computed dispatch liveness (running, worker_gone, etc.) |
pr_state |
PR state from pr_metadata (OPEN, MERGED, CLOSED) |
pr_ci_state |
CI rollup (success, failure, pending) |
pr_mergeable |
mergeability check (MERGEABLE, CLEAN, CONFLICTING, etc.) |
Precedence (highest first):
- cancelled — terminal abandon beats every other signal
- active worker (
starting/runningrun_state) — beats PR and queue state - PR state —
MERGEDfolds intoDONE(if completed) orPR_MERGED;OPENdispatches to pr_conflicts/pr_red/pr_green/pr_testing - completed (frontmatter) —
DONEwhen no PR signal in step 3 - queued — beats stale closeout from prior runs
- closeout — any terminal run state with no PR, not queued, not completed
- planned (default)
Key files¶
| File | What it does |
|---|---|
stages.py |
Stage enum, derive_stage_from_state, observe_task, STAGES table with hooks |
tick.py |
Per-task tick loop: seed/advance/noop based on declared vs observed stage |
stuck.py |
Stuck machine state, escalate_to_human, needs_human_slugs |
transitions.py |
Stage transition log (SQLite stage_transitions table) |
service.py |
Heartbeat integration: run_heartbeat_cycle, run_manager_loop |
Stage hooks¶
Each stage entry in STAGES (a dict[Stage, StageDef]) can declare:
on_entry— runs once when the tick advances into this stageongoing— runs every tick while the task remains in this stagestuck— list ofStuckRuleentries withwhen,unstick, andon_broke
Current hooks:
| Stage | Hook type | Function | What it does |
|---|---|---|---|
| RUNNING | ongoing |
_running_ongoing |
O(1) PID liveness check via get_run(task.run_id); marks dead workers lost |
| RUNNING | stuck worker_died |
_unstick_worker_vanished |
Requeues task when worker vanished and retries remain |
| RUNNING | stuck idle_too_long |
_unstick_running_idle |
Sends nudge message to idle worker |
| RUNNING | stuck worker_retries_exhausted |
→ escalate_to_human |
Escalates when max retries exceeded |
| PR_MERGED | on_entry |
_pr_merged_on_entry |
Marks task completed in registry when PR merges |
Tick loop¶
tick(tasks) in tick.py runs once per heartbeat cycle per task:
- Compute
observed = observe_task(task) - Look up
declared = current_stage(task.slug)from the transition log - If no declared stage: seed the task (write first transition)
- If
declared != observed: advance (write new transition, fireon_entryfor new stage) - If
declared == observed: noop (fireongoinghook if declared) - After entry/ongoing hooks: evaluate stuck rules for the current stage
Stuck machine¶
The stuck machine in stuck.py tracks per-task attempts for each stuck rule:
when(task)fires → increment attempt counter- If attempts <
max_attempts: callunstick(task)(Tier 1 repair) - If attempts >=
max_attempts: transition tobroke, callon_broke(task)(Tier 2 escalation) escalate_to_human(task)writes aneeds_humanevent;needs_human_slugs()reads it for the operator view
Who owns transitions¶
| Component | Main entrypoints | What it moves |
|---|---|---|
| Spec sync | sync_markdown_task_specs_to_sqlite() |
Markdown spec → SQLite task-spec rows |
| Queueing | kickoff_task(), queue_task() |
planned → queued |
| Heartbeat | run_heartbeat_cycle() |
Periodic refresh, classification, dispatch pickup, stage tick |
| Dispatcher | run_dispatch() |
queued/launching → running, worker follow-up, stale-gate detection on worker exit |
| Stage tick | tick.py hooks |
on_entry, ongoing, stuck rules |
| Delivery controller | _refresh_task_runtime_view() + PR helpers |
PR discovery, CI/conflict state, review routing |
| Stage/stuck rules | tick.py hooks |
Nudges, retries, stall breaking |
| Incident detector | process_stale_runs() |
Emits incident row, sets driver_action=incident_emitted, delivers to tasks agent |
Stage-by-stage audit¶
| Stage | What it means | Who puts a task here | What normally moves it next |
|---|---|---|---|
planned |
Task exists, not queued, no active runtime | Task creation / markdown sync | kickoff_task() |
queued |
Queue row exists, no live worker yet | kickoff_task(), stage_unstick retry |
Heartbeat dispatch pickup |
launching |
Run row started, worker bootstrapping | Dispatcher start_dispatch_run() |
Worker heartbeat advances to running |
running |
Active worker run exists | Dispatcher | Worker exit, stuck-rule nudge/retry, dead-worker detection |
closeout |
Worker finished, no open PR, task not completed | Run terminal state | PR creation, task completion, operator action |
pr_testing |
Open PR, CI not yet determined | PR discovery | CI result arrives |
pr_red |
Open PR, CI failing | CI failure | Fix pushed, CI passes |
pr_conflicts |
Open PR, merge conflicts | Conflict detected | Rebase / conflict resolution |
pr_green |
Open PR, CI passing, mergeable | CI + mergeability pass | PR merge |
pr_merged |
PR merged, task not yet completed | Merge event | _pr_merged_on_entry marks completed |
done |
Terminal success | _pr_merged_on_entry, or completed frontmatter |
None |
cancelled |
Terminal abandoned | Cancellation intent | None |
Happy path¶
sequenceDiagram
participant User
participant Queue as Queue API
participant Heartbeat
participant Dispatch
participant Delivery
User->>Queue: kickoff_task() / queue_task()
Heartbeat->>Heartbeat: sync_markdown_task_specs_to_sqlite()
Heartbeat->>Dispatch: launch_dispatch()
Dispatch->>Dispatch: start_dispatch_run()
Dispatch->>Dispatch: run worker + liveness heartbeat
Dispatch->>Dispatch: finish_dispatch_run()
Delivery->>Delivery: persist PR metadata / CI / mergeability
Heartbeat->>Heartbeat: tick() _pr_merged_on_entry marks completed
Normal contributor expectation:
- Task is queued.
- Heartbeat picks it up and dispatches.
- Dispatch creates the worktree and registers a run.
- Worker runs until closeout.
- Closeout creates a PR.
- Delivery tracks CI and mergeability while PR is open.
- PR merge triggers
_pr_merged_on_entry, which marks the task completed.
Repair paths¶
| Problem | Where it is noticed | Automatic action |
|---|---|---|
| Dead worker PID | reconcile_live_runs() during heartbeat |
Mark latest active run lost, surface as worker_gone |
| Idle worker | Stage/stuck rules in tick.py |
Queue a nudge message |
failed / interrupted / worker_gone with no PR and retries left |
Stage/stuck rules in tick.py |
Requeue task |
| Stale owned run idle past threshold | process_stale_runs() in heartbeat |
Emit incident row, set driver_action=incident_emitted, deliver packet to tasks agent |
| PR CI failing | Delivery controller | Post repair instructions / surface attention |
| PR conflicts | Delivery controller | Post repair instructions / surface attention |
| PR merged | reconcile_merged_tasks() |
Mark task completed in SQLite |
| Terminal task still has active run or queue row | reconcile_terminal_runs_to_sqlite_runtime() |
Close stale runtime rows |
UI buckets are not lifecycle stages¶
The homepage does not show raw stages. It groups tasks into display buckets via classify_tasks().
| UI bucket | Meaning |
|---|---|
Ready |
Task is queued, unregistered, non-terminal, no dependency blockers |
Waiting on dependency |
Task is queued but blockers still exist |
Active |
Task is running/launching, or in a healthy open-PR delivery state |
Needs attention |
A repair/anomaly signal exists |
Needs human |
Stuck machine escalated to needs_human |
Other |
Everything else |
Practical debugging order¶
When a task looks wrong, check in this order:
- Is the task spec present and synced into SQLite?
- Is there a queue row?
- Is there a latest run row, and what is its
state? - Does the run still have a live PID?
- Is there persisted PR metadata?
- What stage does
observe_task()derive? - What attention signals did
classify_tasks()add?
That sequence mirrors how the manager itself decides what the task is.
Related docs¶
ai_notes/core/TASK_MANAGER_MINIMAL_ARCHITECTURE.mdai_notes/core/task-manager-layers.mdlibs/tasks/tasks/tools/cli-reference.md