Task Manager Lifecycle¶

This page is the canonical contributor map for how a task moves through the task manager in libs/tasks/.

If you are debugging why a task is not moving, keep three layers separate:

Task spec: markdown under tasks/workstreams/*/tasks/*.md
Runtime truth: SQLite rows in the shared task-manager control plane
UI grouping: derived buckets such as Ready or Needs attention

Most confusion comes from mixing those layers.

Core rule¶

The heartbeat reads markdown once at the top of a cycle with sync_markdown_task_specs_to_sqlite(). After that, runtime surfaces should read from SQLite only.

That means:

task markdown defines spec-owned fields such as title, owner, dependencies, milestone, and worksheet content
SQLite defines runtime-owned fields such as queue state, runs, PR state, delivery state, and operator classification
UI buckets are projections, not authoritative state

Who owns transitions¶

Component	Main entrypoints	What it is allowed to move
Spec sync	`sync_markdown_task_specs_to_sqlite()`	Markdown spec into SQLite task-spec rows
Queueing	`kickoff_task()`, `queue_task()`	`planned -> queued`
Heartbeat	`run_heartbeat_cycle()`, `build_manager_state()`	Periodic refresh, reconciliation, classification, dispatch pickup
Dispatcher	`run_dispatch()`	`queued/dispatching -> running`, worker follow-up loops, stale-gate detection on worker exit
Run finalizer	`finish_dispatch_run()`	`running -> exited/failed/interrupted/closeout_*`
Delivery controller	`_refresh_task_runtime_view()` and PR repair helpers	PR discovery, CI/conflict state, review routing
Stage/stuck rules	`tick.py` hooks (`on_entry`, `ongoing`, stuck rules)	Nudges, retries, stall breaking — one recovery engine, runs in the fast cycle
Reconciler	`reconcile_live_runs()`, `reconcile_merged_tasks()`, `reconcile_terminal_runs_to_sqlite_runtime()`	Dead-worker cleanup, merged-task completion, terminal drift cleanup
Incident detector	`process_stale_runs()` in `incident.py`	Emits durable incident row, sets `driver_action=incident_emitted`, delivers to `tasks` tmux agent
Tasks agent	`tasks` tmux session (operator or manager-spawned)	Judgment-based repair: nudge, restart, escalate to user

Stage model¶

The canonical stage enum lives in libs/tasks/tasks/manager/stages.py. There are 12 stages:

flowchart TD
    planned["planned"]
    queued["queued"]
    launching["launching"]
    running["running"]
    closeout["closeout"]
    pr_testing["pr_testing"]
    pr_red["pr_red"]
    pr_conflicts["pr_conflicts"]
    pr_green["pr_green"]
    pr_merged["pr_merged"]
    done["done"]
    cancelled["cancelled"]

    planned --> queued
    queued --> launching
    launching --> running
    running --> closeout
    closeout --> pr_testing
    pr_testing --> pr_red
    pr_testing --> pr_conflicts
    pr_testing --> pr_green
    pr_red --> pr_green
    pr_conflicts --> pr_green
    pr_green --> pr_merged
    pr_merged --> done
    planned --> cancelled
    queued --> cancelled
    launching --> cancelled
    running --> cancelled

Stage derivation¶

Stages are derived every tick from observable state — they are never stored as mutable fields. derive_stage_from_state() in stages.py maps the following inputs to the canonical stage:

Input	Description
`is_cancelled`	frontmatter status is a cancelled value
`is_completed`	frontmatter status is a completed value
`queued_at`	non-empty queue row timestamp
`run_state`	latest run row `state` column
`dispatch_state`	computed dispatch liveness (`running`, `worker_gone`, etc.)
`pr_state`	PR state from `pr_metadata` (`OPEN`, `MERGED`, `CLOSED`)
`pr_ci_state`	CI rollup (`success`, `failure`, `pending`)
`pr_mergeable`	mergeability check (`MERGEABLE`, `CLEAN`, `CONFLICTING`, etc.)

Precedence (highest first):

cancelled — terminal abandon beats every other signal
active worker (starting/running run_state) — beats PR and queue state
PR state — MERGED folds into DONE (if completed) or PR_MERGED; OPEN dispatches to pr_conflicts/pr_red/pr_green/pr_testing
completed (frontmatter) — DONE when no PR signal in step 3
queued — beats stale closeout from prior runs
closeout — any terminal run state with no PR, not queued, not completed
planned (default)

Key files¶

File	What it does
`stages.py`	Stage enum, `derive_stage_from_state`, `observe_task`, STAGES table with hooks
`tick.py`	Per-task tick loop: seed/advance/noop based on declared vs observed stage
`stuck.py`	Stuck machine state, `escalate_to_human`, `needs_human_slugs`
`transitions.py`	Stage transition log (SQLite `stage_transitions` table)
`service.py`	Heartbeat integration: `run_heartbeat_cycle`, `run_manager_loop`

Stage hooks¶

Each stage entry in STAGES (a dict[Stage, StageDef]) can declare:

on_entry — runs once when the tick advances into this stage
ongoing — runs every tick while the task remains in this stage
stuck — list of StuckRule entries with when, unstick, and on_broke

Current hooks:

Stage	Hook type	Function	What it does
RUNNING	`ongoing`	`_running_ongoing`	O(1) PID liveness check via `get_run(task.run_id)`; marks dead workers `lost`
RUNNING	stuck `worker_died`	`_unstick_worker_vanished`	Requeues task when worker vanished and retries remain
RUNNING	stuck `idle_too_long`	`_unstick_running_idle`	Sends nudge message to idle worker
RUNNING	stuck `worker_retries_exhausted`	→ `escalate_to_human`	Escalates when max retries exceeded
PR_MERGED	`on_entry`	`_pr_merged_on_entry`	Marks task `completed` in registry when PR merges

Tick loop¶

tick(tasks) in tick.py runs once per heartbeat cycle per task:

Compute observed = observe_task(task)
Look up declared = current_stage(task.slug) from the transition log
If no declared stage: seed the task (write first transition)
If declared != observed: advance (write new transition, fire on_entry for new stage)
If declared == observed: noop (fire ongoing hook if declared)
After entry/ongoing hooks: evaluate stuck rules for the current stage

Stuck machine¶

The stuck machine in stuck.py tracks per-task attempts for each stuck rule:

when(task) fires → increment attempt counter
If attempts < max_attempts: call unstick(task) (Tier 1 repair)
If attempts >= max_attempts: transition to broke, call on_broke(task) (Tier 2 escalation)
escalate_to_human(task) writes a needs_human event; needs_human_slugs() reads it for the operator view

Who owns transitions¶

Component	Main entrypoints	What it moves
Spec sync	`sync_markdown_task_specs_to_sqlite()`	Markdown spec → SQLite task-spec rows
Queueing	`kickoff_task()`, `queue_task()`	`planned → queued`
Heartbeat	`run_heartbeat_cycle()`	Periodic refresh, classification, dispatch pickup, stage tick
Dispatcher	`run_dispatch()`	`queued/launching → running`, worker follow-up, stale-gate detection on worker exit
Stage tick	`tick.py` hooks	`on_entry`, `ongoing`, stuck rules
Delivery controller	`_refresh_task_runtime_view()` + PR helpers	PR discovery, CI/conflict state, review routing
Stage/stuck rules	`tick.py` hooks	Nudges, retries, stall breaking
Incident detector	`process_stale_runs()`	Emits incident row, sets `driver_action=incident_emitted`, delivers to `tasks` agent

Stage-by-stage audit¶

Stage	What it means	Who puts a task here	What normally moves it next
`planned`	Task exists, not queued, no active runtime	Task creation / markdown sync	`kickoff_task()`
`queued`	Queue row exists, no live worker yet	`kickoff_task()`, stage_unstick retry	Heartbeat dispatch pickup
`launching`	Run row started, worker bootstrapping	Dispatcher `start_dispatch_run()`	Worker heartbeat advances to `running`
`running`	Active worker run exists	Dispatcher	Worker exit, stuck-rule nudge/retry, dead-worker detection
`closeout`	Worker finished, no open PR, task not completed	Run terminal state	PR creation, task completion, operator action
`pr_testing`	Open PR, CI not yet determined	PR discovery	CI result arrives
`pr_red`	Open PR, CI failing	CI failure	Fix pushed, CI passes
`pr_conflicts`	Open PR, merge conflicts	Conflict detected	Rebase / conflict resolution
`pr_green`	Open PR, CI passing, mergeable	CI + mergeability pass	PR merge
`pr_merged`	PR merged, task not yet completed	Merge event	`_pr_merged_on_entry` marks completed
`done`	Terminal success	`_pr_merged_on_entry`, or completed frontmatter	None
`cancelled`	Terminal abandoned	Cancellation intent	None

Happy path¶

sequenceDiagram
    participant User
    participant Queue as Queue API
    participant Heartbeat
    participant Dispatch
    participant Delivery

    User->>Queue: kickoff_task() / queue_task()
    Heartbeat->>Heartbeat: sync_markdown_task_specs_to_sqlite()
    Heartbeat->>Dispatch: launch_dispatch()
    Dispatch->>Dispatch: start_dispatch_run()
    Dispatch->>Dispatch: run worker + liveness heartbeat
    Dispatch->>Dispatch: finish_dispatch_run()
    Delivery->>Delivery: persist PR metadata / CI / mergeability
    Heartbeat->>Heartbeat: tick() _pr_merged_on_entry marks completed

Normal contributor expectation:

Task is queued.
Heartbeat picks it up and dispatches.
Dispatch creates the worktree and registers a run.
Worker runs until closeout.
Closeout creates a PR.
Delivery tracks CI and mergeability while PR is open.
PR merge triggers _pr_merged_on_entry, which marks the task completed.

Repair paths¶

Problem	Where it is noticed	Automatic action
Dead worker PID	`reconcile_live_runs()` during heartbeat	Mark latest active run `lost`, surface as `worker_gone`
Idle worker	Stage/stuck rules in `tick.py`	Queue a nudge message
`failed` / `interrupted` / `worker_gone` with no PR and retries left	Stage/stuck rules in `tick.py`	Requeue task
Stale owned run idle past threshold	`process_stale_runs()` in heartbeat	Emit incident row, set `driver_action=incident_emitted`, deliver packet to `tasks` agent
PR CI failing	Delivery controller	Post repair instructions / surface attention
PR conflicts	Delivery controller	Post repair instructions / surface attention
PR merged	`reconcile_merged_tasks()`	Mark task completed in SQLite
Terminal task still has active run or queue row	`reconcile_terminal_runs_to_sqlite_runtime()`	Close stale runtime rows

UI buckets are not lifecycle stages¶

The homepage does not show raw stages. It groups tasks into display buckets via classify_tasks().

UI bucket	Meaning
`Ready`	Task is queued, unregistered, non-terminal, no dependency blockers
`Waiting on dependency`	Task is queued but blockers still exist
`Active`	Task is running/launching, or in a healthy open-PR delivery state
`Needs attention`	A repair/anomaly signal exists
`Needs human`	Stuck machine escalated to `needs_human`
`Other`	Everything else

Practical debugging order¶

When a task looks wrong, check in this order:

Is the task spec present and synced into SQLite?
Is there a queue row?
Is there a latest run row, and what is its state?
Does the run still have a live PID?
Is there persisted PR metadata?
What stage does observe_task() derive?
What attention signals did classify_tasks() add?

That sequence mirrors how the manager itself decides what the task is.

ai_notes/core/TASK_MANAGER_MINIMAL_ARCHITECTURE.md
ai_notes/core/task-manager-layers.md
libs/tasks/tasks/tools/cli-reference.md

Task Manager Lifecycle¶

Core rule¶

Who owns transitions¶

Stage model¶

Stage derivation¶

Key files¶

Stage hooks¶

Tick loop¶

Stuck machine¶

Who owns transitions¶

Stage-by-stage audit¶

Happy path¶

Repair paths¶

UI buckets are not lifecycle stages¶

Practical debugging order¶

Related docs¶