Skip to content

Task Manager Lifecycle

This page is the canonical contributor map for how a task moves through the task manager in libs/tasks/.

If you are debugging why a task is not moving, keep three layers separate:

  1. Task spec: markdown under tasks/workstreams/*/tasks/*.md
  2. Runtime truth: SQLite rows in the shared task-manager control plane
  3. UI grouping: derived buckets such as Ready or Needs attention

Most confusion comes from mixing those layers.

Core rule

The heartbeat reads markdown once at the top of a cycle with sync_markdown_task_specs_to_sqlite(). After that, runtime surfaces should read from SQLite only.

That means:

  • task markdown defines spec-owned fields such as title, owner, dependencies, milestone, and worksheet content
  • SQLite defines runtime-owned fields such as queue state, runs, PR state, delivery state, and operator classification
  • UI buckets are projections, not authoritative state

Who owns transitions

Component Main entrypoints What it is allowed to move
Spec sync sync_markdown_task_specs_to_sqlite() Markdown spec into SQLite task-spec rows
Queueing kickoff_task(), queue_task() planned -> queued
Heartbeat run_heartbeat_cycle(), build_manager_state() Periodic refresh, reconciliation, classification, dispatch pickup
Dispatcher run_dispatch() queued/dispatching -> running, worker follow-up loops, stale-gate detection on worker exit
Run finalizer finish_dispatch_run() running -> exited/failed/interrupted/closeout_*
Delivery controller _refresh_task_runtime_view() and PR repair helpers PR discovery, CI/conflict state, review routing
Stage/stuck rules tick.py hooks (on_entry, ongoing, stuck rules) Nudges, retries, stall breaking — one recovery engine, runs in the fast cycle
Reconciler reconcile_live_runs(), reconcile_merged_tasks(), reconcile_terminal_runs_to_sqlite_runtime() Dead-worker cleanup, merged-task completion, terminal drift cleanup
Incident detector process_stale_runs() in incident.py Emits durable incident row, sets driver_action=incident_emitted, delivers to tasks tmux agent
Tasks agent tasks tmux session (operator or manager-spawned) Judgment-based repair: nudge, restart, escalate to user

Stage model

The canonical stage enum lives in libs/tasks/tasks/manager/stages.py. There are 12 stages:

flowchart TD
    planned["planned"]
    queued["queued"]
    launching["launching"]
    running["running"]
    closeout["closeout"]
    pr_testing["pr_testing"]
    pr_red["pr_red"]
    pr_conflicts["pr_conflicts"]
    pr_green["pr_green"]
    pr_merged["pr_merged"]
    done["done"]
    cancelled["cancelled"]

    planned --> queued
    queued --> launching
    launching --> running
    running --> closeout
    closeout --> pr_testing
    pr_testing --> pr_red
    pr_testing --> pr_conflicts
    pr_testing --> pr_green
    pr_red --> pr_green
    pr_conflicts --> pr_green
    pr_green --> pr_merged
    pr_merged --> done
    planned --> cancelled
    queued --> cancelled
    launching --> cancelled
    running --> cancelled

Stage derivation

Stages are derived every tick from observable state — they are never stored as mutable fields. derive_stage_from_state() in stages.py maps the following inputs to the canonical stage:

Input Description
is_cancelled frontmatter status is a cancelled value
is_completed frontmatter status is a completed value
queued_at non-empty queue row timestamp
run_state latest run row state column
dispatch_state computed dispatch liveness (running, worker_gone, etc.)
pr_state PR state from pr_metadata (OPEN, MERGED, CLOSED)
pr_ci_state CI rollup (success, failure, pending)
pr_mergeable mergeability check (MERGEABLE, CLEAN, CONFLICTING, etc.)

Precedence (highest first):

  1. cancelled — terminal abandon beats every other signal
  2. active worker (starting/running run_state) — beats PR and queue state
  3. PR stateMERGED folds into DONE (if completed) or PR_MERGED; OPEN dispatches to pr_conflicts/pr_red/pr_green/pr_testing
  4. completed (frontmatter) — DONE when no PR signal in step 3
  5. queued — beats stale closeout from prior runs
  6. closeout — any terminal run state with no PR, not queued, not completed
  7. planned (default)

Key files

File What it does
stages.py Stage enum, derive_stage_from_state, observe_task, STAGES table with hooks
tick.py Per-task tick loop: seed/advance/noop based on declared vs observed stage
stuck.py Stuck machine state, escalate_to_human, needs_human_slugs
transitions.py Stage transition log (SQLite stage_transitions table)
service.py Heartbeat integration: run_heartbeat_cycle, run_manager_loop

Stage hooks

Each stage entry in STAGES (a dict[Stage, StageDef]) can declare:

  • on_entry — runs once when the tick advances into this stage
  • ongoing — runs every tick while the task remains in this stage
  • stuck — list of StuckRule entries with when, unstick, and on_broke

Current hooks:

Stage Hook type Function What it does
RUNNING ongoing _running_ongoing O(1) PID liveness check via get_run(task.run_id); marks dead workers lost
RUNNING stuck worker_died _unstick_worker_vanished Requeues task when worker vanished and retries remain
RUNNING stuck idle_too_long _unstick_running_idle Sends nudge message to idle worker
RUNNING stuck worker_retries_exhausted escalate_to_human Escalates when max retries exceeded
PR_MERGED on_entry _pr_merged_on_entry Marks task completed in registry when PR merges

Tick loop

tick(tasks) in tick.py runs once per heartbeat cycle per task:

  1. Compute observed = observe_task(task)
  2. Look up declared = current_stage(task.slug) from the transition log
  3. If no declared stage: seed the task (write first transition)
  4. If declared != observed: advance (write new transition, fire on_entry for new stage)
  5. If declared == observed: noop (fire ongoing hook if declared)
  6. After entry/ongoing hooks: evaluate stuck rules for the current stage

Stuck machine

The stuck machine in stuck.py tracks per-task attempts for each stuck rule:

  • when(task) fires → increment attempt counter
  • If attempts < max_attempts: call unstick(task) (Tier 1 repair)
  • If attempts >= max_attempts: transition to broke, call on_broke(task) (Tier 2 escalation)
  • escalate_to_human(task) writes a needs_human event; needs_human_slugs() reads it for the operator view

Who owns transitions

Component Main entrypoints What it moves
Spec sync sync_markdown_task_specs_to_sqlite() Markdown spec → SQLite task-spec rows
Queueing kickoff_task(), queue_task() planned → queued
Heartbeat run_heartbeat_cycle() Periodic refresh, classification, dispatch pickup, stage tick
Dispatcher run_dispatch() queued/launching → running, worker follow-up, stale-gate detection on worker exit
Stage tick tick.py hooks on_entry, ongoing, stuck rules
Delivery controller _refresh_task_runtime_view() + PR helpers PR discovery, CI/conflict state, review routing
Stage/stuck rules tick.py hooks Nudges, retries, stall breaking
Incident detector process_stale_runs() Emits incident row, sets driver_action=incident_emitted, delivers to tasks agent

Stage-by-stage audit

Stage What it means Who puts a task here What normally moves it next
planned Task exists, not queued, no active runtime Task creation / markdown sync kickoff_task()
queued Queue row exists, no live worker yet kickoff_task(), stage_unstick retry Heartbeat dispatch pickup
launching Run row started, worker bootstrapping Dispatcher start_dispatch_run() Worker heartbeat advances to running
running Active worker run exists Dispatcher Worker exit, stuck-rule nudge/retry, dead-worker detection
closeout Worker finished, no open PR, task not completed Run terminal state PR creation, task completion, operator action
pr_testing Open PR, CI not yet determined PR discovery CI result arrives
pr_red Open PR, CI failing CI failure Fix pushed, CI passes
pr_conflicts Open PR, merge conflicts Conflict detected Rebase / conflict resolution
pr_green Open PR, CI passing, mergeable CI + mergeability pass PR merge
pr_merged PR merged, task not yet completed Merge event _pr_merged_on_entry marks completed
done Terminal success _pr_merged_on_entry, or completed frontmatter None
cancelled Terminal abandoned Cancellation intent None

Happy path

sequenceDiagram
    participant User
    participant Queue as Queue API
    participant Heartbeat
    participant Dispatch
    participant Delivery

    User->>Queue: kickoff_task() / queue_task()
    Heartbeat->>Heartbeat: sync_markdown_task_specs_to_sqlite()
    Heartbeat->>Dispatch: launch_dispatch()
    Dispatch->>Dispatch: start_dispatch_run()
    Dispatch->>Dispatch: run worker + liveness heartbeat
    Dispatch->>Dispatch: finish_dispatch_run()
    Delivery->>Delivery: persist PR metadata / CI / mergeability
    Heartbeat->>Heartbeat: tick() _pr_merged_on_entry marks completed

Normal contributor expectation:

  1. Task is queued.
  2. Heartbeat picks it up and dispatches.
  3. Dispatch creates the worktree and registers a run.
  4. Worker runs until closeout.
  5. Closeout creates a PR.
  6. Delivery tracks CI and mergeability while PR is open.
  7. PR merge triggers _pr_merged_on_entry, which marks the task completed.

Repair paths

Problem Where it is noticed Automatic action
Dead worker PID reconcile_live_runs() during heartbeat Mark latest active run lost, surface as worker_gone
Idle worker Stage/stuck rules in tick.py Queue a nudge message
failed / interrupted / worker_gone with no PR and retries left Stage/stuck rules in tick.py Requeue task
Stale owned run idle past threshold process_stale_runs() in heartbeat Emit incident row, set driver_action=incident_emitted, deliver packet to tasks agent
PR CI failing Delivery controller Post repair instructions / surface attention
PR conflicts Delivery controller Post repair instructions / surface attention
PR merged reconcile_merged_tasks() Mark task completed in SQLite
Terminal task still has active run or queue row reconcile_terminal_runs_to_sqlite_runtime() Close stale runtime rows

UI buckets are not lifecycle stages

The homepage does not show raw stages. It groups tasks into display buckets via classify_tasks().

UI bucket Meaning
Ready Task is queued, unregistered, non-terminal, no dependency blockers
Waiting on dependency Task is queued but blockers still exist
Active Task is running/launching, or in a healthy open-PR delivery state
Needs attention A repair/anomaly signal exists
Needs human Stuck machine escalated to needs_human
Other Everything else

Practical debugging order

When a task looks wrong, check in this order:

  1. Is the task spec present and synced into SQLite?
  2. Is there a queue row?
  3. Is there a latest run row, and what is its state?
  4. Does the run still have a live PID?
  5. Is there persisted PR metadata?
  6. What stage does observe_task() derive?
  7. What attention signals did classify_tasks() add?

That sequence mirrors how the manager itself decides what the task is.

  • ai_notes/core/TASK_MANAGER_MINIMAL_ARCHITECTURE.md
  • ai_notes/core/task-manager-layers.md
  • libs/tasks/tasks/tools/cli-reference.md