An open-source spec for orchestration: Symphony

Stay Ahead, Stay ONMINE

An open-source spec for orchestration: Symphony

1# Symphony Service Specification3Status: Draft v1 (language-agnostic)5Purpose: Define a service that orchestrates coding agents to get project work done.7## 1. Problem Statement9Symphony is a long-running automation service that continuously reads work from an issue tracker10(Linear in this specification version), creates an isolated workspace for each issue, and runs a11coding agent session for that issue inside the workspace.13The service solves four operational problems:15- It turns issue execution into a repeatable daemon workflow instead of manual scripts.16- It isolates agent execution in per-issue workspaces so agent commands run only inside per-issue17 workspace directories.18- It keeps the workflow policy in-repo (`WORKFLOW.md`) so teams version the agent prompt and runtime19 settings with their code.20- It provides enough observability to operate and debug multiple concurrent agent runs.22Implementations are expected to document their trust and safety posture explicitly. This23specification does not require a single approval, sandbox, or operator-confirmation policy; some24implementations may target trusted environments with a high-trust configuration, while others may25require stricter approvals or sandboxing.27Important boundary:29- Symphony is a scheduler/runner and tracker reader.30- Ticket writes (state transitions, comments, PR links) are typically performed by the coding agent31 using tools available in the workflow/runtime environment.32- A successful run may end at a workflow-defined handoff state (for example `Human Review`), not33 necessarily `Done`.35## 2. Goals and Non-Goals37### 2.1 Goals39- Poll the issue tracker on a fixed cadence and dispatch work with bounded concurrency.40- Maintain a single authoritative orchestrator state for dispatch, retries, and reconciliation.41- Create deterministic per-issue workspaces and preserve them across runs.42- Stop active runs when issue state changes make them ineligible.43- Recover from transient failures with exponential backoff.44- Load runtime behavior from a repository-owned `WORKFLOW.md` contract.45- Expose operator-visible observability (at minimum structured logs).46- Support restart recovery without requiring a persistent database.48### 2.2 Non-Goals50- Rich web UI or multi-tenant control plane.51- Prescribing a specific dashboard or terminal UI implementation.52- General-purpose workflow engine or distributed job scheduler.53- Built-in business logic for how to edit tickets, PRs, or comments. (That logic lives in the54 workflow prompt and agent tooling.)55- Mandating strong sandbox controls beyond what the coding agent and host OS provide.56- Mandating a single default approval, sandbox, or operator-confirmation posture for all57 implementations.59## 3. System Overview61### 3.1 Main Components631. `Workflow Loader`64 – Reads `WORKFLOW.md`.65 – Parses YAML front matter and prompt body.66 – Returns `{config, prompt_template}`.682. `Config Layer`69 – Exposes typed getters for workflow config values.70 – Applies defaults and environment variable indirection.71 – Performs validation used by the orchestrator before dispatch.733. `Issue Tracker Client`74 – Fetches candidate issues in active states.75 – Fetches current states for specific issue IDs (reconciliation).76 – Fetches terminal-state issues during startup cleanup.77 – Normalizes tracker payloads into a stable issue model.794. `Orchestrator`80 – Owns the poll tick.81 – Owns the in-memory runtime state.82 – Decides which issues to dispatch, retry, stop, or release.83 – Tracks session metrics and retry queue state.855. `Workspace Manager`86 – Maps issue identifiers to workspace paths.87 – Ensures per-issue workspace directories exist.88 – Runs workspace lifecycle hooks.89 – Cleans workspaces for terminal issues.916. `Agent Runner`92 – Creates workspace.93 – Builds prompt from issue + workflow template.94 – Launches the coding agent app-server client.95 – Streams agent updates back to the orchestrator.977. `Status Surface` (optional)98 – Presents human-readable runtime status (for example terminal output, dashboard, or other99 operator-facing view).1018. `Logging`102 – Emits structured runtime logs to one or more configured sinks.104### 3.2 Abstraction Levels106Symphony is easiest to port when kept in these layers:1081. `Policy Layer` (repo-defined)109 – `WORKFLOW.md` prompt body.110 – Team-specific rules for ticket handling, validation, and handoff.1122. `Configuration Layer` (typed getters)113 – Parses front matter into typed runtime settings.114 – Handles defaults, environment tokens, and path normalization.1163. `Coordination Layer` (orchestrator)117 – Polling loop, issue eligibility, concurrency, retries, reconciliation.1194. `Execution Layer` (workspace + agent subprocess)120 – Filesystem lifecycle, workspace preparation, coding-agent protocol.1225. `Integration Layer` (Linear adapter)123 – API calls and normalization for tracker data.1256. `Observability Layer` (logs + optional status surface)126 – Operator visibility into orchestrator and agent behavior.128### 3.3 External Dependencies130- Issue tracker API (Linear for `tracker.kind: linear` in this specification version).131- Local filesystem for workspaces and logs.132- Optional workspace population tooling (for example Git CLI, if used).133- Coding-agent executable that supports JSON-RPC-like app-server mode over stdio.134- Host environment authentication for the issue tracker and coding agent.136## 4. Core Domain Model138### 4.1 Entities140#### 4.1.1 Issue142Normalized issue record used by orchestration, prompt rendering, and observability output.144Fields:146- `id` (string)147 – Stable tracker-internal ID.148- `identifier` (string)149 – Human-readable ticket key (example: `ABC-123`).150- `title` (string)151- `description` (string or null)152- `priority` (integer or null)153 – Lower numbers are higher priority in dispatch sorting.154- `state` (string)155 – Current tracker state name.156- `branch_name` (string or null)157 – Tracker-provided branch metadata if available.158- `url` (string or null)159- `labels` (list of strings)160 – Normalized to lowercase.161- `blocked_by` (list of blocker refs)162 – Each blocker ref contains:163 – `id` (string or null)164 – `identifier` (string or null)165 – `state` (string or null)166- `created_at` (timestamp or null)167- `updated_at` (timestamp or null)169#### 4.1.2 Workflow Definition171Parsed `WORKFLOW.md` payload:173- `config` (map)174 – YAML front matter root object.175- `prompt_template` (string)176 – Markdown body after front matter, trimmed.178#### 4.1.3 Service Config (Typed View)180Typed runtime values derived from `WorkflowDefinition.config` plus environment resolution.182Examples:184- poll interval185- workspace root186- active and terminal issue states187- concurrency limits188- coding-agent executable/args/timeouts189- workspace hooks191#### 4.1.4 Workspace193Filesystem workspace assigned to one issue identifier.195Fields (logical):197- `path` (workspace path; current runtime typically uses absolute paths, but relative roots are198 possible if configured without path separators)199- `workspace_key` (sanitized issue identifier)200- `created_now` (boolean, used to gate `after_create` hook)202#### 4.1.5 Run Attempt204One execution attempt for one issue.206Fields (logical):208- `issue_id`209- `issue_identifier`210- `attempt` (integer or null, `null` for first run, ` >=1` for retries/continuation)211- `workspace_path`212- `started_at`213- `status`214- `error` (optional)216#### 4.1.6 Live Session (Agent Session Metadata)218State tracked while a coding-agent subprocess is running.220Fields:222- `session_id` (string, `-`)223- `thread_id` (string)224- `turn_id` (string)225- `codex_app_server_pid` (string or null)226- `last_codex_event` (string/enum or null)227- `last_codex_timestamp` (timestamp or null)228- `last_codex_message` (summarized payload)229- `codex_input_tokens` (integer)230- `codex_output_tokens` (integer)231- `codex_total_tokens` (integer)232- `last_reported_input_tokens` (integer)233- `last_reported_output_tokens` (integer)234- `last_reported_total_tokens` (integer)235- `turn_count` (integer)236 – Number of coding-agent turns started within the current worker lifetime.238#### 4.1.7 Retry Entry240Scheduled retry state for an issue.242Fields:244- `issue_id`245- `identifier` (best-effort human ID for status surfaces/logs)246- `attempt` (integer, 1-based for retry queue)247- `due_at_ms` (monotonic clock timestamp)248- `timer_handle` (runtime-specific timer reference)249- `error` (string or null)251#### 4.1.8 Orchestrator Runtime State253Single authoritative in-memory state owned by the orchestrator.255Fields:257- `poll_interval_ms` (current effective poll interval)258- `max_concurrent_agents` (current effective global concurrency limit)259- `running` (map `issue_id – > running entry`)260- `claimed` (set of issue IDs reserved/running/retrying)261- `retry_attempts` (map `issue_id – > RetryEntry`)262- `completed` (set of issue IDs; bookkeeping only, not dispatch gating)263- `codex_totals` (aggregate tokens + runtime seconds)264- `codex_rate_limits` (latest rate-limit snapshot from agent events)266### 4.2 Stable Identifiers and Normalization Rules268- `Issue ID`269 – Use for tracker lookups and internal map keys.270- `Issue Identifier`271 – Use for human-readable logs and workspace naming.272- `Workspace Key`273 – Derive from `issue.identifier` by replacing any character not in `[A-Za-z0-9._-]` with `_`.274 – Use the sanitized value for the workspace directory name.275- `Normalized Issue State`276 – Compare states after `lowercase`.277- `Session ID`278 – Compose from coding-agent `thread_id` and `turn_id` as `-`.280## 5. Workflow Specification (Repository Contract)282### 5.1 File Discovery and Path Resolution284Workflow file path precedence:2861. Explicit application/runtime setting (set by CLI startup path).2872. Default: `WORKFLOW.md` in the current process working directory.289Loader behavior:291- If the file cannot be read, return `missing_workflow_file` error.292- The workflow file is expected to be repository-owned and version-controlled.294### 5.2 File Format296`WORKFLOW.md` is a Markdown file with optional YAML front matter.298Design note:300- `WORKFLOW.md` should be self-contained enough to describe and run different workflows (prompt,301 runtime settings, hooks, and tracker selection/config) without requiring out-of-band302 service-specific configuration.304Parsing rules:306- If file starts with `—`, parse lines until the next `—` as YAML front matter.307- Remaining lines become the prompt body.308- If front matter is absent, treat the entire file as prompt body and use an empty config map.309- YAML front matter must decode to a map/object; non-map YAML is an error.310- Prompt body is trimmed before use.312Returned workflow object:314- `config`: front matter root object (not nested under a `config` key).315- `prompt_template`: trimmed Markdown body.317### 5.3 Front Matter Schema319Top-level keys:321- `tracker`322- `polling`323- `workspace`324- `hooks`325- `agent`326- `codex`328Unknown keys should be ignored for forward compatibility.332- The workflow front matter is extensible. Optional extensions may define additional top-level keys333 (for example `server`) without changing the core schema above.334- Extensions should document their field schema, defaults, validation rules, and whether changes335 apply dynamically or require restart.336- Common extension: `server.port` (integer) enables the optional HTTP server described in Section339#### 5.3.1 `tracker` (object)341Fields:343- `kind` (string)344 – Required for dispatch.345 – Current supported value: `linear`346- `endpoint` (string)347 – Default for `tracker.kind == “linear”`: `https://api.linear.app/graphql`348- `api_key` (string)349 – May be a literal token or `$VAR_NAME`.350 – Canonical environment variable for `tracker.kind == “linear”`: `LINEAR_API_KEY`.351 – If `$VAR_NAME` resolves to an empty string, treat the key as missing.352- `project_slug` (string)353 – Required for dispatch when `tracker.kind == “linear”`.354- `active_states` (list of strings)355 – Default: `Todo`, `In Progress`356- `terminal_states` (list of strings)357 – Default: `Closed`, `Cancelled`, `Canceled`, `Duplicate`, `Done`359#### 5.3.2 `polling` (object)361Fields:363- `interval_ms` (integer or string integer)364 – Default: `30000`365 – Changes should be re-applied at runtime and affect future tick scheduling without restart.367#### 5.3.3 `workspace` (object)369Fields:371- `root` (path string or `$VAR`)372 – Default: `/symphony_workspaces`373 – `~` and strings containing path separators are expanded.374 – Bare strings without path separators are preserved as-is (relative roots are allowed but375 discouraged).377#### 5.3.4 `hooks` (object)379Fields:381- `after_create` (multiline shell script string, optional)382 – Runs only when a workspace directory is newly created.383 – Failure aborts workspace creation.384- `before_run` (multiline shell script string, optional)385 – Runs before each agent attempt after workspace preparation and before launching the coding386 agent.387 – Failure aborts the current attempt.388- `after_run` (multiline shell script string, optional)389 – Runs after each agent attempt (success, failure, timeout, or cancellation) once the workspace390 exists.391 – Failure is logged but ignored.392- `before_remove` (multiline shell script string, optional)393 – Runs before workspace deletion if the directory exists.394 – Failure is logged but ignored; cleanup still proceeds.395- `timeout_ms` (integer, optional)396 – Default: `60000`397 – Applies to all workspace hooks.398 – Non-positive values should be treated as invalid and fall back to the default.399 – Changes should be re-applied at runtime for future hook executions.401#### 5.3.5 `agent` (object)403Fields:405- `max_concurrent_agents` (integer or string integer)406 – Default: `10`407 – Changes should be re-applied at runtime and affect subsequent dispatch decisions.408- `max_retry_backoff_ms` (integer or string integer)409 – Default: `300000` (5 minutes)410 – Changes should be re-applied at runtime and affect future retry scheduling.411- `max_concurrent_agents_by_state` (map `state_name – > positive integer`)412 – Default: empty map.413 – State keys are normalized (`lowercase`) for lookup.414 – Invalid entries (non-positive or non-numeric) are ignored.416#### 5.3.6 `codex` (object)418Fields:420For Codex-owned config values such as `approval_policy`, `thread_sandbox`, and421`turn_sandbox_policy`, supported values are defined by the targeted Codex app-server version.422Implementors should treat them as pass-through Codex config values rather than relying on a423hand-maintained enum in this spec. To inspect the installed Codex schema, run424`codex app-server generate-json-schema –out ` and inspect the relevant definitions referenced425by `v2/ThreadStartParams.json` and `v2/TurnStartParams.json`. Implementations may validate these426fields locally if they want stricter startup checks.428- `command` (string shell command)429 – Default: `codex app-server`430 – The runtime launches this command via `bash -lc` in the workspace directory.431 – The launched process must speak a compatible app-server protocol over stdio.432- `approval_policy` (Codex `AskForApproval` value)433 – Default: implementation-defined.434- `thread_sandbox` (Codex `SandboxMode` value)435 – Default: implementation-defined.436- `turn_sandbox_policy` (Codex `SandboxPolicy` value)437 – Default: implementation-defined.438- `turn_timeout_ms` (integer)439 – Default: `3600000` (1 hour)440- `read_timeout_ms` (integer)441 – Default: `5000`442- `stall_timeout_ms` (integer)443 – Default: `300000` (5 minutes)444 – If ` cwd default).4932. YAML front matter values.4943. Environment indirection via `$VAR_NAME` inside selected YAML values.4954. Built-in defaults.497Value coercion semantics:499- Path/command fields support:500 – `~` home expansion501 – `$VAR` expansion for env-backed path values502 – Apply expansion only to values intended to be local filesystem paths; do not rewrite URIs or503 arbitrary shell command strings.505### 6.2 Dynamic Reload Semantics507Dynamic reload is required:509- The software should watch `WORKFLOW.md` for changes.510- On change, it should re-read and re-apply workflow config and prompt template without restart.511- The software should attempt to adjust live behavior to the new config (for example polling512 cadence, concurrency limits, active/terminal states, codex settings, workspace paths/hooks, and513 prompt content for future runs).514- Reloaded config applies to future dispatch, retry scheduling, reconciliation decisions, hook515 execution, and agent launches.516- Implementations are not required to restart in-flight agent sessions automatically when config517 changes.518- Extensions that manage their own listeners/resources (for example an HTTP server port change) may519 require restart unless the implementation explicitly supports live rebind.520- Implementations should also re-validate/reload defensively during runtime operations (for example521 before dispatch) in case filesystem watch events are missed.522- Invalid reloads should not crash the service; keep operating with the last known good effective523 configuration and emit an operator-visible error.525### 6.3 Dispatch Preflight Validation527This validation is a scheduler preflight run before attempting to dispatch new work. It validates528the workflow/config needed to poll and launch workers, not a full audit of all possible workflow529behavior.531Startup validation:533- Validate configuration before starting the scheduling loop.534- If startup validation fails, fail startup and emit an operator-visible error.536Per-tick dispatch validation:538- Re-validate before each dispatch cycle.539- If validation fails, skip dispatch for that tick, keep reconciliation active, and emit an540 operator-visible error.542Validation checks:544- Workflow file can be loaded and parsed.545- `tracker.kind` is present and supported.546- `tracker.api_key` is present after `$` resolution.547- `tracker.project_slug` is present when required by the selected tracker kind.548- `codex.command` is present and non-empty.550### 6.4 Config Fields Summary (Cheat Sheet)552This section is intentionally redundant so a coding agent can implement the config layer quickly.554- `tracker.kind`: string, required, currently `linear`555- `tracker.endpoint`: string, default `https://api.linear.app/graphql` when `tracker.kind=linear`556- `tracker.api_key`: string or `$VAR`, canonical env `LINEAR_API_KEY` when `tracker.kind=linear`557- `tracker.project_slug`: string, required when `tracker.kind=linear`558- `tracker.active_states`: list of strings, default `[“Todo”, “In Progress”]`559- `tracker.terminal_states`: list of strings, default `[“Closed”, “Cancelled”, “Canceled”, “Duplicate”, “Done”]`560- `polling.interval_ms`: integer, default `30000`561- `workspace.root`: path, default `/symphony_workspaces`562- `worker.ssh_hosts` (extension): list of SSH host strings, optional; when omitted, work runs563 locally564- `worker.max_concurrent_agents_per_host` (extension): positive integer, optional; shared per-host565 cap applied across configured SSH hosts566- `hooks.after_create`: shell script or null567- `hooks.before_run`: shell script or null568- `hooks.after_run`: shell script or null569- `hooks.before_remove`: shell script or null570- `hooks.timeout_ms`: integer, default `60000`571- `agent.max_concurrent_agents`: integer, default `10`572- `agent.max_turns`: integer, default `20`573- `agent.max_retry_backoff_ms`: integer, default `300000` (5m)574- `agent.max_concurrent_agents_by_state`: map of positive integers, default `{}`575- `codex.command`: shell command string, default `codex app-server`576- `codex.approval_policy`: Codex `AskForApproval` value, default implementation-defined577- `codex.thread_sandbox`: Codex `SandboxMode` value, default implementation-defined578- `codex.turn_sandbox_policy`: Codex `SandboxPolicy` value, default implementation-defined579- `codex.turn_timeout_ms`: integer, default `3600000`580- `codex.read_timeout_ms`: integer, default `5000`581- `codex.stall_timeout_ms`: integer, default `300000`582- `server.port` (extension): integer, optional; enables the optional HTTP server, `0` may be used583 for ephemeral local bind, and CLI `–port` overrides it585## 7. Orchestration State Machine587The orchestrator is the only component that mutates scheduling state. All worker outcomes are588reported back to it and converted into explicit state transitions.590### 7.1 Issue Orchestration States592This is not the same as tracker states (`Todo`, `In Progress`, etc.). This is the service’s internal593claim state.5951. `Unclaimed`596 – Issue is not running and has no retry scheduled.5982. `Claimed`599 – Orchestrator has reserved the issue to prevent duplicate dispatch.600 – In practice, claimed issues are either `Running` or `RetryQueued`.6023. `Running`603 – Worker task exists and the issue is tracked in `running` map.6054. `RetryQueued`606 – Worker is not running, but a retry timer exists in `retry_attempts`.6085. `Released`609 – Claim removed because issue is terminal, non-active, missing, or retry path completed without610 re-dispatch.612Important nuance:614- A successful worker exit does not mean the issue is done forever.615- The worker may continue through multiple back-to-back coding-agent turns before it exits.616- After each normal turn completion, the worker re-checks the tracker issue state.617- If the issue is still in an active state, the worker should start another turn on the same live618 coding-agent thread in the same workspace, up to `agent.max_turns`.619- The first turn should use the full rendered task prompt.620- Continuation turns should send only continuation guidance to the existing thread, not resend the621 original task prompt that is already present in thread history.622- Once the worker exits normally, the orchestrator still schedules a short continuation retry623 (about 1 second) so it can re-check whether the issue remains active and needs another worker624 session.626### 7.2 Run Attempt Lifecycle628A run attempt transitions through these phases:6301. `PreparingWorkspace`6312. `BuildingPrompt`6323. `LaunchingAgentProcess`6334. `InitializingSession`6345. `StreamingTurn`6356. `Finishing`6367. `Succeeded`6378. `Failed`6389. `TimedOut`63910. `Stalled`64011. `CanceledByReconciliation`642Distinct terminal reasons are important because retry logic and logs differ.644### 7.3 Transition Triggers646- `Poll Tick`647 – Reconcile active runs.648 – Validate config.649 – Fetch candidate issues.650 – Dispatch until slots are exhausted.652- `Worker Exit (normal)`653 – Remove running entry.654 – Update aggregate runtime totals.655 – Schedule continuation retry (attempt `1`) after the worker exhausts or finishes its in-process656 turn loop.658- `Worker Exit (abnormal)`659 – Remove running entry.660 – Update aggregate runtime totals.661 – Schedule exponential-backoff retry.663- `Codex Update Event`664 – Update live session fields, token counters, and rate limits.666- `Retry Timer Fired`667 – Re-fetch active candidates and attempt re-dispatch, or release claim if no longer eligible.669- `Reconciliation State Refresh`670 – Stop runs whose issue states are terminal or no longer active.672- `Stall Timeout`673 – Kill worker and schedule retry.675### 7.4 Idempotency and Recovery Rules677- The orchestrator serializes state mutations through one authority to avoid duplicate dispatch.678- `claimed` and `running` checks are required before launching any worker.679- Reconciliation runs before dispatch on every tick.680- Restart recovery is tracker-driven and filesystem-driven (no durable orchestrator DB required).681- Startup terminal cleanup removes stale workspaces for issues already in terminal states.683## 8. Polling, Scheduling, and Reconciliation685### 8.1 Poll Loop687At startup, the service validates config, performs startup cleanup, schedules an immediate tick, and688then repeats every `polling.interval_ms`.690The effective poll interval should be updated when workflow config changes are re-applied.692Tick sequence:6941. Reconcile running issues.6952. Run dispatch preflight validation.6963. Fetch candidate issues from tracker using active states.6974. Sort issues by dispatch priority.6985. Dispatch eligible issues while slots remain.6996. Notify observability/status consumers of state changes.701If per-tick validation fails, dispatch is skipped for that tick, but reconciliation still happens704### 8.2 Candidate Selection Rules706An issue is dispatch-eligible only if all are true:708- It has `id`, `identifier`, `title`, and `state`.709- Its state is in `active_states` and not in `terminal_states`.710- It is not already in `running`.711- It is not already in `claimed`.712- Global concurrency slots are available.713- Per-state concurrency slots are available.714- Blocker rule for `Todo` state passes:715 – If the issue state is `Todo`, do not dispatch when any blocker is non-terminal.717Sorting order (stable intent):7191. `priority` ascending (1..4 are preferred; null/unknown sorts last)7202. `created_at` oldest first7213. `identifier` lexicographic tie-breaker723### 8.3 Concurrency Control725Global limit:727- `available_slots = max(max_concurrent_agents – running_count, 0)`729Per-state limit:731- `max_concurrent_agents_by_state[state]` if present (state key normalized)732- otherwise fallback to global limit734The runtime counts issues by their current tracked state in the `running` map.736Optional SSH host limit:738- When `worker.max_concurrent_agents_per_host` is set, each configured SSH host may run at most739 that many concurrent agents at once.740- Hosts at that cap are skipped for new dispatch until capacity frees up.742### 8.4 Retry and Backoff744Retry entry creation:746- Cancel any existing retry timer for the same issue.747- Store `attempt`, `identifier`, `error`, `due_at_ms`, and new timer handle.749Backoff formula:751- Normal continuation retries after a clean worker exit use a short fixed delay of `1000` ms.752- Failure-driven retries use `delay = min(10000 * 2^(attempt – 1), agent.max_retry_backoff_ms)`.753- Power is capped by the configured max retry backoff (default `300000` / 5m).755Retry handling behavior:7571. Fetch active candidate issues (not all issues).7582. Find the specific issue by `issue_id`.7593. If not found, release claim.7604. If found and still candidate-eligible:761 – Dispatch if slots are available.762 – Otherwise requeue with error `no available orchestrator slots`.7635. If found but no longer active, release claim.767- Terminal-state workspace cleanup is handled by startup cleanup and active-run reconciliation768 (including terminal transitions for currently running issues).769- Retry handling mainly operates on active candidates and releases claims when the issue is absent,770 rather than performing terminal cleanup itself.772### 8.5 Active Run Reconciliation774Reconciliation runs every tick and has two parts.776Part A: Stall detection778- For each running issue, compute `elapsed_ms` since:779 – `last_codex_timestamp` if any event has been seen, else780 – `started_at`781- If `elapsed_ms > codex.stall_timeout_ms`, terminate the worker and queue a retry.782- If `stall_timeout_ms success989- `turn/failed` – > failure990- `turn/cancelled` – > failure991- turn timeout (`turn_timeout_ms`) – > failure992- subprocess exit – > failure994Continuation processing:996- If the worker decides to continue after a successful turn, it should issue another `turn/start`997 on the same live `threadId`.998- The app-server subprocess should remain alive across those continuation turns and be stopped only999 when the worker run is ending.1001Line handling requirements:1003- Read protocol messages from stdout only.1004- Buffer partial stdout lines until newline arrives.1005- Attempt JSON parse on complete stdout lines.1006- Stderr is not part of the protocol stream:1007 – ignore it or log it as diagnostics1008 – do not attempt protocol JSON parsing on stderr1010### 10.4 Emitted Runtime Events (Upstream to Orchestrator)1012The app-server client emits structured events to the orchestrator callback. Each event should1013include:1015- `event` (enum/string)1016- `timestamp` (UTC timestamp)1017- `codex_app_server_pid` (if available)1018- optional `usage` map (token counts)1019- payload fields as needed1021Important emitted events may include:1023- `session_started`1024- `startup_failed`1025- `turn_completed`1026- `turn_failed`1027- `turn_cancelled`1028- `turn_ended_with_error`1029- `turn_input_required`1030- `approval_auto_approved`1031- `unsupported_tool_call`1032- `notification`1033- `other_message`1034- `malformed`1036### 10.5 Approval, Tool Calls, and User Input Policy1038Approval, sandbox, and user-input behavior is implementation-defined.1040Policy requirements:1042- Each implementation should document its chosen approval, sandbox, and operator-confirmation1043 posture.1044- Approval requests and user-input-required events must not leave a run stalled indefinitely. An1045 implementation should either satisfy them, surface them to an operator, auto-resolve them, or1046 fail the run according to its documented policy.1048Example high-trust behavior:1050- Auto-approve command execution approvals for the session.1051- Auto-approve file-change approvals for the session.1052- Treat user-input-required turns as hard failure.1054Unsupported dynamic tool calls:1056- Supported dynamic tool calls that are explicitly implemented and advertised by the runtime should1057 be handled according to their extension contract.1058- If the agent requests a dynamic tool call (`item/tool/call`) that is not supported, return a tool1059 failure response and continue the session.1060- This prevents the session from stalling on unsupported tool execution paths.1062Optional client-side tool extension:1064- An implementation may expose a limited set of client-side tools to the app-server session.1065- Current optional standardized tool: `linear_graphql`.1066- If implemented, supported tools should be advertised to the app-server session during startup1067 using the protocol mechanism supported by the targeted Codex app-server version.1068- Unsupported tool names should still return a failure result and continue the session.1070`linear_graphql` extension contract:1072- Purpose: execute a raw GraphQL query or mutation against Linear using Symphony’s configured1073 tracker auth for the current session.1074- Availability: only meaningful when `tracker.kind == “linear”` and valid Linear auth is configured.1075- Preferred input shape:1077 “`json1079 “query”: “single GraphQL query or mutation document”,1080 “variables”: {1081 “optional”: “graphql variables object”1086- `query` must be a non-empty string.1087- `query` must contain exactly one GraphQL operation.1088- `variables` is optional and, when present, must be a JSON object.1089- Implementations may additionally accept a raw GraphQL query string as shorthand input.1090- Execute one GraphQL operation per tool call.1091- If the provided document contains multiple operations, reject the tool call as invalid input.1092- `operationName` selection is intentionally out of scope for this extension.1093- Reuse the configured Linear endpoint and auth from the active Symphony workflow/runtime config; do1094 not require the coding agent to read raw tokens from disk.1095- Tool result semantics:1096 – transport success + no top-level GraphQL `errors` – > `success=true`1097 – top-level GraphQL `errors` present – > `success=false`, but preserve the GraphQL response body1098 for debugging1099 – invalid input, missing auth, or transport failure – > `success=false` with an error payload1100- Return the GraphQL response or error payload as structured tool output that the model can inspect1101 in-session.1103Illustrative responses (equivalent payload shapes are acceptable if they preserve the same outcome):1105“`json1106{“id”:””,”result”:{“approved”:true}}1107{“id”:””,”result”:{“success”:false,”error”:”unsupported_tool_call”}}1110Hard failure on user input requirement:1112- If the agent requests user input, fail the run attempt immediately.1113- The client detects this via:1114 – explicit method (`item/tool/requestUserInput`), or1115 – turn methods/flags indicating input is required.1117### 10.6 Timeouts and Error Mapping1119Timeouts:1121- `codex.read_timeout_ms`: request/response timeout during startup and sync requests1122- `codex.turn_timeout_ms`: total turn stream timeout1123- `codex.stall_timeout_ms`: enforced by orchestrator based on event inactivity1125Error mapping (recommended normalized categories):1127- `codex_not_found`1128- `invalid_workspace_cwd`1129- `response_timeout`1130- `turn_timeout`1131- `port_exit`1132- `response_error`1133- `turn_failed`1134- `turn_cancelled`1135- `turn_input_required`1137### 10.7 Agent Runner Contract1139The `Agent Runner` wraps workspace + prompt + app-server client.1141Behavior:11431. Create/reuse workspace for issue.11442. Build prompt from workflow template.11453. Start app-server session.11464. Forward app-server events to orchestrator.11475. On any error, fail the worker attempt (the orchestrator will retry).1151- Workspaces are intentionally preserved after successful runs.1153## 11. Issue Tracker Integration Contract (Linear-Compatible)1155### 11.1 Required Operations1157An implementation must support these tracker adapter operations:11591. `fetch_candidate_issues()`1160 – Return issues in configured active states for a configured project.11622. `fetch_issues_by_states(state_names)`1163 – Used for startup terminal cleanup.11653. `fetch_issue_states_by_ids(issue_ids)`1166 – Used for active-run reconciliation.1168### 11.2 Query Semantics (Linear)1170Linear-specific requirements for `tracker.kind == “linear”`:1172- `tracker.kind == “linear”`1173- GraphQL endpoint (default `https://api.linear.app/graphql`)1174- Auth token sent in `Authorization` header1175- `tracker.project_slug` maps to Linear project `slugId`1176- Candidate issue query filters project using `project: { slugId: { eq: $projectSlug } }`1177- Issue-state refresh query uses GraphQL issue IDs with variable type `[ID!]`1178- Pagination required for candidate issues1179- Page size default: `50`1180- Network timeout: `30000 ms`1182Important:1184- Linear GraphQL schema details can drift. Keep query construction isolated and test the exact query1185 fields/types required by this specification.1187A non-Linear implementation may change transport details, but the normalized outputs must match the1188domain model in Section 4.1190### 11.3 Normalization Rules1192Candidate issue normalization should produce fields listed in Section 4.1.1.1194Additional normalization details:1196- `labels` – > lowercase strings1197- `blocked_by` – > derived from inverse relations where relation type is `blocks`1198- `priority` – > integer only (non-integers become null)1199- `created_at` and `updated_at` – > parse ISO-8601 timestamps1201### 11.4 Error Handling Contract1203Recommended error categories:1205- `unsupported_tracker_kind`1206- `missing_tracker_api_key`1207- `missing_tracker_project_slug`1208- `linear_api_request` (transport failures)1209- `linear_api_status` (non-200 HTTP)1210- `linear_graphql_errors`1211- `linear_unknown_payload`1212- `linear_missing_end_cursor` (pagination integrity error)1214Orchestrator behavior on tracker errors:1216- Candidate fetch failure: log and skip dispatch for this tick.1217- Running-state refresh failure: log and keep active workers running.1218- Startup terminal cleanup failure: log warning and continue startup.1220### 11.5 Tracker Writes (Important Boundary)1222Symphony does not require first-class tracker write APIs in the orchestrator.1224- Ticket mutations (state transitions, comments, PR metadata) are typically handled by the coding1225 agent using tools defined by the workflow prompt.1226- The service remains a scheduler/runner and tracker reader.1227- Workflow-specific success often means “reached the next handoff state” (for example1228 `Human Review`) rather than tracker terminal state `Done`.1229- If the optional `linear_graphql` client-side tool extension is implemented, it is still part of1230 the agent toolchain rather than orchestrator business logic.1232## 12. Prompt Construction and Context Assembly1234### 12.1 Inputs1236Inputs to prompt rendering:1238- `workflow.prompt_template`1239- normalized `issue` object1240- optional `attempt` integer (retry/continuation metadata)1242### 12.2 Rendering Rules1244- Render with strict variable checking.1245- Render with strict filter checking.1246- Convert issue object keys to strings for template compatibility.1247- Preserve nested arrays/maps (labels, blockers) so templates can iterate.1249### 12.3 Retry/Continuation Semantics1251`attempt` should be passed to the template because the workflow prompt may provide different1252instructions for:1254- first run (`attempt` null or absent)1255- continuation run after a successful prior session1256- retry after error/timeout/stall1258### 12.4 Failure Semantics1260If prompt rendering fails:1262- Fail the run attempt immediately.1263- Let the orchestrator treat it like any other worker failure and decide retry behavior.1265## 13. Logging, Status, and Observability1267### 13.1 Logging Conventions1269Required context fields for issue-related logs:1271- `issue_id`1272- `issue_identifier`1274Required context for coding-agent session lifecycle logs:1276- `session_id`1278Message formatting requirements:1280- Use stable `key=value` phrasing.1281- Include action outcome (`completed`, `failed`, `retrying`, etc.).1282- Include concise failure reason when present.1283- Avoid logging large raw payloads unless necessary.1285### 13.2 Logging Outputs and Sinks1287The spec does not prescribe where logs must go (stderr, file, remote sink, etc.).1289Requirements:1291- Operators must be able to see startup/validation/dispatch failures without attaching a debugger.1292- Implementations may write to one or more sinks.1293- If a configured log sink fails, the service should continue running when possible and emit an1294 operator-visible warning through any remaining sink.1296### 13.3 Runtime Snapshot / Monitoring Interface (Optional but Recommended)1298If the implementation exposes a synchronous runtime snapshot (for dashboards or monitoring), it1299should return:1301- `running` (list of running session rows)1302- each running row should include `turn_count`1303- `retrying` (list of retry queue rows)1304- `codex_totals`1305 – `input_tokens`1306 – `output_tokens`1307 – `total_tokens`1308 – `seconds_running` (aggregate runtime seconds as of snapshot time, including active sessions)1309- `rate_limits` (latest coding-agent rate limit payload, if available)1311Recommended snapshot error modes:1313- `timeout`1314- `unavailable`1316### 13.4 Optional Human-Readable Status Surface1318A human-readable status surface (terminal output, dashboard, etc.) is optional and1319implementation-defined.1321If present, it should draw from orchestrator state/metrics only and must not be required for1322correctness.1324### 13.5 Session Metrics and Token Accounting1326Token accounting rules:1328- Agent events may include token counts in multiple payload shapes.1329- Prefer absolute thread totals when available, such as:1330 – `thread/tokenUsage/updated` payloads1331 – `total_token_usage` within token-count wrapper events1332- Ignore delta-style payloads such as `last_token_usage` for dashboard/API totals.1333- Extract input/output/total token counts leniently from common field names within the selected1334 payload.1335- For absolute totals, track deltas relative to last reported totals to avoid double-counting.1336- Do not treat generic `usage` maps as cumulative totals unless the event type defines them that1338- Accumulate aggregate totals in orchestrator state.1340Runtime accounting:1342- Runtime should be reported as a live aggregate at snapshot/render time.1343- Implementations may maintain a cumulative counter for ended sessions and add active-session1344 elapsed time derived from `running` entries (for example `started_at`) when producing a1345 snapshot/status view.1346- Add run duration seconds to the cumulative ended-session runtime when a session ends (normal exit1347 or cancellation/termination).1348- Continuous background ticking of runtime totals is not required.1350Rate-limit tracking:1352- Track the latest rate-limit payload seen in any agent update.1353- Any human-readable presentation of rate-limit data is implementation-defined.1355### 13.6 Humanized Agent Event Summaries (Optional)1357Humanized summaries of raw agent protocol events are optional.1359If implemented:1361- Treat them as observability-only output.1362- Do not make orchestrator logic depend on humanized strings.

# Symphony Service Specification

Status: Draft v1 (language-agnostic)

Purpose: Define a service that orchestrates coding agents to get project work done.

## 1. Problem Statement

Symphony is a long-running automation service that continuously reads work from an issue tracker

(Linear in this specification version), creates an isolated workspace for each issue, and runs a

coding agent session for that issue inside the workspace.

The service solves four operational problems:

– It turns issue execution into a repeatable daemon workflow instead of manual scripts.

– It isolates agent execution in per-issue workspaces so agent commands run only inside per-issue

workspace directories.

– It keeps the workflow policy in-repo (`WORKFLOW.md`) so teams version the agent prompt and runtime

settings with their code.

– It provides enough observability to operate and debug multiple concurrent agent runs.

Implementations are expected to document their trust and safety posture explicitly. This

specification does not require a single approval, sandbox, or operator-confirmation policy; some

implementations may target trusted environments with a high-trust configuration, while others may

require stricter approvals or sandboxing.

Important boundary:

– Symphony is a scheduler/runner and tracker reader.

– Ticket writes (state transitions, comments, PR links) are typically performed by the coding agent

using tools available in the workflow/runtime environment.

– A successful run may end at a workflow-defined handoff state (for example `Human Review`), not

necessarily `Done`.

## 2. Goals and Non-Goals

### 2.1 Goals

– Poll the issue tracker on a fixed cadence and dispatch work with bounded concurrency.

– Maintain a single authoritative orchestrator state for dispatch, retries, and reconciliation.

– Create deterministic per-issue workspaces and preserve them across runs.

– Stop active runs when issue state changes make them ineligible.

– Recover from transient failures with exponential backoff.

– Load runtime behavior from a repository-owned `WORKFLOW.md` contract.

– Expose operator-visible observability (at minimum structured logs).

– Support restart recovery without requiring a persistent database.

### 2.2 Non-Goals

– Rich web UI or multi-tenant control plane.

– Prescribing a specific dashboard or terminal UI implementation.

– General-purpose workflow engine or distributed job scheduler.

– Built-in business logic for how to edit tickets, PRs, or comments. (That logic lives in the

workflow prompt and agent tooling.)

– Mandating strong sandbox controls beyond what the coding agent and host OS provide.

– Mandating a single default approval, sandbox, or operator-confirmation posture for all

implementations.

## 3. System Overview

### 3.1 Main Components

1. `Workflow Loader`

– Reads `WORKFLOW.md`.

– Parses YAML front matter and prompt body.

– Returns `{config, prompt_template}`.

2. `Config Layer`

– Exposes typed getters for workflow config values.

– Applies defaults and environment variable indirection.

– Performs validation used by the orchestrator before dispatch.

3. `Issue Tracker Client`

– Fetches candidate issues in active states.

– Fetches current states for specific issue IDs (reconciliation).

– Fetches terminal-state issues during startup cleanup.

– Normalizes tracker payloads into a stable issue model.

4. `Orchestrator`

– Owns the poll tick.

– Owns the in-memory runtime state.

– Decides which issues to dispatch, retry, stop, or release.

– Tracks session metrics and retry queue state.

5. `Workspace Manager`

– Maps issue identifiers to workspace paths.

– Ensures per-issue workspace directories exist.

– Runs workspace lifecycle hooks.

– Cleans workspaces for terminal issues.

6. `Agent Runner`

– Creates workspace.

– Builds prompt from issue + workflow template.

– Launches the coding agent app-server client.

– Streams agent updates back to the orchestrator.

7. `Status Surface` (optional)

– Presents human-readable runtime status (for example terminal output, dashboard, or other

operator-facing view).

101

8. `Logging`

102

– Emits structured runtime logs to one or more configured sinks.

104

### 3.2 Abstraction Levels

106

Symphony is easiest to port when kept in these layers:

108

1. `Policy Layer` (repo-defined)

109

– `WORKFLOW.md` prompt body.

110

– Team-specific rules for ticket handling, validation, and handoff.

112

2. `Configuration Layer` (typed getters)

113

– Parses front matter into typed runtime settings.

114

– Handles defaults, environment tokens, and path normalization.

116

3. `Coordination Layer` (orchestrator)

117

– Polling loop, issue eligibility, concurrency, retries, reconciliation.

119

4. `Execution Layer` (workspace + agent subprocess)

120

– Filesystem lifecycle, workspace preparation, coding-agent protocol.

122

5. `Integration Layer` (Linear adapter)

123

– API calls and normalization for tracker data.

125

6. `Observability Layer` (logs + optional status surface)

126

– Operator visibility into orchestrator and agent behavior.

128

### 3.3 External Dependencies

130

– Issue tracker API (Linear for `tracker.kind: linear` in this specification version).

131

– Local filesystem for workspaces and logs.

132

– Optional workspace population tooling (for example Git CLI, if used).

133

– Coding-agent executable that supports JSON-RPC-like app-server mode over stdio.

134

– Host environment authentication for the issue tracker and coding agent.

136

## 4. Core Domain Model

138

### 4.1 Entities

140

#### 4.1.1 Issue

142

Normalized issue record used by orchestration, prompt rendering, and observability output.

144

Fields:

146

– `id` (string)

147

– Stable tracker-internal ID.

148

– `identifier` (string)

149

– Human-readable ticket key (example: `ABC-123`).

150

– `title` (string)

151

– `description` (string or null)

152

– `priority` (integer or null)

153

– Lower numbers are higher priority in dispatch sorting.

154

– `state` (string)

155

– Current tracker state name.

156

– `branch_name` (string or null)

157

– Tracker-provided branch metadata if available.

158

– `url` (string or null)

159

– `labels` (list of strings)

160

– Normalized to lowercase.

161

– `blocked_by` (list of blocker refs)

162

– Each blocker ref contains:

163

– `id` (string or null)

164

– `identifier` (string or null)

165

– `state` (string or null)

166

– `created_at` (timestamp or null)

167

– `updated_at` (timestamp or null)

169

#### 4.1.2 Workflow Definition

171

Parsed `WORKFLOW.md` payload:

173

– `config` (map)

174

– YAML front matter root object.

175

– `prompt_template` (string)

176

– Markdown body after front matter, trimmed.

178

#### 4.1.3 Service Config (Typed View)

180

Typed runtime values derived from `WorkflowDefinition.config` plus environment resolution.

182

Examples:

184

– poll interval

185

– workspace root

186

– active and terminal issue states

187

– concurrency limits

188

– coding-agent executable/args/timeouts

189

– workspace hooks

191

#### 4.1.4 Workspace

193

Filesystem workspace assigned to one issue identifier.

195

Fields (logical):

197

– `path` (workspace path; current runtime typically uses absolute paths, but relative roots are

198

possible if configured without path separators)

199

– `workspace_key` (sanitized issue identifier)

200

– `created_now` (boolean, used to gate `after_create` hook)

202

#### 4.1.5 Run Attempt

204

One execution attempt for one issue.

206

Fields (logical):

208

– `issue_id`

209

– `issue_identifier`

210

– `attempt` (integer or null, `null` for first run, `>=1` for retries/continuation)

211

– `workspace_path`

212

– `started_at`

213

– `status`

214

– `error` (optional)

216

#### 4.1.6 Live Session (Agent Session Metadata)

218

State tracked while a coding-agent subprocess is running.

220

Fields:

222

– `session_id` (string, `-`)

223

– `thread_id` (string)

224

– `turn_id` (string)

225

– `codex_app_server_pid` (string or null)

226

– `last_codex_event` (string/enum or null)

227

– `last_codex_timestamp` (timestamp or null)

228

– `last_codex_message` (summarized payload)

229

– `codex_input_tokens` (integer)

230

– `codex_output_tokens` (integer)

231

– `codex_total_tokens` (integer)

232

– `last_reported_input_tokens` (integer)

233

– `last_reported_output_tokens` (integer)

234

– `last_reported_total_tokens` (integer)

235

– `turn_count` (integer)

236

– Number of coding-agent turns started within the current worker lifetime.

238

#### 4.1.7 Retry Entry

240

Scheduled retry state for an issue.

242

Fields:

244

– `issue_id`

245

– `identifier` (best-effort human ID for status surfaces/logs)

246

– `attempt` (integer, 1-based for retry queue)

247

– `due_at_ms` (monotonic clock timestamp)

248

– `timer_handle` (runtime-specific timer reference)

249

– `error` (string or null)

251

#### 4.1.8 Orchestrator Runtime State

253

Single authoritative in-memory state owned by the orchestrator.

255

Fields:

257

– `poll_interval_ms` (current effective poll interval)

258

– `max_concurrent_agents` (current effective global concurrency limit)

259

– `running` (map `issue_id -> running entry`)

260

– `claimed` (set of issue IDs reserved/running/retrying)

261

– `retry_attempts` (map `issue_id -> RetryEntry`)

262

– `completed` (set of issue IDs; bookkeeping only, not dispatch gating)

263

– `codex_totals` (aggregate tokens + runtime seconds)

264

– `codex_rate_limits` (latest rate-limit snapshot from agent events)

266

### 4.2 Stable Identifiers and Normalization Rules

268

– `Issue ID`

269

– Use for tracker lookups and internal map keys.

270

– `Issue Identifier`

271

– Use for human-readable logs and workspace naming.

272

– `Workspace Key`

273

– Derive from `issue.identifier` by replacing any character not in `[A-Za-z0-9._-]` with `_`.

274

– Use the sanitized value for the workspace directory name.

275

– `Normalized Issue State`

276

– Compare states after `lowercase`.

277

– `Session ID`

278

– Compose from coding-agent `thread_id` and `turn_id` as `-`.

280

## 5. Workflow Specification (Repository Contract)

282

### 5.1 File Discovery and Path Resolution

284

Workflow file path precedence:

286

1. Explicit application/runtime setting (set by CLI startup path).

287

2. Default: `WORKFLOW.md` in the current process working directory.

289

Loader behavior:

291

– If the file cannot be read, return `missing_workflow_file` error.

292

– The workflow file is expected to be repository-owned and version-controlled.

294

### 5.2 File Format

296

`WORKFLOW.md` is a Markdown file with optional YAML front matter.

298

Design note:

300

– `WORKFLOW.md` should be self-contained enough to describe and run different workflows (prompt,

301

runtime settings, hooks, and tracker selection/config) without requiring out-of-band

302

service-specific configuration.

304

Parsing rules:

306

– If file starts with `—`, parse lines until the next `—` as YAML front matter.

307

– Remaining lines become the prompt body.

308

– If front matter is absent, treat the entire file as prompt body and use an empty config map.

309

– YAML front matter must decode to a map/object; non-map YAML is an error.

310

– Prompt body is trimmed before use.

312

Returned workflow object:

314

– `config`: front matter root object (not nested under a `config` key).

315

– `prompt_template`: trimmed Markdown body.

317

### 5.3 Front Matter Schema

319

Top-level keys:

321

– `tracker`

322

– `polling`

323

– `workspace`

324

– `hooks`

325

– `agent`

326

– `codex`

328

Unknown keys should be ignored for forward compatibility.

332

– The workflow front matter is extensible. Optional extensions may define additional top-level keys

333

(for example `server`) without changing the core schema above.

334

– Extensions should document their field schema, defaults, validation rules, and whether changes

335

apply dynamically or require restart.

336

– Common extension: `server.port` (integer) enables the optional HTTP server described in Section

339

#### 5.3.1 `tracker` (object)

341

Fields:

343

– `kind` (string)

344

– Required for dispatch.

345

– Current supported value: `linear`

346

– `endpoint` (string)

347

– Default for `tracker.kind == “linear”`: `https://api.linear.app/graphql`

348

– `api_key` (string)

349

– May be a literal token or `$VAR_NAME`.

350

– Canonical environment variable for `tracker.kind == “linear”`: `LINEAR_API_KEY`.

351

– If `$VAR_NAME` resolves to an empty string, treat the key as missing.

352

– `project_slug` (string)

353

– Required for dispatch when `tracker.kind == “linear”`.

354

– `active_states` (list of strings)

355

– Default: `Todo`, `In Progress`

356

– `terminal_states` (list of strings)

357

– Default: `Closed`, `Cancelled`, `Canceled`, `Duplicate`, `Done`

359

#### 5.3.2 `polling` (object)

361

Fields:

363

– `interval_ms` (integer or string integer)

364

– Default: `30000`

365

– Changes should be re-applied at runtime and affect future tick scheduling without restart.

367

#### 5.3.3 `workspace` (object)

369

Fields:

371

– `root` (path string or `$VAR`)

372

– Default: `/symphony_workspaces`

373

– `~` and strings containing path separators are expanded.

374

– Bare strings without path separators are preserved as-is (relative roots are allowed but

375

discouraged).

377

#### 5.3.4 `hooks` (object)

379

Fields:

381

– `after_create` (multiline shell script string, optional)

382

– Runs only when a workspace directory is newly created.

383

– Failure aborts workspace creation.

384

– `before_run` (multiline shell script string, optional)

385

– Runs before each agent attempt after workspace preparation and before launching the coding

386

agent.

387

– Failure aborts the current attempt.

388

– `after_run` (multiline shell script string, optional)

389

– Runs after each agent attempt (success, failure, timeout, or cancellation) once the workspace

390

exists.

391

– Failure is logged but ignored.

392

– `before_remove` (multiline shell script string, optional)

393

– Runs before workspace deletion if the directory exists.

394

– Failure is logged but ignored; cleanup still proceeds.

395

– `timeout_ms` (integer, optional)

396

– Default: `60000`

397

– Applies to all workspace hooks.

398

– Non-positive values should be treated as invalid and fall back to the default.

399

– Changes should be re-applied at runtime for future hook executions.

401

#### 5.3.5 `agent` (object)

403

Fields:

405

– `max_concurrent_agents` (integer or string integer)

406

– Default: `10`

407

– Changes should be re-applied at runtime and affect subsequent dispatch decisions.

408

– `max_retry_backoff_ms` (integer or string integer)

409

– Default: `300000` (5 minutes)

410

– Changes should be re-applied at runtime and affect future retry scheduling.

411

– `max_concurrent_agents_by_state` (map `state_name -> positive integer`)

412

– Default: empty map.

413

– State keys are normalized (`lowercase`) for lookup.

414

– Invalid entries (non-positive or non-numeric) are ignored.

416

#### 5.3.6 `codex` (object)

418

Fields:

420

For Codex-owned config values such as `approval_policy`, `thread_sandbox`, and

421

`turn_sandbox_policy`, supported values are defined by the targeted Codex app-server version.

422

Implementors should treat them as pass-through Codex config values rather than relying on a

423

hand-maintained enum in this spec. To inspect the installed Codex schema, run

424

`codex app-server generate-json-schema –out ` and inspect the relevant definitions referenced

425

by `v2/ThreadStartParams.json` and `v2/TurnStartParams.json`. Implementations may validate these

426

fields locally if they want stricter startup checks.

428

– `command` (string shell command)

429

– Default: `codex app-server`

430

– The runtime launches this command via `bash -lc` in the workspace directory.

431

– The launched process must speak a compatible app-server protocol over stdio.

432

– `approval_policy` (Codex `AskForApproval` value)

433

– Default: implementation-defined.

434

– `thread_sandbox` (Codex `SandboxMode` value)

435

– Default: implementation-defined.

436

– `turn_sandbox_policy` (Codex `SandboxPolicy` value)

437

– Default: implementation-defined.

438

– `turn_timeout_ms` (integer)

439

– Default: `3600000` (1 hour)

440

– `read_timeout_ms` (integer)

441

– Default: `5000`

442

– `stall_timeout_ms` (integer)

443

– Default: `300000` (5 minutes)

444

– If `<= 0`, stall detection is disabled.

446

### 5.4 Prompt Template Contract

448

The Markdown body of `WORKFLOW.md` is the per-issue prompt template.

450

Rendering requirements:

452

– Use a strict template engine (Liquid-compatible semantics are sufficient).

453

– Unknown variables must fail rendering.

454

– Unknown filters must fail rendering.

456

Template input variables:

458

– `issue` (object)

459

– Includes all normalized issue fields, including labels and blockers.

460

– `attempt` (integer or null)

461

– `null`/absent on first attempt.

462

– Integer on retry or continuation run.

464

Fallback prompt behavior:

466

– If the workflow prompt body is empty, the runtime may use a minimal default prompt

467

(`You are working on an issue from Linear.`).

468

– Workflow file read/parse failures are configuration/validation errors and should not silently fall

469

back to a prompt.

471

### 5.5 Workflow Validation and Error Surface

473

Error classes:

475

– `missing_workflow_file`

476

– `workflow_parse_error`

477

– `workflow_front_matter_not_a_map`

478

– `template_parse_error` (during prompt rendering)

479

– `template_render_error` (unknown variable/filter, invalid interpolation)

481

Dispatch gating behavior:

483

– Workflow file read/YAML errors block new dispatches until fixed.

484

– Template errors fail only the affected run attempt.

486

## 6. Configuration Specification

488

### 6.1 Source Precedence and Resolution Semantics

490

Configuration precedence:

492

1. Workflow file path selection (runtime setting -> cwd default).

493

2. YAML front matter values.

494

3. Environment indirection via `$VAR_NAME` inside selected YAML values.

495

4. Built-in defaults.

497

Value coercion semantics:

499

– Path/command fields support:

500

– `~` home expansion

501

– `$VAR` expansion for env-backed path values

502

– Apply expansion only to values intended to be local filesystem paths; do not rewrite URIs or

503

arbitrary shell command strings.

505

### 6.2 Dynamic Reload Semantics

507

Dynamic reload is required:

509

– The software should watch `WORKFLOW.md` for changes.

510

– On change, it should re-read and re-apply workflow config and prompt template without restart.

511

– The software should attempt to adjust live behavior to the new config (for example polling

512

cadence, concurrency limits, active/terminal states, codex settings, workspace paths/hooks, and

513

prompt content for future runs).

514

– Reloaded config applies to future dispatch, retry scheduling, reconciliation decisions, hook

515

execution, and agent launches.

516

– Implementations are not required to restart in-flight agent sessions automatically when config

517

changes.

518

– Extensions that manage their own listeners/resources (for example an HTTP server port change) may

519

require restart unless the implementation explicitly supports live rebind.

520

– Implementations should also re-validate/reload defensively during runtime operations (for example

521

before dispatch) in case filesystem watch events are missed.

522

– Invalid reloads should not crash the service; keep operating with the last known good effective

523

configuration and emit an operator-visible error.

525

### 6.3 Dispatch Preflight Validation

527

This validation is a scheduler preflight run before attempting to dispatch new work. It validates

528

the workflow/config needed to poll and launch workers, not a full audit of all possible workflow

529

behavior.

531

Startup validation:

533

– Validate configuration before starting the scheduling loop.

534

– If startup validation fails, fail startup and emit an operator-visible error.

536

Per-tick dispatch validation:

538

– Re-validate before each dispatch cycle.

539

– If validation fails, skip dispatch for that tick, keep reconciliation active, and emit an

540

operator-visible error.

542

Validation checks:

544

– Workflow file can be loaded and parsed.

545

– `tracker.kind` is present and supported.

546

– `tracker.api_key` is present after `$` resolution.

547

– `tracker.project_slug` is present when required by the selected tracker kind.

548

– `codex.command` is present and non-empty.

550

### 6.4 Config Fields Summary (Cheat Sheet)

552

This section is intentionally redundant so a coding agent can implement the config layer quickly.

554

– `tracker.kind`: string, required, currently `linear`

555

– `tracker.endpoint`: string, default `https://api.linear.app/graphql` when `tracker.kind=linear`

556

– `tracker.api_key`: string or `$VAR`, canonical env `LINEAR_API_KEY` when `tracker.kind=linear`

557

– `tracker.project_slug`: string, required when `tracker.kind=linear`

558

– `tracker.active_states`: list of strings, default `[“Todo”, “In Progress”]`

559

– `tracker.terminal_states`: list of strings, default `[“Closed”, “Cancelled”, “Canceled”, “Duplicate”, “Done”]`

560

– `polling.interval_ms`: integer, default `30000`

561

– `workspace.root`: path, default `/symphony_workspaces`

562

– `worker.ssh_hosts` (extension): list of SSH host strings, optional; when omitted, work runs

563

locally

564

– `worker.max_concurrent_agents_per_host` (extension): positive integer, optional; shared per-host

565

cap applied across configured SSH hosts

566

– `hooks.after_create`: shell script or null

567

– `hooks.before_run`: shell script or null

568

– `hooks.after_run`: shell script or null

569

– `hooks.before_remove`: shell script or null

570

– `hooks.timeout_ms`: integer, default `60000`

571

– `agent.max_concurrent_agents`: integer, default `10`

572

– `agent.max_turns`: integer, default `20`

573

– `agent.max_retry_backoff_ms`: integer, default `300000` (5m)

574

– `agent.max_concurrent_agents_by_state`: map of positive integers, default `{}`

575

– `codex.command`: shell command string, default `codex app-server`

576

– `codex.approval_policy`: Codex `AskForApproval` value, default implementation-defined

577

– `codex.thread_sandbox`: Codex `SandboxMode` value, default implementation-defined

578

– `codex.turn_sandbox_policy`: Codex `SandboxPolicy` value, default implementation-defined

579

– `codex.turn_timeout_ms`: integer, default `3600000`

580

– `codex.read_timeout_ms`: integer, default `5000`

581

– `codex.stall_timeout_ms`: integer, default `300000`

582

– `server.port` (extension): integer, optional; enables the optional HTTP server, `0` may be used

583

for ephemeral local bind, and CLI `–port` overrides it

585

## 7. Orchestration State Machine

587

The orchestrator is the only component that mutates scheduling state. All worker outcomes are

588

reported back to it and converted into explicit state transitions.

590

### 7.1 Issue Orchestration States

592

This is not the same as tracker states (`Todo`, `In Progress`, etc.). This is the service’s internal

593

claim state.

595

1. `Unclaimed`

596

– Issue is not running and has no retry scheduled.

598

2. `Claimed`

599

– Orchestrator has reserved the issue to prevent duplicate dispatch.

600

– In practice, claimed issues are either `Running` or `RetryQueued`.

602

3. `Running`

603

– Worker task exists and the issue is tracked in `running` map.

605

4. `RetryQueued`

606

– Worker is not running, but a retry timer exists in `retry_attempts`.

608

5. `Released`

609

– Claim removed because issue is terminal, non-active, missing, or retry path completed without

610

re-dispatch.

612

Important nuance:

614

– A successful worker exit does not mean the issue is done forever.

615

– The worker may continue through multiple back-to-back coding-agent turns before it exits.

616

– After each normal turn completion, the worker re-checks the tracker issue state.

617

– If the issue is still in an active state, the worker should start another turn on the same live

618

coding-agent thread in the same workspace, up to `agent.max_turns`.

619

– The first turn should use the full rendered task prompt.

620

– Continuation turns should send only continuation guidance to the existing thread, not resend the

621

original task prompt that is already present in thread history.

622

– Once the worker exits normally, the orchestrator still schedules a short continuation retry

623

(about 1 second) so it can re-check whether the issue remains active and needs another worker

624

session.

626

### 7.2 Run Attempt Lifecycle

628

A run attempt transitions through these phases:

630

1. `PreparingWorkspace`

631

2. `BuildingPrompt`

632

3. `LaunchingAgentProcess`

633

4. `InitializingSession`

634

5. `StreamingTurn`

635

6. `Finishing`

636

7. `Succeeded`

637

8. `Failed`

638

9. `TimedOut`

639

10. `Stalled`

640

11. `CanceledByReconciliation`

642

Distinct terminal reasons are important because retry logic and logs differ.

644

### 7.3 Transition Triggers

646

– `Poll Tick`

647

– Reconcile active runs.

648

– Validate config.

649

– Fetch candidate issues.

650

– Dispatch until slots are exhausted.

652

– `Worker Exit (normal)`

653

– Remove running entry.

654

– Update aggregate runtime totals.

655

– Schedule continuation retry (attempt `1`) after the worker exhausts or finishes its in-process

656

turn loop.

658

– `Worker Exit (abnormal)`

659

– Remove running entry.

660

– Update aggregate runtime totals.

661

– Schedule exponential-backoff retry.

663

– `Codex Update Event`

664

– Update live session fields, token counters, and rate limits.

666

– `Retry Timer Fired`

667

– Re-fetch active candidates and attempt re-dispatch, or release claim if no longer eligible.

669

– `Reconciliation State Refresh`

670

– Stop runs whose issue states are terminal or no longer active.

672

– `Stall Timeout`

673

– Kill worker and schedule retry.

675

### 7.4 Idempotency and Recovery Rules

677

– The orchestrator serializes state mutations through one authority to avoid duplicate dispatch.

678

– `claimed` and `running` checks are required before launching any worker.

679

– Reconciliation runs before dispatch on every tick.

680

– Restart recovery is tracker-driven and filesystem-driven (no durable orchestrator DB required).

681

– Startup terminal cleanup removes stale workspaces for issues already in terminal states.

683

## 8. Polling, Scheduling, and Reconciliation

685

### 8.1 Poll Loop

687

At startup, the service validates config, performs startup cleanup, schedules an immediate tick, and

688

then repeats every `polling.interval_ms`.

690

The effective poll interval should be updated when workflow config changes are re-applied.

692

Tick sequence:

694

1. Reconcile running issues.

695

2. Run dispatch preflight validation.

696

3. Fetch candidate issues from tracker using active states.

697

4. Sort issues by dispatch priority.

698

5. Dispatch eligible issues while slots remain.

699

6. Notify observability/status consumers of state changes.

701

If per-tick validation fails, dispatch is skipped for that tick, but reconciliation still happens

704

### 8.2 Candidate Selection Rules

706

An issue is dispatch-eligible only if all are true:

708

– It has `id`, `identifier`, `title`, and `state`.

709

– Its state is in `active_states` and not in `terminal_states`.

710

– It is not already in `running`.

711

– It is not already in `claimed`.

712

– Global concurrency slots are available.

713

– Per-state concurrency slots are available.

714

– Blocker rule for `Todo` state passes:

715

– If the issue state is `Todo`, do not dispatch when any blocker is non-terminal.

717

Sorting order (stable intent):

719

1. `priority` ascending (1..4 are preferred; null/unknown sorts last)

720

2. `created_at` oldest first

721

3. `identifier` lexicographic tie-breaker

723

### 8.3 Concurrency Control

725

Global limit:

727

– `available_slots = max(max_concurrent_agents – running_count, 0)`

729

Per-state limit:

731

– `max_concurrent_agents_by_state[state]` if present (state key normalized)

732

– otherwise fallback to global limit

734

The runtime counts issues by their current tracked state in the `running` map.

736

Optional SSH host limit:

738

– When `worker.max_concurrent_agents_per_host` is set, each configured SSH host may run at most

739

that many concurrent agents at once.

740

– Hosts at that cap are skipped for new dispatch until capacity frees up.

742

### 8.4 Retry and Backoff

744

Retry entry creation:

746

– Cancel any existing retry timer for the same issue.

747

– Store `attempt`, `identifier`, `error`, `due_at_ms`, and new timer handle.

749

Backoff formula:

751

– Normal continuation retries after a clean worker exit use a short fixed delay of `1000` ms.

752

– Failure-driven retries use `delay = min(10000 * 2^(attempt – 1), agent.max_retry_backoff_ms)`.

753

– Power is capped by the configured max retry backoff (default `300000` / 5m).

755

Retry handling behavior:

757

1. Fetch active candidate issues (not all issues).

758

2. Find the specific issue by `issue_id`.

759

3. If not found, release claim.

760

4. If found and still candidate-eligible:

761

– Dispatch if slots are available.

762

– Otherwise requeue with error `no available orchestrator slots`.

763

5. If found but no longer active, release claim.

767

– Terminal-state workspace cleanup is handled by startup cleanup and active-run reconciliation

768

(including terminal transitions for currently running issues).

769

– Retry handling mainly operates on active candidates and releases claims when the issue is absent,

770

rather than performing terminal cleanup itself.

772

### 8.5 Active Run Reconciliation

774

Reconciliation runs every tick and has two parts.

776

Part A: Stall detection

778

– For each running issue, compute `elapsed_ms` since:

779

– `last_codex_timestamp` if any event has been seen, else

780

– `started_at`

781

– If `elapsed_ms > codex.stall_timeout_ms`, terminate the worker and queue a retry.

782

– If `stall_timeout_ms <= 0`, skip stall detection entirely.

784

Part B: Tracker state refresh

786

– Fetch current issue states for all running issue IDs.

787

– For each running issue:

788

– If tracker state is terminal: terminate worker and clean workspace.

789

– If tracker state is still active: update the in-memory issue snapshot.

790

– If tracker state is neither active nor terminal: terminate worker without workspace cleanup.

791

– If state refresh fails, keep workers running and try again on the next tick.

793

### 8.6 Startup Terminal Workspace Cleanup

795

When the service starts:

797

1. Query tracker for issues in terminal states.

798

2. For each returned issue identifier, remove the corresponding workspace directory.

799

3. If the terminal-issues fetch fails, log a warning and continue startup.

801

This prevents stale terminal workspaces from accumulating after restarts.

803

## 9. Workspace Management and Safety

805

### 9.1 Workspace Layout

807

Workspace root:

809

– `workspace.root` (normalized path; the current config layer expands path-like values and preserves

810

bare relative names)

812

Per-issue workspace path:

814

– `/`

816

Workspace persistence:

818

– Workspaces are reused across runs for the same issue.

819

– Successful runs do not auto-delete workspaces.

821

### 9.2 Workspace Creation and Reuse

823

Input: `issue.identifier`

825

Algorithm summary:

827

1. Sanitize identifier to `workspace_key`.

828

2. Compute workspace path under workspace root.

829

3. Ensure the workspace path exists as a directory.

830

4. Mark `created_now=true` only if the directory was created during this call; otherwise

831

`created_now=false`.

832

5. If `created_now=true`, run `after_create` hook if configured.

836

– This section does not assume any specific repository/VCS workflow.

837

– Workspace preparation beyond directory creation (for example dependency bootstrap, checkout/sync,

838

code generation) is implementation-defined and is typically handled via hooks.

840

### 9.3 Optional Workspace Population (Implementation-Defined)

842

The spec does not require any built-in VCS or repository bootstrap behavior.

844

Implementations may populate or synchronize the workspace using implementation-defined logic and/or

845

hooks (for example `after_create` and/or `before_run`).

847

Failure handling:

849

– Workspace population/synchronization failures return an error for the current attempt.

850

– If failure happens while creating a brand-new workspace, implementations may remove the partially

851

prepared directory.

852

– Reused workspaces should not be destructively reset on population failure unless that policy is

853

explicitly chosen and documented.

855

### 9.4 Workspace Hooks

857

Supported hooks:

859

– `hooks.after_create`

860

– `hooks.before_run`

861

– `hooks.after_run`

862

– `hooks.before_remove`

864

Execution contract:

866

– Execute in a local shell context appropriate to the host OS, with the workspace directory as

867

`cwd`.

868

– On POSIX systems, `sh -lc

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Infected Cisco firewalls need cold start to clear persistent Firestarter backdoor

In a separate advisory, Cisco’s Talos threat intelligence service said a group it calls UAT-4356 is behind Firestarter, as part of its continued targeting of Firepower devices. Other researchers call the group Storm-1849, and identify the campaign targeting networking devices from Cisco and other vendors as ArcaneDoor, dating back to

Nvidia’s ‘AI insurance policy’ balances immediate and future AI approaches

If you think 2028 is far-out thinking, you’ll love the next thing: quantum computing. Yes, there are systems around today. No, you won’t find enterprises using quantum computing even in specialized missions. Yes, a recent set of reports suggests that quantum computing is not infinitely scalable, as was once believed.

Meta’s compute grab continues with agreement to deploy tens of millions of AWS Graviton cores

“This is really about control of the AI system, not just scale,” said Kimball. As AI evolves toward persistent, agentic workloads, the role of the CPU becomes “quite meaningful;” it serves as the control plane, handling orchestration, managing memory, scheduling, and other intensive tasks across accelerators. “This is especially true

Cirrascale to offer on-prem Google Gemini models

Google Distributed Cloud can be deployed in customer-controlled environments, including installations that are disconnected from the Internet, which is a key requirement for some government and critical-infrastructure users. One of the big challenges is that these models are incredibly valuable and they need to be delivered in a trusted, secure

Golden Pass LNG ships first export cargo

Editor’s Note: Updated Apr. 23 to include information provided by the US Energy Information Administration. Golden Pass LNG, a joint venture between QatarEnergy and ExxonMobil Corp., has loaded and shipped its first LNG export cargo from the plant in Sabine Pass, Tex. The departure comes following first LNG production from Train 1 late last month. Once fully operational, Golden Pass LNG expects to export about 18 million tons/year (tpy) of LNG. Golden Pass LNG is the 10th LNG plant in the US, the US Energy Information Administration (EIA) noted in a separate release Apr. 23. It is the only new US LNG export plant currently expected to begin LNG shipments this year, EIA said. Construction and commissioning continue on Trains 2 and 3, which are expected to come online in turn, following stable operation of Train 1. EIA noted Golden Pass aims to start up Train 2 in second-half 2026 and Train 3 in first-half 2027. QatarEnergy holds 70% interest in Golden Pass LNG, while ExxonMobil holds the remaining 30%. LNG demand ExxonMobil forecasts natural gas demand to rise 20% by 2050 and LNG demand to rise by 3% per year through 2050. The operator is developing four LNG projects and, by 2030, expects to double its supply compared to 2020 to more than 40 million tpy.

Ecopetrol agrees to acquire equity stake in Brava Energia with plans for increased ownership

State-owned Ecopetrol SA, Bogotá, Colombia, has agreed to acquire a 26% equity stake in Brava Energia SA from a group of shareholders and plans to launch a tender offer to increase its ownership to 51%, which would give it control of the Brazilian oil and gas independent. The move would add exposure to roughly 81,000 boe/d of production and 459 MMboe of reserves, expanding Ecopetrol’s footprint in Brazil. Ecopetrol entered into share purchase agreement with Jive, Yellowstone, and Bloco Somah Printemps Quantum, which together constitute a group holding about 26% of the outstanding common shares of Brava Energia. Brava Energia, the second-largest independent company listed in the Brazilian market in terms of reserves and production, was incorporated in 2024 from the merger between 3R Petroleum Óleo e Gás SA and Enauta Participações SA. Completion of the deal is subject to certain conditions, including, among others, approval by Brazil’s Administrative Council for Economic Defense (CADE), the grant of certain waivers and consents considering Brava’s financing instruments and relevant commercial agreements, as well as the purchase by Ecopetrol SA, or one of its affiliates or subsidiaries within the Ecopetrol Group, of the number of shares required to achieve a 51% controlling stake of Brava’s voting share capital. Ecopetrol plans to launch a voluntary tender offer on the B3 stock exchange in Brazil to buy additional shares to reach 51% controlling stake at R$23.00 per share, subject to regulatory requirements and certain conditions. Ecopetrol in Brazil In Brazil, Ecopetrol, through subsidiary Ecopetrol Óleo e Gás do Brasil Ltda., holds 30% interest in 11 blocks in the southern area of Santos basin in consortium with Shell Brasil Petróleo Ltda. (operator, 70%). The company also holds a 30% non-operated interest in Gato do Mato (BM-S-54) and Sul de Gato do Mato (production sharing agreement), which

China leads global oil stockpiles in 2025

China, the United States, and Japan held the world’s largest strategic oil inventories as of December 2025, the US Energy Information Administration (EIA) said in a recent note. The EIA examined significant global buildup in strategic oil inventories as of December 2025, prior to the International Energy Agency (IEA)-coordinated emergency release in March 2026 triggered by the Strait of Hormuz disruption. These reserves—first established by OECD countries in the 1970s—continue to serve as a critical buffer against supply shocks. China holds the largest volume of oil inventories globally. EIA estimates about 360 million bbl in government-held stocks and roughly 1 billion bbl in commercial inventories, bringing its total to nearly 1.4 billion bbl. The agency said China added about 1.1 million b/d to inventories in 2025, reflecting an aggressive stockpiling strategy. The US follows, with about 413 million bbl in its Strategic Petroleum Reserve (SPR) as of December 2025, alongside more than 400 million bbl in commercial crude stocks, EIA said. Japan ranks third, holding 263 million bbl in government reserves, with an additional 220 million bbl required under Japan’s Oil Stockpiling Act. OECD Europe held about 179 million bbl, and South Korea maintained roughly 79 million bbl. Among non-OECD countries, estimates are less transparent, EIA noted. Saudi Arabia held about 82 million bbl, Iran 71 million bbl, and the UAE 34 million bbl in on-land inventories, while India’s SPR totaled 21.4 million bbl, with plans to expand storage capacity domestically and abroad. Global estimates remain conservative due to limited transparency and varying definitions of “strategic” inventories, EIA said. In most countries, only government or national oil company holdings are counted, though China is a key exception where commercial inventories are included due to state-directed stockpiling. EIA plans to update its assessment periodically in its Short-Term Energy Outlook beginning this May.

Peace signals temper crude rally, Europe jet fuel tightness intensifies

Tentative diplomatic signals offer limited relief to markets still dominated by supply disruption concerns surrounding the Iran war. At the time of writing, Brent crude futures hovered around $105–106/bbl after earlier trading above $107/bbl, while West Texas Intermediate (WTI) held near $95–97/bbl. Prices softened modestly following reports of renewed diplomatic engagement. Iranian Foreign Minister Seyed Abbas Araghchi is expected to visit Pakistan for talks. Separately, Israel and Lebanon agreed to extend their ceasefire by 3 weeks after meetings with US officials in Washington. Stay updated on oil price volatility, shipping disruptions, LNG market analysis, and production output at OGJ’s Iran war content hub. Despite these developments, market participants remain cautious, with analysts warning that any easing in risk premiums may prove temporary. Ongoing tensions linked to the US-Iran conflict continue to disrupt flows through the Strait of Hormuz, a critical artery for global oil trade. In remarks at CNBC’s Converge Live conference, Fatih Birol, executive director of the International Energy Agency (IEA), described the situation as an unprecedented energy security challenge, noting that the strait is operating under what he termed a “double blockade,” severely constraining tanker movements. The impact is being felt acutely in refined product markets, particularly in Europe’s aviation sector. With Middle Eastern exports curtailed, European refiners have shifted output toward jet fuel production, though with limited flexibility. According to Frans Everts of Shell plc, refineries across the region are operating in “max jet mode,” with only marginal capacity to increase yields further. Inventory data underscore the tightening balance. Jet fuel and kerosene stocks in the Amsterdam-Rotterdam-Antwerp hub fell to 597,000 metric tons, the lowest level since April 2020, declining 10% year-on-year. In response, Europe has increasingly relied on imports from the US Gulf Coast to offset lost Middle Eastern supply. According to IEA, global oil supply

Shell to expand Canadian operations with $16.4-billion acquisition of ARC Resources

Shell plc has agreed to acquire ARC Resources Ltd. in a transaction valued at about $16.4 billion, including $13.6 billion in equity and roughly $2.8 billion in assumed net debt and leases. The acquisition is expected to strengthen Shell’s integrated gas portfolio and expand its position in Canada through the addition of long-life Montney resources in British Columbia and Alberta, the companies said Apr. 27. “ARC is a high-quality, low-cost, and top-quartile low carbon intensity producer in the Montney that complements our existing footprint in Canada and strengthens our resource base for decades,” said Wael Sawan, Shell chief executive officer. “This establishes Canada as a heartland for Shell while furthering our strategy to deliver more value with less emissions.” ARC produced 374,000 boe/d in 2025 (before royalties). Its assets overlap with Shell’s existing Groundbirch position in British Columbia and the Gold Creek development in Alberta. Groundbirch supplies gas to the 14-million tonnes/year LNG Canada liquefaction plant (Shell, 40%), as well as to the domestic market.

Brent holds above $100/bbl; US shale response remains restrained

Global crude markets remained firmly supported Apr. 27 as the ongoing Iran conflict and continued disruption in the Strait of Hormuz reinforced a persistent geopolitical risk premium, offsetting intermittent diplomatic signals. Brent crude traded in the upper-$100/bbl range, while West Texas Intermediate (WTI) held in the high-$90s/bbl, reflecting tight physical supply conditions and uncertainty surrounding Middle East export flows. Stay updated on oil price volatility, shipping disruptions, LNG market analysis, and production output at OGJ’s Iran war content hub. While diplomatic efforts between the US and Iran have produced occasional signs of progress—including reported proposals to reopen the strait—negotiations remain fragile. The situation has evolved into a prolonged stalemate, with neither a full escalation nor a clear resolution in sight. Current market structure reflects a geopolitically driven pricing regime, with volatility concentrated in near-term crude futures while longer-dated contracts remain relatively anchored. The impact of Iran-related supply disruptions is being priced primarily into prompt contracts, whereas deferred benchmarks—such as 2027 WTI—have moved more modestly, holding in the low-$70/bbl range. This divergence suggests that traders view the current supply shock as severe but not necessarily permanent, with expectations of eventual normalization. However, according to the latest Dallas Fed survey, 86% of US oil and gas executives view another future Hormuz disruption within the next 5 years as somewhat or very likely, while 40% do not expect normalization of Hormuz traffic by August. A further 35% believe less than 90% of shut-in Gulf production will eventually return. These figures suggest the industry is calibrating its medium-term strategy around a world of elevated and recurring geopolitical risk. US shale response remains restrained According to an analysis from Macquire, despite favorable pricing, the US upstream response is expected to be measured. With average breakeven levels near $43/bbl WTI, current prices offer highly attractive margins

AI data flows force rethink of data center networking at Backblaze

According to a report that Backblaze released this morning, traffic from content delivery networks and hosting and Internet services providers have stayed largely within historical norms over the past year. But traffic from hyperscalers and neoclouds fluctuated dramatically, with steep climbs in September and October and another uptick in March. Another network traffic change related to AI is geography. “Traditionally, it didn’t matter where cloud infrastructure was located,” says Nowak. But with AI workloads, if storage is close to compute, enterprises get lower latency and higher throughput. Today, Virginia and California have a high concentration of AI compute providers. This, in turn, brings in more storage companies. “In July, we chose to double our footprint in US East to increase the proximity to hyperscalers and neoclouds,” says Nowak. And that, in turn, leads to even more demand for compute, and even greater concentration. “There’s a snowball effect,” Nowak says. Why neoclouds for AI? Enterprises might think that they don’t need to worry about network traffic details if they’re using a hyperscaler for their AI workloads because the data and the processing both stay within the cloud. But there are advantages to using a third-party storage provider combined with neoclouds for the GPUs. According to a report released by Synergy Research Group in early April, neocloud revenues hit $9 billion in the fourth quarter of 2025, a 223% year-over-year increase. Revenues passed $25 billion for the whole year and are expected to hit $400 billion by 2031.

TD Cowen: AI Adoption Is Already Here. Infrastructure Demand Is What Comes Next.

Enterprise AI adoption is no longer emerging. It is already embedded and beginning to scale in ways that will reshape data center demand. The latest TD Cowen GenAI Adoption Survey makes that clear. Across 689 U.S. enterprises, 92% are now using at least one major AI platform, with Microsoft Copilot, Google Gemini, and ChatGPT forming the core triad of daily enterprise tooling. That’s the baseline. The more important story is what comes next. AI is moving quickly from assistive software to autonomous systems, and that shift carries direct implications for compute demand, power consumption, and infrastructure design. From Copilots to Autonomous Systems Today’s enterprise AI footprint is already broad, but it is still largely human-in-the-loop. That is beginning to change. Roughly a third of respondents say they already have semi-autonomous AI agents running in production, while another large cohort is piloting or planning deployments over the next 12 to 18 months. By 2027, more than three-quarters expect to be running AI agents capable of executing multi-step workflows without human intervention. This is not incremental adoption. It is a step-function shift. Autonomous agents don’t just respond to prompts; they execute tasks, interact with enterprise systems, and continuously access data. For data centers, that translates into more persistent, baseline load: exactly the kind of demand profile that stresses power delivery, increases utilization, and accelerates capacity planning timelines. To wit: AI is moving from a bursty workload to a continuous one. ROI Is No Longer the Question At the same time, the debate around AI return on investment is effectively over. Three-quarters of respondents report positive ROI, while only a small minority report negative outcomes. A meaningful share is already seeing multiples of return on their investments. The implication seems straightforward: AI budgets are becoming durable. This is no longer experimental spend that

BYOP Moves to the Center of Data Center Strategy

Self-Sufficiency Becomes a Feature, Not a Risk Consider Wyoming’s Project Jade, where county commissioners approved an AI campus tied to 2.7 GW of new natural gas-fired generation being developed by Tallgrass Energy. Reporting from POWER described the project as a “bring your own power” model designed for a high degree of self-sufficiency, with a mix of natural gas generation and Bloom fuel cells. The campus is expected to scale significantly over time. What stands out is not only the size, but the positioning. Self-sufficiency is becoming a selling point both for developers seeking to de-risk timelines, and for local stakeholders wary of overloading existing utility infrastructure. Fuel Cells and Nuclear: The Middle Ground and the Long Game Fuel cells occupy an important middle ground in this shift. Bloom Energy’s 2026 report positions fuel cells as a leading onsite option due to shorter lead times, modular deployment, and lower local emissions. Market activity suggests that interest is real. For developers, fuel cells can be easier to permit than large turbine installations and can be deployed incrementally. That makes them effective as bridge-to-grid solutions or as permanent components of hybrid architectures. Advanced nuclear remains the most strategically significant, but least immediate, BYOP pathway. Companies including Switch and other data center operators have explored partnerships with Oklo around its Aurora small modular reactor design. Nuclear holds long-term appeal because it offers firm, low-carbon power at scale. But for current AI buildouts, it remains a future option rather than a near-term construction solution. The immediate reality is that gas and modular onsite systems are closing the time-to-power gap, while nuclear is being positioned as a longer-duration successor as licensing and deployment timelines evolve. The model itself is also evolving. BYOP is beginning to blur the line between developer, energy provider, and compute customer. Reuters

Microsoft Builds for Two Worlds: Sovereign Cloud and AI Factories

So far in 2026, across the United States and overseas, Microsoft is building an infrastructure portfolio at full hyperscale. The strategy runs on two tracks. The first is familiar: sovereign cloud expansion involving new regions, local data residency, and compliance-driven enterprise infrastructure. The second is larger and more consequential: purpose-built AI factory campuses designed for dense GPU clusters, liquid cooling, private fiber, and power acquisition at a scale that extends far beyond traditional cloud infrastructure. Despite reports last year that Microsoft was pulling back on data center development, the company is accelerating. It is not only advancing its own large-scale campuses, but also absorbing premium AI capacity originally aligned with OpenAI. In Texas and Norway, projects tied to OpenAI’s infrastructure plans have shifted back into Microsoft’s orbit. Even after contractual changes gave OpenAI greater flexibility to source compute elsewhere, Microsoft remains the market’s most reliable backstop buyer for top-tier AI infrastructure. It no longer needs to control every OpenAI build to maintain its position. In 2026, Microsoft is still the company best positioned to turn uncertain AI demand into deployed capacity, e.g. concrete, steel, power, and silicon at scale. Building at Industrial Scale The clearest indicator of Microsoft’s intent is its capital spending. In its January 2026 earnings cycle, Reuters reported that Microsoft’s quarterly capital expenditures reached a record $37.5 billion, up nearly 66% year over year. The company’s cloud backlog rose to $625 billion, with roughly 45% of remaining performance obligations tied to OpenAI. About two-thirds of that quarterly capex was directed toward compute chips. To be clear: this is no speculative buildout. Microsoft is deploying capital against a massive, committed demand pipeline, even as it maintains significant exposure to OpenAI-driven workloads. The company is solving two infrastructure problems at once: supporting broad Azure and Copilot growth, while ensuring

AI’s Execution Era: Aligned and Netrality on Power, Speed, and the New Data Center Reality

At Data Center World 2026, the industry didn’t need convincing that something fundamental has shifted. “This feels different,” said Bill Kleyman as he opened a keynote fireside with Phill Lawson-Shanks and Amber Caramella. “In the past 24 months, we’ve seen more evolution… than in the two decades before.” What followed was less a forecast than a field report from the front lines of the AI infrastructure buildout—where demand is immediate, power is decisive, and execution is everything. A Different Kind of Growth Cycle For Caramella, the shift starts with scale—and speed. “What feels fundamentally different is just the sheer pace and breadth of the demand combined with a real shift in architecture,” she said. Vacancy rates have collapsed even as capacity expands. AI workloads are not just additive—they are redefining absorption curves across the market. But the deeper change is behavioral. “Over 75% of people are using AI in their day-to-day business… and now the conversation is shifting to agentic AI,” Caramella noted. That shift—from tools to delegated workflows—points to a second wave of infrastructure demand that has not yet fully materialized. Lawson-Shanks framed the transformation in more structural terms. The industry, he said, has always followed a predictable chain: workload → software → hardware → facility → location. That chain has broken. “We had a very predictable industry… prior to Covid. And Covid changed everything,” he said, describing how hyperscale demand compressed deployment cycles overnight. What followed was a surge that utilities—and supply chains—were not prepared to meet. From Capacity to Constraint: Power Becomes Strategy If AI has a gating factor, it is no longer compute. It is power. “Before it used to be an operational convenience,” Caramella said. “Now it’s a strategic advantage—or constraint if you don’t have it.” That shift is reshaping executive decision-making. Power is no

The Trillion-Dollar AIDC Boom Gets Real: Omdia Maps the Path From Megaclusters to Microgrids

The AI data center buildout is getting bigger, denser, and more electrically complex than even many bullish observers expected. That was the core message from Omdia’s Data Center World analyst summit, where Senior Director Vlad Galabov and Practice Lead Shen Wang laid out a view of the market that has grown more expansive in just the past year. What had been a large-scale infrastructure story is now, in Omdia’s telling, something closer to a full-stack industrial transition: hyperscalers are still leading, but enterprises, second-tier cloud providers, and new AI use cases are beginning to add demand on top of demand. Omdia’s updated forecast reflects that shift. Galabov said the firm has now raised its 2030 projection for data center investment beyond the $1.6 trillion figure it showed a year ago, arguing that surging AI usage, expanding buyer classes, and the emergence of new power infrastructure categories have all forced a rethink. “One of the reasons why we raised it is that people keep using more AI,” Galabov said. “And that just means more money, because we need to buy more GPUs to run the AI.” That is the simple version. The more consequential one is that AI is no longer behaving like a contained technology cycle. It is spilling outward into adjacent infrastructure markets, including batteries, gas-fired onsite generation, and high-voltage DC power architectures that until recently sat well outside the mainstream data center conversation. A Market Moving Faster Than the Forecasts Galabov opened by revisiting the predictions Omdia made last year for 2030. On several fronts, he said, the market is already validating them faster than expected. AI applications are becoming commonplace. AI has become the dominant driver of data center investment. Self-generation is no longer a fringe strategy. Even some of the rack-scale architecture concepts that once looked

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle

Stay Ahead, Stay ONMINE