Monitoring & Observability
Use this guide for runtime health, scheduled-task visibility, and alerting. This is the canonical monitoring document; the old root MONITORING.md and ERROR_ALERTS.md surfaces are absorbed here.
Canonical Paths
The default configured paths are:
- legacy state file artifact:
orchestrator_state.jsonmay still appear in older setups, but it is not the default current runtime target - logs directory:
logs/ - digest output:
logs/digests/ - runtime state target:
./orchestrator/data/orchestrator-state.json
When local configuration changes those paths, follow orchestrator_config.json.
Live Logs
tail -f logs/orchestrator.log
grep "heartbeat" logs/orchestrator.log | tail -10
grep "error\\|ERROR" logs/orchestrator.logState Checks
The orchestrator persists current runtime state to the configured stateFile target. In the repo-native default posture, that is a local JSON file under ./orchestrator/data/.
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/dashboard/overview | jq
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/tasks/catalog | jq '.tasks | length'
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/health/extended | jq '.truthLayer'Useful fields to watch:
lastStartedAttaskHistoryredditResponsesrssDraftsdriftRepairsdeployedAgents
Heartbeat Health
The orchestrator enqueues an internal maintenance heartbeat every 5 minutes. It is part of scheduled control-plane upkeep, not a normal public trigger path.
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/dashboard/overview | jq '.health.lastHeartbeatAt'If the latest heartbeat is stale, treat it as a runtime health warning and check process liveness immediately.
Scheduled Task Monitoring
The default recurring tasks are:
nightly-batchsend-digestheartbeat
Watch the relevant events:
grep -E "nightly-batch|send-digest|heartbeat" logs/orchestrator.log | tail -20
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/dashboard/overview | jq '.recentTasks[:10]'
curl -fsS -H "Authorization: Bearer $API_KEY" "http://127.0.0.1:3312/api/tasks/runs?includeInternal=true&limit=10&type=heartbeat" | jq '.runs'
ls -lah logs/digests/digest-*.jsonWhen nightly-batch runs, verify:
- a digest file was created in
logs/digests/ - the task appears in
taskHistory - the next
send-digesttask completed or logged a clear failure
Task And Agent Visibility
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/dashboard/overview | jq '.recentTasks[] | select(.status=="error")'
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/agents/overview | jq '.agents[] | {id, lifecycleMode, hostServiceStatus}'This gives you recent failures, agent-heavy task flows, and the current deployment memory tracked by the runtime.
GitHub Push Monitoring
The orchestrator can watch the latest GitHub Actions result for the checked-out repository, so a failed push shows up in System Health even if you never open GitHub manually.
What it needs:
- a working
ghCLI in the orchestrator runtime environment gh auth statusalready authenticated for the target repository- the default repo remote, or an explicit override through
GITHUB_ACTIONS_MONITOR_REPO
What it exposes:
dependencies.githubinGET /api/health/extended- a runtime note on the
System Healthpage when the latest workflow is failed, still running, or otherwise warning - an observed-truth signal and incident when the latest workflow conclusion is failed
Quick check:
gh auth status
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/health/extended | jq '.dependencies.github'When the latest workflow fails, treat the pushed repo state as degraded until the workflow is green again.
Service Lifecycle Visibility
The current runtime distinguishes worker-first agents from service-expected agents. Do not infer this from src/service.ts alone.
Check the operator surfaces first:
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/agents/overview | jq '.agents[] | {id, lifecycleMode, hostServiceStatus, serviceUnitName, serviceInstalled, serviceRunning}'
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/health/extended | jq '.workers'What to look for:
lifecycleMode=="service-expected"means host unit coverage is part of the runtime contracthostServiceStatustells you whether the unit is running, installed but stopped, not installed, probe-unavailable, or not applicableworkers.serviceExpectedGapCountshould stay visible during host/service troubleshooting
On Linux hosts, confirm the unit state directly:
systemctl show doc-specialist.service reddit-helper.service --property=Id,LoadState,ActiveState,SubState,UnitFileState --no-pagerAlerts
The orchestrator supports built-in alerting for failure accumulation and critical runtime problems.
Common environment variables:
export ALERTS_ENABLED=true
export ALERT_SEVERITY_THRESHOLD=error
export SLACK_ERROR_WEBHOOK=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
export ALERT_EMAIL_TO=ops@example.com
export EMAIL_API_URL=https://your-email-service/send
export EMAIL_API_KEY=your-api-keyAlert behavior to expect:
- repeated task failures escalate in severity
- missed heartbeat windows should be treated as critical
- notification delivery failures should still appear in logs even when the external channel fails
Quick Health Pass
ps aux | grep "node\\|tsx" | grep -v grep
ls -la logs/
curl -fsS http://127.0.0.1:3312/health | jq
curl -fsS -H "Authorization: Bearer $API_KEY" http://127.0.0.1:3312/api/dashboard/overview | jq '.health.lastHeartbeatAt'Common Failure Patterns
- No heartbeat for more than 10-15 minutes: check if the orchestrator process is down or hung.
- Missing digest file after
nightly-batch: checklogs/orchestrator.logand/api/dashboard/overviewfor batch errors. - Notification expected but nothing arrived: verify webhook/email configuration and look for notifier errors in the log.
- State or log growth looks abnormal: inspect the configured
stateFile, queue-related arrays, and artifact retention.
Escalation Rule
When runtime health looks wrong:
- Check process liveness.
- Check the latest heartbeat and
/api/runtime/facts. - Check the most recent failing task record.
- Inspect notifier errors if alerts did not arrive.
- Use Common Issues and Debugging for deeper recovery.