Ductile Health Check Procedure¶
Operational procedure for reviewing the day-to-day health of a running Ductile instance. Useful as a daily status review or as the first thing to run when investigating flaky behaviour.
Assumes a systemd-managed deployment. Adjust service name, binary path, and config path to match your instance.
Step 1 — Service & binary sanity¶
systemctl --user is-active <service-name>
systemctl --user status <service-name> --no-pager | sed -n '1,8p'
ductile version
ductile system status
Expected: active. Record the version, commit, and uptime — these let you correlate
errors against deploys later.
Known quirk: when the daemon is running, ductile system status reports DEGRADED
with pid_lock: FAIL (pid <N>). The PID reported IS the running daemon. This is expected
when calling system status against a live instance — not a real failure. Verify the PID
matches the systemd Main PID and move on.
Step 2 — Recent errors¶
Scan a 24h window for error-level log entries, and separately scan everything since the current binary was started, so that pre-deploy and post-deploy issues are distinguishable.
# 24h error scan
journalctl --user -u <service-name> --since "24 hours ago" --no-pager \
| grep -i -E '"level":"ERROR"|panic|FATAL'
# Since service restart (use the "Active: since" timestamp from `systemctl status`)
journalctl --user -u <service-name> --since "<HH:MM:SS>" --no-pager \
| grep -i -E '"level":"ERROR"|panic|FATAL'
Common patterns and their meaning:
| Pattern | Meaning |
|---|---|
baggage path "ductile.route_depth" is immutable |
Routing/baggage propagation issue. The plugin itself may have succeeded; failure is in routed-context creation downstream. |
plugin fingerprint check failed (strict mode) |
A plugin entrypoint was edited without ductile config lock. Review the change, then re-lock. |
failed to create event context for pipeline entry |
Usually a symptom of a baggage/routing bug — investigate the underlying cause rather than the symptom. |
Step 3 — 24h job stats per scheduled plugin¶
Enumerate scheduled plugins from the live config:
Then collect per-plugin 24h counts:
FROM=$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)
for plugin in <plugin-a> <plugin-b> <plugin-c>; do
ductile job logs --from "$FROM" --plugin "$plugin" --limit 200 --json 2>/dev/null \
| python3 -c "
import sys,json
d=json.load(sys.stdin)
logs=d.get('logs') or []
total=d.get('total',0)
succ=sum(1 for j in logs if j.get('Status')=='succeeded')
fail=sum(1 for j in logs if j.get('Status')=='failed')
print(f'{\"$plugin\":<22} total={total:<4} in_window={len(logs)} succ={succ:<4} fail={fail}')
"
done
Also query any event-driven plugins you care about; they may show 0 in the window, which is fine if no upstream event triggered them.
CLI gotchas:
- JSON field names are capitalized: Status, Plugin, CreatedAt, LastError,
Stderr, Result.
- --limit maxes at 200. When total > in_window, you are seeing only the most recent 200
entries, but total is the truthful 24h count.
Step 4 — Investigate failures¶
For any plugin showing fails, pull full details including Result, LastError, and
Stderr via --include-result:
FROM=$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)
ductile job logs --from "$FROM" --plugin <name> --limit 200 --include-result --json 2>/dev/null \
| python3 -c "
import sys,json
d=json.load(sys.stdin)
for j in d.get('logs',[]):
if j.get('Status')=='failed':
print(f\" {j.get('CreatedAt')} cmd={j.get('Command')} attempt={j.get('Attempt')}\")
for k in ('LastError','Stderr'):
v=j.get(k) or ''
if v: print(f' {k}:', v[:300])
res=j.get('Result')
if res: print(' Result:', json.dumps(res)[:300])
"
Watch for cases where Result.status == "ok" but LastError is set: the plugin itself
succeeded, and the failure is downstream in Ductile's routing/context layer. Those are
core bugs, not plugin bugs, and usually need to be matched against recent upstream commits.
For job lineage (baggage, attempts, routing context) across a pipeline:
Step 5 — Deploy correlation¶
If errors cluster before a timestamp and stop after it, confirm a deploy/restart explains it rather than transient recovery:
# find service restarts
journalctl --user -u <service-name> --since "24 hours ago" --no-pager \
| grep -E 'Started|ductile running'
# binary age
ls -la $(command -v ductile)
Match against git log in the ductile source tree between the old and new commit: values
(from ductile version) to identify which commits fixed which errors.
Step 6 — Verdict¶
Summarise as:
- Service state — active/degraded (ignoring the
pid_lockquirk), binary version, uptime. - 24h job totals — overall success rate, per-plugin failure counts.
- Failures — root cause(s), whether already patched in the running binary, whether operator action is needed.
- Post-restart window — clean or not (most important signal for "is it healthy now?").
Target: the post-restart window has zero errors. Pre-restart errors with a matching fix already in the running binary are history, not present-day problems.