Spec — Autonomous Data Health Investigator (Airflow × Databricks × Claude Agent SDK)

Autonomous Data Health Investigator

Airflow → Databricks → Claude Agent SDK loop

Draft 2026-05-20 Owner: Michał Filipów PM partner: Sid Dani

An Airflow-scheduled Databricks job computes data-quality statistics (z-scores, distribution checks). When a metric trips, instead of paging a human, the job invokes a Claude Agent SDK loop that writes and executes diagnostic code on the fly, interprets results, and posts a structured RCA to Slack or email.

Problem

Data-quality alerts today are dumb: "metric_x crossed threshold y." A human still has to log into Databricks, write a notebook, slice the data five ways, and figure out why. That's 20–60 minutes of toil per alert, fragmented across the team, and the investigation playbook lives in people's heads — not in code. The team wants the first 80% of every investigation to happen automatically, so the human picks up at "here's the likely cause, here's the evidence."

"I'd love for it to not just alert us, but actually investigate — write and execute code on the fly to dig into why the data looks bad, and then send a notification." — Michał Filipów, 2026-05-19

Goals & non-goals

Goals

Detect: Airflow + Databricks compute z-scores and distribution checks on a schedule
Investigate: when a metric trips, a Claude Agent SDK loop autonomously writes Python, executes against Databricks, and iterates
Summarize: agent produces a structured RCA (likely cause, evidence, confidence, recommended action)
Notify: post to Slack channel or email with the RCA + links to the queries it ran
Bounded: every run has hard caps (turns, $, time) and cannot mutate data

Non-goals (for v1)

Auto-remediation — agent does not write fixes back to pipelines
Custom UI — output is Slack/email + persisted artifacts
Cross-pipeline correlation — v1 investigates one anomaly at a time
Replacing the existing alerting system — this enriches alerts, doesn't replace them
Real-time / streaming — Airflow-scheduled is the trigger surface for v1

What the Agent SDK loop actually does

This is the part to lock in the whiteboard. The Claude Agent SDK gives you a harness with five primitives — system prompt, tools, the agent loop, session state, structured output. Here's how each one maps to Michał's use case.

┌─────────────────────────────────────────────────────────────────────┐ │ AIRFLOW DAG TASK (Databricks operator) │ │ ───────────────────────────────────── │ │ 1. Run scheduled metric checks (z-scores, distribution drift) │ │ 2. If anomaly detected → build context payload: │ │ { table, metric, z_score, window, expected, actual, │ │ lineage_summary, recent_dag_runs } │ │ 3. Spawn Claude Agent SDK subprocess (asyncio) │ └──────────────────────────┬──────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ CLAUDE AGENT SDK LOOP │ │ ───────────────────────── │ │ system_prompt: "You are a data-quality investigator. The anomaly │ │ is {payload}. Use the tools to form and test hypotheses." │ │ │ │ tools (the harness): │ │ • run_sql(query) → Databricks SQL Warehouse (read-only) │ │ • execute_python(code) → sandboxed code-exec for stats/plots │ │ • read_lineage(table) → Unity Catalog lineage API │ │ • post_finding(rca) → emits the final structured RCA │ │ │ │ LOOP (until done or cap hit): │ │ ┌─ turn N ────────────────────────────────────────────────────┐ │ │ │ 1. Claude reasons: "z-score on revenue spiked → check if │ │ │ │ one source_id dominates the change" │ │ │ │ 2. Calls run_sql("SELECT source_id, SUM(...) ...") │ │ │ │ 3. Sees result → calls execute_python to compute │ │ │ │ contribution % vs baseline │ │ │ │ 4. Forms new hypothesis OR concludes │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ Caps: max_turns=20 · max_budget_usd=$2 · timeout=15min │ │ Exit when: agent calls post_finding(rca) OR cap hits │ └──────────────────────────┬──────────────────────────────────────────┘ │ ▼ Slack/email notification with structured RCA

Concretely — the five Agent SDK primitives mapped

SDK primitive	What you put in it
System prompt	The investigator role + the anomaly payload + the "playbook" of how a senior data engineer would approach this kind of check. Keep it short; offload the playbook to skills (below).
Tools (`@tool`)	Four custom tools. `run_sql` (Databricks SQL connector, read-only role), `execute_python` (sandboxed code interpreter — stats, plots, contribution analysis), `read_lineage` (Unity Catalog lineage so the agent knows upstream tables), `post_finding` (the final structured RCA emitter). The agent composes these — no math tool needed, it writes numpy/scipy in `execute_python`.
Agent loop	Provided by the SDK. You set `max_turns`, `max_budget_usd`, `permission_mode`. The SDK handles tool dispatch, retries, and turn accounting. Loop exits when the agent calls `post_finding` or hits a cap.
Session state	Session ID = `{dag_id}:{run_id}:{task_id}`. If Airflow re-runs the task, the SDK resumes the session instead of burning fresh budget. Persist the full transcript to a Databricks Volume per turn for post-mortem.
Structured output	Pydantic model for the RCA: `root_cause`, `evidence_queries[]`, `confidence`, `suggested_action`, `escalate_to_human:bool`. Validate; if invalid, fail the Airflow task — do not auto-retry a non-deterministic loop.

Direct answer to your math-tool question: no separate math tool. Math, stats, plots all happen inside execute_python. The agent writes scipy.stats.zscore(...) or pandas.groupby(...).contribution() — same as a human would in a notebook. Exposing calculate_zscore() as a discrete tool is overfitting; you'd be re-implementing what a code interpreter already does.

Skills — the playbook layer

"Skills" in Agent SDK = markdown files the agent loads when relevant, holding playbooks/conventions. For this use case, ship 3–4 skills v1:

zscore-investigation.md — "when z-score on a daily aggregate spikes, here are the five cuts to try first: source_id, market, device_type, day_of_week, recent schema changes."
distribution-drift.md — "when KS-test fails, compare percentile shapes; check for upstream backfill; check for new categorical values."
databricks-sql-conventions.md — table naming, partition columns, common joins. Saves wasted turns.
output-format.md — the exact RCA Pydantic schema with examples. Keeps the final turn cheap.

Skills are how you turn "the senior engineer's mental model" into something the agent reliably applies — without bloating the system prompt.

Evals — non-negotiable

Without evals you don't know if the loop is doing useful work. Build a corpus of 10–20 historical anomalies (we have these in the Slack DQ-alerts channel) with known ground-truth root causes. Eval harness runs the agent against the anomaly payload without the known answer, then scores:

Correctness: did the agent identify the right root cause? (LLM-as-judge against ground truth, plus exact-match on the dominant dimension)
Efficiency: turns used, $ spent, queries issued — distribution, not average
Safety: any attempted writes, any unbounded scans, any infinite-loop patterns
Quality of evidence: are the cited queries reproducible and actually load-bearing?

Re-run evals on every prompt or skill change. Gate prod rollout on 70%+ correctness on the corpus.

Success metrics

Metric	Target (v1)	How measured
RCA correctness on eval corpus	≥ 70%	LLM-judge + manual spot-check on 20-anomaly corpus
Median time-to-RCA	< 90 seconds	Airflow task duration on real triggers
Cost per investigation	< $0.50 median, $2 hard cap	Agent SDK cost telemetry → DuckDB (Sid's pipe)
Human override rate	< 30% of fired alerts	Slack reaction taxonomy: 👍 useful / 👎 wrong / 🤷 unclear
Loop divergence (3× repeated query)	0 in prod	Call-hash dedup guard fires → alert
Engineering hours saved / week	≥ 4 hrs (team)	Self-reported on weekly retro for 4 weeks

User stories

As a data engineer on-call, I want data-quality alerts to arrive with a likely root cause and the evidence queries that support it, so that I can confirm in 2 minutes instead of investigating from scratch.

As a pipeline owner, I want the investigation agent to be read-only with hard cost caps, so that a runaway loop can never mutate prod data or burn through budget.

As a data platform lead, I want every investigation persisted with full transcript and queries, so that we can audit decisions and continuously improve the agent's skills.

As a PM partner (Sid), I want to plug agent telemetry (turns, cost, latency) into the existing claude-code-telemetry DuckDB pipe, so that we see usage and ROI in one dashboard.

Scope — 4 milestones

Skeleton

Airflow DAG with mock anomaly → Agent SDK loop with execute_python only. No Databricks yet. Proves the harness works end-to-end in Cowork.

Databricks tools

Add run_sql via databricks-sql-connector (read-only role). Swap to managed MCP when AHD-3061 unblocks. Wire read_lineage.

Skills + evals

Ship 3–4 markdown skills. Build 10–20 anomaly eval corpus from past DQ alerts. Hit ≥70% correctness target. Wire telemetry into Sid's DuckDB pipe.

Pilot

Turn on for one non-critical pipeline. 2-week shadow run — agent posts to a private channel, humans grade outcomes. Promote to real alerting channel when override rate < 30%.

Open questions

Michał

Exactly which pipelines are in v1 scope? Need a list of 1–2 to pilot — ideally ones where DQ alerts are noisy and the team is already manually triaging.

Michał

What's the existing anomaly-detection logic — is it already in production Databricks code, or are we building both the detector and the investigator? Affects scope by a week.

Sid + Bob

Databricks Managed MCP via Cowork — AHD-3061 unblock timeline. If > 2 weeks out, ship v1 with the SQL-connector fallback and migrate later.

Sid

Wire investigation telemetry (turns, cost, model, outcome) into claude-code-telemetry DuckDB. One additional table or extend existing? Defer until milestone 03.

Michał + team

Notification channel taxonomy — one channel for all DQ investigations, or per-pipeline? Affects how we collect 👍/👎 grading signal.

Michał

What does "good RCA" look like to the team — is there a shape (1-line headline + 3 evidence bullets) the team prefers? Lock this in output-format.md skill before evals.

Sid

Open-source potential — is this generalizable enough to be a public reference repo (anonymized)? Champion-program demo candidate.

Draft · 2026-05-20 · beru-workspace/3-Resources/meeting-prep/2026-05-20-michal-data-health-spec.html
Built from Michał's 2026-05-19 message. Replaces the earlier 2026-05-20-michal-agent-loop.html which was framed before his actual ask was known.