Autonomous Data Health Investigator
Airflow → Databricks → Claude Agent SDK loop
Draft 2026-05-20 Owner: Michał Filipów PM partner: Sid Dani

An Airflow-scheduled Databricks job computes data-quality statistics (z-scores, distribution checks). When a metric trips, instead of paging a human, the job invokes a Claude Agent SDK loop that writes and executes diagnostic code on the fly, interprets results, and posts a structured RCA to Slack or email.

Problem

Data-quality alerts today are dumb: "metric_x crossed threshold y." A human still has to log into Databricks, write a notebook, slice the data five ways, and figure out why. That's 20–60 minutes of toil per alert, fragmented across the team, and the investigation playbook lives in people's heads — not in code. The team wants the first 80% of every investigation to happen automatically, so the human picks up at "here's the likely cause, here's the evidence."

"I'd love for it to not just alert us, but actually investigate — write and execute code on the fly to dig into why the data looks bad, and then send a notification." — Michał Filipów, 2026-05-19

Goals & non-goals

Goals

  • Detect: Airflow + Databricks compute z-scores and distribution checks on a schedule
  • Investigate: when a metric trips, a Claude Agent SDK loop autonomously writes Python, executes against Databricks, and iterates
  • Summarize: agent produces a structured RCA (likely cause, evidence, confidence, recommended action)
  • Notify: post to Slack channel or email with the RCA + links to the queries it ran
  • Bounded: every run has hard caps (turns, $, time) and cannot mutate data

Non-goals (for v1)

  • Auto-remediation — agent does not write fixes back to pipelines
  • Custom UI — output is Slack/email + persisted artifacts
  • Cross-pipeline correlation — v1 investigates one anomaly at a time
  • Replacing the existing alerting system — this enriches alerts, doesn't replace them
  • Real-time / streaming — Airflow-scheduled is the trigger surface for v1

What the Agent SDK loop actually does

This is the part to lock in the whiteboard. The Claude Agent SDK gives you a harness with five primitives — system prompt, tools, the agent loop, session state, structured output. Here's how each one maps to Michał's use case.

┌─────────────────────────────────────────────────────────────────────┐ │ AIRFLOW DAG TASK (Databricks operator) │ │ ───────────────────────────────────── │ │ 1. Run scheduled metric checks (z-scores, distribution drift) │ │ 2. If anomaly detected → build context payload: │ │ { table, metric, z_score, window, expected, actual, │ │ lineage_summary, recent_dag_runs } │ │ 3. Spawn Claude Agent SDK subprocess (asyncio) │ └──────────────────────────┬──────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ CLAUDE AGENT SDK LOOP │ │ ───────────────────────── │ │ system_prompt: "You are a data-quality investigator. The anomaly │ │ is {payload}. Use the tools to form and test hypotheses." │ │ │ │ tools (the harness): │ │ • run_sql(query) → Databricks SQL Warehouse (read-only) │ │ • execute_python(code) → sandboxed code-exec for stats/plots │ │ • read_lineage(table) → Unity Catalog lineage API │ │ • post_finding(rca) → emits the final structured RCA │ │ │ │ LOOP (until done or cap hit): │ │ ┌─ turn N ────────────────────────────────────────────────────┐ │ │ │ 1. Claude reasons: "z-score on revenue spiked → check if │ │ │ │ one source_id dominates the change" │ │ │ │ 2. Calls run_sql("SELECT source_id, SUM(...) ...") │ │ │ │ 3. Sees result → calls execute_python to compute │ │ │ │ contribution % vs baseline │ │ │ │ 4. Forms new hypothesis OR concludes │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ Caps: max_turns=20 · max_budget_usd=$2 · timeout=15min │ │ Exit when: agent calls post_finding(rca) OR cap hits │ └──────────────────────────┬──────────────────────────────────────────┘ │ ▼ Slack/email notification with structured RCA

Concretely — the five Agent SDK primitives mapped

SDK primitiveWhat you put in it
System prompt The investigator role + the anomaly payload + the "playbook" of how a senior data engineer would approach this kind of check. Keep it short; offload the playbook to skills (below).
Tools (@tool) Four custom tools. run_sql (Databricks SQL connector, read-only role), execute_python (sandboxed code interpreter — stats, plots, contribution analysis), read_lineage (Unity Catalog lineage so the agent knows upstream tables), post_finding (the final structured RCA emitter). The agent composes these — no math tool needed, it writes numpy/scipy in execute_python.
Agent loop Provided by the SDK. You set max_turns, max_budget_usd, permission_mode. The SDK handles tool dispatch, retries, and turn accounting. Loop exits when the agent calls post_finding or hits a cap.
Session state Session ID = {dag_id}:{run_id}:{task_id}. If Airflow re-runs the task, the SDK resumes the session instead of burning fresh budget. Persist the full transcript to a Databricks Volume per turn for post-mortem.
Structured output Pydantic model for the RCA: root_cause, evidence_queries[], confidence, suggested_action, escalate_to_human:bool. Validate; if invalid, fail the Airflow task — do not auto-retry a non-deterministic loop.
Direct answer to your math-tool question: no separate math tool. Math, stats, plots all happen inside execute_python. The agent writes scipy.stats.zscore(...) or pandas.groupby(...).contribution() — same as a human would in a notebook. Exposing calculate_zscore() as a discrete tool is overfitting; you'd be re-implementing what a code interpreter already does.

Skills — the playbook layer

"Skills" in Agent SDK = markdown files the agent loads when relevant, holding playbooks/conventions. For this use case, ship 3–4 skills v1:

Skills are how you turn "the senior engineer's mental model" into something the agent reliably applies — without bloating the system prompt.

Evals — non-negotiable

Without evals you don't know if the loop is doing useful work. Build a corpus of 10–20 historical anomalies (we have these in the Slack DQ-alerts channel) with known ground-truth root causes. Eval harness runs the agent against the anomaly payload without the known answer, then scores:

Re-run evals on every prompt or skill change. Gate prod rollout on 70%+ correctness on the corpus.

Success metrics

MetricTarget (v1)How measured
RCA correctness on eval corpus≥ 70%LLM-judge + manual spot-check on 20-anomaly corpus
Median time-to-RCA< 90 secondsAirflow task duration on real triggers
Cost per investigation< $0.50 median, $2 hard capAgent SDK cost telemetry → DuckDB (Sid's pipe)
Human override rate< 30% of fired alertsSlack reaction taxonomy: 👍 useful / 👎 wrong / 🤷 unclear
Loop divergence (3× repeated query)0 in prodCall-hash dedup guard fires → alert
Engineering hours saved / week≥ 4 hrs (team)Self-reported on weekly retro for 4 weeks

User stories

As a data engineer on-call, I want data-quality alerts to arrive with a likely root cause and the evidence queries that support it, so that I can confirm in 2 minutes instead of investigating from scratch.
As a pipeline owner, I want the investigation agent to be read-only with hard cost caps, so that a runaway loop can never mutate prod data or burn through budget.
As a data platform lead, I want every investigation persisted with full transcript and queries, so that we can audit decisions and continuously improve the agent's skills.
As a PM partner (Sid), I want to plug agent telemetry (turns, cost, latency) into the existing claude-code-telemetry DuckDB pipe, so that we see usage and ROI in one dashboard.

Scope — 4 milestones

01

Skeleton

Airflow DAG with mock anomaly → Agent SDK loop with execute_python only. No Databricks yet. Proves the harness works end-to-end in Cowork.

02

Databricks tools

Add run_sql via databricks-sql-connector (read-only role). Swap to managed MCP when AHD-3061 unblocks. Wire read_lineage.

03

Skills + evals

Ship 3–4 markdown skills. Build 10–20 anomaly eval corpus from past DQ alerts. Hit ≥70% correctness target. Wire telemetry into Sid's DuckDB pipe.

04

Pilot

Turn on for one non-critical pipeline. 2-week shadow run — agent posts to a private channel, humans grade outcomes. Promote to real alerting channel when override rate < 30%.

Open questions

Michał
Exactly which pipelines are in v1 scope? Need a list of 1–2 to pilot — ideally ones where DQ alerts are noisy and the team is already manually triaging.
Michał
What's the existing anomaly-detection logic — is it already in production Databricks code, or are we building both the detector and the investigator? Affects scope by a week.
Sid + Bob
Databricks Managed MCP via Cowork — AHD-3061 unblock timeline. If > 2 weeks out, ship v1 with the SQL-connector fallback and migrate later.
Sid
Wire investigation telemetry (turns, cost, model, outcome) into claude-code-telemetry DuckDB. One additional table or extend existing? Defer until milestone 03.
Michał + team
Notification channel taxonomy — one channel for all DQ investigations, or per-pipeline? Affects how we collect 👍/👎 grading signal.
Michał
What does "good RCA" look like to the team — is there a shape (1-line headline + 3 evidence bullets) the team prefers? Lock this in output-format.md skill before evals.
Sid
Open-source potential — is this generalizable enough to be a public reference repo (anonymized)? Champion-program demo candidate.