Draft2026-05-20Owner: Michał FilipówPM partner: Sid Dani
An Airflow-scheduled Databricks job computes data-quality statistics (z-scores, distribution checks). When a metric trips, instead of paging a human, the job invokes a Claude Agent SDK loop that writes and executes diagnostic code on the fly, interprets results, and posts a structured RCA to Slack or email.
Problem
Data-quality alerts today are dumb: "metric_x crossed threshold y." A human still has to log into Databricks, write a notebook, slice the data five ways, and figure out why. That's 20–60 minutes of toil per alert, fragmented across the team, and the investigation playbook lives in people's heads — not in code. The team wants the first 80% of every investigation to happen automatically, so the human picks up at "here's the likely cause, here's the evidence."
"I'd love for it to not just alert us, but actually investigate — write and execute code on the fly to dig into why the data looks bad, and then send a notification."
— Michał Filipów, 2026-05-19
Goals & non-goals
Goals
Detect: Airflow + Databricks compute z-scores and distribution checks on a schedule
Investigate: when a metric trips, a Claude Agent SDK loop autonomously writes Python, executes against Databricks, and iterates
Notify: post to Slack channel or email with the RCA + links to the queries it ran
Bounded: every run has hard caps (turns, $, time) and cannot mutate data
Non-goals (for v1)
Auto-remediation — agent does not write fixes back to pipelines
Custom UI — output is Slack/email + persisted artifacts
Cross-pipeline correlation — v1 investigates one anomaly at a time
Replacing the existing alerting system — this enriches alerts, doesn't replace them
Real-time / streaming — Airflow-scheduled is the trigger surface for v1
What the Agent SDK loop actually does
This is the part to lock in the whiteboard. The Claude Agent SDK gives you a harness with five primitives — system prompt, tools, the agent loop, session state, structured output. Here's how each one maps to Michał's use case.
┌─────────────────────────────────────────────────────────────────────┐
│ AIRFLOW DAG TASK (Databricks operator) │
│ ───────────────────────────────────── │
│ 1. Run scheduled metric checks (z-scores, distribution drift) │
│ 2. If anomaly detected → build context payload: │
│ { table, metric, z_score, window, expected, actual, │
│ lineage_summary, recent_dag_runs } │
│ 3. Spawn Claude Agent SDK subprocess (asyncio) │
└──────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ CLAUDE AGENT SDK LOOP │
│ ───────────────────────── │
│ system_prompt: "You are a data-quality investigator. The anomaly │
│ is {payload}. Use the tools to form and test hypotheses." │
│ │
│ tools (the harness): │
│ • run_sql(query) → Databricks SQL Warehouse (read-only) │
│ • execute_python(code) → sandboxed code-exec for stats/plots │
│ • read_lineage(table) → Unity Catalog lineage API │
│ • post_finding(rca) → emits the final structured RCA │
│ │
│ LOOP (until done or cap hit): │
│ ┌─ turn N ────────────────────────────────────────────────────┐ │
│ │ 1. Claude reasons: "z-score on revenue spiked → check if │ │
│ │ one source_id dominates the change" │ │
│ │ 2. Calls run_sql("SELECT source_id, SUM(...) ...") │ │
│ │ 3. Sees result → calls execute_python to compute │ │
│ │ contribution % vs baseline │ │
│ │ 4. Forms new hypothesis OR concludes │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ Caps: max_turns=20 · max_budget_usd=$2 · timeout=15min │
│ Exit when: agent calls post_finding(rca) OR cap hits │
└──────────────────────────┬──────────────────────────────────────────┘
│
▼
Slack/email notification with structured RCA
Concretely — the five Agent SDK primitives mapped
SDK primitive
What you put in it
System prompt
The investigator role + the anomaly payload + the "playbook" of how a senior data engineer would approach this kind of check. Keep it short; offload the playbook to skills (below).
Tools (@tool)
Four custom tools. run_sql (Databricks SQL connector, read-only role), execute_python (sandboxed code interpreter — stats, plots, contribution analysis), read_lineage (Unity Catalog lineage so the agent knows upstream tables), post_finding (the final structured RCA emitter). The agent composes these — no math tool needed, it writes numpy/scipy in execute_python.
Agent loop
Provided by the SDK. You set max_turns, max_budget_usd, permission_mode. The SDK handles tool dispatch, retries, and turn accounting. Loop exits when the agent calls post_finding or hits a cap.
Session state
Session ID = {dag_id}:{run_id}:{task_id}. If Airflow re-runs the task, the SDK resumes the session instead of burning fresh budget. Persist the full transcript to a Databricks Volume per turn for post-mortem.
Structured output
Pydantic model for the RCA: root_cause, evidence_queries[], confidence, suggested_action, escalate_to_human:bool. Validate; if invalid, fail the Airflow task — do not auto-retry a non-deterministic loop.
Direct answer to your math-tool question: no separate math tool. Math, stats, plots all happen inside execute_python. The agent writes scipy.stats.zscore(...) or pandas.groupby(...).contribution() — same as a human would in a notebook. Exposing calculate_zscore() as a discrete tool is overfitting; you'd be re-implementing what a code interpreter already does.
Skills — the playbook layer
"Skills" in Agent SDK = markdown files the agent loads when relevant, holding playbooks/conventions. For this use case, ship 3–4 skills v1:
zscore-investigation.md — "when z-score on a daily aggregate spikes, here are the five cuts to try first: source_id, market, device_type, day_of_week, recent schema changes."
distribution-drift.md — "when KS-test fails, compare percentile shapes; check for upstream backfill; check for new categorical values."
output-format.md — the exact RCA Pydantic schema with examples. Keeps the final turn cheap.
Skills are how you turn "the senior engineer's mental model" into something the agent reliably applies — without bloating the system prompt.
Evals — non-negotiable
Without evals you don't know if the loop is doing useful work. Build a corpus of 10–20 historical anomalies (we have these in the Slack DQ-alerts channel) with known ground-truth root causes. Eval harness runs the agent against the anomaly payload without the known answer, then scores:
Correctness: did the agent identify the right root cause? (LLM-as-judge against ground truth, plus exact-match on the dominant dimension)
Efficiency: turns used, $ spent, queries issued — distribution, not average
Safety: any attempted writes, any unbounded scans, any infinite-loop patterns
Quality of evidence: are the cited queries reproducible and actually load-bearing?
Re-run evals on every prompt or skill change. Gate prod rollout on 70%+ correctness on the corpus.
Success metrics
Metric
Target (v1)
How measured
RCA correctness on eval corpus
≥ 70%
LLM-judge + manual spot-check on 20-anomaly corpus
As a data engineer on-call, I want data-quality alerts to arrive with a likely root cause and the evidence queries that support it, so that I can confirm in 2 minutes instead of investigating from scratch.
As a pipeline owner, I want the investigation agent to be read-only with hard cost caps, so that a runaway loop can never mutate prod data or burn through budget.
As a data platform lead, I want every investigation persisted with full transcript and queries, so that we can audit decisions and continuously improve the agent's skills.
As a PM partner (Sid), I want to plug agent telemetry (turns, cost, latency) into the existing claude-code-telemetry DuckDB pipe, so that we see usage and ROI in one dashboard.
Scope — 4 milestones
01
Skeleton
Airflow DAG with mock anomaly → Agent SDK loop with execute_python only. No Databricks yet. Proves the harness works end-to-end in Cowork.
02
Databricks tools
Add run_sql via databricks-sql-connector (read-only role). Swap to managed MCP when AHD-3061 unblocks. Wire read_lineage.
03
Skills + evals
Ship 3–4 markdown skills. Build 10–20 anomaly eval corpus from past DQ alerts. Hit ≥70% correctness target. Wire telemetry into Sid's DuckDB pipe.
04
Pilot
Turn on for one non-critical pipeline. 2-week shadow run — agent posts to a private channel, humans grade outcomes. Promote to real alerting channel when override rate < 30%.
Open questions
Michał
Exactly which pipelines are in v1 scope? Need a list of 1–2 to pilot — ideally ones where DQ alerts are noisy and the team is already manually triaging.
Michał
What's the existing anomaly-detection logic — is it already in production Databricks code, or are we building both the detector and the investigator? Affects scope by a week.
Sid + Bob
Databricks Managed MCP via Cowork — AHD-3061 unblock timeline. If > 2 weeks out, ship v1 with the SQL-connector fallback and migrate later.
Sid
Wire investigation telemetry (turns, cost, model, outcome) into claude-code-telemetry DuckDB. One additional table or extend existing? Defer until milestone 03.
Michał + team
Notification channel taxonomy — one channel for all DQ investigations, or per-pipeline? Affects how we collect 👍/👎 grading signal.
Michał
What does "good RCA" look like to the team — is there a shape (1-line headline + 3 evidence bullets) the team prefers? Lock this in output-format.md skill before evals.
Sid
Open-source potential — is this generalizable enough to be a public reference repo (anonymized)? Champion-program demo candidate.