Data Safety with AI

Recap

recap · parts 01 + 02

Speed + reproducibility ≠ credibility.

01

Counter-productivity

the visible cost

02

The dilemma

no ground truth

03

A workflow

human as gatekeeper

Section 1
Counter-productivity of AI

Fast ≠ Better

the trap of speed

Fast

≠

Better

Throughput up. Quality not guaranteed.

The reviewer’s question

"

Could you answer in a sentence — without re‑reading the AI's explanation?

   — the reviewer's question
  

if yes

AI raised speed.

vs

if no

AI lowered quality.

Three failure modes

mode 01

Plausible code,
wrong assumptions

Code runs. Shape is right. Join, group, or filter is wrong.

mode 02

Confident summaries
of noisy results

Reads like an abstract. Confidence not calibrated to evidence.

mode 03

Velocity that
hides the error

Errors compound. By the time you notice, ten cells need re-tracing.

  each one invisible at the moment it happens
 

The reasoning is the analysis

by hand

You think

Decide which to drop, which to transform, which test to run — and why.

decide justify remember

accepted

It produces

Code arrives. It runs. You no longer know why it is right.

accept ship forget

The reasoning is the analysis. The code is just its trace.

Where AI clearly helps

Boilerplate

SAFE · 01

Imports, themes, axis labels, file I/O.

library() theme_*

Translation

SAFE · 02

R↔Python, Stata→R, SQL dialects.

R → Py Stata → R

Documentation

SAFE · 03

Variable names, docstrings, READMEs.

naming README

Library lookup

SAFE · 04

"How do I do X in polars?"

syntax examples

  rule of thumb  ·  delegate where the right answer is checkable in 5 seconds
 

Section 2
The dilemma in data science

Missing values — it depends why

a column has missing values

What should you do? It depends why.

at random

Sensor blip. Forgot to record.

✓ impute / drop ✕ as-zero

systematic

Question skipped on purpose.

✓ encode as category ✕ mean impute

structurally undefined

"Spouse income" for unmarried.

✓ filter subgroup ✕ impute

unit changed

Mid-collection switch.

✓ split at change ✕ treat as one

  AI's default move  ·  mean impute — wrong in 3 of 4
 

Every interesting decision

missing values is just one example

?

Outliers

Drop them — or are they the phenomenon?

?

Model assumptions

Violation: fatal — or tolerable for this question?

?

Collinearity

Drop the covariate — or keep for theory?

?

p = 0.04

An effect — or evidence you ran enough tests?

Two competent analysts can disagree — and both be doing real science.

Why a harness loop fails here

an automated loop, no ground truth

01

Generator

picks plausible default

→

02

Reviewer

agrees — same training data

→

03

Verifier

checks code runs, not the decision

→

04

Output ships

silently wrong

why it fails

No verifier with ground truth → reviewer is just another generator → two LLMs agreeing is not evidence.

The credibility question

"

Reproducible code is not the same as a defensible analysis.

   — the credibility gap
  

Your output is a claim about the world. Your job is to defend it.

Section 3
A successful workflow

Human as gatekeeper

the principle

AI assists

drafts code, mechanics

→

YOU

decide · sign off

→

Defensible

claim about the world

Decide which steps a human must own. Then enforce that boundary.

Two non-negotiables

non-negotiable 01

You own every
data decision

What to keep, drop, transform, impute, model, exclude — and why.

drop impute model exclude

non-negotiable 02

You write the
methods section

The audit trail of your analysis. The one piece you should never delegate.

choices why defense

  everything else can flex
 

Delegation map

Boilerplate & syntax

delegate

checkable in seconds

First draft of transformation code

delegate, then read

reason every line yourself

Handling missing values

keep

no ground truth — your call

Picking a model specification

keep

theory + judgment

Interpreting model output

keep

this is the analysis

Drafting the methods section

keep

how you defend the work

Reformatting the methods section

delegate

mechanics, not decisions

Pin the decisions in code

analysis.py

# Drop respondents who skipped the income question.
# Structural skip (only employed respondents see it),
# not a refusal — see codebook §4.2.
df = df[df["employed"] == 1]

defensible later

Reviewers six months from now can read your reasoning.

guardrail for AI

Stops the next session from "helpfully" rewriting it.

Verify against an independent check

three independent checks

check 01

A published
summary stat

Codebook, prior paper, official table.

check 02

A back-of-envelope
calculation

Done by hand. No model, no AI.

check 03

A second analyst
(not your session)

Different chat, different prompt, different model.

Disagreement → the AI-assisted pipeline is wrong until you can explain the gap.

Two issues data science makes worse

issue 01

Reproducibility

AI sessions are not deterministic. Different versions, different prompts, different output.

# helper drafted with
claude-sonnet-4-6, 2026-05-12

issue 02

Disclosure

Journals, conferences, funders ask. The honest answer is almost always yes.

    "Code drafts were generated with [model] and reviewed and modified by the authors before use."
   

  one comment, one sentence  ·  protects future-you
 

Verify, then trust

"

Plausible is not the same as correct.

   — the data-science gap
  

old

"Trust, but verify."

→

new

"Verify, then trust."

Three takeaways

✓