Data Safety with AI


Pingfan Hu & Dr. John Helveston
George Washington University

2026 Agentic Engineering Workshop

Recap

recap · parts 01 + 02
Speed + reproducibility  ≠  credibility.
01
Counter-productivity
the visible cost
02
The dilemma
no ground truth
03
A workflow
human as gatekeeper

Section 1
Counter-productivity of AI

Fast Better

the trap of speed
Fast
Better
Throughput up. Quality not guaranteed.

The reviewer’s question

"
Could you answer in a sentence — without re‑reading the AI's explanation?
— the reviewer's question
if yes
AI raised speed.
vs
if no
AI lowered quality.

Three failure modes

mode 01
Plausible code,
wrong assumptions
Code runs. Shape is right. Join, group, or filter is wrong.
mode 02
Confident summaries
of noisy results
Reads like an abstract. Confidence not calibrated to evidence.
mode 03
Velocity that
hides the error
Errors compound. By the time you notice, ten cells need re-tracing.
each one invisible at the moment it happens

The reasoning is the analysis

by hand
You think
Decide which to drop, which to transform, which test to run — and why.
decide justify remember
accepted
It produces
Code arrives. It runs. You no longer know why it is right.
accept ship forget
The reasoning is the analysis. The code is just its trace.

Where AI clearly helps

Boilerplate
SAFE · 01
Imports, themes, axis labels, file I/O.
library() theme_*
Translation
SAFE · 02
R↔Python, Stata→R, SQL dialects.
R → Py Stata → R
Documentation
SAFE · 03
Variable names, docstrings, READMEs.
naming README
Library lookup
SAFE · 04
"How do I do X in polars?"
syntax examples
rule of thumb  ·  delegate where the right answer is checkable in 5 seconds

Section 2
The dilemma in data science

Missing values — it depends why

a column has missing values
What should you do? It depends why.
at random
Sensor blip. Forgot to record.
✓ impute / drop ✕ as-zero
systematic
Question skipped on purpose.
✓ encode as category ✕ mean impute
structurally undefined
"Spouse income" for unmarried.
✓ filter subgroup ✕ impute
unit changed
Mid-collection switch.
✓ split at change ✕ treat as one
AI's default move  ·  mean impute — wrong in 3 of 4

Every interesting decision

missing values is just one example
?
Outliers
Drop them — or are they the phenomenon?
?
Model assumptions
Violation: fatal — or tolerable for this question?
?
Collinearity
Drop the covariate — or keep for theory?
?
p = 0.04
An effect — or evidence you ran enough tests?
Two competent analysts can disagree — and both be doing real science.

Why a harness loop fails here

an automated loop, no ground truth
01
Generator
picks plausible default
02
Reviewer
agrees — same training data
03
Verifier
checks code runs, not the decision
04
Output ships
silently wrong
why it fails
No verifier with ground truth  →  reviewer is just another generator  →  two LLMs agreeing is not evidence.

The credibility question

"
Reproducible code is not the same as a defensible analysis.
— the credibility gap
Your output is a claim about the world. Your job is to defend it.

Section 3
A successful workflow

Human as gatekeeper

the principle
AI assists
drafts code, mechanics
YOU
decide · sign off
Defensible
claim about the world
Decide which steps a human must own. Then enforce that boundary.

Two non-negotiables

non-negotiable 01
You own every
data decision
What to keep, drop, transform, impute, model, exclude — and why.
drop impute model exclude
non-negotiable 02
You write the
methods section
The audit trail of your analysis. The one piece you should never delegate.
choices why defense
everything else can flex

Delegation map

Boilerplate & syntax
delegate
checkable in seconds
First draft of transformation code
delegate, then read
reason every line yourself
Handling missing values
keep
no ground truth — your call
Picking a model specification
keep
theory + judgment
Interpreting model output
keep
this is the analysis
Drafting the methods section
keep
how you defend the work
Reformatting the methods section
delegate
mechanics, not decisions

Pin the decisions in code

analysis.py
# Drop respondents who skipped the income question.
# Structural skip (only employed respondents see it),
# not a refusal — see codebook §4.2.
df = df[df["employed"] == 1]
defensible later
Reviewers six months from now can read your reasoning.
guardrail for AI
Stops the next session from "helpfully" rewriting it.

Verify against an independent check

three independent checks
check 01
A published
summary stat
Codebook, prior paper, official table.
check 02
A back-of-envelope
calculation
Done by hand. No model, no AI.
check 03
A second analyst
(not your session)
Different chat, different prompt, different model.
Disagreement  →  the AI-assisted pipeline is wrong until you can explain the gap.

Two issues data science makes worse

issue 01
Reproducibility
AI sessions are not deterministic. Different versions, different prompts, different output.
# helper drafted with
claude-sonnet-4-6, 2026-05-12
issue 02
Disclosure
Journals, conferences, funders ask. The honest answer is almost always yes.
"Code drafts were generated with [model] and reviewed and modified by the authors before use."
one comment, one sentence  ·  protects future-you

Verify, then trust

"
Plausible is not the same as correct.
— the data-science gap
old
"Trust, but verify."
new
"Verify, then trust."

Three takeaways

Counter-productivity
Speed without supervision = throughput without quality.
The dilemma
No ground truth → two LLMs agreeing is not evidence.
A workflow
You own data decisions. You write the methods section.

Thank you

"
AI is a powerful collaborator. It is not a competent analyst.
The difference is who signs the paper.
— closing thought
thank you
2026 Agentic Engineering Workshop  ·  The George Washington University