Data Safety with AI
Pingfan Hu
&
Dr. John Helveston
George Washington University
2026 Agentic Engineering Workshop
Recap
recap · parts 01 + 02
Speed
+
reproducibility
≠
credibility
.
01
Counter-productivity
the visible cost
02
The dilemma
no ground truth
03
A workflow
human as gatekeeper
Section 1
Counter-productivity of AI
Fast
≠
Better
the trap of speed
Fast
≠
Better
Throughput up.
Quality not guaranteed.
The reviewer’s question
"
Could you answer
in a sentence
— without re‑reading the AI's explanation?
— the reviewer's question
if yes
AI raised
speed
.
vs
if no
AI lowered
quality
.
Three failure modes
mode 01
Plausible code,
wrong assumptions
Code runs. Shape is right.
Join, group, or filter is wrong.
mode 02
Confident summaries
of noisy results
Reads like an abstract.
Confidence not calibrated to evidence.
mode 03
Velocity that
hides the error
Errors compound.
By the time you notice, ten cells need re-tracing.
each one invisible at the moment it happens
The
reasoning
is the analysis
by hand
You think
Decide which to drop, which to transform, which test to run — and
why
.
decide
justify
remember
accepted
It produces
Code arrives. It runs.
You no longer know why it is right.
accept
ship
forget
The reasoning
is
the analysis.
The code is just its trace.
Where AI
clearly
helps
Boilerplate
SAFE · 01
Imports, themes, axis labels, file I/O.
library()
theme_*
Translation
SAFE · 02
R↔Python, Stata→R, SQL dialects.
R → Py
Stata → R
Documentation
SAFE · 03
Variable names, docstrings, READMEs.
naming
README
Library lookup
SAFE · 04
"How do I do X in
polars
?"
syntax
examples
rule of thumb ·
delegate where the right answer is checkable in
5 seconds
Section 2
The dilemma in data science
Missing values —
it depends why
a column has missing values
What should you do?
It depends why.
at random
Sensor blip. Forgot to record.
✓ impute / drop
✕ as-zero
systematic
Question skipped on purpose.
✓ encode as category
✕ mean impute
structurally undefined
"Spouse income" for unmarried.
✓ filter subgroup
✕ impute
unit changed
Mid-collection switch.
✓ split at change
✕ treat as one
AI's default move ·
mean impute —
wrong in 3 of 4
Every interesting decision
missing values is just one example
?
Outliers
Drop them — or are they the phenomenon?
?
Model assumptions
Violation: fatal — or tolerable for this question?
?
Collinearity
Drop the covariate — or keep for theory?
?
p = 0.04
An effect — or evidence you ran enough tests?
Two competent analysts can
disagree
— and both be doing real science.
Why a
harness loop
fails here
an automated loop, no ground truth
01
Generator
picks plausible default
→
02
Reviewer
agrees — same training data
→
03
Verifier
checks code runs, not the decision
→
04
Output ships
silently wrong
why it fails
No verifier with ground truth → reviewer is just another generator →
two LLMs agreeing is not evidence
.
The credibility question
"
Reproducible code
is not the same as a
defensible analysis
.
— the credibility gap
Your output is a
claim about the world
. Your job is to
defend it
.
Section 3
A successful workflow
Human as
gatekeeper
the principle
AI assists
drafts code, mechanics
→
YOU
decide · sign off
→
Defensible
claim about the world
Decide which steps a human
must
own. Then enforce that boundary.
Two non-negotiables
non-negotiable 01
You own every
data decision
What to keep, drop, transform, impute, model, exclude — and
why
.
drop
impute
model
exclude
non-negotiable 02
You write the
methods section
The audit trail of your analysis. The one piece you should
never
delegate.
choices
why
defense
everything else can flex
Delegation map
Boilerplate & syntax
delegate
checkable in seconds
First draft of transformation code
delegate, then read
reason every line yourself
Handling missing values
keep
no ground truth — your call
Picking a model specification
keep
theory + judgment
Interpreting model output
keep
this
is
the analysis
Drafting the methods section
keep
how you defend the work
Reformatting the methods section
delegate
mechanics, not decisions
Pin the decisions in code
analysis.py
# Drop respondents who skipped the income question.
# Structural skip (only employed respondents see it),
# not a refusal — see codebook §4.2.
df
=
df
[
df
[
"employed"
]
==
1
]
defensible later
Reviewers six months from now can read your reasoning.
guardrail for AI
Stops the next session from "helpfully" rewriting it.
Verify against an
independent
check
three independent checks
check 01
A published
summary stat
Codebook, prior paper, official table.
check 02
A back-of-envelope
calculation
Done by hand. No model, no AI.
check 03
A second analyst
(not your session)
Different chat, different prompt, different model.
Disagreement → the
AI-assisted pipeline is wrong
until you can explain the gap.
Two issues data science makes worse
issue 01
Reproducibility
AI sessions are
not deterministic
. Different versions, different prompts, different output.
# helper drafted with
claude-sonnet-4-6
,
2026-05-12
issue 02
Disclosure
Journals, conferences, funders ask. The honest answer is almost always
yes
.
"Code drafts were generated with
[model]
and reviewed and modified by the authors before use."
one comment, one sentence · protects future-you
Verify
, then trust
"
Plausible
is not the same as
correct
.
— the data-science gap
old
"
Trust
, but verify."
→
new
"
Verify
, then trust."
Three takeaways
✓
Counter-productivity
Speed without supervision =
throughput without quality
.
✓
The dilemma
No ground truth →
two LLMs agreeing is not evidence
.
✓
A workflow
You own data decisions.
You write the methods section.
Thank you
"
AI is a powerful
collaborator
. It is not a competent
analyst
.
The difference is
who signs the paper
.
— closing thought
thank you
Pingfan Hu
·
Dr. John Helveston
2026 Agentic Engineering Workshop · The George Washington University