Jackson Shuey
← projects
○ ongoingMay 2026Mind

EHRzipper

A schema-reconciliation engine for oncology data. Many incompatible EHR sources in, one audit-ready real-world-evidence dataset out — modeled on Flatiron Health's advanced-NSCLC cohort.

StackPython · Pydantic · Snowflake · Anthropic · Streamlit · FHIR · LOINC / RxNorm · Kaplan-MeierLinksgithub

The problem

The same clinical fact arrives three different ways — a lab value as FHIR JSON from one hospital, an HL7v2 message from another, a CSV column named whatever someone chose from a third. To compute a cohort or a survival curve, every one has to land in a single canonical field, and you have to prove how it got there. EHRzipper is that reconciliation layer, built on an advanced-NSCLC cohort modeled on Flatiron's published methodology.

Reconcile — three tiers, full audit

Each incoming column is routed through three tiers, stopping at the first that fires: a deterministic lookup for known LOINC / RxNorm / ICD-10 codes (no model call), then Claude Haiku for the ambiguous ones, then append-as-new. Every decision — which tier, which canonical target, why — lands in an append-only log; corrections add a row, nothing is overwritten. Running the three synthetic sources through it produced 74 traceable decisions in Snowflake.

Store — four layers in Snowflake

Data flows raw landing → per-source staging → canonical core → analytic marts, each layer rebuildable from the one below. The store is live in Snowflake: 50 patients reconciled from FHIR, HL7v2, and CSV, with healthcare-aware types that preserve date precision, convert units, and validate coded values at the boundary rather than silently corrupting a number downstream.

Derive — the endpoints pharma buys

On top of the core sit the deliverables: lines of therapy (1st through 3rd line, with combination-regimen and maintenance rules) and real-world overall survival as a Kaplan-Meier curve. A separate LLM layer abstracts stage, ECOG, biomarkers, and progression out of free-text notes at 94% measured accuracy against known ground truth. A Streamlit app ties it together — build a cohort, read the survival curve, inspect the audit trail.