Zachary Jacokes

I'm a biomedical data scientist building scalable, reproducible systems that extract reliable structure from complex, high-dimensional data. I transmute the noise in clinical and biological datasets into interpretable, real-world insights.

Outside of work, I'm a father of two young boys and a basketball and tennis enthusiast. I enjoy thoughtful conversations and welcome opportunities to connect.

Email GitHub Google Scholar LinkedIn

About Me

I used to think the beauty of data science was in the arc: raw data in, clean result out, story told. And that instinct wasn't exactly wrong, but it was a naïve version you believe before you've spent years inside real datasets.

What I've actually learned is that the interesting part isn't the trajectory. It's the process of building everything so the structure holds up. The pipeline that makes an analysis repeatable across multiple sites. The quality checks that catch when a scanner drifts. The database design that means a clinician in Seattle and a clinician in Los Angeles are actually recording the same thing. That's where the real work lives.

How I got here

I studied psychology at Emory, originally out of curiosity about how the mind works. What actually grabbed me wasn't the clinical side, though; it was the methodology. The studies that stuck with me were the ones with disciplined design, and I got more interested in that than in the findings themselves.

At Georgia Tech, I worked on a computational experiment using expectation-maximization to model social smiling behavior, and that was the moment machine learning clicked. Not just conceptually, but as a tool you could point at a messy problem and get structure out of.

After that, the direction was pretty clear. I wanted to be the person who makes the analysis possible, not just the person who runs it.

The multi-site years

I spent four years at USC's Lab of Neuroimaging working as part of a data coordination center for a multi-site study. The job was making sure data collected by different teams, on different scanners, in different cities could be compared meaningfully. Building workflows, running quality control, constantly asking whether the variation in a dataset is telling you something about biology or just about which MRI machine someone happened to use.

At UVA, I took that a step further and built systems from scratch: REDCap databases for dozens of clinical instruments, automated preprocessing pipelines, HIPAA-compliant data handling, the whole stack. I trained clinical staff; I wrote the documentation; I was the person who got the call when something broke. Most people touch one piece of a data pipeline. I've worked across all of it, from data entry to model output.

The PhD

I already knew how to build infrastructure, so the doctorate wasn't a pivot. It was about going deeper on what happens inside that infrastructure. How to represent and model high-dimensional biomedical data, and, importantly, how you know whether what you've found is real.

My research boils down to one question: can you extract reliable signal from noisy, heterogeneous, multi-site clinical data? That touches dimensionality reduction, spectral embeddings, cross-site harmonization, and brain-behavior modeling, but the common thread is skepticism. When you're working with data from multiple sites, the hardest problem is figuring out whether the pattern you found would survive a different scanner, a different cohort, or a different Tuesday. Once that's clear, modeling becomes a question of fit, not faith.

What I do

I build and run data systems in biomedical environments where nothing is simple: multi-site studies, inconsistent inputs, regulatory constraints, and evolving scientific questions. I've published 14 peer-reviewed papers and a book chapter, presented at international conferences, built large-scale data pipelines, and mentored students and staff along the way.

The ethos that defines how I work is this: I don't treat modeling as separate from data infrastructure. If the data feeding your model aren't trustworthy, the model doesn't matter. If the result doesn't generalize past the dataset that produced it, you haven't learned anything. And if the whole system falls apart when you're not watching it, it's not a system.

What's next

I'm drawn to problems where the answers aren't settled. Real-world evidence, neurotechnology, clinical data platforms; places where the data are messy because the biology is messy. Those are the environments where careful infrastructure and honest modeling matter most.

The version of me who wrote a grad school application once said he wanted to "participate in the data revolution." I still like the energy of that, even if I'd phrase it differently now. These days, the goal is more specific: build systems and models that make complex data usable reliably, at scale, and without anyone having to hold them together by hand.