We need to build the GitHub of scientific data

Sep 26, 2024

I was reading this article published in 2017. It had a bold claim. A scientific dataset's odds of being available drop by 17% each year. That's not just bad. It's a crisis.

Think about what that means. Every four years, half of all science data published four years before effectively vanishes. Gone. Inaccessible.

This is insane. We're living in an age where storage is cheaper than ever. Where we can save every photo we take, every tweet we post, every half-formed idea we jot down. Yet, the foundation of scientific progress—the actual data—is vanishing at an alarming rate.

It's not just biology, either. This is happening across all fields of science. And it's not because scientists don't care. It's because we've built a system that makes it hard to do the right thing.

Here's the problem: when a scientist publishes a paper, they care about the paper itself. They're incentivized to. The data? It's an afterthought. They might upload it to their university's server. Or, to a field-specific repository. But then they move on to the next idea. And when they change institutions, or when the funding for that server runs out, or when the repository decides to "upgrade" its systems... poof. The data is gone.

This wouldn't be a big deal if data didn't matter. But data is everything in science. It's the raw material from which knowledge is built. Losing data doesn't just inconvenience a few researchers. It actively hinders scientific progress.

Imagine if software worked this way. Imagine if, every time you used a library, you had to find the original developer and hope they had a copy. It would be absurd. Yet that's essentially what we're asking scientists to do.

We need to fix this. And I think the solution is surprisingly simple: we need GitHub for scientific data.

Think about what GitHub did for code. Before GitHub, sharing code was a pain. You had to set up servers. You had to manage version control. You had to figure out how to handle contributions from others. GitHub made all of that easy. It unleashed a wave of collaboration and innovation. This has transformed the software industry. Of course, there were some solutions out there, but none that stood the test of time as well as GitHub.

We need the same thing for scientific data. A single, global platform where:

Data is stored permanently.
Everything is indexed and easily searchable.
Version control is built-in.
Licensing is clear and standardized.
Forking and modifying datasets is as easy as forking a GitHub repo.

This isn't just a nice-to-have. It's essential for the future of science. Because right now, we're building on quicksand. Each new study, each new discovery, is relying on data that might not be there in a few years. We're creating a house of cards that's destined to collapse.

Some will argue that we already have solutions for this. There are field-specific repositories, institutional databases, and even some attempts at cross-disciplinary platforms. But none of them could solve the whole problem. They're fragmented and hard to use. They lack the network effects that made GitHub so powerful.

We need a single, universal platform. It should be the default place to store and share scientific data. A platform so easy to use, so obviously beneficial that not using it would seem crazy.

This isn't just about making scientists' lives easier. It's about accelerating the pace of scientific progress itself. It's about solving big problems faster.

We need to build the GitHub of scientific data.

Vishnu R Nair

Discussion about this post