Farid Codes

I was reading Chip Huyen's "AI Engineering" at 7 AM (because apparently that's when my brain decides to process complex ideas best), when I stumbled upon something that made me put down my mate and start scribbling notes.

In Chapter 2, Huyen discusses a fascinating problem: the ratio of human-generated text to LLM-generated content is shifting dramatically. We're producing less original human content relative to the exponential growth of AI-generated material needed to train these models. The math is simple but concerning, if models keep training on content generated by other models, we risk a quality degradation loop. Garbage in, garbage out, but at scale.

The Hypothesis: We're All Curators Now

What if we're looking at this wrong?

Yes, there's less purely human-generated content. But something interesting is happening in the space between human and machine: we've become curators, editors, refiners of AI output. And that role might be more valuable for training than we think. Every time a developer vibe code, they're not just accepting suggestions blindly. They're tweaking variable names, adjusting logic, fixing edge cases. Every time someone uses Gemini, Claude or Cursor to write an article, they're reorganizing paragraphs, adding personal touches, correcting hallucinations.

Those tweaks? Those adjustments? They're signals. Incredibly rich signals about where the model failed to capture human intent.

The GitHub Laboratory

Look at GitHub. Millions of lines of code flowing in daily. An increasing percentage generated by AI assistants. But here's the kicker: most of that AI-generated code gets modified by humans before being committed.

Those modifications aren't just bug fixes. They're a dataset of human intention correction. They're showing us exactly where Claude got it 90% right but missed that crucial 10%. Where cursor suggested something functional but not idiomatic. Where the model understood the what but missed the why.

If we could isolate those human edits—separate the AI-generated baseline from the human refinements, we'd have something incredibly valuable: a natural human reinforcement learning dataset at massive scale.

The Signal in the Noise

Traditional reinforcement learning from human feedback (RLHF) is expensive. You need human raters, structured feedback sessions, careful prompt engineering. It doesn't scale easily.

But what if the feedback is already there, hiding in plain sight? Every git diff, every edited draft, every refined prompt—they're all examples of humans saying "not quite, let me show you what I meant."

The challenge isn't generating more synthetic data. It's identifying and extracting the human signal from the increasingly AI-saturated corpus of content.

The Questions This Raises

Can we develop methods to distinguish between purely AI-generated content and AI-generated-but-human-refined content? More importantly, can we extract the delta—the difference between what the model produced and what the human wanted?

Would training on these "correction signals" be more efficient than training on purely human-generated content? After all, mistakes are often better teachers than successes.

And the big one: as we become better at curating AI output, are we inadvertently creating the next generation's training data through our editorial choices?

Why This Matters

If this hypothesis holds any water, it suggests that the "data crisis" for LLMs might not be as dire as it seems. We're not running out of training data—we're evolving into a new paradigm where the human-AI collaboration itself becomes the dataset.

The bottleneck shifts from "how much content can humans generate" to "how can we capture and learn from human curation at scale?"

It's 9 AM now. My mate is cold and i'll start work. But I can't stop thinking about those millions of git commits, each one a tiny lesson in what humans really want when they ask AI for help.

Maybe we're not training AI on human content anymore. Maybe we're training it on human intention... one correction at a time.

What do you think? Am I onto something or just overcaffeinated at 9 AM? Let me know on Twitter Linkedin.