AI Versus The Internet

AI Versus The Internet

On May 6, 2024 OpenAI and StackOverflow issued statements announcing that they would be collaborating jointly on their respective AI missions. OverflowAI, StackOverflow's LLM project, would depend on OpenAI, and presumably OpenAI would get access to StackOverflow's 59 million questions and answers.

On the surface level, this makes a ton of sense. StackOverflow has a massive repository of natural language articles in several different languages and multiple subject matters. These questions even have labeled answers, meaning that OpenAI would get access to a data repository with questions and labeled, correct answers. You don't need to know too much about AI to know how valuable that would be.

OpenAI is the leading LLM developer, and the application of LLM's to StackOverflow makes sense. It can help cut down on duplicate questions, answer questions faster and, possibly, create answers with rich and hopefully correct context versus some of the... toxic or incomplete StackOverflow answers you see fairly frequently.

You might be surprised, then, to find that there was significant controversy following this announcement. This Wired article sums it up fairly well. StackOverflow users are furious because they feel as though the platform is selling off their contributions for cash to a company that is doing a lot of ill on platforms like StackOverflow. They feel as though the platform has changed course after being initially resistant to and even adversarial toward AI in the past.

This conflict is a great example pointing toward much larger, and more important, conflict going on behind the scenes.

AI Versus The Internet

As an "old millennial" I grew up just old enough to experience the internet in my youth in the post-Dotcom Bubble era, but not old enough to be highly active in the pre-social media era. As such, social media, whether it be Facebook, Twitter, YouTube or forums, were always an integral part of the internet I experienced growing up.

This means my view of the internet has been "tainted" somewhat, because I grew up in the social media era and not the "the internet is a public resource, information deserves to be free, crypto-libertarian-esque" era. I still espouse those beliefs, but the reality that I've grown up in has been that of a very different internet. The "information should be free and the internet shouldn't be ruled by corporations" ideology for me is an idealistic one, while for those who grew up in the earlier internet era see this type of ideology as one to return to.

When StackOverflow announced they were collaborating with OpenAI, my immediate response was that of pragmatic pessimism: "Well, that makes sense, even though it sucks." To me, it was bound to happen. The internet as I've seen it is one in decline, so this furtherance of the decline fits the mold.

That said, this fight is an important one, especially if you're not as enveloped in pragmatic pessimism as I am. The idea that one's contributions to the public square can be captured and monetized by corporations that take the "move fast and break things" ideology to the extreme, a company that is hyper-fixated on the idea of developing an Artificial General Intelligence, is one we should fight back against. Folks are understandably angry that their contributions to the greater good of the public square are being captured and monetized.

This isn't just OpenAI. Twitter, Meta and Google have long been capturing our data and monetizing it, regardless of our will. You can say that the "Terms of Service" that nobody reads constitutes consent, but it is neither fully consensual nor fully informed. Your options are to segregate yourself from the public square of the internet, or agree to let your online identity be captured and sold. This is not a new battle, it is one as old as the version of the internet that I was born into.

What constitutes ethical usage of The Commons?

If I were to walk into a common area and grow a bunch of corn for folks to eat collectively, but someone walked into my patch of corn, took it all and opened up a corn stall to sell the corn he did not grow or pay for, we could all probably agree that that's unethical usage of the commons. It's theft. Someone is benefiting off of the work of someone else with zero repayment or consent.

What happens when the corn doesn't disappear, though?

When OpenAI scrapes publicly available, non-personal data from StackOverflow, answers that someone put their heart and soul into in order to help out their fellow developers, that information doesn't disappear. GitHub's Copilot AI made bountiful use of all of the open source code that was uploaded to GitHub over the years, but that code stayed on GitHub and was just as available before it was used to train Copilot as it was after. Technically, what Copilot did was take individual codebases that had (limited, but still existent) value and they created a product that technically added value to those codebases. All of a sudden, those codebases were not just valuable because of their function, the ability for devs to learn from them, etc., they were also valuable because they fed into a product that is now being used to build new projects. Nothing was technically lost in the code.

But in our heads, something still feels wrong, doesn't it?

To me, the answer lies in the way we feel about plagiarism. If I rip off a blog from someone else and I don't cite it, nothing of value was lost for that original blogger. Sure, maybe they lose a couple bucks of ad revenue, memberships, etc., but probably not, and the argument that would ensue when the blogger discovers my misstep likely wouldn't center on the money.

It would center on me taking credit for work I didn't do.

We all take pride in our work, in our accomplishments, and in our creations. This is especially true for products that we're putting out into the world to make the world a better place, or to help people. That pride is a good thing: you should feel a certain level of pride in contributing to StackOverflow, donating your time to helping others with 0 or close to 0 expectations of future reward.

That, to me, is what a lot of the discussion around the OpenAI+StackOverflow collaboration is about: OpenAI is taking the contributions of others, contributions that were made for the unselfish benefit of the follow developer, and they're almost bastardizing it by capitalizing on this work and, in creating a product from it, functionally taking credit for it. We don't like it not because value is being removed from the environment in the theft of StackOverflow's data, we don't like it because we have a natural inclination to dislike the capitalization on other people's work.

There's nothing illegal about what StackOverflow and OpenAI are doing. It's probably not even unethical. There might be something to say if StackOverflow is breaking their own rules set in their terms of service, but let's be honest, nobody should expect companies to stick to their own ToS.

Rules for thee, but not for me.

What we are angry about, then, is not a legalistic problem, nor is it a clear-cut ethical problem. We're angry about a values problem. In our minds, an important value, a norm that we have set, that open source contributions should remain open source and not be capitalized upon by giant corporations, is being broken by these AI companies.

And value problems are hard to argue.