What is Generative AI and LLMs, really?

This article is also available in: español

One day, I needed to sanitize an HTML input and went to check out how the Loofah gem solved a problem I was having.

Out of curiosity, I asked ChatGPT to provide an example of how to do it.

The answer it gave me was an exact copy and paste from Loofah. Except, there was no attribution. I only knew it was a copy because I had looked at Loofah’s implementation first.

This was not cool. I shared this with my colleagues and we decided to start writing about the recent Large Language Models (LLMs) popularization and the ethical implications of these so-called “revolutionary” Artificial Intelligence (AI) tools.

There’s a lot of discussion about the “fair use” of the training data. And I’m not the only one questioning that argument. I wanted to do something about it.

My amazing colleague, Mike Burns, accepted my invitation for a series of blog posts that dive deep into this topic. We decided to start from the beginning.

Where to begin to understand more about AI and LLMs?

In this first post, Mike lays the ground for the next chapters. The focus is on the technological implementation of LLMs, and not on companies commercializing them (stay tuned for the next post).

Most of the talk around AI has nothing to do with AI, and it’s mostly focused on driving accountability and blames away from people and towards “algorithms”.

Generative AI and LLMs

I, (Mike) don’t know what generative AI is, so let’s talk about Large Language Models (LLMs).

A Large Language Model is a set of data structures and algorithms. You can think of it like linked lists and binary search: a set of tools in the programmer’s toolbox, with specific use cases and drawbacks.

Some venture capitalists and visionary thinkers have made them into something they’re not. Often, they’ve made them into a scapegoat for their own goals, a prerequisite for allowing them to do what they want.

Let’s talk about that.

LLMs are not coming for your job

An LLM is not managing a company; that is done by people in the company. So if someone is saying that “LLMs are coming for your job”, what they mean is that a person is going to fire you and then have software do your job. That person is making that decision.

Algorithms using LLMs have a specific use case. They generate

  • words
  • pixels
  • bytes

They do not produce correct words, they produce likely words. Think of your phone’s predictive keyboard feature, but better.

The tasks that you do at work likely cannot be replaced with such an algorithm. So when a manager fires someone and then uses software to do their job, that’s a decision that the manager is making at the expense of the customer.

And that’s a choice. Of a human.

(Stefanni here)

Since we’re talking about the impact of these tools, as with any other “revolutionary” tool, we can never know what will happen. Or if it’s indeed a revolutionary tool after all. I agree with Cal Newport’s video on “The AI Revolution: How To Get Ahead While Others Panic” view on this.

Could it be that companies realize that chatGPT doesn’t do extraordinary things that a regular employee could do? Or that hiring a person to do the job is cheaper than paying for API requests?

I guess we will find out soon.

Back to Mike.

LLMs don’t plagiarize artwork

OpenAI, one of the big names commercializing LLMs, trained their LLM on the Web, circa September 2021. GitHub, another company investing heavily in LLMs, trained theirs on the repositories that they host.

DeviantArt, Meta, Microsoft, Stability, Midjourney, Grammarly, StackOverflow, and others, have done similar practices: training their LLMs on works created by others, without their consent.

This was a choice that all of them made. This is not a requirement for training an LLM.

LLMs are a tool

All of this is to point out that the algorithms and data structures comprising an LLMs are a tool, not products, and people need to be careful with the tools that they use.

Of course, this is not new. LLMs did not usher in the idea of software harming people.

At every point in the history of computers, we have had to wrestle with the ethics of the products we are making.

No aspect of the data structures available has made that easier or harder. It is still our the choice of whether to make a product for good and it is our burden to consider the systemic effects of the products we build.

Your programming tools have biases

Whether it’s the strongly-opinionated Rails stack or the opinionatedly-unopinionated world of JavaScript libraries, how we model software is shaped by the APIs we are given.

Let’s explore that in the context of a shaped LLM API: OpenAI.

OpenAI is a network API

All of this is honestly a moot point for the vast majority of us:

We are not building LLMS.

Instead, we make API calls to endpoints that claim to be backed by LLMs.

Just as when you make an API call to get the exchange rate for a currency, you need to be clear to yourself and your users about the limitations of the data. There’s nothing new here.

And just as when logging data to your customer support platform, you need to be careful of what data you send over the wire. For us, consultants on our fourth HIPAA training, this is especially not new.

What are the specific, inherent problems of LLMs and GPTs?

Generative Pre-trained Transformers (GPTs), such as those used by OpenAI’s ChatGPT, are awful for the environment. Absolutely abysmal. They are bad in a unique and new way.

Quick note: we will explore this section more in the next post.

Your ethics drive the product, not your tools

The long and short of it is that the existence of LLM algorithms and data structures is not a threat. If you can find a way to make use of a computer that is bad at math and sometimes wrong, that on its own is not a problem.

The existential crisis is in making a product that harms people, which can be done with or without LLMs. It is your responsibility to build your products with care, whether you use insertion sort or machine learning.

What is your responsibility and the correct thing to do?

As with any other tools that surface widespread ethical problems, understanding how they work is critical for us to do something about it.

We hope this post clarified some terms about AI, LLMs, GPTs, etc.

In the next post, we will explore the product-facing, commercial side of companies training LLMs more in-depth, including the impact in our environment.

In the meantime, we have a question for you: do you know any ethically trained LLM, or LLM that respects copyright? If so, we’d love to know.