How Language Models Became the Foundation of AI Engineering

This is the second post while reading AI Engineering by Chip Huyen. Chapter 1 already touches on very interesting topics: where language models come from, what foundation models actually are, and how AI engineering emerged as a discipline.

The Statistical Nature of Language

The idea that language follows statistical patterns isn't new. We've known for centuries that certain letters appear more frequently than others, for example: the letter E is the most common in English.

In 1951, Claude Shannon published a landmark paper on the statistical nature of language, and many of his concepts are still used today (i.e. entropy).

This matters because language models are, at their core, statistical machines. They encode patterns about how language works and use those patterns to predict what comes next.

Tokens: The Basic Unit

The fundamental unit of a language model is the token. Tokenization is the process of splitting text into these units. Tokens strike a better balance than characters or full words for how models process text. If you want to see this in action, OpenAI has a tokenizer tool where you can input any text and see exactly how it gets split:

I think its really cool that you can also see how a language model represents text internally:

Two Kinds of Language Models

The book distinguishes two types of language models:

Masked language models are trained to fill in the blank — given a sentence with a missing word, predict what goes there. BERT is the classic example. These are useful for tasks like text classification and code debugging.

Autoregressive language models predict the next token in a sequence. This is the architecture behind GPT and the models we associate with generative AI.

Claude actually created a diagram to explain it to me:

The Breakthrough That Made LLMs Possible

In traditional machine learning, there are 2 main approaches: supervised learning and unsupervised learning. For supervised learning, you need labeled data: inputs paired with known correct outputs. Creating those labels requires humans, and that can be very expensive.

The breakthrough for text was self-supervision: models learned to infer the labels directly from the input data. This meant models could train on essentially all the text available on the internet without anyone manually labeling anything. That's how we went from language models to large language models.

To get a sense of the scale: GPT-1 (2018) had 117 million parameters. GPT-2 (2019) jumped to 1.5 billion. GPT-3 (2020) reached 175 billion. GPT-4 is estimated at 1.7 trillion, though at this point the big labs have stopped disclosing these numbers.

Foundation Models and Multi-Modality

LLMs were trained on text, but the same principles extended to images, video, audio, and other formats. This is called multi-modality, and it gave us Large Multi-modal Models (LMMs). Both LLMs and LMMs are foundation models: general-purpose models trained on massive datasets by a handful of well-resourced labs, then adapted by everyone else.

The adaptation happens mostly through three main techniques:

Prompt engineering (giving the model specific instructions)
RAG (connecting the model to external data sources)
Fine-tuning (further training on domain-specific data)

From Foundation Models to AI Engineering

AI engineering is the discipline of building applications on top of these foundation models. Training foundation models is so prohibitively expensive that only a few organizations can do it, which created the "model as a service" paradigm, and with it, a need for engineers who can take these models and adapt them to solve real-world problems.

What kind of applications can be built?

Chip references AWS's categorization of generative AI into three buckets:

Customer experience
Employee productivity
Process optimization

AI has already proven valuable in marketing, code generation (Chip notes that experts say AI is significantly better at generating frontend code than backend code, uh-oh 😅), information aggregation, and workflow automation: from booking restaurants and planning trips for end users, to lead management and invoicing for enterprises.

The Moat Question

One of the most interesting discussions in this chapter is about product defensibility. The low barrier to entry in AI is both a blessing and a curse. If something is easy for you to build, it's easy for your competitors too.

Chip mentions a VC general partner crushing view: many startups' entire products could become a feature inside Google Docs or Microsoft Office. If their product takes off, what stops a tech giant from assigning three engineers to replicate it in two weeks?

Chip further argues that in AI, competitive advantages come from three places: technology, data, and distribution. Most startups would be using the same foundation models so technology is not a moat. Distribution advantages tend to belong to big companies. That leaves data as the real edge for startups and individuals: even if you couldn't use your data to train a model, the behavioral data you collect — what your users want, how they interact with your product — is invaluable.

The AI Engineering Stack

AI engineering differs from ML engineering in a fundamental way: it's not about developing models from scratch, but about adapting and evaluating them. The core responsibilities boil down to three things:

Evaluation
Prompt engineering
Building the AI interface

Evaluation is especially critical because foundation models are open-ended. Unlike traditional ML models with narrow, well-defined outputs, these models can produce almost anything, which makes measuring success much harder.

Prompt engineering is about extracting desirable behaviors from a model without modifying its weights. This includes providing context, connecting tools and managing memory systems.

Why Full-Stack Engineers Have an Edge

Finally, Chip describes how AI engineering has flipped the traditional ML workflow. Before, you started with data and models, and the product came last. Now, as an AI engineer you can start with the product, validate it with users, and only then invest in data and model optimization. This rewards fast iterators.

In terms of programming languages, Python still dominates, but JavaScript is growing (LangChain.js, Vercel AI SDK, etc.), making this space increasingly accessible to full-stack engineers. Chip puts it clearly: full-stack engineers have an advantage over traditional ML engineers in their ability to quickly turn ideas into demos, get feedback, and iterate.

From Language Models to AI Engineering

The Statistical Nature of Language

Tokens: The Basic Unit

Two Kinds of Language Models

The Breakthrough That Made LLMs Possible

Foundation Models and Multi-Modality

From Foundation Models to AI Engineering

The Moat Question

The AI Engineering Stack

Why Full-Stack Engineers Have an Edge

Comments

Reading AI Engineering

AI Engineering book: First Notes

More from this blog

Building a RAG System from Scratch with TypeScript (Part 1)

Crash Course de Vercel AI SDK v6

AI Engineering book: First Notes

Hello, world

Command Palette

The Statistical Nature of Language

Tokens: The Basic Unit

Two Kinds of Language Models

The Breakthrough That Made LLMs Possible

Foundation Models and Multi-Modality

From Foundation Models to AI Engineering

The Moat Question

The AI Engineering Stack

Why Full-Stack Engineers Have an Edge

Comments

Reading AI Engineering

AI Engineering book: First Notes

More from this blog