"Large Language Model" is the buzzword of the decade. But strip away the hype, and what is it? It's a probability machine—a very, very smart one.
The Core Concept: Next-Token Prediction
At its heart, an LLM like GPT-4 does one thing: it predicts the next word (or "token") in a sequence. If you input: "The quick brown fox jumps over the ______" The model calculates the probability of every word in its vocabulary.
- "lazy" (85%)
- "fence" (10%)
- "moon" (0.001%)
It selects "lazy" and adds it to the sentence. Then it predicts the next word based on the new sentence. It does this autoregressively, one word at a time.
How does it "Know" things?
It doesn't "know" facts like a database does. It stores relationships between concepts. It learned these relationships by reading petabytes of text—books, wikipedia, code, and websites.
It learned that "Paris" appears often near "France" and "Capital". It learned that function() usually contains code.
Tokens vs. Words
Models don't see words; they see tokens. A token can be a full word ("apple") or a part of a word ("ing").
- The word "antidisestablishmentarianism" might be split into 5 tokens.
- This is why models sometimes struggle with simple math or spelling reversed words—they see the token
384, not the digits3,8,4.
Temperature: The Creativity Dial
When generating text, you set a "Temperature" (0 to 1).
- Temp 0: Always pick the most likely word. Good for coding or factual answers.
- Temp 1: Pick slightly less likely words occasionally. This creates "creativity" and variation, but increases the risk of hallucinations (making things up).
Understanding these limits helps you use LLMs better. They are reasoners and writers, not truth engines.
