Understandable Explanation of How an LLM Works
Posted: 9/17/2025 8:04:21 AM
By: PrintableKanjiEmblem
Times Read: 69
0 Dislikes: 0
Topic: Reference

Overview

Large‑Language Models (LLMs) are transformer‑based neural nets trained on terabytes of text to learn a contextual distribution over tokens.
The training pipeline is a blend of data engineering, deep‑learning research, and large‑scale distributed systems.
Below is a “full stack” walk‑through aimed at someone who’s built distributed software and written low‑level code.


1. Data Pipeline

StageWhat you doWhy it matters

Corpus collection

Web‑scraping, public corpora (Common Crawl, Wikipedia, books, code), proprietary data.Determines the model’s inductive biases and knowledge.

Deduplication / filtering

Remove exact or near‑duplicates, profanity filters, harmful content flags.Prevents data leakage, reduces redundancy, keeps the dataset clean.

Sharding

Split the corpus into shards (~1 GB each).Enables parallel ingestion, deterministic random access.

Tokenization

Byte‑Pair Encoding (BPE), SentencePiece, or WordPiece.Turns raw text into integer IDs; the choice influences vocabulary size and OOV handling.

Pre‑tokenization transforms

Lowercasing, whitespace normalisation, special token insertion (, ).Standardises the input and helps the model learn boundaries.

Dataset construction

Convert shards to a dataset that supports streaming: TextDataset, IterableDataset.Allows reading data without keeping it all in memory.

Sequence packing

For transformer training, break each shard into fixed‑length token sequences (e.g. 2048 tokens).Enables efficient batching and padding‑free computation.

Bucketing / collate

Group similar‑length sequences together.Minimises padding, improves GPU utilisation.

Implementation hint: Use torch.utils.data.IterableDataset with a generator that reads the shards sequentially, then yields mini‑batches to a collate function that pads to the max length in the batch.


2. Model Architecture

Sub‑modulePurposeTypical design choices

Embedding layer

Turns token IDs into dense vectors.

nn.Embedding(vocab_size, hidden_dim); optionally tied to the output projection.

Positional encoding

Adds order information.Learned positional embeddings or sinusoidal embeddings.

Transformer blocks

Core transformer: self‑attention + MLP + residuals.12–96 layers, 768–12288 hidden dim, 12–128 heads.

LayerNorm

Stabilises training.Pre‑ or post‑LayerNorm depending on the variant.

Output head

Predict next‑token distribution.Linear layer tied to embedding (weight = embedding.weight).

Key equations

  1. Scaled Dot‑Product AttentionAttention⁡(Q,K,V)=softmax⁡ ⁣(QK⊤dk)VAttention(Q,K,V)=softmax(dk​​QK⊤​)V
  2. Multi‑Head AttentionMH⁡(Q,K,V)=Concat⁡(head⁡1,…,head⁡h)WOMH(Q,K,V)=Concat(head1​,…,headh​)WO
  3. Position‑wise Feed‑ForwardFFN⁡(x)=max⁡(0,xW1+b1)W2+b2FFN(x)=max(0,xW1​+b1​)W2​+b2​

3. Training Objective & Loss

ObjectiveLossWhy it is chosen

Autoregressive LM

Cross‑entropy over next‑tokenEnables free‑form text generation.

Masked LM (BERT style)

Cross‑entropy over masked tokensEnables bidirectional context; used for fine‑tuning.

For GPT‑style models the loss is computed over every token in the sequence:

L=1N∑i=1NCE⁡(pθ(xi+1∣x1:i),one-hot(xi+1))L=N1​i=1∑N​CE(​(xi+1​∣x1:i​),one-hot(xi+1​))

Regularisation tricks

TrickEffect

Label smoothing (e.g., 0.1)

Reduces over‑confidence, improves calibration.

Weight decay (AdamW)

Encourages smaller weights, improves generalisation.

Dropout

0.1–0.2 on attention & FFN; mitigates overfitting.

Stochastic Depth (DropPath)

Randomly skips entire layers during training.

4. Optimisation

ComponentDetailWhy it matters

Optimizer

AdamW (Adam with decoupled weight decay).Proven to converge faster on transformers.

Learning‑rate schedule

Cosine decay with linear warm‑up (e.g., 3 k steps).Stabilises early training, avoids catastrophic updates.

Batch‑size

Effective batch = gradient_accumulation_steps × local_batch × num_devices.Larger batches improve gradient estimate but need more memory.

Gradient clipping

Clip by norm (e.g., 1.0).Prevents exploding gradients in long sequences.

Mixed‑precision

FP16 + loss‑scale or BF16 on TPU.Cuts memory, speeds up inference/training, keeps loss stable.

Example hyper‑parameters for a 175B GPT‑3‑style run:

ParamValue

hidden_dim

12288

num_heads

96

num_layers

96

seq_len

2048

vocab_size

50257

batch_size (per GPU)

8 × 2048 tokens

accum_steps

16

learning_rate

0.00025

weight_decay

0.1

warmup_steps

250k

total_steps

300k

5. Distributed Training & Hardware

5.1 Parallelism Strategies

StrategyWhat it splitsTypical use‑case

Data Parallelism (DDP)

Input batchesSimple, scales up to ~128 GPUs

Tensor Parallelism

Weight matrices inside a layerNeeded for >1 TB models (e.g., GPT‑3, PaLM)

Pipeline Parallelism

Stages of the model across devicesHelps when model size > GPU memory but < 1 TB

Sharded Optimizer State

Optimiser parameters (AdamW)Saves GPU memory when using large batch sizes

Off‑loading (CPU, NVMe)

Activations, gradientsAllows scaling beyond GPU memory limits

Popular libraries: DeepSpeed (ZeRO‑3), Megatron‑LLaMA (tensor + pipeline parallelism), NVIDIA Megatron‑LM, GLaM, etc.

5.2 Hardware Landscape

HardwareMemory per coreThroughputNotes
NVIDIA A100 80 GB80 GB~200 TFLOP/s FP16Standard for 70–175 B models
NVIDIA H100 80 GB80 GB~350 TFLOP/s FP8/FP16Higher bandwidth, supports FP8 training
Google TPU v432 GB~400 TFLOP/sNative BF16, 5 TB‑scale training
AMD Instinct MI250X128 GB~200 TFLOP/s FP16Good for pipeline parallelism

Memory budgeting:

total_mem ≈ 2×(emb + transformer + activations) + optimizer

A typical rule‑of‑thumb is to keep activations ~ 40 % of memory. Activation checkpointing (aka recomputation) trades compute for memory.


6. Training Workflow (Pseudo‑code)

# Simplified training loop using DeepSpeed ZeRO-3

import deepspeed
import torch
from transformers import GPT2Config, GPT2LMHeadModel

config = GPT2Config(
    vocab_size=50257,
    n_embd=12288,
    n_layer=96,
    n_head=96,
    bos_token_id=50256,
    eos_token_id=50256,
)

model = GPT2LMHeadModel(config).to('cuda')
optimizer, model, _, _ = deepspeed.initialize(
    args=deepspeed_args,
    model=model,
    model_parameters=model.parameters(),
    config=deepspeed_config,
)

data_loader = build_dataset(...)

for epoch in range(num_epochs):
    for step, batch in enumerate(data_loader):
        inputs = batch['input_ids'].to('cuda')
        labels = batch['labels'].to('cuda')

        outputs = model(inputs, labels=labels)
        loss = outputs.loss / deepspeed_args.gradient_accumulation_steps

        model.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

        if step % 100 == 0:
            print(f'Epoch {epoch} Step {step} Loss {loss.item()}')

Key points:

  1. DeepSpeed ZeRO‑3 automatically shards optimizer state & gradients.
  2. Gradient accumulation lets you simulate a huge batch on a few GPUs.
  3. Automatic mixed‑precision (AMP) is enabled by default.

7. Post‑Training

7.1 Fine‑tuning & Adaptation

ApproachWhen to useTypical pipeline

LoRA (Low‑Rank Adaptation)

Adapting to a new domain with few parametersFreeze base, add rank‑r adapters in each attention/feed‑forward; train only adapters.

Prefix Tuning

Prompt‑engineering for few‑shot tasksAdd learnable prefix tokens to the attention queries.

P-tuning

Fine‑tune the prompt embeddingsInsert a small learnable token sequence.

Full‑model fine‑tune

Large domain shift or supervised taskTrain all weights; requires more compute.

7.2 RLHF (Reinforcement Learning from Human Feedback)

  1. Reward model – supervised regression on human‑ranked responses.
  2. Proximal Policy Optimization – fine‑tune language model to maximise reward.
  3. Safety constraints – policy‑based or rejection sampling.

8. Evaluation & Metrics

MetricWhat it tells you

Perplexity

Log‑likelihood of validation data. Lower is better.

BLEU / ROUGE

N‑gram overlap for summarisation / translation tasks.

Zero‑shot benchmarks (SuperGLUE, MMLU, etc.)

General reasoning ability.

Calibration error

Probability estimates vs. empirical correctness.

Bias & toxicity tests

Detecting harmful outputs.

Tip: For large models, perplexity alone is insufficient; use a suite of downstream tasks and human evaluation.


9. Common Pitfalls & Gotchas

PitfallFix

Memory fragmentation

Use contiguous tensors, avoid inplace ops that break autograd graph.

Gradient explosion on long contexts

Gradient clipping, use RMSprop‑like optimisers (AdamW is fine).

Over‑fitting to short‑text datasets

Mix in longer documents, use longer context windows during training.

Data leakage (public model leaks training data)

Deduplication, privacy‑preserving filtering.

Inadequate hyper‑parameter tuning

Automate with Optuna or Ray Tune; monitor learning‑rate dynamics.

10. Quick Reference: Tooling

ToolPurposeLanguage

TensorFlow / PyTorch

Core DL frameworksPython

DeepSpeed / ZeRO

Model & optimizer shardingPython

Megatron‑LM

Large‑scale transformer trainingPython

JAX / Flax

TPU‑native trainingPython

NVMe‑Offload

Gradient/activation off‑loadC/C++

Dask / Ray

Distributed data prepPython

MLflow / Weights & Biases

Experiment trackingPython

Bottom line

Training an LLM is essentially training a gigantic, multi‑layer transformer on a massive, clean, tokenised corpus, using distributed, memory‑efficient parallelism, and a carefully tuned optimisation loop. Once you master the data pipeline and the distributed system, the rest is largely a matter of hyper‑parameter sweeps and scaling experiments.

Happy training! 🚀

Rating: (You must be logged in to vote)