The foundational research behind modern AI — 23 papers, each summarized in plain English with a link to the full text. Browse by era, sort the list, or follow the Timeline to see how each discovery led to the next.
The neural-net building blocks everything else stands on.
The 'deep learning works' moment that launched the modern era.
AlexNet achieves breakthrough ImageNet classification by combining deep convolution with ReLU, dropout, and GPU-accelerated training.
AlexNet won the 2012 ImageNet competition by training a deep convolutional neural network with 60 million parameters on 1.2 million images across 1000 classes. The model achieved 37.5% top-1 error and 17.0% top-5 error on ILSVRC-2010—a major breakthrough compared to prior work at 45.7% top-1. The paper introduced several key innovations: ReLU activations that train 6× faster than tanh, a dual-GPU parallelization scheme to handle memory constraints, local response normalization to improve generalization, and overlapping pooling. To prevent overfitting despite the large model size, the authors used two data augmentation strategies (random crops and PCA color jittering that increased training data 2048×) and dropout regularization (randomly zeroing 50% of hidden units during training). The network was trained on two GTX 580 GPUs for 5–6 days using SGD with momentum.
Recurrent architecture that remembers across long sequences; dominant before transformers.
LSTM introduces gated memory cells to enable recurrent networks to learn long-range dependencies by controlling gradient flow.
This paper introduces Long Short-Term Memory (LSTM), a recurrent neural network architecture designed to solve the vanishing gradient problem that prevented standard RNNs from learning long-range dependencies. The core idea uses memory cells with gating mechanisms—input, forget, and output gates—that control information flow, allowing gradients to propagate over many time steps without exponential decay. The gates are learned functions that decide what to remember, forget, and output at each time step. LSTMs demonstrated ability to learn tasks requiring memory over 1000+ time steps, far exceeding what vanilla RNNs could achieve. This architecture became the practical standard for sequence modeling (speech, language, time series) for over two decades, until attention-based transformers emerged. The paper's contribution was fundamentally architectural: it showed that carefully designed connectivity and learned gating could preserve information across long sequences, making recurrent models viable for real-world sequential data.
The backpropagation algorithm that makes training neural networks possible.
Backpropagation enables efficient gradient computation in multi-layer neural networks via reverse-mode automatic differentiation.
This paper introduces the backpropagation algorithm, the foundational method for training multi-layer neural networks. The problem: neural networks with hidden layers couldn't be trained effectively because there was no way to compute gradients for weights in layers that didn't directly produce the output. The core idea is to use the chain rule of calculus to propagate error signals backward through the network, layer by layer, enabling computation of how each weight contributes to the final error. By computing these gradients, you can update all weights via gradient descent. The key result is demonstrating that networks with hidden layers can learn non-linear mappings on benchmark problems (like XOR) that single-layer networks cannot. This removed the barrier to training deep networks and became the standard algorithm for neural network optimization, enabling the deep learning era decades later.
Embeddings, sequence models, and the first attention mechanisms.
Residual connections that let very deep networks train; relied on by transformers.
Skip connections enable training of very deep networks (152 layers) by learning residual functions, winning ImageNet 2015 with 3.57% error.
Deep neural networks become harder to train as you add more layers, even when theory suggests deeper should be better. This paper introduces residual connections (skip connections) that let each layer learn the *difference* from its input rather than learning an entirely new representation. Instead of stacking layers in the traditional way, ResNets reformulate training so that a layer learns a residual function F(x) added to the original input x, giving output F(x) + x. This simple architectural change makes it practical to train networks 8× deeper than previous work (152 layers vs. 19 for VGG) while actually reducing computational complexity. On ImageNet, their ResNet-152 ensemble achieved 3.57% error, winning ILSVRC 2015. The approach also improved object detection on COCO by 28% relative to prior methods, demonstrating that increased depth directly translates to better visual recognition across multiple tasks.
The optimizer almost everything is still trained with.
Adam combines momentum and adaptive learning rates by tracking gradient means and variances per parameter, enabling effective training with minimal hyperparameter tuning.
Adam is an adaptive learning rate optimizer for training neural networks with stochastic gradient descent. It maintains running estimates of first and second moments (mean and variance) of gradients, using exponential moving averages to adapt per-parameter learning rates automatically. This lets it handle sparse gradients, non-stationary objectives, and noisy data without manual learning rate tuning. The algorithm is simple to implement, memory-efficient, and works with diagonal rescaling invariance. Kingma and Ba prove convergence bounds under online convex optimization assumptions and show empirical results on image classification and language modeling tasks where Adam outperforms or matches SGD with momentum and other adaptive methods like AdaGrad.
The actual origin of the attention mechanism, three years before the Transformer.
Attention mechanism enables decoders to dynamically focus on source sentence parts, eliminating the fixed-length vector bottleneck in neural machine translation.
This paper identifies a critical limitation in early neural machine translation systems: the fixed-length vector bottleneck. Encoder-decoder models compressed entire source sentences into a single fixed-size vector before decoding, losing information and struggling with long sentences. The authors introduce the attention mechanism, which allows the decoder to dynamically focus on relevant parts of the source sentence when generating each target word. Rather than explicitly segmenting the source, the model learns soft alignments—weighted attention distributions over all source words. Tested on English-to-French translation, this approach matches state-of-the-art phrase-based statistical systems while enabling interpretable alignment patterns that align with linguistic intuition. The attention mechanism became foundational to modern sequence-to-sequence architectures.
Encoder-decoder framing for translation and generation.
Two-layer LSTM encoder-decoder architecture with reversed source sentences achieves state-of-the-art neural machine translation on WMT'14, outperforming phrase-based SMT baselines.
This paper introduces a practical end-to-end neural approach to sequence-to-sequence learning using deep LSTMs. The core idea is simple: use one LSTM to encode a variable-length input sequence into a fixed-dimensional vector, then use a second LSTM to decode the target sequence from that vector. Tested on WMT'14 English-to-French translation, the model achieves 34.8 BLEU score on direct translation (beating the 33.3 SMT baseline) and 36.5 BLEU when rescoring SMT n-best lists. A key practical finding is that reversing the word order of source sentences improves performance significantly (test perplexity drops from 5.8 to 4.7, BLEU improves from 25.9 to 30.6), because it introduces short-term dependencies that make optimization easier. The model also handles long sentences well—contrary to expectations—and learns meaningful sentence representations sensitive to word order and voice.
Words as vectors; meaning as geometry.
Word2Vec proposes CBOW and Skip-gram architectures that efficiently learn word embeddings at scale, achieving 66% accuracy on semantic-syntactic word relationships with training times under a day.
This paper introduces Word2Vec, two efficient neural architectures for learning word embeddings from massive datasets. The core problem: previous methods for generating word vectors were computationally expensive, limiting scale to hundreds of millions of words. The authors propose CBOW (Continuous Bag-of-Words), which predicts a target word from surrounding context words, and Skip-gram, which predicts surrounding words from a target word. Both use log-linear classifiers with hierarchical softmax to reduce complexity from O(H×V) to O(H×log(V)), where V is vocabulary size. Trained on 6 billion words from Google News, these models learn 300-1000 dimensional vectors in under a day on a single CPU. The key result: Skip-gram achieves 66% accuracy on a new 19,544-question semantic-syntactic test set (measuring relationships like "Paris is to France as Rome is to Italy"), compared to 24% for previous RNNLM approaches. The vectors capture meaningful linear relationships—subtracting word vectors and adding others recovers semantic analogies. This matters because efficient, scalable word embeddings became foundational infrastructure for all downstream NLP tasks.
The architecture that became the hinge point of the field.
The Transformer architecture; the single hinge point of the field.
Transformer architecture replaces recurrence and convolution with pure attention mechanisms, achieving better translation quality with faster parallel training.
This paper introduces the Transformer architecture, a fundamentally new approach to sequence modeling that replaces recurrent neural networks (RNNs) and convolutional layers with pure attention mechanisms. Prior translation systems used RNNs or CNNs in encoder-decoder setups, which were inherently sequential and slow to train. The Transformer uses multi-head self-attention to let all positions in a sequence attend to each other in parallel, enabling significantly faster training without sacrificing quality. On WMT 2014 English-to-German translation, it achieves 28.4 BLEU (2+ BLEU improvement over prior best results), and on English-to-French reaches 41.8 BLEU single-model SOTA while training in just 3.5 days on 8 GPUs—a fraction of prior computational cost. The authors also demonstrate the architecture generalizes beyond translation to constituency parsing.
Pretraining at scale and the rise of large language models.
In-context learning; the paper that started the current wave.
GPT-3 (175B parameters) achieves competitive few-shot performance on diverse NLP tasks without fine-tuning, showing scale alone enables task-agnostic learning from text demonstrations.
This paper introduces GPT-3, a 175-billion-parameter autoregressive language model trained on large-scale text data. The key finding is that scaling to this size enables strong few-shot learning—performing new tasks with only a few examples provided in text prompts, without any gradient-based fine-tuning. The authors test GPT-3 on diverse NLP benchmarks including translation, question-answering, arithmetic, and reasoning tasks. Results show that GPT-3 often matches or approaches performance of prior fine-tuned models while remaining task-agnostic. The work demonstrates that task-specific training data and fine-tuning become less critical as model scale increases, fundamentally challenging the then-standard paradigm of pre-train-then-fine-tune. The paper also documents failure cases and notes that GPT-3 can generate realistic synthetic news, raising societal concerns.
Performance as a predictable function of compute, data, and model size.
Language model loss follows power-law scaling with model size, dataset size, and compute; larger models are more sample-efficient when compute-budgets are fixed.
This paper systematically measures how neural language model performance improves with scale. The researchers trained models of varying sizes (13M to 1.3B parameters) on datasets ranging from 20M to 23B tokens, measuring cross-entropy loss across experiments spanning seven orders of magnitude. They found power-law relationships: loss decreases predictably as you increase model size, dataset size, or total compute. Crucially, they showed that architectural choices like width and depth matter far less than these three scaling factors. The paper also derives a practical formula for optimally distributing a fixed compute budget—larger models trained on less data (stopped before convergence) are actually more sample-efficient and reach lower losses than smaller models trained to convergence on more data. This finding contradicted conventional wisdom and directly influenced the training strategies of subsequent large language models.
Reframed every NLP task as text-to-text.
T5 unifies NLP tasks as text-to-text problems and systematically explores transfer learning design choices, achieving SOTA results through scale and clean pre-training data.
This paper introduces T5 (Text-to-Text Transfer Transformer), a unified framework that reformulates all NLP tasks as text-to-text problems—feeding text input to a sequence-to-sequence model and generating text output. The authors conduct a comprehensive empirical study comparing different pre-training objectives (like denoising and language modeling), encoder-decoder architectures, dataset sizes, and fine-tuning strategies across dozens of tasks including summarization, question answering, and classification. They introduce C4, a large cleaned web corpus for pre-training, and show that scaling up both model size and data significantly improves performance. By systematically optimizing these components and combining them at scale, T5 achieves state-of-the-art results across GLUE, SuperGLUE, SQuAD, and other benchmarks. The work provides practical guidance on transfer learning design decisions for NLP practitioners and releases models, datasets, and code to enable reproducibility.
Scaling generative pretraining; emergent zero-shot task ability.
Large Transformer language models trained on diverse web text learn to perform multiple NLP tasks zero-shot via language modeling alone, without task-specific supervision or architecture changes.
This paper introduces GPT-2, a 1.5B parameter Transformer language model trained on WebText—a 40GB dataset of high-quality web documents filtered through Reddit upvotes. The key contribution is demonstrating that large language models can perform diverse NLP tasks without explicit task-specific training or fine-tuning (zero-shot transfer). By training purely on next-token prediction across varied domains, GPT-2 learns to implicitly handle reading comprehension, machine translation, summarization, and question answering by conditioning on natural language task descriptions. The model achieves state-of-the-art results on 7 of 8 language modeling benchmarks (LAMBADA, Children's Book Test, Winograd Schema) and reaches 55 F1 on CoQA reading comprehension without using its 127K training examples. Results show scaling laws: larger models consistently improve zero-shot performance log-linearly across tasks, suggesting that capacity is essential for unsupervised multitask learning. The paper also includes careful analysis of data overlap between training and test sets, finding modest but consistent benefits from incidental overlap.
Bidirectional pretraining; dominated NLP benchmarks for years.
Bidirectional pre-training on masked language modeling enables transformers to achieve state-of-the-art results across diverse NLP tasks with minimal fine-tuning.
BERT introduces a bidirectional pre-training method for transformer-based language models that fundamentally changed how NLP systems are built. Previous models like GPT used left-to-right context only; BERT conditions on both left and right context simultaneously across all layers during pre-training on unlabeled text. The model uses two simple training objectives: masked language modeling (randomly masking tokens and predicting them) and next sentence prediction (predicting whether two sentences follow each other). After pre-training on large unlabeled corpora, BERT can be fine-tuned with minimal task-specific modifications—just adding a single output layer—to achieve state-of-the-art results across diverse NLP tasks. The approach achieved measurable improvements on 11 benchmarks: GLUE (80.5%, +7.7%), MultiNLI (86.7%, +4.6%), SQuAD v1.1 (93.2 F1, +1.5), and SQuAD v2.0 (83.1 F1, +5.1).
The pretrain-then-finetune recipe for language.
Two-stage pre-training and fine-tuning with a Transformer language model achieves state-of-the-art on diverse NLP tasks with minimal architecture changes.
GPT-1 demonstrates that large performance gains on NLP tasks can be achieved by combining unsupervised language model pre-training with supervised fine-tuning. The paper trains a 12-layer Transformer decoder on the BooksCorpus (7,000+ books with long contiguous text) using standard language modeling loss. After pre-training, the model is fine-tuned on downstream tasks—natural language inference, question answering, semantic similarity, and text classification—by adding only a linear output layer and using task-specific input transformations that convert structured inputs into token sequences. The approach achieves state-of-the-art results on 9 of 12 evaluated benchmarks, with notable improvements: 8.9% on StoryCloze (commonsense reasoning), 5.7% on RACE (reading comprehension), 1.5% on MultiNLI (textual entailment), and 5.5% on GLUE. The paper shows that the Transformer architecture's structured attention mechanism is crucial—replacing it with an LSTM drops average performance by 5.6 points—and that each pre-trained layer transfers useful functionality for solving target tasks.
Scaling, alignment, efficiency, and reasoning.
Reasoning emerging from pure RL without supervised fine-tuning; current frontier of reasoning-model training.
Pure RL trains LLMs to develop reasoning without human demonstrations, achieving state-of-the-art math and coding performance through emergent self-verification and adaptive strategies.
DeepSeek-R1 addresses a fundamental challenge in LLM reasoning: existing models rely heavily on human-annotated chain-of-thought demonstrations, which are expensive and limit capability gains. The paper demonstrates that reasoning abilities can be developed through pure reinforcement learning (RL) without human-labeled reasoning trajectories. The RL framework trains models to solve verifiable tasks (math, competitive coding, STEM problems) by rewarding correct outputs, which causes emergent reasoning patterns like self-reflection, answer verification, and adaptive strategy selection to spontaneously develop. The resulting model outperforms supervised baselines trained on human demonstrations across multiple benchmarks. Additionally, the reasoning patterns learned by the large model can be distilled to improve smaller models, making advanced reasoning accessible without scaling to massive models.
The open-weight model that catalyzed the open ecosystem.
Open-source 7B-65B parameter models trained on public data match or exceed much larger proprietary LLMs on standard benchmarks.
LLaMA introduces a family of open-source language models ranging from 7B to 65B parameters, trained exclusively on publicly available datasets. The key contribution is demonstrating that efficient scaling and training practices can yield models that outperform much larger proprietary models: LLaMA-13B beats GPT-3 (175B parameters) on most benchmarks, and LLaMA-65B matches or exceeds Chinchilla-70B and PaLM-540B. The authors train on trillions of tokens using standard transformer architecture with architectural innovations (rotary positional embeddings, grouped-query attention variants). By releasing all model weights publicly, LLaMA democratized access to high-quality foundation models and enabled widespread community research and fine-tuning, contrasting with prior closed-source approaches from OpenAI and Google.
Eliciting step-by-step reasoning from large models.
Few-shot prompting with explicit reasoning steps dramatically improves LLM performance on math, commonsense, and symbolic reasoning tasks without any model training.
Chain-of-thought (CoT) prompting is a simple technique where language models are shown a few examples that explicitly work through intermediate reasoning steps before arriving at a final answer. Rather than jumping directly to answers, this approach encourages models to 'think out loud' by generating step-by-step reasoning. The paper demonstrates that when applied to large language models (tested on models up to 540B parameters), CoT prompting substantially improves performance on reasoning-heavy tasks including arithmetic word problems, commonsense reasoning, and symbolic manipulation. A concrete example: providing just eight CoT exemplars to a 540B model achieves state-of-the-art results on GSM8K (grade-school math word problems), outperforming even finetuned GPT-3 baselines that used separate verifier models. The key insight is that this emergent reasoning capability requires sufficient model scale—smaller models see minimal benefit—and that explicit intermediate steps are crucial for unlocking complex reasoning in otherwise standard language models.
Alignment via AI feedback rather than human labels (RLAIF).
Train harmless AI assistants using only written principles and AI-generated feedback, eliminating the need for human-labeled harmful examples.
This paper introduces Constitutional AI, a method for training AI assistants to be harmless without requiring human-labeled datasets of harmful outputs. Instead of human feedback on what's harmful, the approach uses a set of written principles (a 'constitution') to guide self-improvement. The process has two stages: first, a supervised learning phase where the model critiques and revises its own outputs based on the constitution, then a reinforcement learning phase where an AI-trained reward model (rather than humans) evaluates which responses are better, enabling 'RL from AI Feedback' (RLAIF). The result is an assistant that refuses harmful requests while explaining its reasoning rather than evasively shutting down conversation. This dramatically reduces the need for expensive human labeling while maintaining or improving performance on human evaluations.
Corrected the data-vs-size tradeoff; reshaped how models are sized.
Model size and training tokens should be scaled equally for compute-optimal LLM training; current models are undertrained relative to their size.
This paper addresses a fundamental question in LLM training: given a fixed compute budget, how should you allocate it between model size and training data? The researchers trained over 400 transformer models ranging from 70M to 16B parameters on 5 to 500B tokens to find optimal allocation strategies. They discovered that most existing large language models are significantly undertrained—companies were scaling model size while keeping data constant, which is suboptimal. The key finding: model size and training tokens should be scaled equally. For every doubling of model parameters, you should also double the number of training tokens. They validated this with Chinchilla (70B parameters, 4× more data than Gopher's budget), which outperformed larger models like Gopher (280B), GPT-3 (175B), and Megatron-Turing NLG (530B) on downstream tasks, achieving 67.5% on MMLU—a 7% improvement over Gopher.
The RLHF alignment recipe behind ChatGPT.
Fine-tune language models with human feedback (supervised learning + RLHF) to align them with user intent, achieving better outputs from much smaller models.
This paper addresses a critical problem: scaling up language models doesn't automatically make them better at following user instructions or intentions. Large models like GPT-3 often produce outputs that are untruthful, toxic, or unhelpful. The authors propose a two-stage fine-tuning approach called InstructGPT to align models with human preferences. First, they fine-tune GPT-3 using supervised learning on a dataset of human demonstrations showing desired behavior across labeler-written and API prompts. Second, they collect human rankings of model outputs and apply reinforcement learning from human feedback (RLHF) to further optimize the model. The results are striking: a 1.3B parameter InstructGPT model produces outputs preferred over those from the 175B parameter GPT-3 in human evaluations—a 100x parameter reduction. InstructGPT also shows improvements in truthfulness and reduced toxic outputs with minimal regression on standard NLP benchmarks. This work demonstrates that human feedback-based fine-tuning is an effective path toward alignment.
Cheap parameter-efficient fine-tuning; foundational for practical adaptation.
LoRA reduces fine-tuning cost by 10,000× using frozen weights plus trainable low-rank matrices, matching full fine-tuning quality without inference overhead.
LoRA addresses the impractical cost of fine-tuning massive pre-trained language models by freezing model weights and injecting small trainable rank-decomposition matrices (low-rank updates) into each Transformer layer. Instead of updating all 175B parameters in GPT-3, LoRA adds only a tiny fraction—reducing trainable parameters by 10,000× and GPU memory by 3× compared to standard fine-tuning with Adam. The method matches or exceeds full fine-tuning performance on RoBERTa, DeBERTa, GPT-2, and GPT-3 across downstream tasks, while training faster and adding zero inference latency. The paper includes empirical analysis showing language models have low intrinsic rank during adaptation, explaining why this decomposition works.
Sparse Mixture-of-Experts scaling, now standard in frontier models.
Simplified single-expert routing enables sparse trillion-parameter Transformers that train 4-7x faster than dense models with the same compute budget.
Switch Transformers simplify Mixture of Experts (MoE) routing to scale language models to a trillion parameters while maintaining constant computational cost. The key insight: instead of routing each input to multiple experts and blending their outputs, route each token to a single expert via a simplified learned routing function. This reduces communication overhead and training instability compared to prior MoE work. The authors introduce training techniques (selective precision, expert dropout, load balancing) that stabilize training of sparse models in lower precision (bfloat16). Empirically, Switch Transformers based on T5 achieve 7x pre-training speedup with identical FLOPs, and extend effectively to 101 languages in multilingual settings. The largest model (1 trillion parameters) trains 4x faster than T5-XXL, demonstrating that sparsity can scale to extreme model sizes without the traditional complexity and communication costs that limited prior MoE adoption.
Updated 6/3/2026, 7:59:50 PM