Papers

The foundational research behind modern AI — 53 papers, each summarized in plain English with a link to the full text. Browse by era, sort the list, or follow the Timeline to see how each discovery led to the next.

Presets

53 papers

Foundations

The neural-net building blocks everything else stands on.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

2015Sergey Ioffe, Christian SzegedyarXiv:1502.03167

Read

Normalizing layer inputs within mini-batches stabilizes training, enables higher learning rates, and accelerates deep network convergence by 14×.

During neural network training, the distribution of inputs to each layer shifts as earlier layers' weights update—a problem called internal covariate shift. This instability forces practitioners to use small learning rates and initialize parameters carefully, and makes networks with saturating activations (like sigmoid) especially hard to train. The paper proposes batch normalization: normalizing each layer's inputs within mini-batches during training and scaling/shifting them with learnable parameters. This technique stabilizes training dynamics, enabling much higher learning rates and reducing initialization sensitivity. Applied to ImageNet classification, batch-normalized networks converge 14× faster to the same accuracy as baselines. The method also provides regularization benefits, sometimes replacing dropout entirely. An ensemble of batch-normalized networks achieved 4.9% top-5 error on ImageNet validation, surpassing previous state-of-the-art and matching human performance.

Internal covariate shift (input distribution drift during training) necessitates careful tuning; normalization per mini-batch mitigates this.
Batch normalization is a layer operation with learnable scale and shift parameters, integrated into the model architecture.
14× reduction in training steps to reach baseline accuracy on image classification; faster convergence permits higher learning rates.
Acts as implicit regularizer, reducing or eliminating need for dropout in some cases.
Achieved 4.9% top-5 ImageNet validation error with an ensemble, matching human rater accuracy at the time.

normalizationtraining optimizationdeep learningconvolutional networksregularizationinternal covariate shift

Auto-Encoding Variational Bayes

2013Diederik P Kingma, Max WellingarXiv:1312.6114

Read

Introduces the reparameterization trick and VAE framework for tractable learning in models with intractable continuous latent variable posteriors.

This paper solves a core problem in probabilistic modeling: how to train and perform inference on models with continuous latent variables when computing the true posterior is intractable. The authors introduce the Variational Autoencoder (VAE) framework, which reparameterizes the variational lower bound (evidence lower bound, or ELBO) in a way that makes it amenable to gradient-based optimization. The key insight is replacing the intractable posterior with a learned inference network (encoder) that maps data to latent distributions. This enables efficient training via standard backpropagation on mini-batches. The method works by jointly optimizing a generative model and inference model using the reparameterization trick to sample from continuous latent variables differentiably. Experiments confirm the approach scales to large datasets while maintaining theoretical validity. This framework became foundational for deep generative modeling, enabling practical training of models with continuous latent representations.

The reparameterization trick reformulates the ELBO as a differentiable objective optimizable with stochastic gradient descent on large datasets.
A learned recognition network (encoder) approximates the intractable posterior, avoiding expensive inference computation per sample.
The framework jointly trains an encoder and decoder via a single objective, enabling end-to-end gradient-based optimization.
Method applies to directed probabilistic models with continuous latent variables and i.i.d. data without requiring variational distribution family restrictions.
Scalability to large datasets comes from amortized inference—one encoder handles all samples rather than fitting per-sample posteriors.

variational inferencelatent variable modelsgenerative modelsgradient estimationreparameterization trickdeep generative models

ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)

2012Krizhevsky, Sutskever, Hinton

Read

The 'deep learning works' moment that launched the modern era.

AlexNet achieves breakthrough ImageNet classification by combining deep convolution with ReLU, dropout, and GPU-accelerated training.

AlexNet won the 2012 ImageNet competition by training a deep convolutional neural network with 60 million parameters on 1.2 million images across 1000 classes. The model achieved 37.5% top-1 error and 17.0% top-5 error on ILSVRC-2010—a major breakthrough compared to prior work at 45.7% top-1. The paper introduced several key innovations: ReLU activations that train 6× faster than tanh, a dual-GPU parallelization scheme to handle memory constraints, local response normalization to improve generalization, and overlapping pooling. To prevent overfitting despite the large model size, the authors used two data augmentation strategies (random crops and PCA color jittering that increased training data 2048×) and dropout regularization (randomly zeroing 50% of hidden units during training). The network was trained on two GTX 580 GPUs for 5–6 days using SGD with momentum.

ReLU activations train ~6× faster than saturating nonlinearities like tanh, making deep networks practical to train.
Dropout regularization (randomly zeroing 50% of neurons) halves overfitting by preventing co-adaptation and is computationally cheap at ~2× training cost.
Data augmentation via random crops (2048× expansion) and PCA color shifts is essential to prevent overfitting on 60M parameter networks.
Dual-GPU parallelization with selective layer communication enables training networks exceeding single-GPU memory while keeping training time competitive.
Architecture depth matters: removing any single convolutional layer degraded top-1 accuracy by ~2%, confirming that depth is critical for performance.

convolutional neural networksrelu activationdropout regularizationgpu accelerationdata augmentationimage classification

Long Short-Term Memory

1997Hochreiter, Schmidhuber

Read

Recurrent architecture that remembers across long sequences; dominant before transformers.

LSTM introduces gated memory cells to enable recurrent networks to learn long-range dependencies by controlling gradient flow.

This paper introduces Long Short-Term Memory (LSTM), a recurrent neural network architecture designed to solve the vanishing gradient problem that prevented standard RNNs from learning long-range dependencies. The core idea uses memory cells with gating mechanisms—input, forget, and output gates—that control information flow, allowing gradients to propagate over many time steps without exponential decay. The gates are learned functions that decide what to remember, forget, and output at each time step. LSTMs demonstrated ability to learn tasks requiring memory over 1000+ time steps, far exceeding what vanilla RNNs could achieve. This architecture became the practical standard for sequence modeling (speech, language, time series) for over two decades, until attention-based transformers emerged. The paper's contribution was fundamentally architectural: it showed that carefully designed connectivity and learned gating could preserve information across long sequences, making recurrent models viable for real-world sequential data.

Solves vanishing gradient problem in RNNs by using multiplicative gates (input, forget, output) that modulate information flow through memory cells.
Demonstrates learning on sequences with 1000+ step dependencies, orders of magnitude longer than standard RNNs could handle.
Memory cell design decouples error signal propagation from neuron activation, allowing gradients to flow across many time steps.
Became the dominant architecture for sequence tasks (language modeling, speech recognition, machine translation) from 1997 to ~2015.
Gating mechanism is differentiable and learned end-to-end, requiring no special training algorithms beyond backpropagation through time.

recurrent-neural-networkslong-term-dependenciesgradient-flowsequence-modelinggating-mechanismsmemory-architectures

Learning representations by back-propagating errors

1986Rumelhart, Hinton, Williams

Read

The backpropagation algorithm that makes training neural networks possible.

Backpropagation enables efficient gradient computation in multi-layer neural networks via reverse-mode automatic differentiation.

This paper introduces the backpropagation algorithm, the foundational method for training multi-layer neural networks. The problem: neural networks with hidden layers couldn't be trained effectively because there was no way to compute gradients for weights in layers that didn't directly produce the output. The core idea is to use the chain rule of calculus to propagate error signals backward through the network, layer by layer, enabling computation of how each weight contributes to the final error. By computing these gradients, you can update all weights via gradient descent. The key result is demonstrating that networks with hidden layers can learn non-linear mappings on benchmark problems (like XOR) that single-layer networks cannot. This removed the barrier to training deep networks and became the standard algorithm for neural network optimization, enabling the deep learning era decades later.

Backpropagation computes gradients by applying the chain rule backward through the network, from output error to input weights.
The algorithm works for networks with any number of hidden layers, overcoming the limitation of single-layer perceptrons.
Weights are updated using gradient descent: each weight change is proportional to the negative gradient of the loss.
Demonstrated on practical problems like learning XOR and other non-linearly separable functions.
Became the de facto standard for training neural networks and remains the core of modern deep learning frameworks.

backpropagationgradient descentmulti-layer networksautomatic differentiationneural network training

Pre-Transformer

Embeddings, sequence models, and the first attention mechanisms.

Deep Residual Learning for Image Recognition

2015Kaiming He, Xiangyu Zhang, Shaoqing Ren +1arXiv:1512.03385

Read

Residual connections that let very deep networks train; relied on by transformers.

Skip connections enable training of very deep networks (152 layers) by learning residual functions, winning ImageNet 2015 with 3.57% error.

Deep neural networks become harder to train as you add more layers, even when theory suggests deeper should be better. This paper introduces residual connections (skip connections) that let each layer learn the *difference* from its input rather than learning an entirely new representation. Instead of stacking layers in the traditional way, ResNets reformulate training so that a layer learns a residual function F(x) added to the original input x, giving output F(x) + x. This simple architectural change makes it practical to train networks 8× deeper than previous work (152 layers vs. 19 for VGG) while actually reducing computational complexity. On ImageNet, their ResNet-152 ensemble achieved 3.57% error, winning ILSVRC 2015. The approach also improved object detection on COCO by 28% relative to prior methods, demonstrating that increased depth directly translates to better visual recognition across multiple tasks.

Residual connections (F(x) + x) solve the vanishing gradient problem and allow practical training of networks 8× deeper than previous architectures.
ResNet-152 achieves 3.57% ImageNet error and won ILSVRC 2015 classification, detection, and localization tasks despite being computationally simpler than VGG-19.
The paper demonstrates that depth is fundamental to visual recognition: deeper representations directly improve performance on ImageNet, CIFAR-10, and COCO detection/segmentation.
Skip connections work even at extreme depths (100–1000 layers on CIFAR-10), showing the approach scales beyond typical network sizes.

residual networksskip connectionsimage classificationdeep learningoptimizationimagenet

Adam: A Method for Stochastic Optimization

2014Diederik P. Kingma, Jimmy BaarXiv:1412.6980

Read

The optimizer almost everything is still trained with.

Adam combines momentum and adaptive learning rates by tracking gradient means and variances per parameter, enabling effective training with minimal hyperparameter tuning.

Adam is an adaptive learning rate optimizer for training neural networks with stochastic gradient descent. It maintains running estimates of first and second moments (mean and variance) of gradients, using exponential moving averages to adapt per-parameter learning rates automatically. This lets it handle sparse gradients, non-stationary objectives, and noisy data without manual learning rate tuning. The algorithm is simple to implement, memory-efficient, and works with diagonal rescaling invariance. Kingma and Ba prove convergence bounds under online convex optimization assumptions and show empirical results on image classification and language modeling tasks where Adam outperforms or matches SGD with momentum and other adaptive methods like AdaGrad.

Adam maintains exponential moving averages of gradients (first moment) and squared gradients (second moment) to compute adaptive per-parameter learning rates.
The algorithm requires only two hyperparameters with intuitive defaults (β₁=0.9, β₂=0.999) and typically needs little tuning across different problems.
Adam is invariant to diagonal rescaling of gradients and handles sparse, noisy, and non-stationary objectives better than vanilla SGD.
Theoretical convergence rate is O(1/√T) for non-convex problems, matching best known bounds for online convex optimization.
AdaMax variant replaces second moment with infinity norm for robustness in settings with extreme gradient outliers.

optimizationadaptive learning ratesstochastic gradient descentmomentumconvergence analysis

Neural Machine Translation by Jointly Learning to Align and Translate

2014Dzmitry Bahdanau, Kyunghyun Cho, Yoshua BengioarXiv:1409.0473

Read

The actual origin of the attention mechanism, three years before the Transformer.

Attention mechanism enables decoders to dynamically focus on source sentence parts, eliminating the fixed-length vector bottleneck in neural machine translation.

This paper identifies a critical limitation in early neural machine translation systems: the fixed-length vector bottleneck. Encoder-decoder models compressed entire source sentences into a single fixed-size vector before decoding, losing information and struggling with long sentences. The authors introduce the attention mechanism, which allows the decoder to dynamically focus on relevant parts of the source sentence when generating each target word. Rather than explicitly segmenting the source, the model learns soft alignments—weighted attention distributions over all source words. Tested on English-to-French translation, this approach matches state-of-the-art phrase-based statistical systems while enabling interpretable alignment patterns that align with linguistic intuition. The attention mechanism became foundational to modern sequence-to-sequence architectures.

Fixed-length context vectors in encoder-decoder models create a bottleneck, especially for long sequences.
Soft attention allows the decoder to learn weighted alignments over all source positions when generating each target word.
Attention weights are learned end-to-end via backpropagation and produce interpretable alignments matching linguistic structure.
The approach achieved competitive performance with phrase-based statistical machine translation on English-French translation.
Attention became a core component of modern transformers and sequence-to-sequence models across NLP tasks.

attention mechanismneural machine translationencoder-decodersequence-to-sequencealignment

Sequence to Sequence Learning with Neural Networks

2014Sutskever, Vinyals, LearXiv:1409.3215

Read

Encoder-decoder framing for translation and generation.

Two-layer LSTM encoder-decoder architecture with reversed source sentences achieves state-of-the-art neural machine translation on WMT'14, outperforming phrase-based SMT baselines.

This paper introduces a practical end-to-end neural approach to sequence-to-sequence learning using deep LSTMs. The core idea is simple: use one LSTM to encode a variable-length input sequence into a fixed-dimensional vector, then use a second LSTM to decode the target sequence from that vector. Tested on WMT'14 English-to-French translation, the model achieves 34.8 BLEU score on direct translation (beating the 33.3 SMT baseline) and 36.5 BLEU when rescoring SMT n-best lists. A key practical finding is that reversing the word order of source sentences improves performance significantly (test perplexity drops from 5.8 to 4.7, BLEU improves from 25.9 to 30.6), because it introduces short-term dependencies that make optimization easier. The model also handles long sentences well—contrary to expectations—and learns meaningful sentence representations sensitive to word order and voice.

Encoder-decoder LSTM architecture: one LSTM reads input sequence to produce fixed vector, second LSTM decodes output from that vector, enabling variable-length input-output mappings.
Reversing source sentence word order cuts test perplexity by ~19% and boosts BLEU by ~4.7 points by reducing minimal time lag between source and target words.
Deep (4-layer) LSTMs substantially outperform shallow ones; model uses 384M parameters with 8000-dimensional hidden state per sentence.
Model handles long sentences effectively despite initial skepticism about LSTM memory limits, with no degradation on sentences under 35 words.
Gradient clipping (norm threshold of 5) and bucketing sentences by length in batches are critical training techniques; 8-GPU parallelization achieved 6,300 words/second throughput.

sequence-to-sequencelstmmachine translationencoder-decoderneural machine translationrecurrent neural networks

Efficient Estimation of Word Representations in Vector Space (Word2Vec)

2013Mikolov, Chen, Corrado +1arXiv:1301.3781

Read

Words as vectors; meaning as geometry.

Word2Vec proposes CBOW and Skip-gram architectures that efficiently learn word embeddings at scale, achieving 66% accuracy on semantic-syntactic word relationships with training times under a day.

This paper introduces Word2Vec, two efficient neural architectures for learning word embeddings from massive datasets. The core problem: previous methods for generating word vectors were computationally expensive, limiting scale to hundreds of millions of words. The authors propose CBOW (Continuous Bag-of-Words), which predicts a target word from surrounding context words, and Skip-gram, which predicts surrounding words from a target word. Both use log-linear classifiers with hierarchical softmax to reduce complexity from O(H×V) to O(H×log(V)), where V is vocabulary size. Trained on 6 billion words from Google News, these models learn 300-1000 dimensional vectors in under a day on a single CPU. The key result: Skip-gram achieves 66% accuracy on a new 19,544-question semantic-syntactic test set (measuring relationships like "Paris is to France as Rome is to Italy"), compared to 24% for previous RNNLM approaches. The vectors capture meaningful linear relationships—subtracting word vectors and adding others recovers semantic analogies. This matters because efficient, scalable word embeddings became foundational infrastructure for all downstream NLP tasks.

CBOW predicts target word from context; Skip-gram predicts context from target word—both far simpler than prior feedforward/recurrent language models.
Hierarchical softmax reduces output layer complexity from V to log(V) using Huffman trees, enabling training on 1.6B+ word datasets.
Skip-gram excels at semantic relationships (55% accuracy) while CBOW handles syntactic tasks better (64% accuracy), showing complementary strengths.
Word vector arithmetic captures linguistic analogies: vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen"), enabling analogy solving.
Training complexity is O(E×T×Q) where Q depends on architecture; Skip-gram Q = C×(D+D×log(V)), comparable to CBOW despite larger context window.

word embeddingsskip-gramcbowlanguage modelingdistributional semanticsefficient nlp

The Transformer

The architecture that became the hinge point of the field.

Attention Is All You Need

2017Ashish Vaswani, Noam Shazeer, Niki Parmar +5arXiv:1706.03762

Read

The Transformer architecture; the single hinge point of the field.

Transformer architecture replaces recurrence and convolution with pure attention mechanisms, achieving better translation quality with faster parallel training.

This paper introduces the Transformer architecture, a fundamentally new approach to sequence modeling that replaces recurrent neural networks (RNNs) and convolutional layers with pure attention mechanisms. Prior translation systems used RNNs or CNNs in encoder-decoder setups, which were inherently sequential and slow to train. The Transformer uses multi-head self-attention to let all positions in a sequence attend to each other in parallel, enabling significantly faster training without sacrificing quality. On WMT 2014 English-to-German translation, it achieves 28.4 BLEU (2+ BLEU improvement over prior best results), and on English-to-French reaches 41.8 BLEU single-model SOTA while training in just 3.5 days on 8 GPUs—a fraction of prior computational cost. The authors also demonstrate the architecture generalizes beyond translation to constituency parsing.

Multi-head self-attention allows all sequence positions to attend to each other in parallel, eliminating sequential RNN bottlenecks.
Achieved 28.4 BLEU on WMT 2014 En-De and 41.8 BLEU on En-Fr, surpassing previous best results while requiring much less training time.
Architecture is fully parallelizable across training steps, training competitive models in 3.5 days on 8 GPUs versus weeks for RNN baselines.
No recurrence or convolution needed; positional encodings inject sequence order information into the attention-based computation.
Generalizes beyond machine translation to parsing tasks with both high and low-resource data regimes.

transformersattentionsequence-to-sequencemachine translationparallelizationneural architecture

LLM Era

Pretraining at scale and the rise of large language models.

Emergent Abilities of Large Language Models

2022Jason Wei, Yi Tay, Rishi Bommasani +13arXiv:2206.07682

Read

Large language models exhibit sudden, unpredictable capability jumps at certain scales that cannot be extrapolated from smaller model performance trends.

This paper investigates a counterintuitive phenomenon in large language models: certain capabilities appear suddenly at larger scales, rather than improving gradually. The authors define emergence as an ability completely absent in smaller models but present in larger ones—meaning you cannot predict its appearance by simply extending trends from smaller model performance curves. They document multiple tasks where this nonlinear jump occurs, such as few-shot reasoning and mathematical problem-solving. The core finding challenges the assumption that scaling follows smooth, predictable trajectories; instead, models can unlock entirely new competencies once they reach certain size thresholds. This has practical implications: it suggests that continued scaling may unlock capabilities that aren't obvious from current smaller model behavior, making it difficult to plan compute budgets or predict model utility without actually building and testing larger versions.

Emergence is defined as an ability completely absent in small models yet present in large ones, making performance gains non-monotonic and discontinuous across scales.
Multiple domains show this pattern: in-context learning, chain-of-thought reasoning, and mathematical reasoning emerge at specific model size thresholds.
Standard scaling laws fail to predict emergent abilities, invalidating simple extrapolation methods used to forecast large model performance.
Emergence suggests future scaling may unlock capabilities that are invisible in current small-to-medium model benchmarks.
The phenomenon implies a phase transition-like behavior where model size, not just more training data or better algorithms, fundamentally unlocks new reasoning modes.

scaling lawsemergent capabilitieslanguage modelsfew-shot learningphase transitionsin-context learning

Finetuned Language Models Are Zero-Shot Learners

2021Jason Wei, Maarten Bosma, Vincent Y. Zhao +6arXiv:2109.01652

Read

Instruction tuning on 60+ diverse NLP tasks enables smaller models to generalize better to unseen tasks than much larger language models.

This paper demonstrates that finetuning a 137B parameter language model on diverse NLP tasks described through natural language instructions significantly improves its ability to generalize to completely unseen tasks at test time. The authors created FLAN by instruction-tuning a pretrained model across 60+ tasks, where each task was expressed using natural language instruction templates rather than task-specific formats. On a benchmark of 25 held-out tasks, FLAN outperformed the unmodified baseline and matched or exceeded zero-shot performance of GPT-3 (175B parameters) on most tasks. Notably, FLAN also surpassed few-shot GPT-3 performance on several established benchmarks including ANLI, RTE, BoolQ, and multiple reasoning datasets. Ablation analysis showed that three factors were critical: the number of distinct finetuning datasets used, the scale of the model, and whether tasks were described using natural language instructions versus other formats.

Instruction tuning—finetuning on tasks with natural language instruction templates—substantially improves zero-shot transfer to held-out task types.
A 137B instruction-tuned model outperformed the unmodified version and GPT-3 (175B) on 20 of 25 evaluation tasks without task-specific examples.
Natural language instruction format, dataset diversity during finetuning, and model scale were all necessary contributors to the improvement.
Instruction-tuned FLAN surpassed even few-shot GPT-3 on several established benchmarks, showing instruction tuning unlocks stronger generalization than in-context learning alone.

instruction tuningzero-shot learningtransfer learninglanguage model finetuninggeneralization

Language Models are Few-Shot Learners

2020Tom B. Brown, Benjamin Mann, Nick Ryder +27arXiv:2005.14165

Read

In-context learning; the paper that started the current wave.

GPT-3 (175B parameters) achieves competitive few-shot performance on diverse NLP tasks without fine-tuning, showing scale alone enables task-agnostic learning from text demonstrations.

This paper introduces GPT-3, a 175-billion-parameter autoregressive language model trained on large-scale text data. The key finding is that scaling to this size enables strong few-shot learning—performing new tasks with only a few examples provided in text prompts, without any gradient-based fine-tuning. The authors test GPT-3 on diverse NLP benchmarks including translation, question-answering, arithmetic, and reasoning tasks. Results show that GPT-3 often matches or approaches performance of prior fine-tuned models while remaining task-agnostic. The work demonstrates that task-specific training data and fine-tuning become less critical as model scale increases, fundamentally challenging the then-standard paradigm of pre-train-then-fine-tune. The paper also documents failure cases and notes that GPT-3 can generate realistic synthetic news, raising societal concerns.

GPT-3 reaches 175 billion parameters—10x larger than prior dense models—trained on 300B tokens from Common Crawl, Wikipedia, Books, and WebText.
Few-shot performance improves dramatically with model scale; GPT-3 often matches fine-tuned baselines on standard benchmarks (SuperGLUE, TriviaQA, MMLU) using only task descriptions and examples in the prompt.
No gradient updates or fine-tuning required; tasks specified entirely via text interaction—the model completes or responds based on context.
Identifies systematic weaknesses: GPT-3 struggles on some datasets, has data contamination issues from web-scale training, and shows gaps in syntactic and semantic understanding on specific benchmarks.
Generated news articles are difficult for humans to distinguish from real articles, flagging potential misuse risks.

large language modelsfew-shot learninggpt-3scaling lawsin-context learningpre-training

Scaling Laws for Neural Language Models

2020Jared Kaplan, Sam McCandlish, Tom Henighan +7arXiv:2001.08361

Read

Performance as a predictable function of compute, data, and model size.

Language model loss follows power-law scaling with model size, data size, and compute; larger models are more sample-efficient for fixed budgets.

This paper investigates how language model performance scales with three key resources: model size, training data size, and computational budget. The researchers discovered that loss follows predictable power-law relationships across all three dimensions, with patterns holding consistently across seven orders of magnitude. Surprisingly, architectural choices like layer count or width matter far less than these three factors within reasonable ranges. The work also characterizes how overfitting depends on model and dataset scale, and how training speed varies with model size. A critical finding: larger models learn more efficiently from less data, which inverts conventional wisdom. The practical implication is that optimal compute-budgeted training means building very large models trained on relatively small datasets, then stopping well before full convergence—rather than training smaller models to completion on massive datasets.

Loss decreases as a power-law function of model size, dataset size, and total compute across 7+ orders of magnitude.
Network depth and width have minimal impact on performance within practical ranges; the three resource-scaling laws dominate.
Larger models require exponentially less data per unit of performance, making them more sample-efficient despite higher parameter counts.
Optimal compute-efficient training stops far before convergence, allocating more compute to model scale than data volume.
Simple equations predict overfitting and training speed, enabling principled compute-budget allocation before expensive training runs.

scaling lawslanguage modelscompute efficiencypower lawsloss predictiontraining optimization

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

2019Colin Raffel, Noam Shazeer, Adam Roberts +6arXiv:1910.10683

Read

Reframed every NLP task as text-to-text.

T5 unifies NLP tasks as text-to-text problems and systematically benchmarks pre-training objectives, architectures, and datasets to achieve state-of-the-art results.

This work presents a systematic investigation of transfer learning strategies for NLP tasks by proposing a unified text-to-text framework called T5, where all NLP problems—whether classification, summarization, translation, or question-answering—are reformulated as sequence-to-sequence tasks. The authors conduct extensive experiments comparing different pre-training objectives (like denoising and language modeling), model architectures (encoder-decoder variants), unlabeled datasets, and fine-tuning strategies across dozens of benchmarks. They introduce C4, a large-scale cleaned web corpus for pre-training, and combine architectural insights with scale to achieve state-of-the-art performance on GLUE, SuperGLUE, SQuAD, CNN/Daily Mail, and other standard datasets. The work provides practical guidance on which design choices matter most for transfer learning—for instance, showing that encoder-decoder Transformer models trained on denoising objectives with larger datasets outperform alternatives—and releases code, models, and data to enable reproducibility.

Converting all NLP tasks into text-to-text format enables a single architecture to handle diverse problems like classification, QA, summarization, and translation without task-specific heads.
Encoder-decoder Transformer models with denoising pre-training objectives outperform decoder-only and other alternatives on downstream tasks.
Larger unlabeled datasets and model scale significantly improve transfer learning performance; C4 corpus (750GB cleaned web text) proved more effective than previous sources.
Systematic ablation studies quantify the impact of design choices: relative position embeddings, layer normalization placement, and dropout rates matter, but some intuitive choices had marginal gains.
Released pre-trained T5 models, C4 dataset, and code enable reproducible research and lower barriers for practitioners applying transfer learning.

transfer learningtransformerssequence-to-sequencepre-trainingscalingmulti-task learning

Language Models are Unsupervised Multitask Learners (GPT-2)

2019Radford, Wu, Child +3

Read

Scaling generative pretraining; emergent zero-shot task ability.

Large Transformer language models trained on diverse web text learn to perform multiple NLP tasks zero-shot via language modeling alone, without task-specific supervision or architecture changes.

This paper introduces GPT-2, a 1.5B parameter Transformer language model trained on WebText—a 40GB dataset of high-quality web documents filtered through Reddit upvotes. The key contribution is demonstrating that large language models can perform diverse NLP tasks without explicit task-specific training or fine-tuning (zero-shot transfer). By training purely on next-token prediction across varied domains, GPT-2 learns to implicitly handle reading comprehension, machine translation, summarization, and question answering by conditioning on natural language task descriptions. The model achieves state-of-the-art results on 7 of 8 language modeling benchmarks (LAMBADA, Children's Book Test, Winograd Schema) and reaches 55 F1 on CoQA reading comprehension without using its 127K training examples. Results show scaling laws: larger models consistently improve zero-shot performance log-linearly across tasks, suggesting that capacity is essential for unsupervised multitask learning. The paper also includes careful analysis of data overlap between training and test sets, finding modest but consistent benefits from incidental overlap.

GPT-2 (1.5B params) achieves competitive or SOTA results on 7/8 language modeling datasets in zero-shot settings: LAMBADA (8.63 perplexity), Children's Book Test (93.3% accuracy), Winograd Schema (70.7% accuracy).
Model capacity is critical: performance improves log-linearly with model size across tasks, from 117M to 1.5B parameters, demonstrating that scale alone enables task learning without explicit supervision.
WebText dataset construction emphasizes quality over size: 40GB of text from Reddit-upvoted links outperforms Common Crawl raw scrapes, showing curated training data matters for downstream generalization.
Zero-shot task conditioning uses natural language prompts (e.g., 'translate to French:' or 'TL;DR:') rather than architectural changes, enabling single model to handle diverse tasks without modification.
Data overlap analysis reveals only 1-3% test-set n-gram overlap with training data for most benchmarks, suggesting results reflect genuine generalization rather than memorization.

transformerlanguage modelingzero-shot learningscaling lawstransfer learningmultitask learning

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2018Jacob Devlin, Ming-Wei Chang, Kenton Lee +1arXiv:1810.04805

Read

Bidirectional pretraining; dominated NLP benchmarks for years.

Bidirectional masked-language pre-training on transformers achieves state-of-the-art results across eleven NLP tasks with minimal task-specific tuning.

BERT introduces a pre-training approach that learns bidirectional context representations using transformer encoders. Unlike prior models that processed text left-to-right or right-to-left, BERT jointly trains on both directions simultaneously by masking random tokens during pre-training and predicting them from surrounding context. The model is trained on unlabeled text at scale, then adapted to downstream tasks by adding a task-specific output layer—no major architectural changes needed. The approach proved more effective than existing language models across eleven NLP benchmarks. On GLUE, a broad task suite, BERT achieved 80.5% accuracy (7.7 points above prior best). It reached 86.7% on MultiNLI (natural language inference), 93.2 F1 on SQuAD reading comprehension, and 83.1 F1 on SQuAD 2.0 (which includes adversarial examples). This work demonstrated that simple masked-language pre-training at scale could produce strong general-purpose representations suitable for diverse downstream applications.

BERT uses masked token prediction during pre-training: random words are hidden and the model learns to recover them using both left and right context in all layers.
Pre-trained BERT transfers effectively to downstream tasks; fine-tuning requires only one added output layer and typically converges in hours on standard hardware.
Measured improvements: GLUE +7.7 points, MultiNLI +4.6 points, SQuAD v1.1 +1.5 F1 points, SQuAD v2.0 +5.1 F1 points over prior best results.
The model is conceptually simpler than contemporaries yet outperforms them, suggesting bidirectional pre-training is more sample-efficient than unidirectional approaches.

pre-trainingtransformerslanguage-understandingfine-tuningmasked-language-modelingnlp-benchmarks

Improving Language Understanding by Generative Pre-Training (GPT-1)

2018Radford, Narasimhan, Salimans +1

Read

The pretrain-then-finetune recipe for language.

Two-stage pre-training and fine-tuning with a Transformer language model achieves state-of-the-art on diverse NLP tasks with minimal architecture changes.

GPT-1 demonstrates that large performance gains on NLP tasks can be achieved by combining unsupervised language model pre-training with supervised fine-tuning. The paper trains a 12-layer Transformer decoder on the BooksCorpus (7,000+ books with long contiguous text) using standard language modeling loss. After pre-training, the model is fine-tuned on downstream tasks—natural language inference, question answering, semantic similarity, and text classification—by adding only a linear output layer and using task-specific input transformations that convert structured inputs into token sequences. The approach achieves state-of-the-art results on 9 of 12 evaluated benchmarks, with notable improvements: 8.9% on StoryCloze (commonsense reasoning), 5.7% on RACE (reading comprehension), 1.5% on MultiNLI (textual entailment), and 5.5% on GLUE. The paper shows that the Transformer architecture's structured attention mechanism is crucial—replacing it with an LSTM drops average performance by 5.6 points—and that each pre-trained layer transfers useful functionality for solving target tasks.

Pre-train a 12-layer Transformer decoder on BooksCorpus (~100 epochs, 512-token sequences) using language modeling loss, then fine-tune with only a linear layer added for each downstream task.
Task-specific input transformations (concatenation with delimiters, both orderings for similarity, document+question+answer triples) adapt structured inputs to the sequence-based pre-trained model without changing core architecture.
State-of-the-art on 9/12 tasks: improvements of 8.9% StoryCloze, 5.7% RACE, 1.5% MultiNLI, and 5.5% GLUE over prior methods; outperforms task-specific architectures and single-model LSTM baselines.
Ablations confirm Transformer architecture (vs LSTM) and pre-training itself are critical; auxiliary language modeling loss during fine-tuning helps large datasets but not small ones; each pre-trained layer contributes functional value.
Zero-shot analysis shows the pre-trained model implicitly learns to perform downstream tasks during language modeling, with performance steadily improving over training updates.

pre-trainingfine-tuningtransformerstransfer learninglanguage modelingsemi-supervised learning

Training & Capabilities

Scaling, alignment, efficiency, and reasoning.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025DeepSeek-AI, Daya Guo, Dejian Yang +27arXiv:2501.12948

Read

Reasoning emerging from pure RL without supervised fine-tuning; current frontier of reasoning-model training.

Pure reinforcement learning trains LLMs to reason effectively without human-labeled trajectories, achieving better performance on math and coding through emergent self-verification and strategy adaptation.

This paper addresses the challenge of building reasoning capabilities in large language models without relying on extensive human-annotated examples. The authors developed a reinforcement learning framework that trains models to improve reasoning by optimizing for correct final answers on tasks where correctness can be automatically verified. During training, the model learns emergent behaviors like checking its own work, reconsidering approaches, and adapting strategies—without explicit instruction to do so. When evaluated on mathematics, coding competitions, and STEM problems, the RL-trained model outperforms versions trained on human-written reasoning examples. A notable secondary contribution is that reasoning patterns learned by the larger model can be extracted and used to improve smaller models' reasoning abilities, making advanced reasoning more accessible across model scales.

RL framework learns reasoning without needing human-written step-by-step solutions, only final answer correctness signals.
Model spontaneously develops behaviors like self-reflection, verification, and dynamic strategy switching during training.
Outperforms supervised learning baselines on verifiable tasks: mathematics, coding competitions, and STEM problems.
Reasoning patterns from large models can be distilled to improve smaller model reasoning capability.
Approach is applicable to any domain where correct answers are automatically verifiable.

reinforcement learningreasoningchain-of-thoughtdistillationmathematicscode generation

Mixtral of Experts

2024Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux +23arXiv:2401.04088

Read

Sparse mixture-of-experts model that activates only 13B of 47B parameters per token, matching 70B-parameter models while reducing inference compute.

Mixtral 8x7B is a sparse mixture-of-experts language model that achieves strong performance while maintaining computational efficiency during inference. The model uses eight feedforward expert networks at each layer, with a router that selects exactly two experts per token. Although the full model contains 47 billion parameters, only 13 billion are active for any given input, reducing computation cost. The model was trained on 32k token context windows. In benchmarks, Mixtral matches or exceeds Llama 2 70B and GPT-3.5 across standard tasks, with particularly large gains in mathematics, code generation, and multilingual understanding. An instruction-tuned variant (Mixtral 8x7B-Instruct) outperforms several commercial models including GPT-3.5 Turbo and Claude 2.1 on human evaluation tasks. Both versions are released openly under Apache 2.0 licensing.

Router network selects 2 out of 8 experts per token at each layer, enabling different expert combinations across timesteps.
Achieves 70B-class performance with 2.2x fewer active parameters, improving inference efficiency and throughput.
Outperforms Llama 2 70B significantly on math, code, and multilingual benchmarks despite similar model scale.
Instruction-tuned variant surpasses GPT-3.5 Turbo and Claude 2.1 on human preference evaluations.
Trained with 32k token context length and released under permissive open-source license.

mixture of expertssparse modelsrouting networksinference efficiencyinstruction tuninglanguage models

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

2023Rafael Rafailov, Archit Sharma, Eric Mitchell +3arXiv:2305.18290

Read

DPO replaces complex RLHF with direct supervised fine-tuning using preference pairs and a classification loss, eliminating reward models and RL instability.

Training language models to follow human preferences typically requires reinforcement learning from human feedback (RLHF), a multi-stage process: first train a separate reward model to score outputs, then use RL algorithms like PPO to optimize the language model against that reward estimate. This approach is complex, unstable, and computationally expensive. Direct Preference Optimization (DPO) replaces this pipeline with a single supervised fine-tuning step. The key insight is reparameterizing the reward function so that the optimal policy can be derived in closed form. Instead of training a reward model and running RL, DPO directly optimizes the language model using a classification loss on pairs of preferred and dispreferred responses. Experiments show DPO matches or exceeds PPO-based RLHF on sentiment control, summarization, and dialogue tasks while being simpler to implement, more stable, and requiring no reward model fitting or RL sampling during training.

DPO derives an optimal policy in closed form from a reparameterized reward function, enabling single-stage fine-tuning without separate reward modeling.
The method uses only a classification loss on human preference pairs, requiring no reinforcement learning, sampling during training, or extensive hyperparameter tuning.
Empirical results show DPO equals or outperforms PPO-based RLHF on sentiment control, summarization, and dialogue benchmarks despite being substantially simpler.
The approach is computationally lightweight and more stable than RLHF pipelines, making preference alignment more accessible and reproducible.

preference alignmentrlhflanguage model fine-tuningreward modelinghuman feedback

Let's Verify Step by Step

2023Hunter Lightman, Vineet Kosaraju, Yura Burda +7arXiv:2305.20050

Read

Process supervision on intermediate reasoning steps beats outcome supervision for mathematical reasoning, achieving 78% MATH accuracy with released 800K step-level labels.

Large language models struggle with multi-step reasoning tasks despite recent advances, frequently making logical errors partway through solutions. This work compares two training approaches: outcome supervision (feedback on final answers) versus process supervision (feedback on intermediate steps). The authors trained models on the MATH dataset, a challenging benchmark of mathematical problems requiring sequential reasoning. Process supervision substantially outperformed outcome supervision—a process-supervised model achieved 78% accuracy on a representative MATH test subset, while outcome supervision lagged significantly behind. The team also demonstrated that active learning (strategically selecting which steps to label) further improved process supervision's effectiveness. To enable follow-up research, they released PRM800K, a dataset containing 800,000 human-annotated step-level labels collected during their experiments.

Process supervision (labeling correctness of each step) significantly outperforms outcome supervision (labeling only final answers) for training models on multi-step math problems.
A process-supervised reward model reached 78% accuracy on the MATH test set, demonstrating practical gains on a competitive benchmark.
Active learning strategies reduce annotation cost by selecting high-value intermediate steps to label, improving sample efficiency.
PRM800K dataset of 800,000 step-level labels is released publicly, lowering barriers for research on process-based training methods.
The findings suggest that detailed intermediate feedback matters more than final-answer feedback for developing reliable mathematical reasoning.

process supervisionreward modelingmathematical reasoningstep-level feedbackactive learningmulti-step reasoning

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023Hugo Touvron, Louis Martin, Kevin Stone +27arXiv:2307.09288

Read

Meta released Llama 2, open-source models up to 70B parameters with chat variants competitive with closed-source alternatives on benchmarks and human evaluations.

Meta released Llama 2, a family of open-source language models at scales of 7B, 13B, 34B, and 70B parameters. The base models were pretrained on 2 trillion tokens and made publicly available. The team then fine-tuned variants called Llama 2-Chat specifically for dialogue, using supervised fine-tuning followed by reinforcement learning from human feedback (RLHF). These chat models were evaluated on benchmarks like MMLU, GSM8K, and HumanEval, where they performed competitively against closed-source models like GPT-3.5. Human raters assessed the models on helpfulness and safety; Llama 2-Chat-70B achieved win rates comparable to ChatGPT on human evaluation tasks. The paper documents their instruction-tuning procedure, safety techniques including adversarial prompting and red-teaming, and reproducibility details to allow community extensions.

Llama 2 comes in three sizes (7B, 13B, 70B parameters); all base models and chat fine-tuned versions are publicly released.
Chat models used supervised fine-tuning on instruction data followed by RLHF to optimize for helpfulness and safety.
Llama 2-Chat-70B achieved competitive or superior performance to GPT-3.5 on human evaluation for helpfulness; safety red-teaming reduced harmful outputs.
Models outperformed open-source alternatives on standard benchmarks (MMLU, GSM8K, HumanEval) and showed improved harmlessness through systematic safety improvements.
Paper includes reproducibility details on training procedure, instruction-tuning data curation, and adversarial testing methodology to enable community research.

llamainstruction tuningrlhfopen-source llmsdialogue modelssafety and alignment

LLaMA: Open and Efficient Foundation Language Models

2023Hugo Touvron, Thibaut Lavril, Gautier Izacard +11arXiv:2302.13971

Read

The open-weight model that catalyzed the open ecosystem.

Open-source language models trained on public data alone match or beat proprietary models at smaller parameter counts through efficient training.

LLaMA addresses the assumption that building competitive large language models requires proprietary datasets or massive scale. The researchers trained a family of models from 7 billion to 65 billion parameters using only publicly available text data, including CommonCrawl, C4, GitHub, Wikipedia, and Books. They prioritized training on more tokens rather than simply scaling model size—the 13B variant trained on trillions of tokens matched or exceeded GPT-3 (175B) across standard benchmarks like MMLU, HellaSwag, and TruthfulQA. The 65B model reached performance parity with much larger models like Chinchilla-70B and PaLM-540B. The key insight was that compute-efficient scaling and high-quality public data could replace the proprietary datasets thought to be necessary, lowering barriers to state-of-the-art model development and enabling open research through model release.

LLaMA-13B outperforms GPT-3 (175B) on most benchmarks despite being 13× smaller, showing parameter efficiency matters.
Training used only public datasets (CommonCrawl, C4, GitHub, Wikipedia, Books) with no proprietary data needed.
Scaling compute (tokens) more than model size—trillions of training tokens—drives performance gains efficiently.
LLaMA-65B is competitive with frontier models Chinchilla-70B and PaLM-540B, proving large scale remains possible without closed datasets.
Models released to research community, removing a major barrier to reproducible large-model research.

language modelsscaling efficiencyopen-sourcefoundation modelstransformersbenchmarks

QLoRA: Efficient Finetuning of Quantized LLMs

2023Tim Dettmers, Artidoro Pagnoni, Ari Holtzman +1arXiv:2305.14314

Read

QLoRA reduces finetuning memory by 4x quantization and LoRA adapters, enabling 65B model training on single 48GB GPU without sacrificing performance.

QLoRA enables efficient finetuning of large language models by combining 4-bit quantization with Low Rank Adapters (LoRA). The method reduces GPU memory requirements dramatically—a 65B parameter model can now be finetuned on a single 48GB GPU while maintaining performance equivalent to full 16-bit finetuning. The approach introduces three key innovations: NF4 (a quantization data type optimized for normally distributed weights), double quantization (quantizing the quantization constants themselves to further reduce memory), and paged optimizers to handle memory spikes during training. The authors finetuned over 1,000 models across multiple datasets and architectures (LLaMA, T5) to validate the approach. Their best model, Guanaco, achieved 99.3% of ChatGPT's performance on the Vicuna benchmark using only 24 hours of single-GPU training. Results demonstrate that small, high-quality instruction datasets combined with QLoRA can match or exceed prior state-of-the-art results on chatbot tasks, even with smaller base models.

NF4 quantization format is information-theoretically optimal for normally distributed LLM weights and forms the core of the memory savings.
Double quantization further compresses memory by quantizing the quantization constants, reducing average footprint with minimal performance loss.
Paged optimizers manage GPU memory spikes during backward passes, enabling stable training on hardware with limited VRAM.
Guanaco models trained with QLoRA on high-quality instruction data matched ChatGPT performance (99.3%) on Vicuna benchmark—validation across 1,000+ models across scales and datasets.
GPT-4 evaluation correlates well with human judgement on chatbot tasks and is proposed as a cost-effective alternative to human evaluation.

quantizationlorafinetuningmemory efficiencylarge language modelsinstruction tuning

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

2022Jason Wei, Xuezhi Wang, Dale Schuurmans +6arXiv:2201.11903

Read

Eliciting step-by-step reasoning from large models.

Few-shot prompting with explicit reasoning steps dramatically improves LLM performance on math, commonsense, and symbolic reasoning tasks without any model training.

Chain-of-thought (CoT) prompting is a simple technique where language models are shown a few examples that explicitly work through intermediate reasoning steps before arriving at a final answer. Rather than jumping directly to answers, this approach encourages models to 'think out loud' by generating step-by-step reasoning. The paper demonstrates that when applied to large language models (tested on models up to 540B parameters), CoT prompting substantially improves performance on reasoning-heavy tasks including arithmetic word problems, commonsense reasoning, and symbolic manipulation. A concrete example: providing just eight CoT exemplars to a 540B model achieves state-of-the-art results on GSM8K (grade-school math word problems), outperforming even finetuned GPT-3 baselines that used separate verifier models. The key insight is that this emergent reasoning capability requires sufficient model scale—smaller models see minimal benefit—and that explicit intermediate steps are crucial for unlocking complex reasoning in otherwise standard language models.

Chain-of-thought prompting requires showing only a handful of worked examples with intermediate reasoning steps; no model finetuning or training needed.
Performance gains are most pronounced on reasoning-heavy tasks: arithmetic word problems (GSM8K), commonsense QA (CommonsenseQA), and symbolic reasoning (SVAMP, MAWPS).
The technique exhibits a scaling property: reasoning benefits emerge primarily in large models (>100B parameters); smaller models show minimal improvement.
A single 540B model with CoT exemplars matched or exceeded prior SOTA including finetuned GPT-3 with auxiliary verifier modules, suggesting CoT is more sample-efficient than prior approaches.
The mechanism works across model families (Gopher, PaLM, InstructGPT-3), indicating CoT is a general prompting principle rather than architecture-specific.

promptingreasoningscaling lawsin-context learninglanguage models

Constitutional AI: Harmlessness from AI Feedback

2022Yuntao Bai, Saurav Kadavath, Sandipan Kundu +27arXiv:2212.08073

Read

Alignment via AI feedback rather than human labels (RLAIF).

Train harmless AI assistants using only written principles and AI-generated feedback, eliminating the need for human-labeled harmful examples.

This paper introduces Constitutional AI, a method for training AI assistants to be harmless without requiring human-labeled datasets of harmful outputs. Instead of human feedback on what's harmful, the approach uses a set of written principles (a 'constitution') to guide self-improvement. The process has two stages: first, a supervised learning phase where the model critiques and revises its own outputs based on the constitution, then a reinforcement learning phase where an AI-trained reward model (rather than humans) evaluates which responses are better, enabling 'RL from AI Feedback' (RLAIF). The result is an assistant that refuses harmful requests while explaining its reasoning rather than evasively shutting down conversation. This dramatically reduces the need for expensive human labeling while maintaining or improving performance on human evaluations.

Constitutional AI uses a written set of principles (constitution) rather than human labels to define what outputs should be avoided.
Two-stage training: supervised self-critiquing phase where the model revises its own outputs, followed by RL phase with AI-trained preference model as reward.
RLAIF (RL from AI Feedback) replaces human preference labeling with model-based evaluation, reducing annotation costs significantly.
Trained assistants engage transparently with harmful queries by explaining objections rather than evasively refusing, improving user experience.
Chain-of-thought reasoning in both stages improves performance and makes AI decision-making more interpretable to human judges.

rlhfai feedbackharmlessnessreinforcement learningself-improvementpreference modeling

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

2022Tri Dao, Daniel Y. Fu, Stefano Ermon +2arXiv:2205.14135

Read

IO-aware tiling algorithm that speeds up exact transformer attention 2-3× by reducing GPU memory transfers, enabling longer sequences without quality loss.

Transformer models suffer from slow attention computation and high memory usage because standard self-attention has quadratic time and space complexity relative to sequence length. While approximate attention methods exist, they often fail to deliver actual wall-clock speedups in practice. This paper introduces FlashAttention, an algorithm that optimizes attention by treating GPU memory hierarchy as a first-class concern. The core idea is tiling: breaking attention computation into blocks that fit in fast on-chip SRAM, reducing expensive transfers between high-bandwidth memory (HBM) and SRAM. FlashAttention computes exact attention (no quality loss) while reducing HBM accesses compared to standard implementations. The algorithm achieves 15% end-to-end speedup on BERT-large with 512 tokens, 3× speedup on GPT-2 at 1K tokens, and 2.4× on long-range tasks. The authors extend the method to block-sparse attention for approximate computation, enabling much longer sequences. Models using FlashAttention train faster and support longer contexts, improving perplexity on GPT-2 by 0.7 points and unlocking performance on Path-X (16K tokens) and Path-256 (64K tokens) benchmarks where standard transformers previously failed.

FlashAttention reduces HBM-SRAM data movement via tiling without approximating attention, preserving model quality while speeding training.
Achieves 15% speedup on BERT-large, 3× on GPT-2 (1K length), and 2.4× on long-range tasks compared to standard implementations.
Block-sparse variant enables 16K and 64K token sequences, unlocking new capabilities like Path-X and Path-256 benchmarks.
IO complexity, not FLOPs, is the bottleneck in attention—treating memory hierarchy explicitly yields practical speedups that approximate methods miss.

attention mechanismsgpu optimizationio-aware algorithmstransformersmemory efficiencylong sequences

Self-Consistency Improves Chain of Thought Reasoning in Language Models

2022Xuezhi Wang, Jason Wei, Dale Schuurmans +5arXiv:2203.11171

Read

Sample multiple reasoning paths and aggregate by majority vote to boost chain-of-thought reasoning performance across math and commonsense benchmarks.

Chain-of-thought prompting helps language models solve complex reasoning problems by generating step-by-step solutions, but relying on a single greedy decoding path leaves performance gains on the table. This work introduces self-consistency, a decoding strategy that samples multiple diverse reasoning paths from the model rather than taking only the highest-probability one, then aggregates by selecting the answer that appears most frequently across all paths. The key insight is that hard reasoning problems have many valid solution routes that converge on the same correct answer. Testing on arithmetic benchmarks (GSM8K, SVAMP, AQuA) and commonsense reasoning (StrategyQA, ARC-challenge) shows substantial improvements: GSM8K jumps 17.9 percentage points, SVAMP rises 11 points, and gains hold across all tested datasets. The method is simple to implement—no retraining needed—and works by letting the model's own distribution of reasoning patterns vote on the final answer.

Self-consistency replaces greedy decoding with sampling: generate K diverse reasoning traces and select the most common final answer.
Tested on five benchmarks (GSM8K, SVAMP, AQuA, StrategyQA, ARC-challenge) with gains ranging from 3.9% to 17.9% over single-path chain-of-thought.
Requires no model retraining or fine-tuning; works as a post-hoc decoding strategy on existing chain-of-thought prompts.
Strongest gains appear on arithmetic reasoning (GSM8K +17.9%) and moderate gains on commonsense tasks, suggesting effectiveness varies by task complexity.

chain-of-thoughtpromptingdecoding strategyreasoningensemble methodslanguage models

Self-Instruct: Aligning Language Models with Self-Generated Instructions

2022Yizhong Wang, Yeganeh Kordi, Swaroop Mishra +4arXiv:2212.10560

Read

Language models can improve their instruction-following ability by generating and filtering their own synthetic instruction-response data for finetuning.

This paper addresses the bottleneck in instruction-tuning language models: human-written instruction datasets are scarce, narrow, and expensive to produce. The authors propose Self-Instruct, a bootstrapping method where a pretrained model generates its own instruction-response pairs. The pipeline works by sampling diverse task instructions from the model itself, creating corresponding inputs and outputs, then filtering out low-quality or duplicate examples before using the synthetic data to finetune the original model. Applied to GPT-3, this approach achieves 33% absolute improvement on the Super-NaturalInstructions benchmark, matching the performance of InstructGPT-001 (which used private human-annotated data). On a held-out set of expert-written novel tasks evaluated by humans, Self-Instruct-tuned GPT-3 outperforms models trained on existing public datasets, closing the gap to InstructGPT-001 to just 5 percentage points. The method is annotation-free at scale and the authors release their synthetic dataset publicly.

Self-Instruct bootstraps task diversity from the model's own generations rather than relying on limited human-written instruction datasets.
The pipeline includes quality filtering and deduplication to remove invalid or redundant synthetic examples before finetuning.
GPT-3 finetuned with Self-Instruct reaches 33% absolute improvement on Super-NaturalInstructions and matches InstructGPT-001 performance on held-out expert tasks.
The method is nearly annotation-free and scales to large synthetic datasets, enabling broader instruction-tuning without expensive human labeling.
Results validated through both benchmark evaluation and human judgment on novel, out-of-distribution tasks.

instruction-tuningsynthetic data generationbootstrappinggpt-3zero-shot generalizationdata filtering

Training Compute-Optimal Large Language Models

2022Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch +19arXiv:2203.15556

Read

Corrected the data-vs-size tradeoff; reshaped how models are sized.

Model size and training tokens should scale equally for compute efficiency; most LLMs were undertrained, as proven by Chinchilla outperforming much larger models.

This paper investigates how to optimally divide a fixed compute budget between model size and training data volume when training transformer language models. The authors trained over 400 models ranging from 70M to 16B parameters on datasets of 5 to 500 billion tokens to find the compute-optimal frontier. They discovered that most large language models at the time were severely undertrained—allocating too much compute to model parameters and too little to data. The key finding is that model size and training tokens should scale proportionally: doubling model parameters should be paired with doubling the training data. Based on this principle, they built Chinchilla, a 70B-parameter model trained on 4× more data than its competitor Gopher (280B parameters) while using equivalent total compute. Chinchilla outperformed larger models including GPT-3 (175B), Gopher, and Megatron-Turing NLG (530B) across multiple benchmarks, achieving 67.5% on MMLU and requiring substantially less compute for inference and fine-tuning.

The compute-optimal allocation rule is roughly equal scaling: doubling parameters requires doubling training tokens, not holding data constant.
Chinchilla (70B parameters, 4× more tokens) matched Gopher's compute budget but surpassed Gopher (280B), GPT-3 (175B), and Megatron-Turing NLG (530B) on downstream tasks.
Contemporary large models like GPT-3 and Gopher were overparameterized relative to their training data, wasting compute on unnecessary model size.
The study covered 400+ trained models across a 70M–16B parameter range and 5–500B token budgets, providing empirical evidence for the scaling law.
Smaller compute-optimal models enable faster inference and cheaper fine-tuning, making them more practical for deployment than larger undertrained alternatives.

scaling lawscompute efficiencytraining data allocationtransformerslanguage modelsmodel optimization

Training language models to follow instructions with human feedback

2022Long Ouyang, Jeff Wu, Xu Jiang +17arXiv:2203.02155

Read

The RLHF alignment recipe behind ChatGPT.

Fine-tuning GPT-3 with human feedback produces smaller, preferred models that better follow instructions and reduce harmful outputs.

Larger language models don't automatically follow user instructions better or avoid harmful outputs. This paper demonstrates a two-stage fine-tuning approach to align models with user intent. Starting with GPT-3, the researchers first collected demonstrations of desired behavior from human labelers on a dataset of API prompts, then used supervised learning to create a baseline model. Next, they gathered human preference rankings comparing model outputs and applied reinforcement learning from human feedback (RLHF) to further optimize the model. The resulting InstructGPT models—particularly the 1.3B parameter version—were preferred by human raters over the full 175B GPT-3 in evaluations, despite having 100× fewer parameters. InstructGPT also showed improvements in factual accuracy and less toxic generation, with minimal decline on standard NLP benchmarks. While the approach has limitations, it establishes human feedback fine-tuning as a practical method for steering model behavior toward user goals.

Two-stage pipeline: supervised learning on labeler demonstrations, then RLHF on ranked model outputs, outperforms scaling alone.
1.3B InstructGPT was rated higher than 175B GPT-3 by human evaluators despite 100× parameter reduction.
Models showed gains in truthfulness and toxicity reduction with minimal regression on public NLP tasks.
Human feedback fine-tuning is a viable alignment technique for steering language model behavior without retraining from scratch.
Dataset constructed from both labeler-written prompts and real API usage patterns, grounding the approach in practical user needs.

rlhfinstruction-followingalignmenthuman-feedbacksupervised-fine-tuninggpt-3

LoRA: Low-Rank Adaptation of Large Language Models

2021Edward J. Hu, Yelong Shen, Phillip Wallis +5arXiv:2106.09685

Read

Cheap parameter-efficient fine-tuning; foundational for practical adaptation.

LoRA reduces fine-tuning cost by 10,000× using frozen weights plus trainable low-rank matrix updates, matching full fine-tuning quality with no inference overhead.

Fine-tuning massive pre-trained language models for downstream tasks is expensive—GPT-3 with 175 billion parameters requires storing and updating enormous weight matrices. LoRA addresses this by freezing the original model weights and adding small trainable matrices decomposed into low-rank factors at each Transformer layer. Instead of updating all parameters, only these rank-decomposed adapters are trained during fine-tuning. On tasks using GPT-3, RoBERTa, DeBERTa, and GPT-2, LoRA achieved comparable or better accuracy than full fine-tuning while reducing trainable parameters by 10,000× and GPU memory by 3×. Crucially, LoRA adds no inference cost—the low-rank updates can be merged into weights at deployment. The authors empirically demonstrate that language model adaptation is rank-deficient, meaning the weight changes during fine-tuning lie in a much lower-dimensional subspace than the full parameter space, justifying the low-rank approach.

LoRA injects trainable rank-decomposed matrices into Transformer layers while keeping pre-trained weights frozen, cutting trainable parameters from billions to millions.
On GPT-3 175B, LoRA achieved 10,000× parameter reduction and 3× GPU memory savings compared to Adam fine-tuning, with equal or better downstream task performance.
Low-rank updates can be merged into weights at inference time, eliminating any latency penalty unlike prior adapter methods.
Empirical analysis shows language model adaptation is rank-deficient—weight changes concentrate in a low-dimensional subspace, explaining why low-rank factors suffice.
Method is model-agnostic and verified across multiple architectures: GPT-2, GPT-3, RoBERTa, and DeBERTa with public PyTorch implementation released.

fine-tuningparameter-efficient adaptationlow-rank decompositionlarge language modelstransformersinference optimization

RoFormer: Enhanced Transformer with Rotary Position Embedding

2021Jianlin Su, Yu Lu, Shengfeng Pan +3arXiv:2104.09864

Read

RoPE encodes token positions as rotation matrices, naturally incorporating relative position information into self-attention and improving long-text classification.

This paper addresses how to encode positional information in transformer models, proposing Rotary Position Embedding (RoPE) as an alternative to existing methods like absolute or relative position embeddings. RoPE encodes each token's absolute position using rotation matrices in the embedding space, which naturally encodes relative distance information in the self-attention mechanism. The key insight is that rotating query and key vectors by angles proportional to their positions causes the attention score between tokens to depend only on their relative distance, not absolute positions. RoFormer—a transformer enhanced with RoPE—was evaluated on long-text classification benchmarks and consistently outperformed baselines. The method offers practical advantages: it handles variable sequence lengths without retraining, exhibits natural distance decay (tokens farther apart contribute less), and works with linear attention variants. The authors provide theoretical analysis explaining why this geometric approach to position encoding improves performance on downstream tasks.

Rotary Position Embedding applies rotation matrices to query/key vectors, making attention scores depend only on relative distance between tokens.
RoPE handles variable sequence lengths without modification and exhibits natural exponential decay of attention based on token distance.
RoFormer consistently outperforms existing position encoding methods on long-text classification benchmarks.
The approach works with both standard and linear self-attention mechanisms.
RoPE is implemented in Hugging Face Transformers and has been adopted in production models.

position embeddingstransformersself-attentionrotary encodinglong-sequence modeling

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

2021William Fedus, Barret Zoph, Noam ShazeerarXiv:2101.03961

Read

Sparse Mixture-of-Experts scaling, now standard in frontier models.

Simplified single-expert routing enables sparse trillion-parameter Transformers that train 4-7x faster than dense models with the same compute budget.

Switch Transformers simplify Mixture of Experts (MoE) routing to scale language models to a trillion parameters while maintaining constant computational cost. The key insight: instead of routing each input to multiple experts and blending their outputs, route each token to a single expert via a simplified learned routing function. This reduces communication overhead and training instability compared to prior MoE work. The authors introduce training techniques (selective precision, expert dropout, load balancing) that stabilize training of sparse models in lower precision (bfloat16). Empirically, Switch Transformers based on T5 achieve 7x pre-training speedup with identical FLOPs, and extend effectively to 101 languages in multilingual settings. The largest model (1 trillion parameters) trains 4x faster than T5-XXL, demonstrating that sparsity can scale to extreme model sizes without the traditional complexity and communication costs that limited prior MoE adoption.

Switch routing selects one expert per token deterministically based on a learned router, eliminating load balancing complexity and communication overhead of traditional multi-expert MoE.
Stable training of large sparse models requires expert dropout, load balancing loss, and selective use of lower precision (bfloat16) for router computation only.
Models scale to 1 trillion parameters on public corpora (C4) and achieve 4x speedup over T5-XXL with the same computational FLOPs, validating efficiency of the sparsity approach.
Switch Transformers transfer well across 101 languages (mT5 variant), suggesting sparse routing generalizes across diverse multilingual data.
Sparse models maintain or improve downstream task performance (SuperGLUE, GLUE) compared to dense baselines despite lower per-token capacity.

mixture of expertssparse modelsscaling lawstransformersroutingtraining stability

Multimodal

Teaching models to see: vision encoders and vision-language models.

Visual Instruction Tuning

2023Haotian Liu, Chunyuan Li, Qingyang Wu +1arXiv:2304.08485

Read

LLaVA uses GPT-4-generated instruction data to train an end-to-end vision-language model that performs competitively with GPT-4 on visual understanding tasks.

This paper introduces LLaVA, a multimodal model that combines a vision encoder with a large language model to handle both visual and language tasks. The key innovation is using GPT-4 to generate instruction-following training data that pairs images with language tasks—a technique proven successful for text-only LLMs but unexplored for multimodal systems. The authors connect a CLIP vision encoder to the Vicuña LLM and train the system end-to-end on the GPT-4-generated data. Results show LLaVA achieves 85.1% of GPT-4's performance on a synthetic instruction-following benchmark and reaches 92.53% accuracy on Science QA when combined with GPT-4 reasoning. The work demonstrates that instruction tuning with synthetic data is effective for building capable multimodal assistants, and the authors release their data, model, and code publicly.

GPT-4 can generate multimodal instruction-following data by reasoning about images, enabling effective instruction tuning for vision-language models.
LLaVA connects CLIP vision encoding with Vicuña LLM in an end-to-end trainable architecture for general visual and language understanding.
The model achieves 85.1% relative performance versus GPT-4 on synthetic multimodal instruction tasks and 92.53% on Science QA.
Instruction tuning with synthetic data scales multimodal capabilities similar to how it improved text-only LLMs on zero-shot generalization.

vision-language modelsinstruction tuningmultimodal learningsynthetic data generationllm-based data augmentationvisual reasoning

Flamingo: a Visual Language Model for Few-Shot Learning

2022Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc +24arXiv:2204.14198

Read

Vision-language model combining frozen pretrained vision and language components achieves state-of-the-art few-shot learning on visual tasks without task-specific fine-tuning.

Flamingo addresses the challenge of building vision-language models that learn new tasks from only a few examples, without task-specific fine-tuning. The core innovation is an architecture that connects independent pretrained vision and language models while handling mixed sequences of images, videos, and text in any order. The model uses cross-attention mechanisms to inject visual information into a frozen language model, allowing it to learn from in-context examples. Trained on large-scale web data containing naturally interleaved images and text, Flamingo achieves strong few-shot performance across diverse tasks: visual question-answering (open and multiple-choice), image/video captioning, and more. On multiple benchmarks, a single Flamingo model matches or exceeds performance of models trained with thousands of times more task-specific labeled data, demonstrating that scale and multimodal pretraining enable rapid adaptation without retraining.

Architecture uses cross-attention to inject visual features into a frozen language model, preserving language capabilities while adding vision understanding.
Handles arbitrarily interleaved sequences of images, videos, and text during both training and inference, enabling few-shot prompting.
Single model achieves best results on image/video captioning, visual QA (both open-ended and multiple-choice), and other vision tasks with only handful of examples.
Trained on large-scale multimodal web data; outperforms task-specific fine-tuned models that use thousands of times more labeled examples.
Supports flexible input: images, videos, or mixed sequences; no architectural changes needed across different task types.

vision-language modelsfew-shot learningmultimodal pretrainingcross-attentionin-context learningvisual question-answering

Learning Transferable Visual Models From Natural Language Supervision

2021Alec Radford, Jong Wook Kim, Chris Hallacy +9arXiv:2103.00020

Read

Train image and text encoders jointly on 400M internet image-caption pairs to learn zero-shot transferable vision models using natural language descriptions.

Traditional computer vision models learn to recognize a fixed set of predefined categories, requiring labeled data for any new concept. This work trains an image encoder and text encoder jointly on 400 million image-caption pairs from the internet by predicting which captions match which images. The dual-encoder architecture learns rich visual representations without task-specific labels. At inference, natural language descriptions become the interface—you can describe any visual concept in words and the model performs zero-shot classification by measuring similarity between image embeddings and text embeddings of class labels. Evaluation across 30+ datasets (ImageNet, OCR, video action recognition, geo-localization, fine-grained classification) shows the learned representations transfer effectively without any downstream fine-tuning. The model matches ResNet-50's ImageNet accuracy zero-shot, using no labeled ImageNet data, while remaining competitive or superior on many other tasks. This demonstrates that internet-scale image-text supervision is a scalable alternative to task-specific labeling.

Dual-encoder architecture (image encoder + text encoder) trained on contrastive image-caption matching learns generalizable visual representations without fixed label sets.
Achieves ResNet-50 ImageNet accuracy in zero-shot mode (no fine-tuning, no ImageNet labels), matching supervised baselines trained on 1.28M labeled examples.
Transfers effectively to 30+ downstream tasks including OCR, video action recognition, geo-localization, and fine-grained classification without task-specific training.
Natural language serves as the unified interface—describe new visual concepts in text and the model infers them via embedding similarity, enabling arbitrary concept specification.
Scales efficiently: larger models and datasets improve performance, and the approach proves more label-efficient than traditional supervised pre-training methods.

vision-language modelscontrastive learningzero-shot transferimage-text supervisionembedding alignmentscaling

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

2020Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov +9arXiv:2010.11929

Read

Pure transformer applied to image patches achieves state-of-the-art vision performance with better computational efficiency than CNNs.

This paper demonstrates that pure transformer architectures can effectively handle image recognition without relying on convolutional networks. The key insight is treating an image as a sequence of fixed-size patches—dividing images into 16×16 pixel blocks and feeding them sequentially to a standard transformer encoder. When pre-trained on large datasets, Vision Transformer (ViT) achieves competitive or superior performance compared to state-of-the-art CNNs on benchmarks like ImageNet, CIFAR-100, and VTAB while using fewer computational resources for training. The approach successfully transfers learned representations to smaller datasets, showing that transformers scale effectively to vision tasks when given sufficient pre-training data. This challenges the conventional wisdom that vision problems inherently require convolutional inductive biases and opens the door to unified architectures across both language and vision domains.

ViT divides images into 16×16 patches, embeds them linearly, and processes the sequence with standard transformer blocks—no convolutions needed.
Performance on ImageNet, CIFAR-100, and VTAB matches or exceeds contemporary CNN architectures when pre-trained on large-scale data.
Pre-training and fine-tuning on smaller datasets is more efficient than training CNNs from scratch, demonstrating effective transfer learning.
Transformers do not require convolutional inductive biases to succeed in vision; sufficient data and scale compensate for weaker image-specific architectural constraints.

vision transformersimage classificationtransformer architecturepre-trainingtransfer learningpatch embedding

Inference Efficiency

Serving models faster and cheaper — attention, caching, decoding.

Efficient Memory Management for Large Language Model Serving with PagedAttention

2023Woosuk Kwon, Zhuohan Li, Siyuan Zhuang +6arXiv:2309.06180

Read

PagedAttention and vLLM reduce LLM serving memory waste by treating KV cache like OS virtual memory, enabling 2-4× throughput improvements.

Serving large language models at high throughput requires processing many requests simultaneously, but the key-value cache required for each request consumes substantial memory and changes size unpredictably. Existing systems waste memory through fragmentation and duplicate storage, reducing how many requests can be processed in a batch. This paper introduces PagedAttention, an attention mechanism that borrows virtual memory concepts from operating systems: the KV cache is divided into fixed-size pages that can be allocated non-contiguously, eliminating fragmentation. The authors built vLLM, a serving system implementing PagedAttention that enables memory reuse across requests and within request sequences. Experiments on models like LLaMA and OPT show vLLM achieves 2-4× higher throughput than prior systems like FasterTransformer and Orca at equivalent latency. The gains are largest for longer sequences, bigger models, and complex decoding schemes.

KV cache fragmentation and redundancy are major bottlenecks in LLM serving; fixed-size paging eliminates fragmentation entirely.
vLLM enables cache sharing between requests and within sequences (e.g., shared prefixes in batch sampling), cutting memory duplication.
Throughput gains scale with sequence length and model size: longer contexts amplify the benefit of efficient memory management.
The system was tested on standard models (LLaMA, OPT) and outperformed optimized baselines FasterTransformer and Orca in controlled latency regimes.

llm servingmemory managementkv cacheattention optimizationbatchingthroughput

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

2023Tri DaoarXiv:2307.08691

Read

FlashAttention-2 doubles attention layer speed through better GPU work partitioning, reaching near-GEMM efficiency for training longer transformer sequences.

Transformer attention layers become a bottleneck when processing long sequences because computation and memory scale quadratically with sequence length. The original FlashAttention addressed this by leveraging GPU memory hierarchy to achieve linear memory usage and 2-4× speedup, but only reached 25-40% of theoretical GPU peak performance. FlashAttention-2 improves GPU utilization through three optimizations: reducing non-matrix operations in the attention algorithm, parallelizing single-head attention across multiple thread blocks to improve occupancy, and distributing work between warps to minimize shared memory traffic. These changes deliver roughly 2× faster runtime than FlashAttention while reaching 50-73% of theoretical peak on NVIDIA A100 GPUs—approaching the efficiency of standard matrix multiplication. In end-to-end GPT training, this translates to 225 TFLOPs/s per A100 (72% model FLOPs utilization), enabling practical training on longer sequences.

Reduces attention runtime 2× versus FlashAttention by fixing thread block occupancy and shared memory access patterns on GPUs.
Achieves 50-73% of peak FLOPs/s on A100 versus 25-40% for original FlashAttention, closing the gap to optimized matrix operations.
Single-head attention now parallelizes across thread blocks instead of using one, increasing GPU utilization without sacrificing accuracy.
End-to-end GPT training reaches 225 TFLOPs/s per A100, enabling practical scaling to longer sequences with higher hardware efficiency.

attention mechanismsgpu optimizationtransformersmemory efficiencygpu kernelstraining efficiency

Fast Inference from Transformers via Speculative Decoding

2022Yaniv Leviathan, Matan Kalman, Yossi MatiasarXiv:2211.17192

Read

Accelerate transformer decoding 2–3× by using small models to propose tokens in parallel, validated exactly by the large model, without changing outputs.

Autoregressive language models generate text one token at a time, making inference slow for long sequences. This work proposes speculative decoding, which accelerates generation by running a smaller, faster model to predict multiple tokens ahead, then verifying those predictions in parallel using the full model. The key insight is that language modeling contains both hard and easy subtasks—harder decisions benefit from full model capacity, while easier spans can be handled by smaller models. Rather than serially running the large model K times to decode K tokens, speculative decoding uses the small model to propose candidates and validates them together, potentially generating several tokens per forward pass through the large model. The method preserves exact output distribution, requires no retraining or architecture changes, and achieves 2–3× speedup on T5-XXL compared to standard T5X decoding.

Small auxiliary models can approximate easier language modeling subtasks; hard decisions use the full model, reducing expensive forward passes.
Speculative execution with a rejection-sampling verification method preserves the exact output distribution of the large model.
No retraining or architectural modifications needed; works on existing off-the-shelf models like T5-XXL.
Multiple tokens can be generated per large-model forward pass by validating small-model proposals in parallel.

inference optimizationspeculative executiontransformersdecodingtoken generationsampling

Agents & Tool Use

Reasoning-and-acting loops, tool calling, and agentic benchmarks.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

2023Carlos E. Jimenez, John Yang, Alexander Wettig +4arXiv:2310.06770

Read

SWE-bench evaluates language models on 2,294 real GitHub issues; state-of-the-art models solve fewer than 2%, revealing a critical capability gap in practical code repair.

This paper introduces SWE-bench, a benchmark for evaluating whether language models can fix real software bugs. The dataset contains 2,294 actual GitHub issues from 12 popular Python repositories, paired with the pull request solutions. Models receive a codebase and an issue description, then must edit the code to resolve the problem. Unlike synthetic code tasks, these issues demand multi-file changes, interaction with execution environments, and reasoning across large codebases. Evaluation results show current models struggle significantly: Claude 2, the best performer, solves only 1.96% of issues. Even fine-tuned models like SWE-Llama achieve minimal success rates. The benchmark highlights a major gap between general language model capabilities and practical software engineering tasks that require sustained context handling, complex reasoning, and coordinated edits.

Benchmark contains 2,294 real GitHub issues from 12 Python repositories with ground-truth pull request solutions
Tasks require understanding multi-file codebases, coordinating changes across functions and classes, and executing code to verify fixes
Best model (Claude 2) achieves 1.96% success rate; SWE-Llama fine-tuned variant and other SOTA models perform worse
Existing code generation benchmarks underestimate complexity; real-world issue resolution demands sustained reasoning beyond isolated function generation
Dataset provides sustainable evaluation framework for measuring progress toward practical, autonomous AI software engineers

code generationsoftware engineeringbenchmarkslanguage modelsprogram repairevaluation

Toolformer: Language Models Can Teach Themselves to Use Tools

2023Timo Schick, Jane Dwivedi-Yu, Roberto Dessì +5arXiv:2302.04761

Read

Language models can learn to call external tools via APIs to improve reasoning and factual tasks while maintaining language capabilities.

Large language models are powerful at learning from examples and following instructions, but they fail at tasks requiring precise computation or current factual knowledge—areas where simpler specialized tools excel. This work presents Toolformer, a method for teaching language models to autonomously invoke external APIs (calculator, search engine, Q&A system, translator, calendar) at appropriate moments during text generation. The model learns when to call each tool, what inputs to provide, and how to incorporate results back into token prediction, all trained with minimal supervised examples per tool. On downstream benchmarks, Toolformer matches or exceeds the performance of substantially larger models while preserving general language understanding. The approach is self-supervised, requiring only a few demonstrations to teach tool use without extensive labeled data.

Toolformer learns to invoke tools (calculator, search, Q&A, translation, calendar) directly within token generation, deciding when and what arguments to pass.
Training requires only a handful of in-context examples per tool, using a self-supervised approach rather than extensive annotation.
The model achieves competitive performance with much larger models on zero-shot tasks in math, factual QA, and other benchmarks.
Tool integration preserves the model's core language modeling ability—it doesn't trade general capability for tool use.
The approach works across multiple tool types and scales, showing that smaller models augmented with tools can outperform larger models without augmentation.

tool useexternal apisin-context learningself-supervised learninglanguage model augmentationreasoning

ReAct: Synergizing Reasoning and Acting in Language Models

2022Shunyu Yao, Jeffrey Zhao, Dian Yu +4arXiv:2210.03629

Read

Interleaving reasoning traces with external actions lets language models ground their thinking, reduce hallucination, and solve complex tasks more reliably.

ReAct addresses a gap in how language models tackle complex problems by combining two typically separate approaches: reasoning and action. The method prompts models to generate reasoning steps (like thinking through a problem) alongside executable actions (like querying external sources) in an interleaved pattern. When answering questions or verifying facts, the model can reason about what it needs, take an action to fetch real information, observe the result, and adjust its reasoning accordingly. This feedback loop prevents hallucination—making up false information—and error accumulation that plague pure reasoning approaches. The authors tested ReAct on four benchmarks: HotpotQA and FEVER (where it queried Wikipedia), plus ALFWorld and WebShop (interactive environments). Performance improvements were substantial: 34% and 10% absolute gains on decision-making tasks, outperforming methods trained with imitation or reinforcement learning, using only one or two examples. Generated trajectories were also rated as more interpretable and trustworthy by human evaluators.

ReAct alternates between generating reasoning steps and executable actions, allowing the model to observe external information and correct its course mid-task.
On HotpotQA and FEVER, the approach eliminated hallucination issues by retrieving Wikipedia facts rather than relying on learned knowledge alone.
On ALFWorld and WebShop interactive benchmarks, ReAct achieved 34% and 10% absolute success-rate improvements over imitation and RL baselines with minimal in-context examples.
Human evaluation showed ReAct trajectories were more interpretable and trustworthy than chain-of-thought reasoning or action-only baselines.
The method works with standard language model prompting; no additional training or fine-tuning required.

promptingreasoningagent architecturestool usehallucination mitigationinteractive decision making

Evaluation

How the field measures capability — the benchmarks everyone cites.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

2022Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao +27arXiv:2206.04615

Read

BIG-bench, a 204-task benchmark, systematically measures language model capabilities across scale and architecture, revealing performance patterns, human-model gaps, and scaling laws.

This work presents BIG-bench, a large-scale evaluation suite designed to measure language model capabilities and limitations as they scale. The benchmark comprises 204 diverse tasks spanning linguistics, mathematics, reasoning, physics, biology, and other domains—deliberately selected to challenge current models rather than showcase known strengths. The authors evaluated multiple model families including GPT models, Google's dense transformers, and Switch sparse transformers, ranging from millions to hundreds of billions of parameters, and compared results against expert human performance. Key findings show that performance improves with scale but remains poor in absolute terms compared to humans; performance patterns are consistent across different model architectures; tasks with gradual improvement tend to involve memorization, while tasks showing sudden capability jumps involve multi-step reasoning or brittle evaluation metrics; and social bias increases with scale in ambiguous contexts but can be mitigated through prompt engineering. The work establishes a foundation for understanding scaling behavior and identifying capability emergence in large language models.

BIG-bench contains 204 tasks across 132 institutions covering linguistics, math, reasoning, science, and bias—designed to test capabilities beyond current model abilities.
Model performance improves with scale across all architectures (dense and sparse transformers), but remains far below expert human baseline across most tasks.
Tasks showing gradual improvement with scale typically involve knowledge or memorization; breakthrough behaviors emerge on multi-step reasoning tasks with harder evaluation metrics.
Sparse transformers (Switch-style) show comparable or better performance than dense models of similar scale, suggesting efficiency gains without accuracy loss.
Social bias generally increases with model scale in ambiguous contexts but can be reduced through targeted prompting strategies.

benchmarkingscaling lawslanguage modelstransformerssparse modelscapability evaluation

Holistic Evaluation of Language Models

2022Percy Liang, Rishi Bommasani, Tony Lee +27arXiv:2211.09110

Read

Unified evaluation framework measuring 30 language models on 42 scenarios and 7 metrics to expose capabilities, limitations, and trade-offs previously hidden by fragmented benchmarking.

Language models had become central to NLP, but evaluation was fragmented—different models were tested on different benchmarks, making direct comparison impossible and hiding important failure modes. HELM addresses this by creating a unified evaluation framework covering 42 scenarios (16 core, 26 targeted) across 7 metrics: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. The authors evaluated 30 prominent models (open-source, limited-access, and proprietary) under identical conditions. Before HELM, models shared only 17.9% overlap in evaluation coverage on average; HELM brings that to 96% consistency. The framework reveals trade-offs between metrics—for example, a model might excel at accuracy but fail on fairness—and exposes performance gaps on underrepresented domains like non-standard English dialects. The authors released all prompts, completions, and a modular toolkit to make the benchmark transparent and extensible.

Before HELM, the same 30 models shared only 17.9% average coverage of core evaluation scenarios; HELM standardizes this to 96% overlap, enabling direct comparison.
Seven metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) are measured per scenario to avoid optimizing for accuracy alone at the expense of safety and fairness.
The benchmark includes 21 scenarios never before used in mainstream LM evaluation, capturing gaps like question-answering on neglected English dialects.
All raw model outputs and prompts are released publicly alongside a modular toolkit, enabling community audit and extension beyond the initial 42 scenarios.
Evaluation reveals critical trade-offs: models that perform well on one metric often degrade on others, requiring practitioners to make explicit design choices.

benchmarkinglanguage modelsevaluation metricsfairnessrobustnesstoxicity detection

Measuring Massive Multitask Language Understanding

2020Dan Hendrycks, Collin Burns, Steven Basart +4arXiv:2009.03300

Read

MMLU benchmark measures language model knowledge across 57 diverse subjects; even largest models fall far short of expert performance and struggle with ethics and law.

This paper introduces MMLU (Massive Multitask Language Understanding), a benchmark dataset designed to evaluate how well language models handle diverse knowledge and reasoning tasks. The benchmark spans 57 subjects ranging from elementary math and history to law, computer science, and ethics. Each subject contains multiple-choice questions testing factual knowledge and problem-solving ability across academic and professional domains. The authors evaluate several models including GPT-3 variants. They find that even the largest GPT-3 model achieves only modest performance—roughly 20 percentage points above random guessing on average—and fails to reach expert-level accuracy on any of the 57 tasks. A critical finding is that models show uneven performance across subjects and often cannot reliably identify when they are wrong. Performance on socially sensitive domains like law and ethics remains near random, revealing major gaps in model behavior. The benchmark serves as a diagnostic tool to measure breadth and depth of model knowledge and pinpoint specific weaknesses across a wide range of human expertise areas.

MMLU contains 57 tasks spanning math, history, law, science, medicine, and other professional domains to test broad world knowledge and reasoning
GPT-3 largest model reaches ~20 percentage points above random baseline on average but fails expert performance on all 57 subjects
Models show uneven performance across domains and cannot reliably detect when their answers are incorrect
Knowledge gaps appear especially pronounced on socially important areas like law and morality, with near-random accuracy
The benchmark enables diagnosis of specific model weaknesses across diverse knowledge areas, not just overall capability

benchmarkinglanguage modelsknowledge evaluationmultitask learninggpt-3model limitations

State Space Models

Linear-time alternatives to attention for long sequences.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

2023Albert Gu, Tri DaoarXiv:2312.00752

Read

Input-dependent state space model (Mamba) matches Transformer performance with 5× faster inference and linear sequence scaling.

Transformers dominate foundation models but struggle with long sequences due to quadratic attention complexity. Prior alternatives like linear attention and state space models (SSMs) run faster but underperform on language tasks because they can't adapt their processing based on input content. This paper introduces Mamba, which makes SSM parameters input-dependent so the model can selectively decide whether to keep or discard information at each step. The authors also develop a hardware-efficient parallel algorithm to train these selective SSMs without relying on traditional attention. The resulting architecture drops attention and MLP layers entirely, achieving 5× inference throughput over Transformers while maintaining linear complexity in sequence length. On language modeling, a 3 billion parameter Mamba model outperforms same-sized Transformers and matches Transformers with 6 billion parameters. The approach generalizes across modalities including audio and genomics, with demonstrated improvements even on sequences approaching one million tokens.

Mamba makes SSM parameters functions of the input token, enabling content-based selective propagation or forgetting of information along sequences.
A custom parallel recurrent algorithm enables efficient training of these data-dependent SSMs on modern hardware despite loss of standard convolution optimization.
3B Mamba model outperforms 3B Transformers and equals 6B Transformers on language pretraining and downstream tasks.
Architecture removes attention and MLP blocks entirely, replacing them with simplified selective SSM layers.
Achieves linear time complexity per token and 5× throughput improvement at inference, while scaling well to million-token sequences.

state space modelssequence modelinglinear complexityefficient inferencelanguage modelinghardware-aware algorithms

Efficiently Modeling Long Sequences with Structured State Spaces

2021Albert Gu, Karan Goel, Christopher RéarXiv:2111.00396

Read

S4 makes state space models efficient for long sequences by reparameterizing via Cauchy kernels, outperforming Transformers on length benchmarks.

Sequence modeling often fails on very long inputs (10,000+ steps) because RNNs, CNNs, and Transformers either lose efficiency or memory capacity. This paper introduces S4, which treats sequences as continuous dynamical systems by simulating a structured state space model. The key insight is reparameterizing the state matrix A using a low-rank correction so it becomes diagonalizable and reduces to computing a Cauchy kernel—a well-understood mathematical operation. This makes S4 much faster and more memory-efficient than naive SSM implementations while maintaining theoretical advantages for capturing long-range dependencies. On benchmarks, S4 matches or beats specialized architectures: it achieves 91% on sequential CIFAR-10 without augmentation, closes the gap to Transformers on language and vision tasks while generating 60× faster, and solves the Path-X task (16k steps) that prior methods cannot handle. The contribution is making continuous state space modeling practical for real sequence problems.

S4 parameterizes the state matrix A with a low-rank correction to enable stable diagonalization and reduce computation to a Cauchy kernel operation.
Achieves 91% accuracy on sequential CIFAR-10 without data augmentation, matching 2-D ResNet performance.
Solves 16,000-step Path-X task from Long Range Arena where prior methods fail completely.
Generates 60× faster than Transformers while closing the performance gap on language and image modeling.
Works as a unified architecture across diverse modalities and tasks without requiring specialized variants.

state space modelslong-range dependenciessequence modelingefficient transformersstructured kernelscauchy matrix

More Papers

Papers added outside the curated reading path.

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

2026Shuo Yang, Haocheng Xi, Yilong Zhao +10arXiv:2603.09229

Read

GPU-optimized K-means that eliminates memory materialization and write contention, achieving 17.9× speedup for practical online clustering.

K-means clustering is traditionally treated as an offline preprocessing step, but this work makes it practical for online AI systems by addressing GPU memory and compute bottlenecks. The authors identified two critical performance problems in existing GPU implementations: (1) the assignment step explicitly materializes a large N×K distance matrix in GPU memory, creating severe bandwidth waste, and (2) the centroid update step suffers from hardware contention when many tokens scatter-write to the same cluster centers. Flash-KMeans fixes these through two kernel innovations—FlashAssign fuses distance computation with argmin to avoid materializing distances at all, and sort-inverse update reorganizes data to convert chaotic atomic writes into efficient localized reduction operations. Additional optimizations include overlapping stream operations and compile-time cache tuning. On NVIDIA H200 GPUs, Flash-KMeans achieves 17.9× speedup over prior baselines, 33× faster than NVIDIA's cuML library, and over 200× faster than FAISS. This makes exact K-means viable as a first-class online component in modern AI pipelines.

The bottleneck in GPU K-means is not algorithmic but system-level: the N×K distance matrix materialization and atomic write contention on centroid updates.
FlashAssign kernel fuses distance computation with argmin reduction, bypassing intermediate distance matrix storage entirely.
Sort-inverse update transforms scattered atomic writes into sequential segment reductions, eliminating hardware contention on cluster centers.
Tested on NVIDIA H200; achieves 33× speedup over cuML and 200× over FAISS in exact K-means clustering.
Open-sourced implementation enables K-means as an online primitive in end-to-end AI systems rather than offline-only preprocessing.

gpu optimizationk-means clusteringmemory efficiencyio-aware algorithmskernel fusionhardware-aware design

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

2024Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal +5arXiv:2406.17557

Read

FineWeb is a 15-trillion token public pretraining dataset with documented curation choices that outperforms other open datasets and includes an educational subset improving reasoning benchmarks.

Large language models' capabilities depend heavily on pretraining data quality, but most state-of-the-art models use proprietary datasets with little public documentation. This work introduces FineWeb, a 15-trillion token dataset built from 96 Common Crawl snapshots, which produces better downstream performance than existing open pretraining datasets. The authors systematically document and test their data curation pipeline, including deduplication and filtering approaches, to clarify which decisions matter. They also release FineWeb-Edu, a 1.3-trillion token subset filtered for educational content, which significantly improves performance on knowledge and reasoning tasks like MMLU and ARC. By publishing both the cleaned datasets and the curation code used in ablation studies, the work aims to demystify how to build high-quality pretraining corpora at scale and provide reproducible baselines for the community.

FineWeb derives 15 trillion tokens from Common Crawl; FineWeb-Edu extracts 1.3 trillion higher-quality educational tokens achieving stronger results on MMLU and ARC
Authors systematically ablate deduplication and filtering strategies to isolate which design choices improve model performance
Full data curation code and ablation models are released openly, enabling reproducible dataset engineering research
Models trained on FineWeb outperform those trained on other publicly available pretraining datasets of comparable size

pretraining datasetsdata curationcommon crawldeduplicationscalingllm training

Updated 6/17/2026, 9:05:44 AM

Loading…