Princeton COS 597A: Efficient Systems for Foundation Models (Fall 2025)
Schedule: Tue 10:40am-12pm, Friend Center 016
Instructor: Tri Dao (tridao.me)
Office hours: Thu 1:30-2:30pm, COS 420
Prerequisites: COS 324. Additional familiarity with machine learning systems will help but not required.
Communication: Please make sure you’re on the course Slack workspace. Link to Slack workspace is on Canvas.
Description
As models increase in size and training budget, they not only systematically improve in upstream quality, but also exhibit novel emergent capabilities. This increase in scale raises proportionate difficulties for practitioners: foundation model training and inference lie at a unique interdisciplinary crossroad, combining open problems in algorithms, system design, and software engineering.
The goal of this course is to give an overview of emerging research questions and challenges associated with foundation model training and inference. We will focus on training and inference systems/algorithms for foundation models, such as scaling-up or on reducing compute, time, memory, bandwidth, and energy requirements:
- Training and inference systems, either distributed at large scale or in resource-constrained scenarios;
- Algorithms for improved training and inference efficiency;
- Systems for foundation models, such as novel programming languages or compilers.
This course is primarily intended for PhD students studying topics related to machine learning systems. That said, the course is open to any students who excelled in CS 324.
Structure and Grading
- 40% Participation in paper discussions
- 25% Paper presentation/lecture
- 35% Research project (report and presentation)
Paper Reading and Discussion
A major component of this course is reading and discussing research papers in depth. To ensure a lively and focused discussion, you should closely read each paper prior to the lecture in which it will be discussed. You should aim to come to class prepared with several points that will substantially contribute to the group discussion. Your participation grade will be determined based on attendance and, more importantly, substantial contributions to paper discussions in class; as a rule of thumb, given the small class size, you should aim for at least two discussion contributions (deep questions, observations, etc.) per lecture.
Paper Presentation/Lecture
In each class, 2-3 students will be expected to present the scheduled paper and lead the discussion for it. Presentations should start with a (roughly) 20-25 minute overview of the paper; in many cases, especially for the first paper in a given topic, presenters are responsible for providing background for the given area (please reach out to the instructor for background pointers). The format of this part of the presentation should be “conference style,” i.e., covering the domain and relevant background for the paper, the problem statement and challenges, the solution, results, and potential limitations and improvements. However, the presentation should go into more detail than a typical conference talk would, particularly on the design of the proposed solution; for this reason, while public conference slides for the paper can be used as an aid, they will not suffice for the lecture. The remainder of the lecture will involve leading discussion by both fielding and posing questions to spark discussion. Non-presenters are expected to actively participate in the discussions and bring discussion points (including questions) of their own. Active participation will lead to a lively discussion that will benefit everyone.
Research Project
In addition to paper reading, this course will also include a semester-long research project. Students will carry out projects in pairs. The goal of this research project is not necessarily to fully implement a research idea. Instead, students are encouraged to pick a problem that is new and exciting to them, and focus primarily on building (small-scale) prototypes and collecting measurements to motivate the problem and their solution. Thus, implementation is a key aspect of the project, but students are encouraged to aim high, and not feel restricted to topics or ideas that could be 100% implemented before the course concludes. The scope of acceptable topics is quite large – anything related to improving training or inference for generative models (language, image, multi-modal, etc.) is fair game. Extensions to ongoing research projects can be used if in scope; please see the instructor to discuss your specific ongoing project and how you would like to extend it for the course. It is strongly encouraged to begin thinking about project topics early on in the semester by reviewing the reading list/topics, and discussing with the instructor.
Course Schedule
Week 1 - Sep 2: Course introduction. Overview of foundation model training and inference.
Reading:
Week 2 - Sep 9: Scaling laws: why foundation models are large
Reading:
- Scaling Laws for Neural Language Models
- Training Compute-Optimal Large Language Models (Chinchilla scaling law)
- Language Models are Few-Shot Learners (GPT3)
Week 3 - Sep 16: Hardware characteristics
Reading:
Week 4 - Sep 23: Distributed training: Tensor Parallel, Pipeline Parallel, Sequence Parallel
Reading:
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- Reducing Activation Recomputation in Large Transformer Models
Week 5 - Sep 30: Attention optimizations
Reading:
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- Ring Attention with Blockwise Transformers for Near-Infinite Context
Week 6 - Oct 7: Mixture of experts
Reading:
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Oct 14: No class
Week 7 - Oct 21: Inference: quantization and sparsity
Reading:
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- SqueezeLLM: Dense-and-Sparse Quantization
- AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
Week 8 - Oct 28: Inference: speculative decoding, architectural optimization
Reading:
- Fast Inference from Transformers via Speculative Decoding
- Fast Transformer Decoding: One Write-Head is All You Need
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Week 9 - Nov 4: Inference serving: PagedAttention, SGLang, chunked prefill
Reading:
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- SGLang: Efficient Execution of Structured Language Model Programs
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Week 10 - Nov 11: Software framework, compilers
Reading:
- Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
- PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation
- HybridFlow: A Flexible and Efficient RLHF Framework
Week 11 - Nov 18: New architectures
Reading:
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Titans: Learning to Memorize at Test Time