Princeton COS 597A: Efficient Systems for Foundation Models (Fall 2025)

Schedule: Tue 10:40am-12pm, Friend Center 016
Instructor: Tri Dao (tridao.me)
Office hours: By appointment, COS 420
Prerequisites: COS 324. Additional familiarity with machine learning systems will help but not required.
Communication: Please make sure you’re on the course Slack workspace. Link to Slack workspace is on Canvas.

Description

As models increase in size and training budget, they not only systematically improve in upstream quality, but also exhibit novel emergent capabilities. This increase in scale raises proportionate difficulties for practitioners: foundation model training and inference lie at a unique interdisciplinary crossroad, combining open problems in algorithms, system design, and software engineering.

The goal of this course is to give an overview of emerging research questions and challenges associated with foundation model training and inference. We will focus on training and inference systems/algorithms for foundation models, such as scaling-up or on reducing compute, time, memory, bandwidth, and energy requirements:

Training and inference systems, either distributed at large scale or in resource-constrained scenarios;
Algorithms for improved training and inference efficiency;
Systems for foundation models, such as novel programming languages or compilers.

This course is primarily intended for PhD students studying topics related to machine learning systems. That said, the course is open to any students who excelled in CS 324.

Structure and Grading

40% Participation in paper discussions
25% Paper presentation/lecture
35% Research project (report and presentation)

Paper Reading and Discussion

A major component of this course is reading and discussing research papers in depth. To ensure a lively and focused discussion, you should closely read each paper prior to the lecture in which it will be discussed. You should aim to come to class prepared with several points that will substantially contribute to the group discussion. Your participation grade will be determined based on attendance and, more importantly, substantial contributions to paper discussions in class; as a rule of thumb, given the small class size, you should aim for at least two discussion contributions (deep questions, observations, etc.) per lecture.

Paper Presentation/Lecture

In each class, 2-3 students will be expected to present the scheduled paper and lead the discussion for it. Presentations should start with a (roughly) 20-25 minute overview of the paper; in many cases, especially for the first paper in a given topic, presenters are responsible for providing background for the given area (please reach out to the instructor for background pointers). The format of this part of the presentation should be “conference style,” i.e., covering the domain and relevant background for the paper, the problem statement and challenges, the solution, results, and potential limitations and improvements. However, the presentation should go into more detail than a typical conference talk would, particularly on the design of the proposed solution; for this reason, while public conference slides for the paper can be used as an aid, they will not suffice for the lecture. The remainder of the lecture will involve leading discussion by both fielding and posing questions to spark discussion. Non-presenters are expected to actively participate in the discussions and bring discussion points (including questions) of their own. Active participation will lead to a lively discussion that will benefit everyone.

Research Project

In addition to paper reading, this course will also include a semester-long research project. Students will carry out projects in pairs. The goal of this research project is not necessarily to fully implement a research idea. Instead, students are encouraged to pick a problem that is new and exciting to them, and focus primarily on building (small-scale) prototypes and collecting measurements to motivate the problem and their solution. Thus, implementation is a key aspect of the project, but students are encouraged to aim high, and not feel restricted to topics or ideas that could be 100% implemented before the course concludes. The scope of acceptable topics is quite large – anything related to improving training or inference for generative models (language, image, multi-modal, etc.) is fair game. Extensions to ongoing research projects can be used if in scope; please see the instructor to discuss your specific ongoing project and how you would like to extend it for the course. It is strongly encouraged to begin thinking about project topics early on in the semester by reviewing the reading list/topics, and discussing with the instructor.

Course Schedule

Week 1 - Sep 2: Course introduction. Overview of foundation model training and inference.

Reading:

How to Scale Your Model

Slides

Week 2 - Sep 9: Scaling laws: why foundation models are large

Reading:

Slides

Week 3 - Sep 16: Hardware characteristics

Reading:

Slides

Week 4 - Sep 23: Distributed training: Tensor Parallel, Pipeline Parallel, Sequence Parallel

Reading:

Slides

Week 5 - Sep 30: Attention optimizations

Reading:

Week 6 - Oct 7: Mixture of experts

Reading:

Oct 14: No class

Week 7 - Oct 21: Inference: quantization and sparsity

Reading:

Week 8 - Oct 28: Inference: speculative decoding, architectural optimization

Reading:

Week 9 - Nov 4: Inference serving: PagedAttention, SGLang, chunked prefill

Reading:

Week 10 - Nov 11: Software framework, compilers

Reading:

Week 11 - Nov 18: New architectures

Reading:

Princeton COS 597A: Efficient Systems for Foundation Models (Fall 2025)

Description

Structure and Grading

Paper Reading and Discussion

Paper Presentation/Lecture

Research Project

Course Schedule

Week 1 - Sep 2: Course introduction. Overview of foundation model training and inference.

Week 2 - Sep 9: Scaling laws: why foundation models are large

Week 3 - Sep 16: Hardware characteristics

Week 4 - Sep 23: Distributed training: Tensor Parallel, Pipeline Parallel, Sequence Parallel

Week 5 - Sep 30: Attention optimizations

Week 6 - Oct 7: Mixture of experts

Oct 14: No class

Week 7 - Oct 21: Inference: quantization and sparsity

Week 8 - Oct 28: Inference: speculative decoding, architectural optimization

Week 9 - Nov 4: Inference serving: PagedAttention, SGLang, chunked prefill

Week 10 - Nov 11: Software framework, compilers

Week 11 - Nov 18: New architectures

Nov 25: No class

Week 12 - Dec 2: Project presentation

Week 13 - Dec 9: Project presentation