Schedule

Given the pace of innovations in this area, the following list is subject to change.

Color Legend: Presenter Reviewer Scriber

Introduction

Jan 8

Course Introduction: Anand

📖 How to Read a Paper

📖 How to Give a Bad Talk

📖 Writing Reviews for Systems Conferences

Paper Presentation Preferences Fill out the form here

Jan 10

Overview of Challenges: Anand

📖 Challenges and Applications of Large Language Models

📖 Understanding LLMs: A Comprehensive Overview from Training to Inference

Jan 12

Paper Presentation Preferences Due

Jan 15

No class Martin Luther King, Jr. Day

Basics of LLMs

Jan 17

Transformers

📖 Attention Is All You Need

📖 Huggingface Transformers Course

📖 Improving Language Understanding by Generative Pre-Training (GPT) Required: Amitrajit Rohan Huayi

📖 Let’s build GPT: from scratch, in code, spelled out

📖 The Illustrated Transformer Required: Ganesh Aniruddha Aayush

📖 The Transformer Family Version 2.0

Jan 22

Diffusion

📖 Denoising Diffusion Probabilistic Models

📖 High-Resolution Image Synthesis with Latent Diffusion Models Required: Sashankh Azeez Vitaly

📖 Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

📖 The Illustrated Stable Diffusion Required: Abhimanyu Arpan Zeyu

Jan 24

Multimodality

📖 Flamingo: a Visual Language Model for Few-Shot Learning

📖 Multimodality and Large Multimodal Models Required: Alexander Rishi Jingli

📖 Visual Instruction Tuning Required: Devashish Apoorva Mohit

Project

Jan 28

Project Groups Formation Due

Jan 29

Project Ideas: Anand

📖 How to Write a Great Research Paper

📖 Hints and Principles for Computer System Design

Pre-training

Jan 31

Training Approaches for Large Models

📖 Fully Sharded Data Parallel: faster AI training with fewer GPUs Required: Mingzheng Mithilesh Shivashankar

📖 PaLM: Scaling Language Modeling with Pathways

📖 Pathways: Asynchronous Distributed Dataflow for ML Required: Rajveer Sera Ziyuan

📖 PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Feb 5

Automating Parallelization Techniques for Large Model Training

📖 Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning Required: Aniruddha Aayush Vima

📖 Efficient large-scale language model training on GPU clusters using megatron-LM Required: Aditya Shubham Amitrajit

📖 GSPMD: General and Scalable Parallelization for ML Computation Graphs

Feb 7

Extreme Scale Training

📖 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

📖 Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models Required: Rohan Kartik Abhimanyu

📖 ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning Required: Divya Ganesh Chinmay

Feb 9

Project Proposal Due

Fine-tuning

Feb 12

Making Large Models Useful with Alignment

📖 Aligning Large Language Models with Human: A Survey

📖 LoRA: Low-Rank Adaptation of Large Language Models Required: Mohit Alexander Sashankh

📖 The Power of Scale for Parameter-Efficient Prompt Tuning

📖 Training Language Models to Follow Instructions with Human Feedback Required: Apoorva Aditya Zachary

Feb 14

Is Alignment Really Useful?

📖 Finetuned Language Models are Zero-Shot Learners Required: Arpan Rajveer Devashish

📖 Large Language Models are Zero-Shot Reasoners

📖 LIMA: Less Is More for Alignment Required: Jingli Huayi Mithilesh

Retrieval & Augmentation

Feb 19

Information retrieval

📖 ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

📖 Improving Language Models by Retrieving from Trillions of Tokens Required: Shivashankar Vitaly Mingzheng

📖 REALM: Retrieval-Augmented Language Model Pre-Training Required: Aayush Abhimanyu Sera

📖 Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

📖 Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study

Feb 21

Accommodating Longer Contexts

📖 Extending Context Window of Large Language Models via Positional Interpolation

📖 Lost in the Middle: How Language Models Use Long Contexts Required: Rishi Divya Azeez

📖 MemGPT: Towards LLMs as Operating Systems Required: Ziyuan Amitrajit Aniruddha

Inference

Feb 26

State-of-The-Art Inference Approaches

📖 DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

📖 Efficient Memory Management for Large Language Model Serving with PagedAttention Required: Zachary Sashankh Shubham

📖 Orca: A Distributed Serving System for Transformer-Based Generative Models Required: Kartik Chinmay Rohan

Feb 28

Removing Inefficiencies

📖 DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference Required: Azeez Devashish Ganesh

📖 SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills Required: Aditya Vima Apoorva

Mar 4

Techniques to Accelerate Decoding

📖 Accelerating Large Language Model Decoding with Speculative Sampling

📖 Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding Required: Huayi Shivashankar Alexander

📖 Fast Inference from Transformers via Speculative Decoding Required: Vitaly Ziyuan Rajveer

Mar 6

Optimizations for Large Model Inference

📖 Efficiently Scaling Transformer Inference

📖 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Required: Mithilesh Mingzheng Arpan

📖 FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU Required: Sera Mohit Divya

📖 Full Stack Optimization of Transformer Inference: a Survey

📖 LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Project

Mar 11: Mid-Semester Project presentations
Mar 13: Mid-Semester Project presentations
Mar 18: No class Spring break
Mar 20: No class Spring break

Special Topics

Mar 25

Mixture-of-Experts

📖 DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

📖 Fast Inference of Mixture-of-Experts Language Models with Offloading Required: Vima Jingli Aditya

📖 Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

📖 MegaBlocks: Efficient Sparse Training with Mixture-of-Experts Required: Shubham Mithilesh Kartik

📖 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

📖 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

📖 Tutel: Adaptive Mixture-of-Experts at Scale

Mar 27

Model Compression

📖 Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

📖 AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Required: Chinmay Zachary Rishi

📖 GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale Required: Aniruddha Mohit Abhimanyu

📖 GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

📖 QLORA: Efficient Finetuning of Quantized LLMs

📖 SqueezeLLM: Dense-and-Sparse Quantization

Apr 1

Dynamism in Large Models

📖 Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving

📖 Confident Adaptive Language Modeling Required: Arpan Sera Ziyuan

📖 Optimizing Dynamic Neural Networks with Brainstorm Required: Vima Zachary Huayi

Apr 3

Legal & Ethical Considerations

📖 Ethical and social risks of harm from Language Models Required: Rohan Kartik Aayush

📖 Foundation Models and Fair Use

📖 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 Required: Rajveer Shivashankar Mingzheng

Apr 8

Class Canceled Solar Eclipse

Apr 10

Security Implications

📖 Extracting Training Data from Diffusion Models

📖 Extracting Training Data from Large Language Models Required: Sashankh Apoorva Azeez

📖 Identifying and Mitigating the Security Risks of Generative AI Required: Alex Rishi Chinmay

Conclusion

Apr 10

Course Wrap-up: Anand

April 15

Final Project presentations

Apr 17

Final Project presentations

Apr 22

Final Project presentations

Apr 26

Final Project Report + Code Due