Π£ Π½Π°Ρ Π²Ρ ΠΌΠΎΠΆΠ΅ΡΠ΅ ΠΏΠΎΡΠΌΠΎΡΡΠ΅ΡΡ Π±Π΅ΡΠΏΠ»Π°ΡΠ½ΠΎ How to Efficiently Serve an LLM? ΠΈΠ»ΠΈ ΡΠΊΠ°ΡΠ°ΡΡ Π² ΠΌΠ°ΠΊΡΠΈΠΌΠ°Π»ΡΠ½ΠΎΠΌ Π΄ΠΎΡΡΡΠΏΠ½ΠΎΠΌ ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅, Π²ΠΈΠ΄Π΅ΠΎ ΠΊΠΎΡΠΎΡΠΎΠ΅ Π±ΡΠ»ΠΎ Π·Π°Π³ΡΡΠΆΠ΅Π½ΠΎ Π½Π° ΡΡΡΠ±. ΠΠ»Ρ Π·Π°Π³ΡΡΠ·ΠΊΠΈ Π²ΡΠ±Π΅ΡΠΈΡΠ΅ Π²Π°ΡΠΈΠ°Π½Ρ ΠΈΠ· ΡΠΎΡΠΌΡ Π½ΠΈΠΆΠ΅:
ΠΡΠ»ΠΈ ΠΊΠ½ΠΎΠΏΠΊΠΈ ΡΠΊΠ°ΡΠΈΠ²Π°Π½ΠΈΡ Π½Π΅
Π·Π°Π³ΡΡΠ·ΠΈΠ»ΠΈΡΡ
ΠΠΠΠΠΠ’Π ΠΠΠΠ‘Π¬ ΠΈΠ»ΠΈ ΠΎΠ±Π½ΠΎΠ²ΠΈΡΠ΅ ΡΡΡΠ°Π½ΠΈΡΡ
ΠΡΠ»ΠΈ Π²ΠΎΠ·Π½ΠΈΠΊΠ°ΡΡ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ ΡΠΎ ΡΠΊΠ°ΡΠΈΠ²Π°Π½ΠΈΠ΅ΠΌ Π²ΠΈΠ΄Π΅ΠΎ, ΠΏΠΎΠΆΠ°Π»ΡΠΉΡΡΠ° Π½Π°ΠΏΠΈΡΠΈΡΠ΅ Π² ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΡ ΠΏΠΎ Π°Π΄ΡΠ΅ΡΡ Π²Π½ΠΈΠ·Ρ
ΡΡΡΠ°Π½ΠΈΡΡ.
Π‘ΠΏΠ°ΡΠΈΠ±ΠΎ Π·Π° ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΡΠ΅ΡΠ²ΠΈΡΠ° ClipSaver.ru
How to Efficiently Serve an LLM Large Language Models (LLMs) have become crucial due to their performance, but their size poses significant serving challenges. This video covers strategies to optimize LLM serving systems for better efficiency. Key Steps in LLM Inference: 1. Request Handling: Users send requests via HTTPs/gRPC, which the LLM server schedules based on Quality of Experience (QoE) metrics: TTFT (Time to First Token) TDS (Token Delivery Speed) 2. Inference Phases: Prefill Phase: Processes input tokens in parallel to generate the KV Cache, utilizing GPU's parallel processing. Decode Phase: Generates output tokens sequentially, requiring optimization for efficiency. Optimization Techniques: 1. Batching: Combines multiple requests to maximize resource use. 2. Model Quantization: Reduces model weight precision to free up GPU memory. 3. Paged Attention: Manages memory efficiently by avoiding fragmentation. 4. Prefill Chunking: Merges prefill and decode phases for different requests. 5. Prefill/Decode Disaggregation: Separates phases to transfer KV Cache effectively. 6. KV Cache Compression: Speeds up network transfer for large context lengths. 7. Speculative Decoding: Uses smaller models for faster token generation. 8. Radix Attention: Reuses KV Cache without recomputation for specific use cases. 9. Early Rejection: Predicts infeasible requests early to save resources. For a detailed dive into each optimization, check my blog post: https://ahmedtremo.com/posts/How-to-E... 00:00 Introduction 00:49 Prefill/Decode 02:00 Pricing 03:05 Continuous Batching 03:41 Quantization 04:37 Prefill Chunking 05:56 Disaggregated Arch 06:30 Radix Attention 07:54 Early Rejection 08:33 KV Compression 09:32 PagedAttention/vAttention 11:00 QoE Scheduling 11:40 Speculative Decoding