--- id: 2026-03-19-llm-inference-benchmark title: "LLM Inference Speed Benchmark: vLLM vs TGI vs Triton" slug: llm-inference-benchmark status: topic content_type: article channels: - wechat - x language: en source_urls: [] assets: - 05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.png - 05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.pdf cover_image: "" template: article owner: content-forge created_at: 2026-03-19T00:00:00+08:00 updated_at: 2026-03-19T00:00:00+08:00 --- # LLM Inference Speed Benchmark: vLLM vs TGI vs Triton ## Background We benchmarked three popular LLM inference engines on Llama-3-70B with A100 80GB GPUs. ## Experimental Setup - Model: Llama-3-70B (FP16) - Hardware: 4x NVIDIA A100 80GB, NVLink - Input length: 512 tokens - Output length: 256 tokens - Batch sizes: 1, 4, 8, 16, 32, 64 - Metric: tokens/second (throughput), ms/token (latency) ## Results ### Throughput (tokens/sec) | Batch Size | vLLM | TGI | Triton | |------------|-------|-------|--------| | 1 | 42 | 38 | 35 | | 4 | 156 | 142 | 128 | | 8 | 298 | 265 | 241 | | 16 | 512 | 478 | 420 | | 32 | 890 | 810 | 695 | | 64 | 1240 | 1100 | 920 | ### Latency p99 (ms/token) | Batch Size | vLLM | TGI | Triton | |------------|------|------|--------| | 1 | 24 | 26 | 29 | | 4 | 26 | 29 | 31 | | 8 | 27 | 30 | 33 | | 16 | 31 | 33 | 38 | | 32 | 36 | 39 | 46 | | 64 | 52 | 58 | 70 | ## Architecture The benchmark pipeline follows this flow: Client → Load Balancer → Inference Engine (vLLM/TGI/Triton) → GPU Cluster → Response Each engine uses different batching strategies: - vLLM: Continuous batching with PagedAttention - TGI: Dynamic batching with FlashAttention-2 - Triton: Static batching with TensorRT-LLM backend ## Key Findings 1. vLLM achieves 13-35% higher throughput than Triton across all batch sizes 2. The gap narrows at larger batch sizes (35% at bs=1 vs 13% at bs=64) 3. vLLM p99 latency is consistently 7-26% lower than Triton 4. TGI sits between vLLM and Triton on both metrics