2.2 KiB
2.2 KiB
| id | title | slug | status | content_type | channels | language | source_urls | assets | cover_image | template | owner | created_at | updated_at | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2026-03-19-llm-inference-benchmark | LLM Inference Speed Benchmark: vLLM vs TGI vs Triton | llm-inference-benchmark | topic | article |
|
en |
|
article | content-forge | 2026-03-19T00:00:00+08:00 | 2026-03-19T00:00:00+08:00 |
LLM Inference Speed Benchmark: vLLM vs TGI vs Triton
Background
We benchmarked three popular LLM inference engines on Llama-3-70B with A100 80GB GPUs.
Experimental Setup
- Model: Llama-3-70B (FP16)
- Hardware: 4x NVIDIA A100 80GB, NVLink
- Input length: 512 tokens
- Output length: 256 tokens
- Batch sizes: 1, 4, 8, 16, 32, 64
- Metric: tokens/second (throughput), ms/token (latency)
Results
Throughput (tokens/sec)
| Batch Size | vLLM | TGI | Triton |
|---|---|---|---|
| 1 | 42 | 38 | 35 |
| 4 | 156 | 142 | 128 |
| 8 | 298 | 265 | 241 |
| 16 | 512 | 478 | 420 |
| 32 | 890 | 810 | 695 |
| 64 | 1240 | 1100 | 920 |
Latency p99 (ms/token)
| Batch Size | vLLM | TGI | Triton |
|---|---|---|---|
| 1 | 24 | 26 | 29 |
| 4 | 26 | 29 | 31 |
| 8 | 27 | 30 | 33 |
| 16 | 31 | 33 | 38 |
| 32 | 36 | 39 | 46 |
| 64 | 52 | 58 | 70 |
Architecture
The benchmark pipeline follows this flow:
Client → Load Balancer → Inference Engine (vLLM/TGI/Triton) → GPU Cluster → Response
Each engine uses different batching strategies:
- vLLM: Continuous batching with PagedAttention
- TGI: Dynamic batching with FlashAttention-2
- Triton: Static batching with TensorRT-LLM backend
Key Findings
- vLLM achieves 13-35% higher throughput than Triton across all batch sizes
- The gap narrows at larger batch sizes (35% at bs=1 vs 13% at bs=64)
- vLLM p99 latency is consistently 7-26% lower than Triton
- TGI sits between vLLM and Triton on both metrics