77 lines
2.2 KiB
Markdown
77 lines
2.2 KiB
Markdown
---
|
|
id: 2026-03-19-llm-inference-benchmark
|
|
title: "LLM Inference Speed Benchmark: vLLM vs TGI vs Triton"
|
|
slug: llm-inference-benchmark
|
|
status: topic
|
|
content_type: article
|
|
channels:
|
|
- wechat
|
|
- x
|
|
language: en
|
|
source_urls: []
|
|
assets:
|
|
- 05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.png
|
|
- 05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.pdf
|
|
cover_image: ""
|
|
template: article
|
|
owner: content-forge
|
|
created_at: 2026-03-19T00:00:00+08:00
|
|
updated_at: 2026-03-19T00:00:00+08:00
|
|
---
|
|
|
|
# LLM Inference Speed Benchmark: vLLM vs TGI vs Triton
|
|
|
|
## Background
|
|
|
|
We benchmarked three popular LLM inference engines on Llama-3-70B with A100 80GB GPUs.
|
|
|
|
## Experimental Setup
|
|
|
|
- Model: Llama-3-70B (FP16)
|
|
- Hardware: 4x NVIDIA A100 80GB, NVLink
|
|
- Input length: 512 tokens
|
|
- Output length: 256 tokens
|
|
- Batch sizes: 1, 4, 8, 16, 32, 64
|
|
- Metric: tokens/second (throughput), ms/token (latency)
|
|
|
|
## Results
|
|
|
|
### Throughput (tokens/sec)
|
|
|
|
| Batch Size | vLLM | TGI | Triton |
|
|
|------------|-------|-------|--------|
|
|
| 1 | 42 | 38 | 35 |
|
|
| 4 | 156 | 142 | 128 |
|
|
| 8 | 298 | 265 | 241 |
|
|
| 16 | 512 | 478 | 420 |
|
|
| 32 | 890 | 810 | 695 |
|
|
| 64 | 1240 | 1100 | 920 |
|
|
|
|
### Latency p99 (ms/token)
|
|
|
|
| Batch Size | vLLM | TGI | Triton |
|
|
|------------|------|------|--------|
|
|
| 1 | 24 | 26 | 29 |
|
|
| 4 | 26 | 29 | 31 |
|
|
| 8 | 27 | 30 | 33 |
|
|
| 16 | 31 | 33 | 38 |
|
|
| 32 | 36 | 39 | 46 |
|
|
| 64 | 52 | 58 | 70 |
|
|
|
|
## Architecture
|
|
|
|
The benchmark pipeline follows this flow:
|
|
|
|
Client → Load Balancer → Inference Engine (vLLM/TGI/Triton) → GPU Cluster → Response
|
|
|
|
Each engine uses different batching strategies:
|
|
- vLLM: Continuous batching with PagedAttention
|
|
- TGI: Dynamic batching with FlashAttention-2
|
|
- Triton: Static batching with TensorRT-LLM backend
|
|
|
|
## Key Findings
|
|
|
|
1. vLLM achieves 13-35% higher throughput than Triton across all batch sizes
|
|
2. The gap narrows at larger batch sizes (35% at bs=1 vs 13% at bs=64)
|
|
3. vLLM p99 latency is consistently 7-26% lower than Triton
|
|
4. TGI sits between vLLM and Triton on both metrics |