vault: auto-sync 2026-03-20 02:00
This commit is contained in:
parent
ab44aca650
commit
eee871b788
77
01-topics/2026-03-19-llm-inference-benchmark.md
Normal file
77
01-topics/2026-03-19-llm-inference-benchmark.md
Normal file
@ -0,0 +1,77 @@
|
||||
---
|
||||
id: 2026-03-19-llm-inference-benchmark
|
||||
title: "LLM Inference Speed Benchmark: vLLM vs TGI vs Triton"
|
||||
slug: llm-inference-benchmark
|
||||
status: topic
|
||||
content_type: article
|
||||
channels:
|
||||
- wechat
|
||||
- x
|
||||
language: en
|
||||
source_urls: []
|
||||
assets:
|
||||
- 05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.png
|
||||
- 05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.pdf
|
||||
cover_image: ""
|
||||
template: article
|
||||
owner: content-forge
|
||||
created_at: 2026-03-19T00:00:00+08:00
|
||||
updated_at: 2026-03-19T00:00:00+08:00
|
||||
---
|
||||
|
||||
# LLM Inference Speed Benchmark: vLLM vs TGI vs Triton
|
||||
|
||||
## Background
|
||||
|
||||
We benchmarked three popular LLM inference engines on Llama-3-70B with A100 80GB GPUs.
|
||||
|
||||
## Experimental Setup
|
||||
|
||||
- Model: Llama-3-70B (FP16)
|
||||
- Hardware: 4x NVIDIA A100 80GB, NVLink
|
||||
- Input length: 512 tokens
|
||||
- Output length: 256 tokens
|
||||
- Batch sizes: 1, 4, 8, 16, 32, 64
|
||||
- Metric: tokens/second (throughput), ms/token (latency)
|
||||
|
||||
## Results
|
||||
|
||||
### Throughput (tokens/sec)
|
||||
|
||||
| Batch Size | vLLM | TGI | Triton |
|
||||
|------------|-------|-------|--------|
|
||||
| 1 | 42 | 38 | 35 |
|
||||
| 4 | 156 | 142 | 128 |
|
||||
| 8 | 298 | 265 | 241 |
|
||||
| 16 | 512 | 478 | 420 |
|
||||
| 32 | 890 | 810 | 695 |
|
||||
| 64 | 1240 | 1100 | 920 |
|
||||
|
||||
### Latency p99 (ms/token)
|
||||
|
||||
| Batch Size | vLLM | TGI | Triton |
|
||||
|------------|------|------|--------|
|
||||
| 1 | 24 | 26 | 29 |
|
||||
| 4 | 26 | 29 | 31 |
|
||||
| 8 | 27 | 30 | 33 |
|
||||
| 16 | 31 | 33 | 38 |
|
||||
| 32 | 36 | 39 | 46 |
|
||||
| 64 | 52 | 58 | 70 |
|
||||
|
||||
## Architecture
|
||||
|
||||
The benchmark pipeline follows this flow:
|
||||
|
||||
Client → Load Balancer → Inference Engine (vLLM/TGI/Triton) → GPU Cluster → Response
|
||||
|
||||
Each engine uses different batching strategies:
|
||||
- vLLM: Continuous batching with PagedAttention
|
||||
- TGI: Dynamic batching with FlashAttention-2
|
||||
- Triton: Static batching with TensorRT-LLM backend
|
||||
|
||||
## Key Findings
|
||||
|
||||
1. vLLM achieves 13-35% higher throughput than Triton across all batch sizes
|
||||
2. The gap narrows at larger batch sizes (35% at bs=1 vs 13% at bs=64)
|
||||
3. vLLM p99 latency is consistently 7-26% lower than Triton
|
||||
4. TGI sits between vLLM and Triton on both metrics
|
||||
Loading…
Reference in New Issue
Block a user