content-forge-vault/01-topics/2026-03-19-llm-inference-benchmark.md
2026-03-20 02:00:01 +08:00

2.2 KiB

id title slug status content_type channels language source_urls assets cover_image template owner created_at updated_at
2026-03-19-llm-inference-benchmark LLM Inference Speed Benchmark: vLLM vs TGI vs Triton llm-inference-benchmark topic article
wechat
x
en
05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.png
05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.pdf
article content-forge 2026-03-19T00:00:00+08:00 2026-03-19T00:00:00+08:00

LLM Inference Speed Benchmark: vLLM vs TGI vs Triton

Background

We benchmarked three popular LLM inference engines on Llama-3-70B with A100 80GB GPUs.

Experimental Setup

  • Model: Llama-3-70B (FP16)
  • Hardware: 4x NVIDIA A100 80GB, NVLink
  • Input length: 512 tokens
  • Output length: 256 tokens
  • Batch sizes: 1, 4, 8, 16, 32, 64
  • Metric: tokens/second (throughput), ms/token (latency)

Results

Throughput (tokens/sec)

Batch Size vLLM TGI Triton
1 42 38 35
4 156 142 128
8 298 265 241
16 512 478 420
32 890 810 695
64 1240 1100 920

Latency p99 (ms/token)

Batch Size vLLM TGI Triton
1 24 26 29
4 26 29 31
8 27 30 33
16 31 33 38
32 36 39 46
64 52 58 70

Architecture

The benchmark pipeline follows this flow:

Client → Load Balancer → Inference Engine (vLLM/TGI/Triton) → GPU Cluster → Response

Each engine uses different batching strategies:

  • vLLM: Continuous batching with PagedAttention
  • TGI: Dynamic batching with FlashAttention-2
  • Triton: Static batching with TensorRT-LLM backend

Key Findings

  1. vLLM achieves 13-35% higher throughput than Triton across all batch sizes
  2. The gap narrows at larger batch sizes (35% at bs=1 vs 13% at bs=64)
  3. vLLM p99 latency is consistently 7-26% lower than Triton
  4. TGI sits between vLLM and Triton on both metrics