zhukang/content-forge-vault

Fork 0

lizikk eee871b788 vault: auto-sync 2026-03-20 02:00

2026-03-20 02:00:01 +08:00

2.2 KiB

Raw Permalink Blame History

title

slug

status

content_type

channels

language

source_urls

assets

cover_image

template

owner

created_at

updated_at

2026-03-19-llm-inference-benchmark

LLM Inference Speed Benchmark: vLLM vs TGI vs Triton

llm-inference-benchmark

topic

article

wechat

05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.png

05-assets/llm-inference-benchmark/llm-inference-benchmark_fig.pdf

article

content-forge

2026-03-19T00:00:00+08:00

LLM Inference Speed Benchmark: vLLM vs TGI vs Triton

Background

We benchmarked three popular LLM inference engines on Llama-3-70B with A100 80GB GPUs.

Experimental Setup

Model: Llama-3-70B (FP16)
Hardware: 4x NVIDIA A100 80GB, NVLink
Input length: 512 tokens
Output length: 256 tokens
Batch sizes: 1, 4, 8, 16, 32, 64
Metric: tokens/second (throughput), ms/token (latency)

Results

Throughput (tokens/sec)

Batch Size	vLLM	TGI	Triton
1	42	38	35
4	156	142	128
8	298	265	241
16	512	478	420
32	890	810	695
64	1240	1100	920

Latency p99 (ms/token)

Batch Size	vLLM	TGI	Triton
1	24	26	29
4	26	29	31
8	27	30	33
16	31	33	38
32	36	39	46
64	52	58	70

Architecture

The benchmark pipeline follows this flow:

Client → Load Balancer → Inference Engine (vLLM/TGI/Triton) → GPU Cluster → Response

Each engine uses different batching strategies:

vLLM: Continuous batching with PagedAttention
TGI: Dynamic batching with FlashAttention-2
Triton: Static batching with TensorRT-LLM backend

Key Findings

vLLM achieves 13-35% higher throughput than Triton across all batch sizes
The gap narrows at larger batch sizes (35% at bs=1 vs 13% at bs=64)
vLLM p99 latency is consistently 7-26% lower than Triton
TGI sits between vLLM and Triton on both metrics

2.2 KiB Raw Permalink Blame History

LLM Inference Speed Benchmark: vLLM vs TGI vs Triton

Background

Experimental Setup

Results

Throughput (tokens/sec)

Latency p99 (ms/token)

Architecture

Key Findings

2.2 KiB

Raw Permalink Blame History