Home

To address the KV cache hit rate challenges in DP deployments, we introduce a prefix KV cache transfer mechanism between DP rankers, significantly reducing redundant computations.

LightLLM now provides optimized support for MinuerU multimodal models: we've reduced RPyC communication overhead, accelerated image preprocessing, and optimized ViT batching and downstream scheduling, resulting in significant end-to-end performance improvements.

Leveraging Multi-Token Prediction to optimize inference performance through efficient KV cache reuse and custom operators.

We're thrilled to announce the official release of LightLLM v1.1.0! This major update introduces a CPU-GPU unified folding architecture that significantly reduces system-level CPU overhead, deep optimizations for DeepSeek and Qwen3-MoE including DeepEP/DeepGEMM integration, fused MoE Triton optimizations, balanced DP request scheduler, MTP suppo...

We are delighted to introduce our paper (Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation) on constrained decoding, which has been accepted by ACL25 Main Conference.

We're thrilled to announce the official release of LightLLM v1.0.0! This major update brings groundbreaking improvements, including minimal inter-process communication overhead, the fastest DeepSeek-R1 serving performance on a single H200, and prototype support for PD-Disaggregation.

Prefix KV Cache Transfer Between DP Rankers

Accelerating MinuerU Multimodal Inference with LightLLM

Accelerating Token Generation with MTP (Multi-Token Prediction)

LightLLM v1.1.0: Now Available!

Pre$^3$: Unlocking Faster, Structured LLM Generation with Deterministic Pushdown Automata

LightLLM v1.0.0: Now Available!

Reducing Overhead with Cuda Graph

Welcome To the LightLLM Blog