Prefix KV Cache Transfer Between DP Rankers
To address the KV cache hit rate challenges in DP deployments, we introduce a prefix KV cache transfer mechanism between DP rankers, significantly reducing redundant computations.
Accelerating MinuerU Multimodal Inference with LightLLM
LightLLM now provides optimized support for MinuerU multimodal models: we've reduced RPyC communication overhead, accelerated image preprocessing, and optimized ViT batching and downstream scheduling, resulting in significant end-to-end performance improvements.
Accelerating Token Generation with MTP (Multi-Token Prediction)
Leveraging Multi-Token Prediction to optimize inference performance through efficient KV cache reuse and custom operators.
LightLLM v1.1.0: Now Available!
We're thrilled to announce the official release of LightLLM v1.1.0! This major update introduces a CPU-GPU unified folding architecture that significantly reduces system-level CPU overhead, deep optimizations for DeepSeek and Qwen3-MoE including DeepEP/DeepGEMM integration, fused MoE Triton optimizations, balanced DP request scheduler, MTP suppo...
Pre$^3$: Unlocking Faster, Structured LLM Generation with Deterministic Pushdown Automata
We are delighted to introduce our paper (Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation) on constrained decoding, which has been accepted by ACL25 Main Conference.
LightLLM v1.0.0: Now Available!
We're thrilled to announce the official release of LightLLM v1.0.0! This major update brings groundbreaking improvements, including minimal inter-process communication overhead, the fastest DeepSeek-R1 serving performance on a single H200, and prototype support for PD-Disaggregation.