Accelerating Token Generation with MTP (Multi-Token Prediction)
Leveraging Multi-Token Prediction to optimize inference performance through efficient KV cache reuse and custom operators.
LightLLM v1.1.0: Now Available!
We're thrilled to announce the official release of LightLLM v1.1.0! This major update introduces a CPU-GPU unified folding architecture that significantly reduces system-level CPU overhead, deep optimizations for DeepSeek and Qwen3-MoE including DeepEP/DeepGEMM integration, fused MoE Triton optimizations, balanced DP request scheduler, MTP suppo...
Pre$^3$: Unlocking Faster, Structured LLM Generation with Deterministic Pushdown Automata
We are delighted to introduce our paper (Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation) on constrained decoding, which has been accepted by ACL25 Main Conference.
LightLLM v1.0.0: Now Available!
We're thrilled to announce the official release of LightLLM v1.0.0! This major update brings groundbreaking improvements, including minimal inter-process communication overhead, the fastest DeepSeek-R1 serving performance on a single H200, and prototype support for PD-Disaggregation.