About LightLLM

A lightweight, high-performance large language model serving framework

Welcome to LightLLM! LightLLM is a large language model inference and serving framework developed purely in Python, featuring a lightweight design, easy extensibility, and high performance. LightLLM integrates the advantages of many open-source solutions, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.

Key Features

Multi-process Collaboration: Tokenization, language model inference, vision model inference, and other tasks are performed asynchronously, significantly improving GPU utilization.
Zero-padding: Provides support for nopad-Attention computation across multiple models to efficiently handle requests with large length differences.
Dynamic Batching: Capable of dynamic batch scheduling for requests.
FlashAttention: Integrates FlashAttention to improve speed and reduce GPU memory usage during inference.
Tensor Parallelism: Utilizes multiple GPUs for tensor parallelism to accelerate inference speed.
Token Attention: Implements a token-level KV cache memory management mechanism, achieving zero memory waste during inference.
High-performance Routing: Combined with Token Attention, it precisely manages GPU memory at the token level, optimizing system throughput.
INT8 KV Cache: This feature can double the maximum number of tokens. Currently, it only supports Llama-architecture models.

Key Features

Supported Models