A lightweight, high-performance large language model serving framework
Welcome to LightLLM! LightLLM is a large language model inference and serving framework developed purely in Python, featuring a lightweight design, easy extensibility, and high performance. LightLLM integrates the advantages of many open-source solutions, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.
Key Features
- Multi-process Collaboration: Tokenization, language model inference, vision model inference, and other tasks are performed asynchronously, significantly improving GPU utilization.
- Zero-padding: Provides support for nopad-Attention computation across multiple models to efficiently handle requests with large length differences.
- Dynamic Batching: Capable of dynamic batch scheduling for requests.
- FlashAttention: Integrates FlashAttention to improve speed and reduce GPU memory usage during inference.
- Tensor Parallelism: Utilizes multiple GPUs for tensor parallelism to accelerate inference speed.
- Token Attention: Implements a token-level KV cache memory management mechanism, achieving zero memory waste during inference.
- High-performance Routing: Combined with Token Attention, it precisely manages GPU memory at the token level, optimizing system throughput.
- INT8 KV Cache: This feature can double the maximum number of tokens. Currently, it only supports Llama-architecture models.