About LightLLM

Lightllm

A lightweight, high-performance large language model serving framework

Star Watch Fork

Welcome to LightLLM! LightLLM is a large language model inference and serving framework developed purely in Python, featuring a lightweight design, easy extensibility, and high performance. LightLLM integrates the advantages of many open-source solutions, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.

Key Features

  • Multi-process Collaboration: Tokenization, language model inference, vision model inference, and other tasks are performed asynchronously, significantly improving GPU utilization.
  • Zero-padding: Provides support for nopad-Attention computation across multiple models to efficiently handle requests with large length differences.
  • Dynamic Batching: Capable of dynamic batch scheduling for requests.
  • FlashAttention: Integrates FlashAttention to improve speed and reduce GPU memory usage during inference.
  • Tensor Parallelism: Utilizes multiple GPUs for tensor parallelism to accelerate inference speed.
  • Token Attention: Implements a token-level KV cache memory management mechanism, achieving zero memory waste during inference.
  • High-performance Routing: Combined with Token Attention, it precisely manages GPU memory at the token level, optimizing system throughput.
  • INT8 KV Cache: This feature can double the maximum number of tokens. Currently, it only supports Llama-architecture models.

Supported Models