README.md

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord | Twitter/X | Developer Slack |


Latest News πŸ”₯

  • [2024/11] We hosted the seventh vLLM meetup with Snowflake! Please find the meetup slides here.

  • [2024/10] We have just created a developer slack (slack.vllm.ai) focusing on coordinating contributions and discussing features. Please feel free to join us there!

  • [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team here. Learn more from the talks from other vLLM contributors and users!

  • [2024/09] We hosted the sixth vLLM meetup with NVIDIA! Please find the meetup slides here.

  • [2024/07] We hosted the fifth vLLM meetup with AWS! Please find the meetup slides here.

  • [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post here.

  • [2024/06] We hosted the fourth vLLM meetup with Cloudflare and BentoML! Please find the meetup slides here.

  • [2024/04] We hosted the third vLLM meetup with Roblox! Please find the meetup slides here.

  • [2024/01] We hosted the second vLLM meetup with IBM! Please find the meetup slides here.

  • [2023/10] We hosted the first vLLM meetup with a16z! Please find the meetup slides here.

  • [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.

  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.


About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput

  • Efficient management of attention key and value memory with PagedAttention

  • Continuous batching of incoming requests

  • Fast model execution with CUDA/HIP graph

  • Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.

  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.

  • Speculative decoding

  • Chunked prefill

Performance benchmark: We include a performance benchmark at the end of our blog post. It compares the performance of vLLM against other LLM serving engines (TensorRT-LLM, SGLang and LMDeploy). The implementation is under nightly-benchmarks folder and you can reproduce this benchmark using our one-click runnable script.

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models

  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more

  • Tensor parallelism and pipeline parallelism support for distributed inference

  • Streaming outputs

  • OpenAI-compatible API server

  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.

  • Prefix caching support

  • Multi-lora support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

  • Transformer-like LLMs (e.g., Llama)

  • Mixture-of-Expert LLMs (e.g., Mixtral)

  • Embedding Models (e.g. E5-Mistral)

  • Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

Getting Started

Install vLLM with pip or from source:

pip install vllm

Visit our documentation to learn more.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Sponsors

vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!

  • a16z

  • AMD

  • Anyscale

  • AWS

  • Crusoe Cloud

  • Databricks

  • DeepInfra

  • Dropbox

  • Google Cloud

  • Lambda Lab

  • Nebius

  • NVIDIA

  • Replicate

  • Roblox

  • RunPod

  • Sequoia Capital

  • Skywork AI

  • Trainy

  • UC Berkeley

  • UC San Diego

  • ZhenFund

We also have an official fundraising venue through OpenCollective. We plan to use the fund to support the development, maintenance, and adoption of vLLM.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Contact Us

  • For technical questions and feature requests, please use Github issues or discussions.

  • For discussing with fellow users, please use Discord.

  • For coordinating contributions and development, please use Slack.

  • For security disclosures, please use Github’s security advisory feature.

  • For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.