Commit Graph

107 Commits

Author SHA1 Message Date
Sherlock Xu b0925b3878 docs: Add BentoML deployment doc (#3336)
Signed-off-by: Sherlock113 <sherlockxu07@gmail.com>
2024-03-12 10:34:30 -07:00
Zhuohan Li 4c922709b6 Add distributed model executor abstraction (#3191) 2024-03-11 11:03:45 -07:00
Philipp Moritz 657061fdce [docs] Add LoRA support information for models (#3299) 2024-03-11 00:54:51 -07:00
Roger Wang 99c3cfb83c [Docs] Fix Unmocked Imports (#3275) 2024-03-08 09:58:01 -08:00
Jialun Lyu 27a7b070db Add document for vllm paged attention kernel. (#2978) 2024-03-04 09:23:34 -08:00
Liangfu Chen d0fae88114 [DOC] add setup document to support neuron backend (#2777) 2024-03-04 01:03:51 +00:00
Sage Moore ce4f5a29fb Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Yuan Tang 49d849b3ab docs: Add tutorial on deploying vLLM model with KServe (#2586)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2024-03-01 11:04:14 -08:00
Ganesh Jagadeesan a8683102cc multi-lora documentation fix (#3064) 2024-02-27 21:26:15 -08:00
Woosuk Kwon 8b430d7dea [Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046) 2024-02-26 20:23:50 -08:00
张大成 48a8f4a7fd Support Orion model (#2539)
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-02-26 19:17:06 -08:00
Harry Mellor ef978fe411 Port metrics from aioprometheus to prometheus_client (#2730) 2024-02-25 11:54:00 -08:00
Zhuohan Li a9c8212895 [FIX] Add Gemma model to the doc (#2966) 2024-02-21 09:46:15 -08:00
Isotr0py ab3a5a8259 Support OLMo models. (#2832) 2024-02-18 21:05:15 -08:00
jvmncs 8f36444c4f multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Philipp Moritz 317b29de0f Remove Yi model definition, please use LlamaForCausalLM instead (#2854)
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 14:22:22 -08:00
Simon Mo f964493274 [CI] Ensure documentation build is checked in CI (#2842) 2024-02-12 22:53:07 -08:00
Philipp Moritz 4ca2c358b1 Add documentation section about LoRA (#2834) 2024-02-12 17:24:45 +01:00
Hongxia Yang 0580aab02f [ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768) 2024-02-10 23:14:37 -08:00
Philipp Moritz 931746bc6d Add documentation on how to do incremental builds (#2796) 2024-02-07 14:42:02 -08:00
Massimiliano Pronesti 5ed704ec8c docs: fix langchain (#2736) 2024-02-03 18:17:55 -08:00
Fengzhe Zhou cd9e60c76c Add Internlm2 (#2666) 2024-02-01 09:27:40 -08:00
Zhuohan Li 1af090b57d Bump up version to v0.3.0 (#2656) 2024-01-31 00:07:07 -08:00
zhaoyang-star 9090bf02e7 Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hongxia Yang 6b7de1a030 [ROCm] add support to ROCm 6.0 and MI300 (#2274) 2024-01-26 12:41:10 -08:00
Junyang Lin 2832e7b9f9 fix names and license for Qwen2 (#2589) 2024-01-24 22:37:51 -08:00
LastWhisper 223c19224b Fix the syntax error in the doc of supported_models (#2584) 2024-01-24 11:22:51 -08:00
Erfan Al-Hossami 9c1352eb57 [Feature] Simple API token authentication and pluggable middlewares (#1106) 2024-01-23 15:13:00 -08:00
Junyang Lin 94b5edeb53 Add qwen2 (#2495) 2024-01-22 14:34:21 -08:00
Hyunsung Lee e1957c6ebd Add StableLM3B model (#2372) 2024-01-16 20:32:40 -08:00
Simon 827cbcd37c Update quickstart.rst (#2369) 2024-01-12 12:56:18 -08:00
Zhuohan Li f745847ef7 [Minor] Fix the format in quick start guide related to Model Scope (#2425) 2024-01-11 19:44:01 -08:00
Jiaxiang 6549aef245 [DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-11 19:26:49 -08:00
Zhuohan Li fd4ea8ef5c Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-03 11:30:22 -08:00
Shivam Thakkar 1db83e31a2 [Docs] Update installation instructions to include CUDA 11.8 xFormers (#2246) 2023-12-22 23:20:02 -08:00
Ronen Schaffer c17daa9f89 [Docs] Fix broken links (#2222) 2023-12-20 12:43:42 -08:00
avideci de60a3fb93 Added DeciLM-7b and DeciLM-7b-instruct (#2062) 2023-12-19 02:29:33 -08:00
kliuae 1b7c791d60 [ROCm] Fixes for GPTQ on ROCm (#2180) 2023-12-18 10:41:04 -08:00
Suhong Moon 3ec8c25cd0 [Docs] Update documentation for gpu-memory-utilization option (#2162) 2023-12-17 10:51:57 -08:00
Woosuk Kwon f8c688d746 [Minor] Add Phi 2 to supported models (#2159) 2023-12-17 02:54:57 -08:00
Woosuk Kwon 26c52a5ea6 [Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00
Woosuk Kwon b81a6a6bb3 [Docs] Add supported quantization methods to docs (#2135) 2023-12-15 13:29:22 -08:00
Antoni Baum 21d93c140d Optimize Mixtral with expert parallelism (#2090) 2023-12-13 23:55:07 -08:00
Woosuk Kwon 096827c284 [Docs] Add notes on ROCm-supported models (#2087) 2023-12-13 09:45:34 -08:00
Woosuk Kwon 6565d9e33e Update installation instruction for vLLM + CUDA 11.8 (#2086) 2023-12-13 09:25:59 -08:00
TJian f375ec8440 [ROCm] Upgrade xformers version for ROCm & update doc (#2079)
Co-authored-by: miloice <jeffaw99@hotmail.com>
2023-12-13 00:56:05 -08:00
Ikko Eltociear Ashimine c0ce15dfb2 Update run_on_sky.rst (#2025)
sharable -> shareable
2023-12-11 10:32:58 -08:00
Woosuk Kwon 4ff0203987 Minor fixes for Mixtral (#2015) 2023-12-11 09:16:15 -08:00
Simon Mo c85b80c2b6 [Docker] Add cuda arch list as build option (#1950) 2023-12-08 09:53:47 -08:00
TJian 6ccc0bfffb Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00