Commit Graph

96 Commits

Author SHA1 Message Date
Harry Mellor ef978fe411 Port metrics from aioprometheus to prometheus_client (#2730) 2024-02-25 11:54:00 -08:00
Zhuohan Li a9c8212895 [FIX] Add Gemma model to the doc (#2966) 2024-02-21 09:46:15 -08:00
Isotr0py ab3a5a8259 Support OLMo models. (#2832) 2024-02-18 21:05:15 -08:00
jvmncs 8f36444c4f multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Philipp Moritz 317b29de0f Remove Yi model definition, please use LlamaForCausalLM instead (#2854)
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 14:22:22 -08:00
Simon Mo f964493274 [CI] Ensure documentation build is checked in CI (#2842) 2024-02-12 22:53:07 -08:00
Philipp Moritz 4ca2c358b1 Add documentation section about LoRA (#2834) 2024-02-12 17:24:45 +01:00
Hongxia Yang 0580aab02f [ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768) 2024-02-10 23:14:37 -08:00
Philipp Moritz 931746bc6d Add documentation on how to do incremental builds (#2796) 2024-02-07 14:42:02 -08:00
Massimiliano Pronesti 5ed704ec8c docs: fix langchain (#2736) 2024-02-03 18:17:55 -08:00
Fengzhe Zhou cd9e60c76c Add Internlm2 (#2666) 2024-02-01 09:27:40 -08:00
Zhuohan Li 1af090b57d Bump up version to v0.3.0 (#2656) 2024-01-31 00:07:07 -08:00
zhaoyang-star 9090bf02e7 Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hongxia Yang 6b7de1a030 [ROCm] add support to ROCm 6.0 and MI300 (#2274) 2024-01-26 12:41:10 -08:00
Junyang Lin 2832e7b9f9 fix names and license for Qwen2 (#2589) 2024-01-24 22:37:51 -08:00
LastWhisper 223c19224b Fix the syntax error in the doc of supported_models (#2584) 2024-01-24 11:22:51 -08:00
Erfan Al-Hossami 9c1352eb57 [Feature] Simple API token authentication and pluggable middlewares (#1106) 2024-01-23 15:13:00 -08:00
Junyang Lin 94b5edeb53 Add qwen2 (#2495) 2024-01-22 14:34:21 -08:00
Hyunsung Lee e1957c6ebd Add StableLM3B model (#2372) 2024-01-16 20:32:40 -08:00
Simon 827cbcd37c Update quickstart.rst (#2369) 2024-01-12 12:56:18 -08:00
Zhuohan Li f745847ef7 [Minor] Fix the format in quick start guide related to Model Scope (#2425) 2024-01-11 19:44:01 -08:00
Jiaxiang 6549aef245 [DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-11 19:26:49 -08:00
Zhuohan Li fd4ea8ef5c Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-03 11:30:22 -08:00
Shivam Thakkar 1db83e31a2 [Docs] Update installation instructions to include CUDA 11.8 xFormers (#2246) 2023-12-22 23:20:02 -08:00
Ronen Schaffer c17daa9f89 [Docs] Fix broken links (#2222) 2023-12-20 12:43:42 -08:00
avideci de60a3fb93 Added DeciLM-7b and DeciLM-7b-instruct (#2062) 2023-12-19 02:29:33 -08:00
kliuae 1b7c791d60 [ROCm] Fixes for GPTQ on ROCm (#2180) 2023-12-18 10:41:04 -08:00
Suhong Moon 3ec8c25cd0 [Docs] Update documentation for gpu-memory-utilization option (#2162) 2023-12-17 10:51:57 -08:00
Woosuk Kwon f8c688d746 [Minor] Add Phi 2 to supported models (#2159) 2023-12-17 02:54:57 -08:00
Woosuk Kwon 26c52a5ea6 [Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00
Woosuk Kwon b81a6a6bb3 [Docs] Add supported quantization methods to docs (#2135) 2023-12-15 13:29:22 -08:00
Antoni Baum 21d93c140d Optimize Mixtral with expert parallelism (#2090) 2023-12-13 23:55:07 -08:00
Woosuk Kwon 096827c284 [Docs] Add notes on ROCm-supported models (#2087) 2023-12-13 09:45:34 -08:00
Woosuk Kwon 6565d9e33e Update installation instruction for vLLM + CUDA 11.8 (#2086) 2023-12-13 09:25:59 -08:00
TJian f375ec8440 [ROCm] Upgrade xformers version for ROCm & update doc (#2079)
Co-authored-by: miloice <jeffaw99@hotmail.com>
2023-12-13 00:56:05 -08:00
Ikko Eltociear Ashimine c0ce15dfb2 Update run_on_sky.rst (#2025)
sharable -> shareable
2023-12-11 10:32:58 -08:00
Woosuk Kwon 4ff0203987 Minor fixes for Mixtral (#2015) 2023-12-11 09:16:15 -08:00
Simon Mo c85b80c2b6 [Docker] Add cuda arch list as build option (#1950) 2023-12-08 09:53:47 -08:00
TJian 6ccc0bfffb Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
AguirreNicolas 24f60a54f4 [Docker] Adding number of nvcc_threads during build as envar (#1893) 2023-12-07 11:00:32 -08:00
gottlike 42c02f5892 Fix quickstart.rst typo jinja (#1964) 2023-12-07 08:34:44 -08:00
Peter Götz d940ce497e Fix typo in adding_model.rst (#1947)
adpated -> adapted
2023-12-06 10:04:26 -08:00
Massimiliano Pronesti c07a442854 chore(examples-docs): upgrade to OpenAI V1 (#1785) 2023-12-03 01:11:22 -08:00
Simon Mo 5313c2cb8b Add Production Metrics in Prometheus format (#1890) 2023-12-02 16:37:44 -08:00
Simon Mo 4cefa9b49b [Docs] Update the AWQ documentation to highlight performance issue (#1883) 2023-12-02 15:52:47 -08:00
Woosuk Kwon e5452ddfd6 Normalize head weights for Baichuan 2 (#1876) 2023-11-30 20:03:58 -08:00
Adam Brusselback 66785cc05c Support chat template and echo for chat API (#1756) 2023-11-30 16:43:13 -08:00
Massimiliano Pronesti 05a38612b0 docs: add instruction for langchain (#1162) 2023-11-30 10:57:44 -08:00
Simon Mo 0f621c2c7d [Docs] Add information about using shared memory in docker (#1845) 2023-11-29 18:33:56 -08:00
Casper a921d8be9d [DOCS] Add engine args documentation (#1741) 2023-11-22 12:31:27 -08:00