wassname/vllm - vllm - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/vllm.git synced 2026-07-02 02:44:20 +08:00

Author	SHA1	Message	Date
Joshua Rosenkranz	b12518d3cf	[Model] MLPSpeculator speculative decoding support (#4947 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>	2024-06-20 20:23:12 -04:00
Michael Goin	8065a7e220	[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718 )	2024-06-20 17:00:13 -06:00
Isotr0py	7d46c8d378	[Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684 )	2024-06-19 17:58:32 +08:00
Ronen Schaffer	7879f24dcc	[Misc] Add OpenTelemetry support (#4687 ) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here	2024-06-19 01:17:03 +09:00
Isotr0py	daef218b55	[Model] Initialize Phi-3-vision support (#4986 )	2024-06-17 19:34:33 -07:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
Allen.Dou	d74674bbd9	[Misc] Fix arg names (#5524 )	2024-06-14 09:47:44 -07:00
Allen.Dou	55d6361b13	[Misc] Fix arg names in quantizer script (#5507 )	2024-06-13 19:02:53 -07:00
Travis Johnson	51602eefd3	[Frontend] [Core] Support for sharded tensorized models (#4990 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 14:13:52 -07:00
Roger Wang	7a9cb294ae	[Frontend] Add OpenAI Vision API Support (#5237 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-07 11:23:32 -07:00
Zhuohan Li	bd0e7802e0	[Bugfix] Add warmup for prefix caching example (#5235 )	2024-06-03 19:36:41 -07:00
Cyrus Leung	7a64d24aad	[Core] Support image processor (#4197 )	2024-06-02 22:56:41 -07:00
Daniil Arapov	c2d6d2f960	[Bugfix]: Fix issues related to prefix caching example (#5177 ) (#5180 )	2024-06-01 15:53:52 -07:00
chenqianfzh	b9c0605a8e	[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776 )	2024-06-01 14:51:10 -06:00
Ronen Schaffer	ae495c74ea	[Doc]Replace deprecated flag in readme (#4526 )	2024-05-29 22:26:33 +00:00
Cyrus Leung	5ae5ed1e60	[Core] Consolidate prompt arguments to LLM engines (#4328 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-05-28 13:29:31 -07:00
Antoni Baum	c5711ef985	[Doc] Update Ray Data distributed offline inference example (#4871 )	2024-05-17 10:52:11 -07:00
Alex Wu	dbc0754ddf	[docs] Fix typo in examples filename openi -> openai (#4864 )	2024-05-17 00:42:17 +09:00
Aurick Qiao	30e754390c	[Core] Implement sharded state loader (#4690 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-05-15 22:11:54 -07:00
Alex Wu	52f8107cf2	[Frontend] Support OpenAI batch file format (#4794 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-15 19:13:36 -04:00
Sanger Steel	8bc68e198c	[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208 )	2024-05-13 14:57:07 -07:00
Chang Su	e254497b66	[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734 )	2024-05-11 11:30:37 -07:00
Hao Zhang	ebce310b74	[Model] Snowflake arctic model implementation (#4652 ) Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com> Co-authored-by: Aurick Qiao <qiao@aurick.net> Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-09 22:37:14 +00:00
Robert Shaw	cea64430f6	[Bugfix] Update grafana.json (#4711 )	2024-05-09 10:10:13 -07:00
Danny Guinther	b8afa8b95a	[MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273 )	2024-05-01 17:34:40 -07:00
Ronen Schaffer	bf480c5302	Add more Prometheus metrics (#2764 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-04-28 15:59:33 -07:00
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Harry Mellor	3d925165f2	Add example scripts to documentation (#4225 ) Co-authored-by: Harry Mellor <hmellor@oxts.com>	2024-04-22 16:36:54 +00:00
Antoni Baum	69e1d2fb69	[Core] Refactor model loading code (#4097 )	2024-04-16 11:34:39 -07:00
Sanger Steel	d619ae2d19	[Doc] Add better clarity for tensorizer usage (#4090 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-04-15 13:28:25 -07:00
Sanger Steel	711a000255	[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476 )	2024-04-13 17:13:01 -07:00
Cade Daniel	e0dd4d3589	[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py (#3864 )	2024-04-04 21:57:33 -07:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
Woosuk Kwon	c0935c96d3	[Bugfix] Set enable_prefix_caching=True in prefix caching example (#3703 )	2024-03-28 16:26:30 -07:00
Simon Mo	a4075cba4d	[CI] Add test case to run examples scripts (#3638 )	2024-03-28 14:36:10 -07:00
xwjiang2010	64172a976c	[Feature] Add vision language model support. (#3042 )	2024-03-25 14:16:30 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Zhuohan Li	e90fc21f2e	[Hardware][Neuron] Refactor neuron support (#3471 )	2024-03-22 01:22:17 +00:00
Simon Mo	8e67598aa6	[Misc] fix line length for entire codebase (#3444 )	2024-03-16 00:36:29 -07:00
Dinghow Yang	cf6ff18246	Fix Baichuan chat template (#3340 )	2024-03-15 21:02:12 -07:00
Dinghow Yang	253a98078a	Add chat templates for ChatGLM (#3418 )	2024-03-14 23:19:22 -07:00
Dinghow Yang	21539e6856	Add chat templates for Falcon (#3420 )	2024-03-14 23:19:02 -07:00
Allen.Dou	a37415c31b	allow user to chose which vllm's merics to display in grafana (#3393 )	2024-03-14 06:35:13 +00:00
DAIZHENWEI	654865e21d	Support Mistral Model Inference with transformers-neuronx (#3153 )	2024-03-11 13:19:51 -07:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
jvmncs	8f36444c4f	multi-LoRA as extra models in OpenAI server (#2775 ) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models	2024-02-17 12:00:48 -08:00
Cheng Su	4abf6336ec	Add one example to run batch inference distributed on Ray (#2696 )	2024-02-02 15:41:42 -08:00
Robert Shaw	93b38bea5d	Refactor Prometheus and Add Request Level Metrics (#2316 )	2024-01-31 14:58:07 -08:00
Simon Mo	1e4277d2d1	lint: format all python file instead of just source code (#2567 )	2024-01-23 15:53:06 -08:00

1 2

77 Commits