Commit Graph

  • 84e4e37d14 [Minor] Fix type annotations (#1238) Woosuk Kwon 2023-10-02 15:28:31 -07:00
  • a60b353005 support sharding llama2-70b on more than 8 GPUs (#1209) Zhuohan Li 2023-10-02 15:26:33 -07:00
  • ebe4d1db3a Fix boundary check in paged attention kernel (#1241) Liang 2023-10-02 02:35:06 +08:00
  • b5a10eb0ef Added dtype arg to benchmarks (#1228) kg6-sleipnir 2023-10-01 00:04:03 -04:00
  • 0967102c6d fixing typo in tiiuae/falcon-rw-7b model name (#1226) Usama Ahmed 2023-09-29 23:40:25 +03:00
  • e2fb71ec9f Bump up the version to v0.2.0 (#1212) Woosuk Kwon 2023-09-28 15:30:38 -07:00
  • f936657eb6 Provide default max model length (#1224) Woosuk Kwon 2023-09-28 14:44:02 -07:00
  • 6f88f762bf Fix OOM in attention kernel test (#1223) Woosuk Kwon 2023-09-28 14:33:24 -07:00
  • 202351d5bf Add Mistral to supported model list (#1221) Woosuk Kwon 2023-09-28 14:33:04 -07:00
  • 2e8e49fce3 [Fix] Remove false assertion (#1222) Woosuk Kwon 2023-09-28 10:52:38 -07:00
  • a8e98aee0c Fix Mistral model (#1220) Woosuk Kwon 2023-09-28 10:44:05 -07:00
  • bb1ba58f06 [Mistral] Mistral-7B-v0.1 support (#1196) Chris Bamford 2023-09-28 19:41:03 +02:00
  • 7bedab5748 Add rope_scaling to Qwen (#1210) Qing 2023-09-28 15:49:23 +08:00
  • 20f7cc4cde Add skip_special_tokens sampling params (#1186) Dan Lord 2023-09-27 19:21:42 -07:00
  • 649aa730c5 Use standard extras for uvicorn (#1166) Danilo Peixoto 2023-09-27 21:41:36 -03:00
  • a19bc5c628 Automatically configure max_num_batched_tokens (#1198) Woosuk Kwon 2023-09-27 16:34:00 -07:00
  • 28e616c4e3 fix qwen-14b model (#1173) Qing 2023-09-28 07:33:16 +08:00
  • 30e775281d fix typo (#1184) Wang Ran (汪然) 2023-09-28 07:22:45 +08:00
  • 21877b0d75 Support Longchat and RoPE scaling (#555) Lily Liu 2023-09-27 03:36:02 -07:00
  • cf5cb1e33e Allocate more shared memory to attention kernel (#1154) Antoni Baum 2023-09-26 22:27:13 -07:00
  • 03ffd0a022 Add comments on RoPE initialization (#1176) Woosuk Kwon 2023-09-26 10:48:33 -07:00
  • a425bd9a9a [Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs (#1074) Woosuk Kwon 2023-09-26 10:21:08 -07:00
  • bbbf86565f Align max_tokens behavior with openai (#852) Wen Sun 2023-09-24 09:10:13 +08:00
  • 9f6be8692e Fix config for Falcon (#1164) Woosuk Kwon 2023-09-23 17:38:43 -07:00
  • f187877945 [FIX] Simplify sampler logic (#1156) Zhuohan Li 2023-09-23 17:21:56 -07:00
  • 947b794146 [Sampler] Vectorized sampling (simplified) (#1048) Zhuohan Li 2023-09-22 17:48:04 -07:00
  • 8d926e91f1 Announce the First vLLM Meetup (#1148) Woosuk Kwon 2023-09-22 11:37:14 -07:00
  • 4ee52bb169 Docs: Fix broken link to openai example (#1145) Nick Perez 2023-09-22 14:36:09 -04:00
  • 7d7e3b78a3 Use --ipc=host in docker run for distributed inference (#1125) Woosuk Kwon 2023-09-21 18:26:47 -07:00
  • f98b745a81 feat: support stop_token_ids parameter. (#1097) Ricardo Lu 2023-09-22 06:34:02 +08:00
  • 2d1e86f1b1 clean api code, remove redundant background task. (#1102) Roy 2023-09-22 04:25:05 +08:00
  • 1ac4ccf73c Add float16 and float32 (#1115) Woosuk Kwon 2023-09-21 00:52:47 -07:00
  • 2ac4d5e2bf Replace DtypeTensor (#1123) Woosuk Kwon 2023-09-21 00:51:47 -07:00
  • 3302f0aef3 rope_theta and max_position_embeddings from config (#1096) Antoni Baum 2023-09-20 13:35:11 -07:00
  • 6f2dd6c37e Add documentation to Triton server tutorial (#983) Tanmay Verma 2023-09-20 10:32:40 -07:00
  • bc0644574c Add gpu_memory_utilization and swap_space to LLM (#1090) Woosuk Kwon 2023-09-19 22:16:04 -07:00
  • 400b8289f7 Add pyarrow to dependencies & Print warning on Ray import error (#1094) Woosuk Kwon 2023-09-18 22:36:17 -07:00
  • c1026311b5 [Community] Add vLLM Discord server (#1086) Zhuohan Li 2023-09-18 12:23:35 -07:00
  • 2b1c116b5a Add minimum capability requirement for AWQ (#1064) Woosuk Kwon 2023-09-18 12:02:01 -07:00
  • cc796b1358 Convert before transpose (#1073) Woosuk Kwon 2023-09-18 11:51:48 -07:00
  • f029ef94d7 Fix get_max_num_running_seqs for waiting and swapped seq groups (#1068) Zhuohan Li 2023-09-18 11:49:40 -07:00
  • 95592fa00a align llm_engine and async_engine. (#1081) Roy 2023-09-19 02:49:10 +08:00
  • fbe66e1d0b added support for quantize on LLM module (#1080) orellavie1212 2023-09-18 21:04:21 +03:00
  • 90979c38f8 [FIX] Don't initialize parameter by default (#1067) Zhuohan Li 2023-09-17 17:15:38 -07:00
  • e21d7687a9 Fix hanging when prompt exceeds limit (#1029) 陈序 2023-09-17 16:48:56 +08:00
  • ff36139ffc Remove AsyncLLMEngine busy loop, shield background task (#1059) Antoni Baum 2023-09-17 00:29:08 -07:00
  • e3e79e9e8a Implement AWQ quantization support for LLaMA (#1032) Woosuk Kwon 2023-09-16 00:03:37 -07:00
  • b9fe4616f9 Abort when coroutine is cancelled (#1020) Jerry Yang 2023-09-15 08:40:18 +08:00
  • 64ca424e75 Fix warning message on LLaMA FastTokenizer (#1037) Woosuk Kwon 2023-09-14 17:33:32 -07:00
  • b5f93d0631 Only fail if logit_bias has actual values (#1045) Lukas Kreussel 2023-09-15 02:33:01 +02:00
  • a58936966f Add pandas to requirements.txt (#1047) Woosuk Kwon 2023-09-14 17:31:38 -07:00
  • dd54a4b026 Fix detokenization leaving special tokens (#1044) Antoni Baum 2023-09-14 16:37:03 -07:00
  • eda1a7cad3 Announce paper release (#1036) Woosuk Kwon 2023-09-13 17:38:13 -07:00
  • f04908cae7 [FIX] Minor bug fixes (#1035) Zhuohan Li 2023-09-13 16:38:12 -07:00
  • ab019eea75 Add Model Revision Support (#1014) Jasmond L 2023-09-14 06:20:02 +08:00
  • 9841d48a10 Use TGI-like incremental detokenization (#984) Antoni Baum 2023-09-13 13:38:01 -07:00
  • 3272d7a0b7 Fix typo in README.md (#1033) Ikko Eltociear Ashimine 2023-09-14 04:55:23 +09:00
  • 0bb1e885a0 Make max_model_len configurable (#972) Antoni Baum 2023-09-12 16:29:19 -07:00
  • d6545ad22e add option to shorten prompt print in log (#991) leiwen83 2023-09-13 06:10:14 +08:00
  • 90eb3f43ca Bump up the version to v0.1.7 (#1013) Woosuk Kwon 2023-09-11 00:54:30 -07:00
  • e67b4f2c2a Use FP32 in RoPE initialization (#1004) Woosuk Kwon 2023-09-11 00:26:35 -07:00
  • d6770d1f23 Update setup.py (#1006) Woosuk Kwon 2023-09-10 23:42:45 -07:00
  • b9cecc2635 [Docs] Update installation page (#1005) Woosuk Kwon 2023-09-10 14:23:31 -07:00
  • 898285c9bf fix: CUDA error when inferencing with Falcon-40B base model (#992) Kyujin Cho 2023-09-10 17:39:02 +09:00
  • a62de9ecfd Fix wrong dtype in PagedAttentionWithALiBi bias (#996) Antoni Baum 2023-09-09 14:58:35 -07:00
  • 4042d192f5 fix "tansformers_module" ModuleNotFoundError when load model with trust_remote_code=True (#871) Jingru 2023-09-09 08:21:30 +08:00
  • 1117aa1411 Bump up the version to v0.1.6 (#989) Zhuohan Li 2023-09-08 00:07:46 -07:00
  • 080438477f Start background task in AsyncLLMEngine.generate (#988) Antoni Baum 2023-09-08 00:03:39 -07:00
  • 4b5bcf8906 faster startup of vLLM (#982) Robert Irvine 2023-09-08 06:48:54 +01:00
  • 852ef5b4f5 Bump up the version to v0.1.5 (#944) Woosuk Kwon 2023-09-08 08:15:31 +09:00
  • db09d4ad83 [FIX] Fix Alibi implementation in PagedAttention kernel (#945) Zhuohan Li 2023-09-07 15:53:14 -07:00
  • c957c741d9 Enable safetensors loading for all models (#974) Zhuohan Li 2023-09-07 15:49:52 -07:00
  • c07ece5ca4 Make AsyncLLMEngine more robust & fix batched abort (#969) Antoni Baum 2023-09-07 13:43:45 -07:00
  • 7a9c20c715 Bum up transformers version (#976) Woosuk Kwon 2023-09-08 05:15:53 +09:00
  • 005ba458b5 Set torch default dtype in a context manager (#971) Antoni Baum 2023-09-06 23:39:37 -07:00
  • 320a622ec4 [BugFix] Implement RoPE for GPT-J (#941) Woosuk Kwon 2023-09-06 11:54:33 +09:00
  • c9927c1a6a Use queue for finished requests (#957) Antoni Baum 2023-09-05 19:27:23 -07:00
  • fbd80ad409 Clean up kernel unit tests (#938) Woosuk Kwon 2023-09-06 08:57:38 +09:00
  • 22379d5513 fix: typo (#948) Wen Sun 2023-09-05 14:22:30 +08:00
  • 1696725879 Initialize AsyncLLMEngine bg loop correctly (#943) Antoni Baum 2023-09-04 17:41:22 -07:00
  • 002800f081 Align vLLM's beam search implementation with HF generate (#857) Zhuohan Li 2023-09-04 17:29:42 -07:00
  • e15932bb60 Only emit warning about internal tokenizer if it isn't being used (#939) Nelson Liu 2023-09-04 08:50:55 -07:00
  • ce741ba3e4 Refactor AsyncLLMEngine (#880) Antoni Baum 2023-09-03 21:43:43 -07:00
  • bf87484efa [BugFix] Fix NaN errors in paged attention kernel (#936) Woosuk Kwon 2023-09-04 09:20:06 +09:00
  • 8ce9c50d40 Avoid compiling kernels for double data type (#933) Woosuk Kwon 2023-09-02 14:59:47 +09:00
  • 32b6816e55 Add tests for models (#922) Woosuk Kwon 2023-09-01 11:19:43 +09:00
  • c128d69856 Fix README.md Link (#927) Zhuohan Li 2023-08-31 17:18:34 -07:00
  • 55b28b1eee [Docs] Minor fixes in supported models (#920) Woosuk Kwon 2023-09-01 08:28:39 +09:00
  • e11222333f fix: bug fix when penalties are negative (#913) Dong-Yong Lee 2023-09-01 00:37:17 +09:00
  • 28873a2799 Improve _prune_hidden_states micro-benchmark (#707) Aman Gupta Karmani 2023-08-31 00:28:43 -04:00
  • 0080d8329d Add acknowledgement to a16z grant Zhuohan Li 2023-08-30 02:17:27 -07:00
  • 0d93f15694 Accelerate LLaMA model loading (#234) JFDuan 2023-08-30 16:00:13 +08:00
  • becd7a56f1 Enable request body OpenAPI spec for OpenAI endpoints (#865) lplcor 2023-08-29 21:54:08 -07:00
  • 75471386de use flash-attn via xformers (#877) Aman Gupta Karmani 2023-08-30 00:52:13 -04:00
  • d2b2eed67c [Fix] Fix a condition for ignored sequences (#867) Zhuohan Li 2023-08-27 23:00:56 -07:00
  • 4b6f069b6f Add support for CodeLlama (#854) Antoni Baum 2023-08-25 12:44:07 -07:00
  • 791d79de32 Bump up the version to v0.1.4 (#846) Woosuk Kwon 2023-08-25 12:28:00 +09:00
  • 94d2f59895 Set replacement=True in torch.multinomial (#858) Woosuk Kwon 2023-08-25 12:22:01 +09:00
  • 75c0ca9d43 Clean up code (#844) wenjun93 2023-08-24 07:44:15 +08:00
  • 2a4ec90854 Fix for breaking changes in xformers 0.0.21 (#834) Woosuk Kwon 2023-08-23 17:44:21 +09:00