Commit Graph

  • 7d761fe3c1 [FIX] Fix the case when input_is_parallel=False for ScaledActivation (#1737) Zhuohan Li 2023-11-20 23:56:48 -08:00
  • cf35d8f3d7 [BugFix] Fix TP support for AWQ (#1731) Woosuk Kwon 2023-11-20 21:42:45 -08:00
  • 4bb6b67188 fix RAM OOM when load large models in tensor parallel mode. (#1395) boydfd 2023-11-21 11:02:42 +08:00
  • 819b18e7ba Rewrite torch.repeat_interleave to remove cpu synchronization (#1599) ljss 2023-11-21 09:46:32 +08:00
  • 19849db573 [Fix] Fix bugs in scheduler (#1727) Zhuofan 2023-11-21 08:10:50 +08:00
  • 3d4ceb292c Fix hanging in the scheduler caused by long prompts (#1534) 陈序 2023-11-21 08:06:49 +08:00
  • f5a37c6c6c [BugFix] Fix a bug in loading safetensors (#1732) Woosuk Kwon 2023-11-20 15:51:18 -08:00
  • 32c927b53f [FIX] Update the doc link in README.md (#1730) Zhuohan Li 2023-11-20 12:46:24 -08:00
  • 5ffc0d13a2 Migrate linter from pylint to ruff (#1665) Simon Mo 2023-11-20 11:58:01 -08:00
  • 112627e8b2 [Docs] Fix the code block's format in deploying_with_docker page (#1722) Wen Sun 2023-11-20 17:22:39 +08:00
  • 37c1e3c218 Documentation about official docker image (#1709) Simon Mo 2023-11-19 20:56:26 -08:00
  • 06e9ebebd5 Add instructions to install vLLM+cu118 (#1717) Woosuk Kwon 2023-11-18 23:48:58 -08:00
  • c5f7740d89 Bump up to v0.2.2 (#1689) Woosuk Kwon 2023-11-18 21:57:07 -08:00
  • be66d9b125 Fix warning msg on quantization (#1715) Woosuk Kwon 2023-11-18 21:49:55 -08:00
  • e1054247ba [Optimization] Implement fused add rmsnorm (#1667) ljss 2023-11-19 10:18:02 +08:00
  • 8d17774f92 Add AWQ support for all models (#1714) Woosuk Kwon 2023-11-18 17:56:47 -08:00
  • e946260cf3 use get_tensor in safe_open (#1696) twaka 2023-11-19 09:45:18 +09:00
  • edb305584b Support download models from www.modelscope.cn (#1588) liuyhwangyh 2023-11-18 12:38:31 +08:00
  • bb00f66e19 Use quantization_config in hf config (#1695) Woosuk Kwon 2023-11-17 16:23:49 -08:00
  • e87557b069 Support Min P Sampler (#1642) Roy 2023-11-18 08:20:49 +08:00
  • dcc543a298 [Minor] Fix comment (#1704) Zhuofan 2023-11-18 01:42:49 +08:00
  • 0fc280b06c Update the adding-model doc according to the new refactor (#1692) Zhuohan Li 2023-11-16 18:46:26 -08:00
  • 20d0699d49 [Fix] Fix comm test (#1691) Zhuohan Li 2023-11-16 16:28:39 -08:00
  • 686f5e3210 Return usage for openai streaming requests (#1663) Iskren Ivov Chernev 2023-11-17 01:28:36 +02:00
  • 415d109527 [Fix] Update Supported Models List (#1690) Zhuohan Li 2023-11-16 14:47:26 -08:00
  • 521b35f799 Support Microsoft Phi 1.5 (#1664) maximzubkov 2023-11-16 23:28:39 +01:00
  • cb08cd0d75 [Minor] Fix duplication of ignored seq group in engine step (#1666) Simon Mo 2023-11-16 13:11:41 -08:00
  • 2a2c135b41 Fix loading error when safetensors contains empty tensor (#1687) twaka 2023-11-17 03:38:10 +09:00
  • 65ea2ddf17 feat(config): support parsing torch.dtype (#1641) Aaron Pham 2023-11-16 04:31:06 -05:00
  • b514d3c496 Revert MptConfig to MPTConfig (#1668) Megha Agarwal 2023-11-16 01:19:39 -08:00
  • 7076fa1c9f TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622) Zhuohan Li 2023-11-15 22:50:41 -08:00
  • 660a7fcfa4 Add DeepSpeed MII backend to benchmark script (#1649) Woosuk Kwon 2023-11-14 12:35:30 -08:00
  • 054072bee5 [Minor] Move RoPE selection logic to get_rope (#1633) Woosuk Kwon 2023-11-12 16:04:50 -08:00
  • eb825c1e74 Fix #1474 - AssertionError:assert param_slice.shape == loaded_weight.shape (#1631) lirui 2023-11-13 07:53:12 +08:00
  • 1b290ace4f Run default _AsyncLLMEngine._run_workers_async in threadpool (#1628) Dominik Schwabe 2023-11-11 23:50:44 +01:00
  • 0d578228ca config parser: add ChatGLM2 seq_length to _get_and_verify_max_len (#1617) Sin 2023-11-10 11:29:51 +08:00
  • aebfcb262a Dockerfile: Upgrade Cuda to 12.1 (#1609) GhaziSyed 2023-11-09 20:49:02 +01:00
  • ab9e8488d5 Add Yi model to quantization support (#1600) forpanyang 2023-11-10 03:47:14 +08:00
  • fd58b73a40 Build CUDA11.8 wheels for release (#1596) Woosuk Kwon 2023-11-09 03:52:29 -08:00
  • 8efe23f150 Fix input_metadata.selected_token_indices in worker prepare_inputs (#1546) Yanming W 2023-11-09 06:19:12 +08:00
  • 06458a0b42 Upgrade to CUDA 12 (#1527) Zhuohan Li 2023-11-08 14:17:49 -08:00
  • 1a2bbc9301 ChatGLM Support (#1261) GoHomeToMacDonal 2023-11-07 08:09:33 +08:00
  • e7f579eb97 Support Yi model (#1567) Roy 2023-11-07 07:26:03 +08:00
  • 8516999495 Add Quantization and AutoAWQ to docs (#1235) Casper 2023-11-05 06:43:39 +01:00
  • 9f669a9a7c Support YaRN models (#1264) Antoni Baum 2023-11-03 14:12:48 -07:00
  • 555bdcc5a3 Added logits processor API to sampling params (#1469) Noam Gat 2023-11-03 23:12:15 +02:00
  • 54ca1ba71d docs: add description (#1553) lots-o 2023-11-04 01:14:52 +09:00
  • 9738b84a08 Force paged attention v2 for long contexts (#1510) Antoni Baum 2023-11-01 16:24:32 -07:00
  • 1fe0990023 Remove MPTConfig (#1529) Woosuk Kwon 2023-11-01 15:29:05 -07:00
  • 7e90a2d117 Add /health Endpoint for both Servers (#1540) Fluder-Paradyne 2023-11-01 22:59:44 +05:30
  • 5687d584fe [BugFix] Set engine_use_ray=True when TP>1 (#1531) ljss 2023-11-01 17:14:18 +08:00
  • cf8849f2d6 Add MptForCausalLM key in model_loader (#1526) Wenfei Yan 2023-10-31 15:46:53 -07:00
  • e575df33b1 [Small] Formatter only checks lints in changed files (#1528) Cade Daniel 2023-10-31 15:39:38 -07:00
  • 0ce8647dc5 Fix integer overflows in attention & cache ops (#1514) Woosuk Kwon 2023-10-31 15:19:30 -07:00
  • 9cabcb7645 Add Dockerfile (#1350) Stephen Krider 2023-10-31 12:36:47 -07:00
  • 7b895c5976 [Fix] Fix duplicated logging messages (#1524) Zhuohan Li 2023-10-31 09:04:47 -07:00
  • 7013a80170 Add support for spaces_between_special_tokens Dan Lord 2023-10-30 16:52:56 -07:00
  • 79a30912b8 Add py.typed so consumers of vLLM can get type checking (#1509) Jared Roesch 2023-10-30 14:50:47 -07:00
  • 2f3d36a8a1 Fix logging so we actually get info level entries in the log. (#1494) Adam Brusselback 2023-10-30 13:02:21 -04:00
  • ac8d36f3e5 Refactor LLMEngine demo script for clarity and modularity (#1413) iongpt 2023-10-30 18:14:37 +02:00
  • 15f5632365 Delay GPU->CPU sync in sampling (#1337) Antoni Baum 2023-10-30 09:01:34 -07:00
  • aa9af07cac Fix bias in InternLM (#1501) Woosuk Kwon 2023-10-30 00:24:18 +01:00
  • 69be658bba Support repetition_penalty (#1424) ljss 2023-10-30 01:02:41 +08:00
  • beac8dd461 fix: don't skip first special token. (#1497) Ricardo Lu 2023-10-29 19:26:36 +08:00
  • 28b47d1e49 Add rope_scaling to Aquila model (#1457) Qing 2023-10-29 19:25:21 +08:00
  • 1f24755bf8 Support SqueezeLLM (#1326) chooper1 2023-10-22 03:14:59 -03:00
  • bf31d3606a Pin pydantic dependency versions (#1429) Thiago Salvatore 2023-10-21 15:18:58 -03:00
  • d189170b6c remove useless statements (#1408) Wang Ran (汪然) 2023-10-20 23:52:07 +08:00
  • f61dc8072f Fix type hints (#1427) Light Lin 2023-10-20 23:50:47 +08:00
  • f8a1e39fae [BugFix] Define __eq__ in SequenceGroupOutputs (#1389) Woosuk Kwon 2023-10-17 01:09:44 -07:00
  • a132435204 Fix typo (#1383) Wang Ran (汪然) 2023-10-17 12:53:37 +08:00
  • 9524867701 Add Mistral 7B to test_models (#1366) Woosuk Kwon 2023-10-16 17:49:54 -07:00
  • c1376e0f82 Change scheduler & input tensor shape (#1381) Woosuk Kwon 2023-10-16 17:48:42 -07:00
  • 651c614aa4 Bump up the version to v0.2.1 (#1355) Zhuohan Li 2023-10-16 12:58:57 -07:00
  • d3a5bd9fb7 Fix sampler test (#1379) Woosuk Kwon 2023-10-16 12:57:26 -07:00
  • e8ef4c0820 Fix PyTorch index URL in workflow (#1378) Woosuk Kwon 2023-10-16 12:37:56 -07:00
  • 348897af31 Fix PyTorch version to 2.0.1 in workflow (#1377) Woosuk Kwon 2023-10-16 11:27:17 -07:00
  • 9d9072a069 Implement prompt logprobs & Batched topk for computing logprobs (#1328) Zhuohan Li 2023-10-16 10:56:50 -07:00
  • 928de46888 Implement PagedAttention V2 (#1348) Woosuk Kwon 2023-10-16 00:59:57 -07:00
  • 29678cd213 Minor fix on AWQ kernel launch (#1356) Woosuk Kwon 2023-10-15 21:53:56 -07:00
  • d0740dff1b Fix error message on TORCH_CUDA_ARCH_LIST (#1239) Woosuk Kwon 2023-10-14 14:47:43 -07:00
  • de89472897 Fix the issue for AquilaChat2-* models (#1339) Lu Wang 2023-10-13 11:51:29 -07:00
  • e7c8555d06 Bump up transformers version & Remove MistralConfig (#1254) Woosuk Kwon 2023-10-13 10:05:26 -07:00
  • ec3b5ce9cc Improve detokenization performance (#1338) Antoni Baum 2023-10-13 09:59:07 -07:00
  • 6368e777a8 Add Aquila2 to README (#1331) ldwang 2023-10-13 03:11:16 +08:00
  • 875afe38ab Add blacklist in model checkpoint (#1325) Woosuk Kwon 2023-10-12 01:05:37 -07:00
  • ee8217e5be Add Mistral to quantization model list (#1278) amaleshvemula 2023-10-11 09:26:24 +02:00
  • 980dd4a2c4 Fix overflow in awq kernel (#1295) CHU Tianxiang 2023-10-11 15:19:53 +08:00
  • 8285736840 workaround of AWQ for Turing GPUs (#1252) twaka 2023-10-11 11:48:16 +09:00
  • 91fce82c6f change the timing of sorting logits (#1309) yhlskt23 2023-10-11 11:37:42 +09:00
  • ac5cf86aa6 Fix __repr__ of SequenceOutputs (#1311) Wang Ran (汪然) 2023-10-11 00:58:28 +08:00
  • 6a6119554c lock torch version to 2.0.1 (#1290) yanxiyue 2023-10-11 00:21:57 +08:00
  • b95ee898fe [Minor] Fix comment in mistral.py (#1303) Zhuohan Li 2023-10-09 19:44:37 -07:00
  • 9eed4d1f3e Update README.md (#1292) Zhuohan Li 2023-10-08 23:15:50 -07:00
  • 6b5296aa3a [FIX] Explain why the finished_reason of ignored sequences are length (#1289) Zhuohan Li 2023-10-08 15:22:38 -07:00
  • ee92b58b3a Move bfloat16 check to worker (#1259) Antoni Baum 2023-10-07 22:10:44 -07:00
  • 09ff7f106a API server support ipv4 / ipv6 dualstack (#1288) Yunfeng Bai 2023-10-07 15:15:54 -07:00
  • acbed3ef40 Use monotonic time where appropriate (#1249) Antoni Baum 2023-10-02 19:22:05 -07:00
  • 66d18a7fb0 add support for tokenizer revision (#1163) Federico Cassano 2023-10-02 22:19:46 -04:00
  • ba0bfd40e2 TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) Zhuohan Li 2023-10-02 15:36:09 -07:00