Commit Graph

  • 14f0b39cda [Bugfix] Fix a bug in RequestOutput.finished (#202) Woosuk Kwon 2023-06-22 00:17:24 -07:00
  • 2e0d314384 fix-ray (#193) Zhuohan Li 2023-06-22 00:21:41 +08:00
  • 67d96c29fb Use slow tokenizer for open llama models (#168) Woosuk Kwon 2023-06-19 23:19:47 -07:00
  • 033f5c78f5 Remove e.g. in README (#167) Zhuohan Li 2023-06-20 14:00:28 +08:00
  • 794e578de0 [Minor] Fix URLs (#166) Woosuk Kwon 2023-06-19 22:57:14 -07:00
  • caddfc14c1 [Minor] Fix icons in doc (#165) Woosuk Kwon 2023-06-19 20:35:38 -07:00
  • fc72e39de3 Change image urls (#164) Zhuohan Li 2023-06-20 11:15:15 +08:00
  • b7e62d3454 Fix repo & documentation URLs (#163) Woosuk Kwon 2023-06-19 20:03:40 -07:00
  • 364536acd1 [Docs] Minor fix (#162) Woosuk Kwon 2023-06-19 19:58:23 -07:00
  • 0b32a987dd Add and list supported models in README (#161) Zhuohan Li 2023-06-20 10:57:46 +08:00
  • 570fb2e9cc [PyPI] Fix package info in setup.py (#158) Woosuk Kwon 2023-06-19 18:05:01 -07:00
  • a255885f83 Add logo and polish readme (#156) Zhuohan Li 2023-06-19 16:31:13 +08:00
  • 5822ede66e Add performance figures for dark mode (#160) Woosuk Kwon 2023-06-18 23:46:24 -07:00
  • 0370afa2e5 Remove benchmark_async_llm_server.py (#155) Zhuohan Li 2023-06-19 11:12:37 +08:00
  • 7e2a913c64 [Minor] Fix CompletionOutput.__repr__ (#157) Woosuk Kwon 2023-06-18 19:58:25 -07:00
  • 3f92038b99 Add comments on swap space (#154) Woosuk Kwon 2023-06-18 11:39:35 -07:00
  • dcda03b4cb Write README and front page of doc (#147) Woosuk Kwon 2023-06-18 03:19:38 -07:00
  • bf5f121c02 Reduce GPU memory utilization to make sure OOM doesn't happen (#153) Zhuohan Li 2023-06-18 17:33:50 +08:00
  • bec7b2dc26 Add quickstart guide (#148) Zhuohan Li 2023-06-18 01:26:12 +08:00
  • 0b98ba15c7 Change the name to vLLM (#150) Woosuk Kwon 2023-06-17 03:07:40 -07:00
  • e5464ee484 Rename servers to engines (#152) Zhuohan Li 2023-06-17 17:25:21 +08:00
  • bab8f3dd0d [Minor] Fix benchmark_throughput.py (#151) Woosuk Kwon 2023-06-16 21:00:52 -07:00
  • eedb46bf03 Rename servers and change port numbers to reduce confusion (#149) Zhuohan Li 2023-06-17 00:13:02 +08:00
  • 311490a720 Add script for benchmarking serving throughput (#145) Woosuk Kwon 2023-06-14 19:55:38 -07:00
  • da5ddcd544 Remove redundant code in ColumnParallelLinear (#146) Woosuk Kwon 2023-06-10 21:25:11 -07:00
  • 5020e1e80c Non-streaming simple fastapi server (#144) Zhuohan Li 2023-06-11 01:43:07 +08:00
  • 4298374265 Add docstrings for LLMServer and related classes and examples (#142) Zhuohan Li 2023-06-07 18:25:20 +08:00
  • e38074b1e6 Support FP32 (#141) Woosuk Kwon 2023-06-07 00:40:21 -07:00
  • 376725ce74 [PyPI] Packaging for PyPI distribution (#140) Woosuk Kwon 2023-06-05 20:03:14 -07:00
  • 456941cfe4 [Docs] Write the Adding a New Model section (#138) Woosuk Kwon 2023-06-05 20:01:26 -07:00
  • 1a956e136b Fix various issues of async servers (#135) Zhuohan Li 2023-06-05 23:44:50 +08:00
  • 8274ca23ac Add docstrings for LLM (#137) Woosuk Kwon 2023-06-04 12:52:41 -07:00
  • 62ec38ea41 Document supported models (#127) Woosuk Kwon 2023-06-02 22:35:17 -07:00
  • 0eda2e0953 Add .readthedocs.yaml (#136) Woosuk Kwon 2023-06-02 22:27:44 -07:00
  • 211318d44a Add throughput benchmarking script (#133) Woosuk Kwon 2023-05-28 03:20:05 -07:00
  • 337871c6fd Enable LLaMA fast tokenizer (#132) Woosuk Kwon 2023-05-28 02:51:42 -07:00
  • 56b7f0efa4 Add a doc for installation (#128) Woosuk Kwon 2023-05-27 01:13:06 -07:00
  • d721168449 Improve setup script & Add a guard for bfloat16 kernels (#130) Woosuk Kwon 2023-05-27 00:59:32 -07:00
  • 4a151dd453 Add activation registry (#126) Woosuk Kwon 2023-05-25 00:09:07 -07:00
  • 057daef778 OpenAI Compatible Frontend (#116) Zhuohan Li 2023-05-23 21:39:50 -07:00
  • e86717833d Incrementally decode output tokens (#121) Woosuk Kwon 2023-05-23 20:46:32 -07:00
  • aedba6d5ec Print warnings/errors for large swap space (#123) Woosuk Kwon 2023-05-23 18:22:26 -07:00
  • a283ec2eec Add contributing guideline and mypy config (#122) Woosuk Kwon 2023-05-23 17:58:51 -07:00
  • 3f942acfe1 Fix latency benchmark script (#118) Woosuk Kwon 2023-05-22 17:03:40 -07:00
  • 19d2899439 Add initial sphinx docs (#120) Woosuk Kwon 2023-05-22 17:02:44 -07:00
  • 655a5e48df Introduce LLM class for offline inference (#115) Woosuk Kwon 2023-05-21 17:04:18 -07:00
  • f746ced08d Implement stop strings and best_of (#114) Woosuk Kwon 2023-05-21 11:18:00 -07:00
  • c3442c1f6f Refactor system architecture (#109) Woosuk Kwon 2023-05-20 13:06:59 -07:00
  • 7297fa6f7c Remove unused parts in Megatron-LM code and add copyright notice (#110) Zhuohan Li 2023-05-20 09:11:34 -06:00
  • b7955ef17b Fix timeout error in the FastAPI frontend (#34) Zhuohan Li 2023-05-19 14:00:46 -06:00
  • f756799b84 Use runtime profiling to replace manual memory analyzers (#81) Zhuohan Li 2023-05-19 11:35:44 -06:00
  • 825d8892b5 Use pytest format for unit tests (#107) Woosuk Kwon 2023-05-17 17:11:23 -07:00
  • b322fd1607 Add docstrings to some modules and classes (#100) Woosuk Kwon 2023-05-14 22:32:38 -07:00
  • 667ba3995c Add copyright headers to source files adapted from FT (#104) Woosuk Kwon 2023-05-14 22:19:19 -07:00
  • 707ec647bb Add copyright headers for HF models (#103) Woosuk Kwon 2023-05-14 21:54:32 -07:00
  • 89988ec8c2 Add Apache-2.0 license (#102) Woosuk Kwon 2023-05-14 18:05:19 -07:00
  • 6208d622ca Minor code cleaning for SamplingParams (#99) Woosuk Kwon 2023-05-12 18:07:09 -07:00
  • 42f1042e1c Enhance SamplingParams (#96) Woosuk Kwon 2023-05-11 15:45:30 -07:00
  • 55f8b0a5de Implement presence and frequency penalties (#95) Woosuk Kwon 2023-05-10 23:39:12 -07:00
  • 9f88db35da Support top-k sampling (#94) Woosuk Kwon 2023-05-10 12:51:36 -07:00
  • ae356774ab Avoid sorting waiting queue & Minor code cleaning (#93) Woosuk Kwon 2023-05-10 01:57:07 -07:00
  • e331957784 Log system stats (#90) Woosuk Kwon 2023-05-10 01:06:53 -07:00
  • 8d66a7b6d7 Rename variables and methods (#91) Woosuk Kwon 2023-05-10 00:58:31 -07:00
  • ce26e57fd3 Update sample prompts in simple_server.py (#89) Woosuk Kwon 2023-05-09 16:47:39 -07:00
  • 85eb631839 Use slow tokenizer for LLaMA (#84) Woosuk Kwon 2023-05-09 16:03:44 -07:00
  • add055e151 Enhance model loader (#83) Woosuk Kwon 2023-05-09 15:46:42 -07:00
  • 7c041ab578 Refactor system architecture (#82) Woosuk Kwon 2023-05-09 15:30:12 -07:00
  • 8917782af6 Add a system logger (#85) Woosuk Kwon 2023-05-08 23:03:35 -07:00
  • 7addca5935 Specify python package dependencies in requirements.txt (#78) Woosuk Kwon 2023-05-07 16:30:43 -07:00
  • c84e924287 [Minor] Fix a dtype bug (#79) Woosuk Kwon 2023-05-06 02:12:12 -07:00
  • c9d5b6d4a8 Replace FlashAttention with xformers (#70) Woosuk Kwon 2023-05-05 02:01:08 -07:00
  • 189ae23133 Use dtype from model config & Add Dolly V2 (#63) Woosuk Kwon 2023-05-04 03:05:37 -07:00
  • e548c1488a Add support for GPT-2 (#60) Woosuk Kwon 2023-05-04 02:59:56 -07:00
  • 130d5fd8c7 Fix a bug in attention kernel (#68) Woosuk Kwon 2023-05-04 02:56:09 -07:00
  • e070829ae8 Support bfloat16 data type (#54) Woosuk Kwon 2023-05-03 14:09:44 -07:00
  • 436e523bf1 Refactor attention kernels (#53) Woosuk Kwon 2023-05-03 13:40:13 -07:00
  • 27f1410d06 New weight loader without np copy (#52) Zhuohan Li 2023-05-03 15:32:04 +08:00
  • 4858f3bb45 Add an option to launch cacheflow without ray (#51) Zhuohan Li 2023-04-30 15:42:17 +08:00
  • a96d63c21d Add support for GPT-NeoX (Pythia) (#50) Woosuk Kwon 2023-04-28 00:32:10 -07:00
  • aa50b17ca7 Change plotting script Woosuk Kwon 2023-04-17 04:49:14 +00:00
  • 0f4b32199e Support various block sizes & Change default block size to 16 (#38) Woosuk Kwon 2023-04-15 09:03:24 -07:00
  • 84eee24e20 Collect system stats in scheduler & Add scripts for experiments (#30) Woosuk Kwon 2023-04-12 15:03:49 -07:00
  • e3cec88aa5 Memcpy kernel for flash attention (#29) Siyuan (Ryans) Zhuang 2023-04-10 18:22:49 -07:00
  • b9926f7f66 Support block size 32 (#35) Woosuk Kwon 2023-04-09 23:07:18 -07:00
  • ee88a7e5f3 Add an option to use dummy model weights (#33) Woosuk Kwon 2023-04-08 23:36:12 -07:00
  • c267b1a02c Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27) Woosuk Kwon 2023-04-08 13:36:09 -07:00
  • 0f40557af6 Implement block copy kernel to optimize beam search (#32) Woosuk Kwon 2023-04-07 17:45:07 -07:00
  • a490aafa36 Fix potential bugs in FastAPI frontend and add comments (#28) Zhuohan Li 2023-04-06 13:44:24 +08:00
  • 12659a0bd7 Add CUDA graph-based all reduce launcher (#26) Woosuk Kwon 2023-04-05 11:16:57 -07:00
  • 21b3671bbc Basic attention kernel that supports cached KV + (multi-)prompts (#24) Siyuan (Ryans) Zhuang 2023-04-04 20:34:46 -07:00
  • 897cb2ae28 Optimize data movement (#20) Woosuk Kwon 2023-04-02 00:30:17 -07:00
  • 1f01a18d39 Merge QKV into one linear layer (#15) Zhuohan Li 2023-04-02 15:23:29 +08:00
  • 2c5cd0defe Add ninja to dependency (#21) Woosuk Kwon 2023-04-01 19:00:20 -07:00
  • a90c97d727 Use FP32 for log probabilities (#19) Woosuk Kwon 2023-03-31 23:33:43 -07:00
  • e3f00d191e Modify README to include info on loading LLaMA (#18) Zhuohan Li 2023-04-01 01:07:57 +08:00
  • 09e9245478 Add custom kernel for RMS normalization (#16) Woosuk Kwon 2023-03-31 09:51:22 -07:00
  • c45f3c3ab6 Optimize tensor parallel execution speed (#17) Zhuohan Li 2023-04-01 00:51:08 +08:00
  • 7a7929abe8 Implement preemption via recomputation & Refactor scheduling logic (#12) Woosuk Kwon 2023-03-30 14:51:46 -07:00
  • 88c0268a18 Implement custom kernel for LLaMA rotary embedding (#14) Woosuk Kwon 2023-03-30 11:04:21 -07:00
  • 80a2f812f1 Implement LLaMA (#9) Woosuk Kwon 2023-03-29 21:25:32 -07:00