Made some adjust for the code in peft and gptq for llama, and make it possible for lora finetuning with a 4 bits base model. The same adjustment can be made for 2, 3 and 8 bits.
Still numerically unstable. Resolved.
Reconstruct fp16 matrix from 4bit data and call torch.matmul largely increased the inference speed.