Or any other OSS LLM model from huggingface.
(Quick and dirty instructions to hopefully reproduce it....)
In general, choose an exl2 based model with the ExLlamav2_HF model loader, seems to work best and most reliable
Here are the steps for the 3.5bpw model of Mixtral 8x7b, which gave good results and fits into 24GB VRAM.
- Start a pod with at least 24GB VRAM on runpod.io
- Choose the "RunPod TheBloke LLMs" template (Readme on GitHub)
- Let it deploy (doesn't take long)
- Click the "Connect" button and choose the second one with the 7860 port number.
- In that oobabooga/text-generation-webui GUI, go to the "Model" tab, add "turboderp/Mixtral-8x7B-instruct-exl2:3.5bpw" to the "Download model" input box and click the Download button (takes a few minutes)
- Reload the model selector, and choose the model in the dropdown
- Choose "ExLlamav2_HF" as model loader (but that should be automatic)
- Set "max_seq_len" to 16384 (otherwise won't fit into 24 GB later)
- Finally, click the "Load" button to load the model
- Your (almost) self-hosted LLM model should be ready in about another minute or two.
API Endpoints
And if you want to use its OpenAI compatible API endpoint, use "https://${RUNPOD_ID}-5000.proxy.runpod.net/v1/" (or add "chat/completions" to it for the chat-completion endpoint) and it should work with all OpenAI API compatible libraries/tools. Also supports streaming.
Template Variables (optional):
If you want to automatically load the model, you can add the following to the UI_ARGS template variable, after you downloaded the model manually (it doesn't start, when it can't find it). The next time you stop/start the pod, it will load the model right away
--model turboderp_Mixtral-8x7B-instruct-exl2_3.5bpw --max_seq_len 16384
Unfortunately the automatic downloading of a model with the MODEL template variable doesn't work currently for models with a branch in it. I opened a Github issue for this https://github.com/TheBlokeAI/dockerLLM/issues/16. If/when that is solved, you can additionally add this to it and hopefully everything should work automatically, even when starting from scratch
turboderp/Mixtral-8x7B-instruct-exl2:3.5bpw