Or any other OSS LLM model from huggingface.

(Quick and dirty instructions to hopefully reproduce it....)

In general, choose an exl2 based model with the ExLlamav2_HF model loader, seems to work best and most reliable

Here are the steps for the 3.5bpw model of Mixtral 8x7b, which gave good results and fits into 24GB VRAM.

  • Start a pod with at least 24GB VRAM on runpod.io
  • Choose the "RunPod TheBloke LLMs" template (Readme on GitHub)
  • Let it deploy (doesn't take long)
  • Click the "Connect" button and choose the second one with the 7860 port number.
  • In that oobabooga/text-generation-webui GUI, go to the "Model" tab, add  "turboderp/Mixtral-8x7B-instruct-exl2:3.5bpw" to the "Download model" input box and click the Download button (takes a few minutes)
  • Reload the model selector, and choose the model in the dropdown
  • Choose "ExLlamav2_HF" as model loader (but that should be automatic)
  • Set "max_seq_len" to 16384 (otherwise won't fit into 24 GB later)
  • Finally, click the "Load" button to load the model
  • Your (almost) self-hosted LLM model should be ready in about another minute or two.

API Endpoints

And if you want to use its OpenAI compatible API endpoint, use "https://${RUNPOD_ID}-5000.proxy.runpod.net/v1/" (or add "chat/completions" to it for the chat-completion endpoint) and it should work with all OpenAI  API compatible libraries/tools. Also supports streaming.

Template Variables (optional):

If you want to automatically load the model, you can add the following to the UI_ARGS template variable, after you downloaded the model manually (it doesn't start, when it can't find it). The next time you stop/start the pod, it will load the model right away

--model turboderp_Mixtral-8x7B-instruct-exl2_3.5bpw --max_seq_len 16384

Unfortunately the automatic downloading of a model with the MODEL template variable doesn't work currently for models with a branch in it. I opened a Github issue for this https://github.com/TheBlokeAI/dockerLLM/issues/16. If/when that is solved, you can additionally add this to it and hopefully everything should work automatically, even when starting from scratch

turboderp/Mixtral-8x7B-instruct-exl2:3.5bpw