How to setup and use a runpod.io pod with mixtral 8x7b

Or any other OSS LLM model from huggingface. Just quick and dirty instructions to reproduce it.

29 Dec 2023 • 2 min read

Or any other OSS LLM model from huggingface.

(Quick and dirty instructions to hopefully reproduce it....)

In general, choose an exl2 based model with the ExLlamav2_HF model loader, seems to work best and most reliable

Here are the steps for the 3.5bpw model of Mixtral 8x7b, which gave good results and fits into 24GB VRAM.

Start a pod with at least 24GB VRAM on runpod.io
Choose the "RunPod TheBloke LLMs" template (Readme on GitHub)
Let it deploy (doesn't take long)
Click the "Connect" button and choose the second one with the 7860 port number.
In that oobabooga/text-generation-webui GUI, go to the "Model" tab, add "turboderp/Mixtral-8x7B-instruct-exl2:3.5bpw" to the "Download model" input box and click the Download button (takes a few minutes)
Reload the model selector, and choose the model in the dropdown
Choose "ExLlamav2_HF" as model loader (but that should be automatic)
Set "max_seq_len" to 16384 (otherwise won't fit into 24 GB later)
Finally, click the "Load" button to load the model
Your (almost) self-hosted LLM model should be ready in about another minute or two.

API Endpoints

And if you want to use its OpenAI compatible API endpoint, use "https://${RUNPOD_ID}-5000.proxy.runpod.net/v1/" (or add "chat/completions" to it for the chat-completion endpoint) and it should work with all OpenAI API compatible libraries/tools. Also supports streaming.

Template Variables (optional):

If you want to automatically load the model, you can add the following to the UI_ARGS template variable, after you downloaded the model manually (it doesn't start, when it can't find it). The next time you stop/start the pod, it will load the model right away

--model turboderp_Mixtral-8x7B-instruct-exl2_3.5bpw --max_seq_len 16384

Unfortunately the automatic downloading of a model with the MODEL template variable doesn't work currently for models with a branch in it. I opened a Github issue for this https://github.com/TheBlokeAI/dockerLLM/issues/16. If/when that is solved, you can additionally add this to it and hopefully everything should work automatically, even when starting from scratch

turboderp/Mixtral-8x7B-instruct-exl2:3.5bpw

How to setup and use a runpod.io pod with mixtral 8x7b

API Endpoints

Template Variables (optional):

How to deploy ghost to kubernetes (and backup)

Resurrecting this blog. Kinda.

In den Alpen

Heute morgen mal in Bern Liebefeld.