Ep 28. How to Host Open Source LLMs

You just bought some GPUs and want to host open source LLMs. With thousands of models available, you’re wondering, “where should I start?” Here’s what you can expect, and what you should do.

Welcome to Episode 28 in Prolego’s Generative AI series. Many companies are choosing open source LLMs created by companies like Meta and Mistral over proprietary ones from companies like OpenAI or Google. This choice is motivated by policy constraints, a desire for increased control, security concerns, or cost savings.

These benefits come with additional challenges of maintaining the open source models. Here is what you can expect.

One challenge is deciding which models to support. HuggingFace currently has thousands of LLMs, with new ones appearing daily. They vary by size, performance, and licenses. There are multiple versions of the same model modified to reduce memory footprint, a process called quantization.

They also perform differently on different tasks. A model good at summarizing text may not work well as an agent.

You also need new policies. “Safety” constraints usually degrade LLM performance, so some engineers remove it. Is this acceptable? Well, it depends. You don’t want a model telling teenagers how to make meth, so safety constraints make sense. But these same constraints could render a model useless for law enforcement.

Maintaining your environment can also require unanticipated effort. LLMs are tied to a configuration, so your operating system and key libraries need to be compatible. In traditional software engineering a development team can usually work with older versions of python, but they can’t do much if an LLM is incompatible with it.

Finally, the systems architecture and deployment patterns are rapidly changing. Teams are now experimenting with LLMs for different tasks. For example, using a small model to summarize text and a large one for prompt orchestration and reasoning. A team may suddenly need additional GPUs to support these designs.

So those are some challenges. Now what should you do?

  1. First, design a process that allows for frequent change. Your application teams will want access to the latest models, but you must balance this demand with your evaluation and testing process.
  2. Second, begin with a handful of the most popular models of different sizes. Currently models from Mistral and Meta are a good starting point. The best source is conversations on sites like X and Reddit, or you could just email me and we’ll make some suggestions.
  3. Finally, have your data science or application team design a representative evaluation framework to run against new models. General benchmarks like the MMLU are not nearly as effective as ones based on your data and tasks.

Check out Episodes 6 and 25 below for tips on designing one.

Let’s Future Proof Your Business.