You just built a LLM app and you’re wondering … how can I make it better? Here are 13 optimization paths and when you should use them.
Welcome to Episode 29 in Prolego’s Generative AI series. Building a demo app or MVP with LLMs is straightforward. Getting it to perform in production? Well, that is much harder. There is very little guidance for diagnosing your challenges and specific steps to mitigate them. For example, should you improve your prompts, fine-tune the model, or add agents? What about tools?
Many resources address specific optimizations, but I haven’t seen anything that describes the big picture. So, we created a list of 13 LLM optimization choices and when you should pursue them. I’ve summarized all 13 in Prolego’s LLM Optimization Playbook. You can download a free copy here.
I’m assuming you have an evaluation framework, and you’ve decided whether to host your own open source LLMs or rely on proprietary ones like GPT-4. If not, check out Episodes 25 and 28.
Let’s start with Model optimizations.
- Choose the right model for the task. Your very first step is picking a general model like GPT-4 or a specialized one like Phind. Usually you’ll want to start with the most powerful general model available. Unfortunately this isn’t a decision you can easily reverse. Your prompts and systems architecture depend on your model choice.
- Choose the model size or version. This choice is more critical if you are using open source LLMs because you have so many options. Larger models are smarter, but they also run slower and take up more GPU memory. You will continuously revisit this decision.
- Quantize the model. Quantization involves reducing the precision of an open source model’s weights to lower memory footprint and increase speed. In practice these improvements come with very little performance degradation. You can probably find a quantized version on HuggingFace, otherwise you’ll need specialized skills.
- Fine-tune the model. You can retrain the model on a curated data set to improve performance. You will want to do this later, if ever. Fine-tuning requires significant effort to build the training data and specialized skills to get performance improvements.
Let’s transition to Prompt optimizations.
- Improve prompts. This is your first and best option for increasing performance. Prompt improvements happen at the system level or through better user training. You will do this continuously.
- Provide examples. Show the LLM how to complete a task when it can’t do so through prompt improvements. Examples are helpful when you need consistent output formats or a task is ambiguous. Since examples take up prompt space, they increase cost and latency. If you have a lot, fine-tuning might be a better choice.
Now a few Context optimizations.
- Add relevant context, or RAG. Everyone, including me, talks about LLM RAG as a solution, but it is actually an optimization. Passing relevant text along with the prompts is the most cost-effective way to customize LLMs with your data. Parsing documents and retaining structure can be a significant engineering challenge.
- Provide structured data access. You can use LLMs to generate SQL code and query database tables. This is a more challenging workflow that usually requires agents and tools, but if your task requires information in tables, you may have to do it.
- Integrate multiple information sources. You may discover the LLM needs additional information beyond what is provided with RAG and structured data. For example, the task requires understanding arcane terminology. In this instance you may need to pass definitions to the LLM.
Finally, Workflow optimizations.
- Implement an agent. Agents interact with users or tools through multiple LLM calls, and you will introduce them as soon as you cannot complete the task with a single LLM call. They increase cost, latency, and system complexity, but are usually necessary for production applications.
- Add tools. You can enable the model to dynamically select specialized functions, either ones you build or external APIs. Use tools whenever there are efficient ways besides LLMs to complete tasks. Do this early and continuously.
- Provide out-of-context variables. If you need to process more data than the LLM context window can handle, a combination of tools and agents can process the data and pass the results to the LLM. This is a complex step that should only be taken if necessary.
- Orchestrate multiple agents. Sometimes you need results from multiple models or agents, such as reasoning with GPT-4 and summarizing text with Mistral 7B. Since you need dedicated prompts for each, this step increases complexity.
That’s a high-level overview of all 13. In coming Episodes we will show you an example of each.