Five Lessons from my first LLM project

Kipkorir Arap Kirui
8 min readOct 23, 2023

--

Photo by Rod Long on Unsplash

Background

The correct information at the right time is critical in a complex landscape such as the climate sector. A key contributor to the complexity is how much data you must sieve to decide. In addition to the two other workstreams (mobile app and project support platform), we worked with the Africa Climate Summit (ACS) secretariat to provide the team with actionable insights based on a wealth of reports, research, and documents related to the five pillars of ACS. We developed an LLM-powered platform by aggregating, categorizing, and analyzing extensive documents, offering a consolidated knowledge hub. The knowledge hub was queried through a chat-like interface, allowing access to specific context-aware information and cross-pillar analysis to provide a complete landscape view. The platform would give clear action points based on analyzed data, allowing investment decisions, decision support structures, and identifying priority areas.

To deliver this project, we leveraged GPT 3.5 and the Azure stack. More specifically, we used:

  • Azure Blob Storage — store and manage files securely in the cloud, accessible from anywhere
  • Azure Cognitive Search — powerful search capabilities to quickly locate and retrieve information from our stored data.
  • Azure OpenAI Integration — cutting-edge AI models to generate text and perform complex natural language processing tasks.

This project was challenging; we only had six weeks to execute it and a steep learning curve. With the constraints, we delivered an MVP that provides insights based on the documents we provided. We wish we had more time to work on the response time and further training based on new documents. Samuel Otieno was our data scientist on this project; Amos Wanene, our AI/ML engineer; and Felix Mokua, our front-end lead. I served as a product strategist and project manager. Ian Mutai offered additional technical support, and Brian Nyangena was the bridge between us and the client. The project was more complicated because I also ran two other projects simultaneously.

In this post, I will share five key takeaways from the project. Being my first LLM project, most of what I will share will be essential knowledge for more advanced parties in the sector. I am sharing as I learn and not waiting to be a subject matter expert before I do.

LLM projects are unlike your typical software projects

When executing a software project like our work on the Africa Climate Summit mobile app, the approach and the outputs are predictable if you have prior experience building software products. For example, for a product like Instagram or eCitizen, the teams behind the products leverage different UI screens to guide users through the product to the intended outcome. With each step, they use visual cues and language to communicate progress. Now compare this to a product like ChatGPT or Bard, which have a very different user interaction model — a conversational AI interface (commonly called a chatbot). You will do the heavy lifting in the backend for LLM-based projects, and the user has a chat interface with limited options. This includes the responses they get. You can’t execute these two projects using the same design and development principles.

I have three tips for saving time and money:

  • Understand your users’ needs and expected outputs before starting any project, especially for LLM projects. Building and maintaining an LLM project is costly, so making changes later can be expensive. This is compounded by your users interacting with a conversational AI interface; you have fewer UIs and copy to guide them.
  • Invest a lot of time understanding how ‘offline’ processes will translate to your AI engine. For example, in our project, we scored projects in the climate space based on their impact on the environment, job creation, and economic impact. We had to translate this into a rubric we used to score projects. Your LLM can only be as accurate as your ability to understand decision rules and how you translate them to your engine.
  • Treat your LLM projects like research projects. Everyone on your team should embrace a culture of research, even for simple decisions. LLM talent is still developing, and changes later on are expensive. On our project, we moved from implementing quickly to documenting the problem and proposed approach. This reduced back-and-forth and improved input. Keep research outputs short and to the point.

Choose your model carefully

GPT 3.5 is one of the best-known and most advanced LLM models. In theory, it is an easy choice for your project. Not so fast, though, because it is more complex. The needs of your project and the talent you have access to are critical when choosing the model to use.

Let us first start with your needs. While general-purpose LLMs like OpenAI’s GPT 4, Meta AI’s LLaMA 2, or Google’s PaLM 2 claim to be powerful (they are), they are not a good fit for every project. For example, if you want to train your model for specific use cases, these models are not ideal. Instead, smaller models might be better suited, commonly called SMLs, like LLaMa, Chinchilla, BERT, and Alpaca. An example of a specific use case is the work we were doing. Our goal was for our users to get insights from the documents they provided about the climate space without relying on the knowledge base provided by GPT. In my opinion, smaller models excel at such tasks. We have seen a trend where big players such as Google release smaller domain-specific models such as Med-PaLM2, a medically tuned version of PaLMs.

Access to talent is one of the primary considerations we make any time we take on a project. This is even more critical in nascent industries such as LLMs. Talent will naturally coalesce around a couple of platforms. I was surprised that I couldn’t find people with industry experience working on GPT 3.5 + Azure outside the team I was working with. In addition, I couldn’t find a community focused on this stack. When working on something new, consider this one of your key considerations.

Cost is a significant factor. Spend time figuring this out

Based on information from different sources, it costs approximately $700,000 to run ChatGPT in a single day. The cost of building and maintaining LLMs is significantly higher than most, if not all other tech products. This is a common theme in AI because of the costs associated with the underlying hardware and tech skills required. If you use any of these models, you must help foot this cost. We are also at a point where the technology is relatively new, so learning to optimize your costs will take trial and error. While working with the Azure stack, there were bills we didn’t understand, and couldn’t get the support we needed to figure it out. Whatever you are building needs to make business sense, even when solving a problem for your users. Some of the essential costs you should think about are:

  • Initial and recurring set-up costs, such as minimum monthly bills
  • Development and testing costs — these can be a tidy sum as you build and validate your model
  • API requests — you will typically be charged per request
  • Request types — some services, such as semantic search, are more expensive compared to simple search
  • Rate limits — platforms often throttle the number of requests per certain period. If you go past the limit, you will have to pay more
  • Scaling costs — what happens when your product moves from 1,000 users daily to 10,000?
  • Retraining costs -understand how much it will cost to retrain or retrain a large portion of your model
  • Hosting and compute services — you will need somewhere to store all those files and compute resources for processing.

Cost is critical because your product must be sustainable to keep offering it.

To fine-tune or not

Tuning is adjusting an LLM to improve its performance on a specific task. The five key tuning ways are pre-training, fine-tuning, in-context learning, few-shot learning, and zero-shot learning. I will not delve into this topic deeply as it is an entire article on its own, but I found this post a good guide — LLM Tuning Made Simple: Types, Pros, Cons, and When to Use Each. We used GPT3.5, so we skipped the pre-training part as GPT is a pre-trained model. We still needed to do more for the model to suit our needs. The recommended approach is to leverage prompt engineering to get the right results. Prompt engineering refers to carefully crafting or designing prompts to generate desired responses from a language model. It involves constructing input queries or instructions to elicit specific and valuable information or responses from the AI model. We leveraged this approach to get better results when working on our model. The results we were getting were not as good as we hoped, so we explored fine-tuning. This was not the best approach, as fine-tuning is complex and expensive. Fine-tuning is better suited for form over substance. The initial cost of fine-tuning is astronomical whether you use human data annotators or AI agents. If you are not hosting the fine-tuned LLM yourself, then you will have to pay a very high rate per API call to use the fine-tuned LLM because specific resources are fully dedicated to your model. Fine-tuning may be done in cases where you require the output to be in a particular form, such as in a sentiment analysis bot(satisfied, dissatisfied, moderate, etc.), or when you require to configure the LLM by giving it personality details(name, quirks). Never use fine-tuning for substance(knowledge) unless it is heavily warranted by a cost-benefit analysis and the client’s needs.

How, then, do you improve your model without fine-tuning? A combination of prompt engineering, in-context learning, and few-shot learning is your best bet. In-context learning is a technique in which the LLM uses the context provided within the input to learn and adapt to the specific task. Few-shot learning is a technique that allows the LLM to learn and adapt to a specific task with just a few examples.

LLMs offer great opportunities, especially for businesses

It took only five days for ChatGPT to reach a million users. Google launched Bard almost three months after ChatGPT. The number of models and datasets available on Hugging Face, a platform for the AI community founded in 2016, has grown exponentially since the generative AI buzz started. The number of startups in the AI space has grown exponentially, as has the VC funding rate. Is this all a buzz similar to what we saw with the crypto and blockchain space? Being a new and buzzy space, there is truth to this, but so far, we have seen a lot of practical and highly impactful applications of LLMs. For emerging markets, my wagger is that LLMs will first get traction for business applications before consumer products. LLMs can help businesses with content generation, customer support, market research, and sentiment analysis, to name a few. The cost associated with developing and maintaining LLMs will be a significant hindrance, but those that identify critical problems solved by LLMs will have a competitive advantage. For example, can LLMs help insurance companies reduce fraudulent claims? Can LLMs help businesses make sense of all the documents they have? The opportunities are endless, and I can’t wait to see what builders will do in this space. I am not as optimistic about consumer-facing products yet because of the costs of running LLMs, but as the technology gets commoditized, such use cases will be unlocked, too.

Are you interested in learning more about generative AI and LLMs? I found the introductory course by Google helpful on this journey — Generative AI on Google Cloud: New training content, from introductory to advanced. I have also found the AI Kenya community active and helpful.

--

--

Kipkorir Arap Kirui

Ex-child, Reluctant adult, Experience Designer, UX Researcher, Design Facilitator, Senior Product Manager, Co-founder Made by People, Product at Microsoft