Two main concerns businesses have about building LLM applications are cost and time. AI implementation does require more resources compared to traditional app development, but with the right approach, building and deploying LLM applications can be fast and cost-effective.
We’ve already discussed the best strategy to implement AI which relies on taking what already exists rather than building AI from scratch. In this article, we’ll explain how our team at Modeso creates AI-enabled apps using foundational LLM such as OpenAI's GPT-4 and the LangChain framework which allows us to integrate language models into our solutions.
But first, let's briefly talk about what LLMs are and how they work.
LLMs, or Large Language Models, are deep learning models trained on huge amounts of data to recognize and generate text. LLMs can be trained to perform various tasks, from text translation to even code writing. Users can interact with language models through text messages, also known as prompts. Here’s what happens to a prompt inside LLM.
Before an LLM can process a prompt (text), this text is broken into smaller subword units called tokens. These tokens are sent to the LLM, operating as a “black box” that hides any logic behind its decisions.
Some examples of LLMs include GPT-4, Claude 2, Gemini, Jurassic-1, Command, LLaMA 3, and more.
Let’s take Gemini for instance. It’s a suite of LLMs by Google. Initially, these models were pre-trained on extensive text data to understand language patterns. To enhance their capabilities, Gemini was fine-tuned with additional multimodal data, allowing it to process and respond to various types of content, such as text and images.
One more example is OpenAI. OpenAI’s models were trained on publicly available data, data licensed from third parties, and data provided by humans, but these models do not have access to the original datasets. Instead, they have learned patterns, structures, and associations between words and phrases to generate responses. Think of it as a person who writes an essay based on what they have learned, rather than quoting a book word for word.
Although trained with vast amounts of high-quality data, LLMs might still provide unpredictable or irrelevant answers. Here are some challenges to consider when integrating LLMs into your application.
If you look at some popular examples of LLM-based solutions like ChatGPT, you will notice a small disclaimer at the bottom saying “ChatGPT can make mistakes. Consider checking important information.”
LLMs are designed to generate outputs, no matter the case. So if, for example, the language model doesn’t have enough context to give the user an accurate answer, it will still provide an output that will not match the user’s intent.
Here are the most common mistakes an LLM can make when generating outputs:
To address some of these challenges, developers employ Retrieval-Augmented Generation (RAG). At Modeso, we have hands-on experience implementing RAG. Here’s why we believe this technique is a must-have for developing LLM-based solutions.
RAG extends the already powerful capabilities of LLMs by giving them new information from external sources. Combined with the model’s knowledge, the new user-specific data acquired through RAG is used to create more up-to-date and contextually relevant output.
RAG allows developers to connect the LLM with external data sources, including internal company databases, multimedia repositories, or else. However, to ensure the LLM can interpret data from these external data sources, it should be first converted into a unified format. Here’s how it’s done with RAG.
A typical RAG workflow includes three stages, namely indexing, retrieval, and generation. Let’s take a closer look at each stage.
First, the data is broken into smaller pieces called chunks. These chunks can be sentences, paragraphs, or any logical data segments. Each chunk is converted into an embedding and stored in a database. This way, the system can search through its knowledge base for relevant information and generate more accurate responses.
Once a user enters a prompt, the prompt is also converted into an embedding. The system compares the prompt embedding with the embeddings in the database to retrieve the chunks that are most relevant to the user’s query.
The most relevant embeddings are combined with the original prompt and sent to the LLM. The model processes this augmented input, generating more context-appropriate output.
Therefore, RAG selects only relevant information that matches the context, so the system is not overloaded with unnecessary data. It reduces the number of tokens that need to be processed by the LLM, leading to lower costs.
Developers often implement RAG systems when building LLM-based apps. For streamlined implementation, we use the LangChain framework, a suite of tools designed to simplify the development process.
LangChain is an open-source framework that simplifies the process of developing LLM-based applications. It provides a comprehensive set of RAG building blocks, like document loaders, embedding models, and retrievers, making the implementation of RAG systems faster and simpler for developers.
When developing AI-powered apps, three main components of LangChain are crucial – Model I/O, Retrieval, and Composition. Although it’s possible to manually write the necessary code, these tools simplify development and significantly reduce the workload. Let’s review these LangChain components in detail.
Model I/O is a set of tools designed to format and manage language model input and output. The toolset helps translate the application’s technical elements, like variables and data structures, into a form the LLM can understand. After processing, it receives the output from the LLM and reformats it to fit the application’s requirements.
Another key component is the retrieval mechanism. It helps fetch relevant information from a knowledge base to provide additional context for the LLM.
With RAG, data from various sources is transformed into a numerical format, and so is the user prompt. Next, this prompt is compared against the data LLM was trained on to identify the most relevant information which is further used to augment the prompt before passing it to the model.
Once the user prompt is extended with relevant pieces retrieved from the data store, it’s sent to the LLM. The final output is then generated based on this augmented user input and the LLM’s initial training data.
Composition in LangChain offers tools to connect external systems, like databases and APIs, with LangChain core primitives, thereby enhancing the functionality, flexibility, and efficiency of the RAG pipeline.
To understand how composition works, here’s an example. Suppose a user needs to calculate how much one plus one is. The AI identifies the parameters from the user’s prompt (“add one plus one”) and then calls the tool that handles addition. After calling the appropriate tool, the LLM receives the output and provides it to the user.
Composition isn’t limited to math and calculations only. With it on board, LLMs can also connect with sources like Wikipedia, broadening their ability to answer questions, provide information, and engage in more complex conversations.
To add capabilities of external tools to the LLM, LangChain Composition provides three components: Agents, Chains, and Tools.
Agents are specialized components that use the LLM to decide which action to take in response to the user query and in which order. Simply put, based on the user input, an agent decides which tool to call. With LangChain, developers can build various agent types, like a tool calling agent, XML agent, structured chat agent, and more.
These are interfaces that allow LangChain agents and chains to interact with the information outside the LLM’s training data. You can use built-in tools to connect to Google Drive, Bing Search, YouTube, and other services. LangChain also offers collections of tools that work together to perform certain tasks.
Unlike agents where an LLM is used as a reasoning engine to decide which actions to take, the sequence of actions is hardcoded in chains. It’s a series of automated actions like taking a video, converting it to a different format, adding the results to a storage system, and retrieving related information from the database to include in the final prompt.
If you’re looking to build an LLM-powered application, LangChain is a way to speed up the development process. Thanks to this framework, developers can build their applications as a series of clearly defined steps, saving time and budget.
Before we move to deployment, one more important thing should be discussed. Vector databases.
As we’ve mentioned, in the context of retrieval numerical representations (embeddings) of data are stored in specialized vector databases. These databases allow efficient retrieval of relevant information by organizing numerical data in arrays that can be searched based on similarity.
One of the most popular vector databases is Pinecone. It provides efficient and scalable storage solutions for handling vector embeddings. It offers three plans to meet various needs, including a Starter Plan (a free plan for beginners or small-scale projects), a Standard Plan (for production applications at any scale), and an Enterprise Plan (for mission-critical production applications). You can get the pricing details on the Standard and Enterprise plans here.
You can also use free alternatives, like Chroma. It’s an open-source vector database for storing and managing vector embeddings. While Pinecone is often preferred for its advanced features, Chroma is a more flexible and budget-friendly alternative worth considering for LLM app development.
Be it Chroma, Pinecone, or any other database, LangChain offers a flexible interface to seamlessly switch between vector databases. You could start with a free option and upgrade later, or switch between services without rewriting your entire codebase.
At this point, we’ve covered almost everything about developing LLM-based apps. The only thing left to discuss is ways you can deploy it to users.
There are at least two ways to deploy an LLM application:
To help you weigh the pros and cons of each option, let’s take a look at each of them.
Alongside its free tools, LangChain offers additional services called LangServe and LangSmith. LangServe is a library that makes the deployment of LLM apps straightforward. It helps deploy your AI components as a REST API, while the rest of your application can seamlessly call these AI components. This way, LangServe converts your LLM app into an API server without requiring a separate deployment setup.
LangSmith is an all-in-one platform for developing, deploying, and monitoring LLM-based applications. This tool provides a monitoring dashboard to track what’s happening in your application, including consumption metrics for various tools and usage patterns, helping developers identify areas for improvement.
LangSmith is a paid service, offering pricing plans for different teams.
The second option is to create a Python backend application and deploy it on the Google Cloud Platform (GCP). At Modeso, we use GCP for most applications, which allows us to manage packaging, deployment, and monitoring within the cloud ecosystem.
While the custom approach lacks LangSmith’s specialized monitoring tools, it gives you full control over deployment and customization.
The choice of a deployment option depends on your project needs. Using LangServe/LangSmith involves additional costs, but you will get the functionality to streamline the development process. On the other hand, if you don’t actually need this functionality right now, why pay for it? You can always start with a custom Python backend and later transition to LangServe/LangSmith if these tools become necessary.
This was a short overview of how we build context-aware LLM applications that dynamically respond to user prompts. Want to build an AI-based project? Let’s discuss our next steps.