Introduction to Enterprise RAG systems

Enterprise Knowledge

Enterprises are great at managing structured data like time series transactional data, telemetry data, logs, and records. These are then cleaned, transformed, and loaded for analytical use cases. Dashboards and reports are built to guide the organization on their strategic next moves. However, when it comes to unstructured data, there really is not any standard solution. The unstructured data represents 90% of the data generated by an organization as observed in a study by IDC. This is where the deep expertise and knowledge of everyone in the company lies hidden.
Is there a way to then access and democratize this knowledge?
Enter RAG platforms

What is RAG?

RAG is short for Retrieval Augmented Generation. The key to RAG is the Retrieval part of the abbreviation. Before we understand how Retrieval works, we need to understand why RAG is the solution to the Unstructured Data Problem. LLMs tools like ChatGPT, Claude, Gemini etc. are based on large language models trained on internet scale of data. It has the historical context of all the data it has been trained on, the facts that were publicly available during training and some fine-tuning to make sure the models behave in helpful, non-destructive ways. But one thing these LLMs cannot access is the enterprise knowledge that was never part of its training dataset. You would wonder if a new LLM can be trained on the enterprise data, it could solve the problem. You would be right, but it is very cost prohibitive to train LLMs, especially when the enterprise data keeps updating daily. These tools can also take in unstructured data in the form of PDFs and Images and then provide responses based on the files. But LLMs are known to have limited context windows i.e. They can only take in a limited number of characters as input. Another issue that is well known is LLMs hallucinate (which is how they work), but in enterprise use cases this can become a problem. This is where RAG comes in. RAG uses a general purpose LLM to reason and reply to queries but has a system that can retrieve the right context to the LLM that is relevant to the query. A good analogy would be of that of a librarian: • A student approaches a librarian with a research topic (query). • The librarian goes and finds books, journals, and articles that are relevant to the topic (retrieval). • The librarian then helps the student synthesize this information to create a well-informed and accurate research paper (augmented generation). In the above analogy, the librarian does not need to know the contents of the entire library, she just needs to know how the books are organized, and she can retrieve the right books given the request.

How does RAG work technically?

Unstructured data in the enterprise would first need to be stored in a way it can easily be retrieved. This process starts with ingesting the data from the source. Unstructured data is usually PDFs, images, videos, and audio content. The ingestion process involves extracting the language from all these file formats. This could be text, images and tables from PDFs, text summaries from images, transcribed text from audio and videos. This is known as metadata extraction. Once the metadata is extracted, this is then contextualized.

The contextualization step involves chunking and converting the text to vectors using embedding models. Chunking is done to split these large documents into smaller manageable chunks so that they are easily retrievable as context. The embedding models are trained in natural language and help build semantic meaning of the text provided to it. The output of an embedding model is vectors. Vectors are nothing but a numerical representation of a chunk of text. For example, the sentence ‘Books on AI are available in the technology section of the library’ when passed to an embedding model has a numeric representation in vectors which captures the semantic meaning of this sentence.

The output vectors from the embedding model are then stored in a vector database. Consider this as a repository of all the knowledge in your enterprise stored in a way that can be easily retrieved. Vector databases are built in a way that similar vectors are clumped together, and opposite vectors are stored far apart. Going back to the analogy of how books are stored in a library, thriller books are located next to mystery books, but business books are located farther away from them.

Once the enterprise data is available in the vector database, we are ready to build the RAG application on top of this. The first and the most crucial step in RAG is retrieving the chunks relevant to the query. Let us consider all the books in a Library are ingested, chunked and the vectors from the embedding model are stored in a vector database. Now if a user queries ‘What are the best thriller books in this library?’ we will first pass this query to the same embedding model to generate a vector. This vector is then fed to the vector database to perform a similarity search and retrieve all the relevant chunks. Because the vector database has organized the information based on the underlying meaning of the chunks, this query might return all the chunks of books that were considered ‘best’ sellers in the ‘thriller’ genre. Once the chunks are retrieved, we will pass this on as context to the base LLM to generate the response to the user’s query.

The ‘Augmented Generation’ step of the RAG system takes the retrieved chunks from the vector database and then provides it as context to the LLM. A typical prompt to the LLM would be: “Answer the question based on the context. If the context is not relevant, say ‘I don’t know’. Question: <User Query>.

Context: <Retrieved Chunks>.” The LLM would then understand the question, look at the retrieved chunks, and generate a response to answer the question concisely with the relevant context it has. And if the context retrieved is not relevant to the question like ‘What is the weather in San Fransico’, since the vector database does not have this information, it would respond with ‘I don’t know’ thus reducing hallucinations and made-up answers.

This is an overview of how a basic RAG system works. To summarize:
• Unstructured data and knowledge are ingested, chunked, converted to vectors, and stored in vector databases
• User asks a query
• Query is converted to vectors and compared to others in a vector database
• Similar chunks are returned as context
• The query along with the context is provided to an LLM
• The LLM responds to the query based on the context retrieved, hence Retrieval Augmented

Challanges of implementing a RAG system in an Enterprise

Now that we understand why RAG can solve the unstructured data problem and how RAG works, we need to understand what challenges we will face when trying to integrate this into the current business processes in an enterprise.

Security and Privacy: Enterprise data needs to be highly secure and is bound to several data governance and data security policies. This requires the RAG system to be deployed within the enterprise’s existing infrastructure while connecting to data sources securely. A further consideration is also using local LLMs for generation and local embedding models for contextualization. This is to make sure that confidential enterprise data is not sent to third party cloud LLMs as context, this could potentially be a breach of data security policies. Also, the consideration of masking PII data before they are stored in the vector databases helps in generating responses where highly secure and confidential data is not disclosed. The final consideration is to make sure that access to specific datasets is restricted to specific users, thus complying with any data access policies present in the enterprise.

Scalability: Enterprise data is fairly large in volume and hence it becomes paramount to be able to build systems that can scale. This requires a system that can manage ingestion jobs in a distributed and parallelized way to improve performance. Embedding model inferences would also need to be served in a distributed way to aid faster ingestion. When storing many vectors, we must ensure that partitioning and indexing of these vector databases are done efficiently to improve database performance and retrieval speed. When many users in the organization are using the RAG system, its reliability and speed are key factors to consider. This requires continuous monitoring of the system stability and deploying the right resources to maintain the scalability of the platform.

Accuracy: One of the most important things we need to acknowledge about Gen AI is its non-deterministic nature. This means we can never truly predict how they would behave and what answers they would generate. Thus, it becomes especially important for us to build a system that is accurate and continuously monitors the accuracy. RAG evaluation frameworks use question answer pairs generated by an LLM itself on the data ingested and then use other LLMs to judge the response of the RAG LLM to deem it relevant to the query or not. This type of agentic method of making sure the final response provided by the RAG system is relevant and provides scores as to how many answers the RAG system got right will help us understand how well the RAG is working. Tools can be implemented to trace the issue of an inaccurate answer to its source. Using these tools can help in continuously improving the system. Various parameters in the RAG pipeline can affect its accuracy. Hence it is important to experiment with these parameters and test it thoroughly before deploying it into production.

User Experience: Once you have been able to deploy a system that has ingested all your unstructured data and stored it into vector databases, you need to also think about how this would cater to specific use cases in the enterprise. RAG systems offer a chat interface which end users can interact with as a copilot. AI agents are another interesting way of connecting your RAG system to a multi-step process automation. Generation can also be used to summarize or create more documents and reports in an organization. Modern advances in speech-to-speech models can let you interact with your enterprise data purely over voice. When multi-modal LLMs are used in a RAG system, you could also interact with it by showing images and asking questions related to that image. Tools like AR can use the live video stream to inform what you are seeing and guide you on the next steps which are grounded in your enterprise data.

Up to date context: With unstructured data in the enterprise, change is constant. New reports, new documents and communication are generated daily. This requires the RAG system to also have its context in the vector database updated at a similar frequency. Creating workflows and pipelines that can continuously ingest new data as it is stored becomes important where use cases depend on the latest context to inform decision making. In such real-time use cases, the retrieval accuracy of the RAG system can also vary over time. Thus, it becomes crucial to constantly monitor it and make necessary configuration updates to keep the accuracy in check. Providing the RAG system with tools to search the web and call APIs can help provide access to context stored in other systems which could be useful to answer certain questions.

Introduction to Enterprise RAG systems

Enterprise Knowledge

What is RAG?

How does RAG work technically?

Challanges of implementing a RAG system in an Enterprise

Ready to Empower Your Data to Decision Journey

ThoughtsWin Systems Inc.

Quick Links

Social Links

Locations

Get In Touch