Building a Local Microeconomics Chatbot with RAG and Gemma 3

This tutorial demonstrates how to build a 100% local microeconomics chatbot using Google's open-source Gemma 3 model and Retrieval-Augmented Generation (RAG). By leveraging Kolosal AI for local inference and BM25 for document retrieval, users can create a private, cost-effective AI assistant that provides context-aware answers based on a local economics knowledge base.

Large Language Models (LLMs) like GPT-3 or ChatGPT are powerful, but running them through cloud APIs can raise concerns around data privacy and cost. Local LLMs address this by allowing you to run models entirely on your own hardware, with no internet required. For example, Google’s new open-source model Gemma 3 is capable of handling text and images with a huge 128K context window, yet its smallest version is only ~529 MB — small enough to run on a typical laptop​. This means you can have a pretty advanced AI model on your PC without relying on any external service.
One major benefit of keeping everything local is data privacy. If you’re dealing with sensitive or proprietary information, sending data to a cloud API might be unacceptable. In fact, one developer faced a situation working with classified documents where using an external API wasn’t an option. By running the LLM app locally with a tool like Kolosal AI, all data stays on the machine — no external dependencies, no risk of leaks​. You get full control over your AI assistant and its data.
However, even a powerful model like Gemma 3 has limits. LLMs are trained on fixed datasets (so they might not know recent or niche info), and they sometimes hallucinate incorrect answers. This is where Retrieval-Augmented Generation (RAG) comes in. RAG is an approach that augments the LLM with a retrieval step: the model can fetch relevant information from an external knowledge base (your documents) and use it to craft more accurate answers​. In essence, we supply the model with up-to-date, specific context so it doesn’t rely solely on its internal memory.
In this tutorial, we’ll combine these ideas to build a microeconomics Q&A chatbot that runs 100% locally. We’ll use Gemma 3 (running via Kolosal AI on our machine) as the brain of the chatbot, and implement a RAG pipeline for retrieval. The chatbot will be able to answer microeconomics questions by looking up information from a local knowledge base of economics content and then generating answers.
We’ll cover everything step-by-step in a beginner-friendly way:
  1. Setting up the environment: installing Kolosal AI and downloading Gemma 3 for local use.
  2. Walking through the chatbot’s architecture: how documents are loaded and indexed, how we retrieve relevant info using a BM25 algorithm, and how Gemma 3 generates queries and answers.
  3. Showing code snippets (Python and Docker) to illustrate how it’s implemented.
  4. Finally, running the chatbot in a local web app (Streamlit) via Docker, and demonstrating it in action.
By the end, you’ll have a working local AI chatbot that can answer microeconomics questions with the help of your custom data. Let’s dive in!

Setting Up Kolosal AI and Gemma 3 Locally

Our first task is to get Gemma 3 up and running on your computer. We’ll use Kolosal AI, a lightweight open-source application that makes it easy to run local AI models offline. Kolosal AI provides an intuitive interface to download models and even offers an API mode so other programs (like our chatbot) can use the model.
Follow these steps to install Kolosal AI and obtain the Gemma 3 model:
  1. Download Kolosal AI: Go to the official Kolosal AI website and download the installer for Windows (an .exe file). The download is only ~20 MB, and Kolosal supports both CPU (AVX2) and GPU acceleration out of the box.
  2. Install Kolosal AI: Run the KolosalAI_Installer.exe you downloaded. You might see a Windows SmartScreen prompt — choose “More info” and “Run anyway” (Kolosal AI is open-source and safe).
  3. Launch Kolosal AI: After installation, open Kolosal AI from the Start menu (or your Desktop shortcut). You’ll be greeted by a clean, developer-friendly interface. On first launch, no model is loaded yet. We’ll fix that by downloading Gemma 3.
  4. Download Gemma 3 Model: Kolosal AI has a built-in Model Manager to easily get new models.
  5. Load the Gemma 3 Model: In Kolosal’s interface, select the Gemma 3 1B model to load it. The app will initialize the model in memory — after a few seconds it should indicate that Gemma 3 is loaded and ready. You can now try a quick test in the Kolosal chat UI: for example, ask “What’s 2+2?” or any simple query and see Gemma 3 respond. 🎉 Congrats! You now have Gemma 3 running locally on your PC.
  6. Enable API Access (Kolosal Server Mode): Our chatbot app will need to send requests to Gemma 3. Kolosal AI version 0.1.7 introduced a “server” feature that exposes an OpenAI-compatible API endpoint locally. In Kolosal, go to the settings or model menu and look for an option to enable the API server. Turn it on, and note the default port (by default Kolosal uses port 8080 for its API). Once enabled, Kolosal will listen at http://localhost:8080/v1 for API calls. We’ll use this to connect our Python app to Gemma 3. (If you don’t see such an option, ensure you have the latest Kolosal AI version. Alternatively, Kolosal might auto-run the API when the model is loaded.)
Note: The Kolosal API is OpenAI-compatible, meaning we can use the standard OpenAI Python library to send requests to our local Gemma 3 as if it were OpenAI. We’ll configure the library to use http://localhost:8080/v1 as the base URL and a dummy API key. This is exactly what our chatbot code will do.
With Kolosal AI running Gemma 3 and the API enabled, we have our local LLM backend ready. Now, let’s build the retrieval-augmented chatbot that will utilize it.

Chatbot Architecture Overview

Our chatbot uses a classic Retrieval-Augmented Generation (RAG) pipeline. In simple terms, it will retrieve relevant information from a set of documents (our microeconomics knowledge base) based on the user’s question, and generate an answer using the LLM (Gemma 3) with that information as context.
The architecture consists of a few components working in sequence. The diagram below illustrates the flow:
Let’s break down the components of our system:
  1. Knowledge Base: A collection of microeconomics documents (text files) that contain facts and explanations (e.g. supply and demand, elasticity, market structures).
  2. Document Loader: Code that loads these documents and prepares them for search.
  3. BM25 Retriever: A retrieval system that finds documents relevant to the query. BM25 is a traditional algorithm that scores documents based on keyword matches.
  4. Query Generator: (Optional step) The chatbot uses Gemma 3 itself to refine the user’s question into an optimized search query (for better retrieval results).
  5. Answer Generator: Finally, Gemma 3 is prompted with the user’s question plus the retrieved document contents, so it can synthesize a detailed answer.
Everything besides the LLM runs in Python via the Streamlit app, which also provides the web interface for chatting. Streamlit makes it easy to create a chat UI with just a few lines of code.

Document Loading and Knowledge Base

First, we need to load our microeconomics documents. In our project repository, we have a folder parsed_documents/ containing several markdown files (e.g. supply_demand.md, elasticity.md, monopoly.md, etc.), each covering a topic in microeconomics. We also have a knowledge.json index file that lists these documents along with a pre-computed summary of each.
The code below (from main.py) shows how we load the knowledge base at startup:
import json from llama_index.core import Document # Load the documents from a JSON index with open("knowledge.json", "r") as f: knowledge_data = json.load(f) documents = [] for item in knowledge_data: # Read the full text of each document with open(item["filename"], "r") as f: full_text = f.read() # Create a Document object (from llama_index) with a summary and the full text as metadata doc = Document( text=item["summary"], metadata={"filename": item["filename"], "text": full_text} ) documents.append(doc)
Let’s unpack this: The knowledge.json file contains entries like {“filename”: “parsed_documents/elasticity.md”, “summary”: “Elasticity measures how responsive quantity… (etc)”} for each document. We iterate through each entry.
For each document, we open the file and read its content into full_text. We then create a Document object (using LlamaIndex library’s data structures). We store the short summary as the text of the Document (this will be used for search) and put the full text into the metadata. We also keep the filename in metadata for reference.
All Document objects are collected into the list of documents. At the end of this, the documents represent our knowledge base, ready to be searched. Each Document holds a concise summary (to make retrieval efficient) and the original text (to provide detail when formulating answers).

BM25 Retrieval of Relevant Context

With the knowledge base in memory, we need a way to find which pieces of information are relevant to a user’s query. We use a BM25 retriever for this. BM25 is a scoring algorithm from the information retrieval world (commonly used in search engines) that ranks documents based on term frequency and inverse document frequency. In short, it finds documents that contain the user’s query terms (or similar terms) and ranks them by relevance.
We utilize the BM25 implementation from the llama-index library (installed via llama-index-retrievers-bm25). Setting it up is straightforward:
from llama_index.retrievers.bm25 import BM25Retriever # Create a BM25 retriever to get top 3 relevant documents retriever = BM25Retriever.from_defaults(nodes=documents, similarity_top_k=3)
This initializes the retriever with our list of Document nodes. We’ve asked for the top 3 matches (similarity_top_k=3) to be returned for each query. The retriever will use the text field of each Document (which we set to the summary) to do the matching. Three documents should be enough context for a single question in our case.
When a user asks a question, we will call retriever.retrieve(query) to get the top 3 documents most related to that query. Those documents (or their content) will then be passed to Gemma 3 to help formulate the answer.

LLM-Powered Query Generation

Before we retrieve documents, our system does an interesting extra step: it uses the LLM to generate a better search query from the user’s input. This is an advanced (but insightful) technique to improve retrieval results. The idea is that users might ask questions in a casual way, but we can prompt the LLM to rephrase that question into a set of keywords or boolean expressions that are likely to match our documents well.
For example, a user might ask: “How does price elasticity affect consumer demand?” The LLM could turn this into a search query like: “price elasticity” AND “consumer demand” AND microeconomics which explicitly mentions the key concepts. This refined query can lead to more direct matches in the texts.
In our code, we have defined a search_system_prompt that instructs Gemma 3 to act as a “Search Query Generator” specialized in microeconomics. We won’t paste the whole prompt here (it’s a bit long), but in summary, it says: “Analyze the user’s chat history, extract key microeconomics concepts, and form a concise BM25 query with AND/OR, phrases, etc.” It even gives examples of turning questions into boolean keyword queries.
When the user enters a new question, we build a chat_history string (containing past conversation turns, if any) and then do:
import openai # Initiate the openai package to interact with Kolosal AI llm = openai.OpenAI(base_url="http://host.docker.internal:8080/v1", api_key="sk-dummy") search_query = llm.chat.completions.create( model="kolosal", messages=[ {"role": "system", "content": search_system_prompt}, {"role": "user", "content": search_user_prompt.format(chat_history=built_chat_history)} ], max_tokens=128 )
Here: llm is our OpenAI-compatible client pointing to the local Gemma 3. We send two messages: the system prompt (instructions) and the user prompt which includes the chat history. Gemma 3 will then respond with a suggested search query. We use model=”kolosal” just as a placeholder name (since Kolosal AI is hosting the model). We limit max_tokens=128 because the query should be short.
We use host.docker.internal here because the app will run in a Docker container (more on that next), and that address lets the container reach the host machine. If you were running this script outside of Docker, you would use base_url=”http://localhost:8080/v1". The api_key is just a dummy string — Kolosal doesn’t actually require authentication by default, but the OpenAI library expects some API key string to be set.
The result search_query contains the AI’s answer. We extract the query string via search_query.choices[0].message.content. That content might look something like:
“price elasticity” AND “consumer demand” AND microeconomics
Now we feed this string into the retriever:
retrieved_documents = retriever.retrieve(search_query.choices[0].message.content)
The BM25 retriever will find, say, the “elasticity.md” and “supply_demand.md” documents as relevant (for our example query). We then prepare the content of those docs to send to the answer generator:
built_documents = "" for doc in retrieved_documents: built_documents += f"Document name: {doc.metadata['filename']}\n" built_documents += f"{doc.metadata['text']}\n\n"
We concatenate the document names and full text (from metadata) into one big context string built_documents. This string now holds the actual content that Gemma 3 should use to answer the question.

Answer Synthesis with Gemma 3

Finally comes the answer generation. We will prompt Gemma 3 with:
  1. The conversation history (so it knows what was asked and the context of the dialog).
  2. The optimized search query it generated (for transparency, though this might not be strictly necessary).
  3. The content of the retrieved documents.
  4. An instruction to formulate a clear answer using this information.
In main.py, we prepared another system prompt answer_system_prompt that says (in essence): “You are an Answer Generator. Given the chat history, a search query, and some documents with relevant info, provide a clear and concise answer to the user’s question. Do not mention sources or the process.” We also have an answer_user_prompt template that will insert the actual history, query, and docs.
Here’s how we call the model for the final answer:
answer_messages = [ {"role": "system", "content": answer_system_prompt}, {"role": "user", "content": answer_user_prompt.format( chat_history=built_chat_history, search_query=search_query.choices[0].message.content, documents=built_documents )} ] # Generate the answer (streamed) stream = llm.chat.completions.create( model="kolosal", messages=answer_messages, stream=True, max_tokens=1024 )
We pass in the system and user message as described. We request streaming (stream=True), which in Streamlit allows us to display the answer as it’s being generated, token by token. We also allow up to 1024 tokens for the answer, so the response can be quite detailed if needed.
Using Streamlit’s chat elements, the code writes the streaming response in real-time to the app:
full_response = st.write_stream(stream) st.session_state.messages.append({"role": "assistant", "content": full_response})
This st.write_stream will output the text as Gemma 3 generates it, and then we store the complete answer in the chat history state.
At this point, the user sees the answer in the chat interface. We also display the “Referenced Documents” (the retrieved context) in collapsible sections below the answer, so the user can expand and see which notes were used to answer their question. This is helpful for transparency and verification of the answer.

In Summary

  1. User enters a question in the chat box.
  2. The system (Gemma 3) reformulates it into a search query.
  3. Top 3 documents are retrieved via BM25.
  4. Those documents + original question go back to Gemma 3 to generate the final answer.
  5. The answer is streamed to the UI.
All of this happens in a few seconds on a local machine with Gemma 3 1B pretty neat!

Running it on Docker

We’ve packaged this project as a Docker container for easy setup. Docker allows us to encapsulate all the dependencies (Python, Streamlit, libraries, code) so you can run the app without manually installing Python packages on your host system.
Prerequisites: Make sure you have Docker installed on your machine. Also, ensure Kolosal AI is running Gemma 3 and the API is enabled on port 8080 (from the earlier steps).
The repository includes a Dockerfile that defines the container. In short, it uses a slim Python 3.10 base image, copies the code and requirements.txt, installs the needed packages (streamlit, openai, llama-index-bm25), and then runs streamlit run main.py. For reference, here are the key parts of the Dockerfile:
FROM python:3.10-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8501 CMD ["streamlit", "run", "main.py", "--server.port=8501", "--server.address=0.0.0.0"]
By default, Streamlit serves the app on port 8501. We expose that and run the app.
To build and run the Docker container, follow these steps in your terminal:
  1. Clone the repository (if you haven’t already) by doing:
    git clone https://github.com/FarrelRamdhani/Microeconomic-Chatbot.git
  2. bash cd Microeconomic-Chatbot
  3. Build the Docker image by running this command:
    docker build -t microeconomic-chatbot .
This will package the app into an image named microeconomic-chatbot. (You can choose another name/tag if you like.) To run the Docker container you can run this command docker run -p 8501:8501 microeconomic-chatbot
We publish port 8501 so that the Streamlit app is accessible on our host. The container will start and launch the Streamlit server inside.
Open the app in your browser: Once the container is up, go to http://localhost:8501 in your web browser. You should see the Streamlit interface for the chatbot.
At first, you’ll see a welcome message in the app: “Welcome to the Microeconomics Chatbot! Please ask me any questions you have about microeconomics.” (This is set in the Streamlit app code.) There’s a text input box at the bottom that says “Ask me anything about microeconomics”.
Go ahead and ask a question! For example, you might type: “What is price elasticity of demand?” and hit Enter. The query will go through the RAG pipeline we described. After a brief moment, you should see an AI response appear in the chat.
An example Q&A interaction in the chatbot UI. The user asks about price elasticity, and the assistant (Gemma 3) responds using the retrieved knowledge on that topic.
You can continue the conversation by asking follow-up questions, or start a new topic. The app maintains the chat history in the session, so you could ask something like “Can you give an example of that?” after the elasticity explanation, and it will remember you’re still talking about elasticity.
Below each answer, you’ll see a “Referenced Documents” section. Expanding it will show which documents were retrieved for that answer, along with their content. This can help you trust but verify the chatbot’s answer against the source material.
Because everything is running locally (both the model and the app), you might be pleasantly surprised by how fast the responses are, considering Gemma 3 is a smaller model. And you never had to send your questions or data to any server — it’s all on your machine (your laptop basically became an AI server!).
In this tutorial, we built a fully local AI chatbot that leverages a Retrieval-Augmented Generation approach to answer questions on microeconomics. We went through setting up Kolosal AI and Gemma 3 indexing custom documents, and using a BM25 retriever to supply context to our local LLM. The result is a chatbot that can provide informed, context-specific answers without ever touching the internet.
Back

Join our Revolution!

Join us to bring AI into everyone's hands.
Own your AI, and shape the future together.