I built my own LLM chat, and it’s running offline

Recently, I took on a project to build AChat, a chat application that runs on my local network and operates entirely offline. The challenge was to utilize a powerful AI model, Meta’s LLaMa, while keeping everything on-premise, without relying on cloud services. It turns out, when you have a GPU-equipped server at home, it’s entirely possible to pull off. Here’s how I built it, the technologies I used, and why it was such a rewarding experience.

The Architecture: AI Locally, Without Internet

At the core of AChat is the Meta LLaMa model, which I run using llama.cpp. By using GPU offloading, I’ve maximized the performance of the model, processing all requests locally on my server. This was key to keeping everything offline and independent from external services.

To manage multiple chats and preserve context across sessions, I wrote a backend in Go, utilizing the Gin-Gonic framework. This backend handles queuing, context management, and history. It’s a pretty straightforward architecture but does the job efficiently—allowing multiple users to interact with the model, even resuming conversations from where they left off after closing the app. The model responses are fast, and the entire thing runs without hitting the internet once.

Frontend and Desktop Client

The frontend of AChat is built with WebPack, SCSS, HTML, and TypeScript. This allows users to access the chat from any browser on the local network. But I wanted something even more convenient, so I built a desktop client using Electron. This removes the need to type URLs, providing a native desktop experience with an icon in the Start menu and a simplified, clean interface.

Keeping Everything in Sync: HTTP 2.0 and Server-Side Events

A key part of making AChat seamless was ensuring synchronization between the backend and the frontend, especially since the system needs to handle multiple simultaneous conversations. For this, I used HTTP 2.0 and Server-Side Events (SSE) to ensure that real-time updates could be sent from the server to the frontend with minimal latency. This approach avoids polling and keeps the communication efficient and snappy, even with multiple concurrent users.

Why Go Local?

Running the model locally provides several benefits. First, there’s no internet dependency, which means you get fast responses without the latency of cloud-based services. And importantly, the entire system is under my control, from data management to performance tuning. There’s a level of satisfaction that comes with knowing exactly what’s happening under the hood at all times.

Building the Stack: Rediscovering Full-Stack Development

While most of my recent experience has been in backend development, this project allowed me to dive back into the full stack, and honestly, it was fun. Working with frontend technologies like WebPack and TypeScript took me out of my comfort zone, but it was rewarding to build an interface that looks good and works well.

On the backend, Go was a natural fit for the queuing and context management tasks. It’s fast, and writing clear, modular code in Go helped keep things maintainable. Additionally, integrating with llama.cpp was a great way to bridge AI with Golang and C++, ensuring that the LLaMa model could be efficiently managed and that GPU offloading worked as expected.

GPU Offloading: The Fun of Optimizing AI

One of the most interesting parts of this project was getting the LLaMa model running smoothly on GPUs. Offloading the heavy lifting of the AI model to my GPU-equipped server gave me significant performance improvements. I was able to handle complex conversations at a speed that would be impractical if I had to rely purely on CPU. It felt like I was running a miniature AI lab in my own home.

The Outcome

This was more than just a side project; it was a deep dive into modern AI, full-stack development, and system optimization. I learned a lot by integrating the various components and getting them to work in harmony. Building AChat gave me the chance to play with architecture, write efficient code, and refresh some of my frontend skills. And it was fun. It’s not every day you get to build your own local AI-powered chat, but the process turned out to be both a technical challenge and a genuinely enjoyable experience.

In the end, AChat works just as intended: it runs completely offline, processes conversations quickly, and keeps everything local. It’s become a useful tool in my daily life, and knowing that everything is self-contained and under my control makes it even better.