ML Repo — Architecture and External RAG Server Design (for Ollama/Open WebUI)
My openWebUI/searxng configs, plugins, RAG server, as well as a custom program that runs the AI's code in isolated Docker containers
Last updated: 2025-09-10
Tip
Looking for the compose version of this? See the compose
Summary :3
This repository wires together a local AI stack built around Open WebUI, Ollama, SearxNG, and two custom utilities: a code runner (executes model-generated code inside sandboxed containers) and a headless research browser UI. The current compose setup already gives you working RAG (retrieval-augmented generation) inside Open WebUI without needing a separate RAG service.
Repo map and how each piece fits
.
├─ docker-compose.yml
├─ searxng.yml # searxng settings; defaults, json+html enabled; not a public instance
├─ cloudflared-tunnel-config.yml # cloudflare tunnel routing to ollama, openwebui, and tools
├─ README.md
├─ LICENSE # apache-2.0
│
├─ rag-server/
│ ├─ Dockerfile # Runs the file that does the RAG stuff
│ └─ index.tsx # Does the RAG stuff
│
├─ browser/
│ └─ Dockerfile # builds browser-use/web-ui (playwright chromium) on :7788
|
└─ coderunner/
├─ Dockerfile # bun-based service that exposes an OpenAPI tool for sandboxed code exec
├─ index.ts # the server; integrates with Open WebUI as a tool via /openapi.json
└─ package.json # @types/node only (dev) to feed the OCD
Open WebUI (in docker-compose.yml)
- purpose: chat UI + orchestration layer; includes a built-in knowledge base + RAG with chunking, embedding, search, and prompt templating.
- notable: backed by Postgres in this compose. exposes
4000:8080. - storage: a docker volume
open-webui:holds app data; Postgres usespgdata:.
Postgres (in docker-compose.yml)
- purpose: persistence for Open WebUI features (users, knowledge, etc.). health-checked with
pg_isready.
SearxNG (in docker-compose.yml + searxng.yml)
- purpose: metasearch engine used by Open WebUI tools/agents for live web lookups.
- config highlights:
use_default_settings: true,public_instance: false,limiter: false; formats:htmlandjson.
Coderunner service (coderunner/)
- what it is: a small HTTP server (Bun runtime) that executes pure source code in short-lived, sandboxed containers.
- why it exists: lets Open WebUI tools run code safely with tight resource limits (no network, read-only fs, cgroup limits,
--cap-drop=ALL,no-new-privileges). - integration contract: exposes an OpenAPI schema at
/openapi.jsonand a single POST/executeendpoint. Open WebUI can import this as a tool server. - security posture: pulls allow-listed base images (gcc, python, node, bun, etc.), mounts only a tmpfs workdir, times out jobs ≈25s, and runs with non-root uid/gid. The container has access to the host’s docker socket only to run the sandbox containers.
Browser-use web-ui (browser/)
- purpose: “autonomous” research browser UI (chromium via playwright), reachable on
:7788. - built from upstream
browser-use/web-uirepo, with python deps and browsers installed in the image.
Cloudflared tunnel (cloudflared-tunnel-config.yml)
- maps hostnames (like
mlep.domain.comfor Ollama,owebui.domain.comfor Open WebUI, and atoolshost) to the internal services. Useful for private, authenticated access without public inbound ports.
Why I currently don’t use an external RAG server
Open WebUI ships with pretty good knowledge / RAG support: add files/URLs, it chunks + embeds, indexes, retrieves, and automatically prefixes retrieved context to the model prompt using a RAG template. For lightweight to mid-sized corpora and single-user/small-team usage, that’s often all you need.
Stay with built-in RAG if most of these are true:
- total corpus is ≤ ~100k chunks and grows slowly.
- single user or small team (no multi-tenant isolation needed).
- no special retrieval logic (hybrid lexical+semantic, rerankers, metadata filters) beyond what Open WebUI provides.
- tolerance for “UI-managed” knowledge; you don’t need programmatic ingestion pipelines or job queues.
When an external RAG server makes sense
Adopt a decoupled RAG service when you need one or more of:
- bigger data / throughput: millions of chunks, higher QPS, horizontal scaling.
- advanced retrieval: custom chunkers, hybrid search (bm25 + vector), reranking, time-decay, per-tenant filters, embeddings A/B, or multi-modal (image/audio) retrieval.
- programmatic ingestion: CI-driven pipelines from git/docs/confluence/S3; delta updates; background jobs.
- governance / isolation: strict multi-tenant separation, PII retention controls, audit trails.
- interoperability: a clean HTTP API and OpenAPI so other apps (beyond Open WebUI) can reuse your index.
External RAG Server — Design and Reference Implementation
This is a small, dependency-light service designed to run with Bun and integrate with both Ollama and Open WebUI.
Goals
- minimal moving parts; runs fine on a single host.
- uses Ollama for embeddings and chat.
- supports collections, upserts, queries, and an opinionated
/chatthat does retrieve-then-generate. - ships an OpenAPI so Open WebUI can import it as a tool server.
- default in-memory store (persisted to JSON) for simplicity; optional adapters for vector DBs later.
API surface
GET /openapi.json– schema for tool integration.POST /collections– create a logical collection{ name }.GET /collections– list collections.POST /upsert–{ collection, items:[{ id?, text, metadata? }] }; chunks+embeds text and stores vectors.POST /query–{ collection, query, topK?=5, where? }--> nearest chunks with scores.POST /chat–{ collection, query, topK?=5, model?, embedModel? }--> runs RAG and calls Ollama chat, returns the answer + citations.
Storage Strategy
- default: in-memory + JSON file on disk (
./data/rag.json). good for dev/small usage. - plug-in adapters: swap in Qdrant, SQLite-Vec, pgvector, Weaviate, etc., without changing the HTTP API.
Add to docker-compose.yml
rag:
build:
context: ./rag-server
dockerfile: Dockerfile
environment:
OLLAMA_BASE: "http://mlep.domain.com:11434"
OLLAMA_CHAT_MODEL: "llama3.1"
OLLAMA_EMBED_MODEL: "nomic-embed-text"
volumes:
- rag_data:/app/data
networks:
- internal
restart: unless-stopped
volumes:
rag_data:
if you already expose services via cloudflared, add another hostname mapping to the
ragcontainer (- hostname: rag.domain.com -> service: http://rag:8788).
Wiring the RAG server into Open WebUI and Ollama
1. Pull models
ollama pull nomic-embed-text(embeddings)ollama pull llama3.1(chat)
2. Expose the OpenAPI to Open WebUI as a tool server
- in Open WebUI --> settings --> tools --> add tool server
- paste the url for the cloudflared hostname
- you’ll now see tool functions like
listCollections,createCollection,upsert,query,chatavailable to the assistant
3. Usage pattern inside a chat
- to build a knowledge base, call the
createCollectionandupserttools with your documents - to answer, call
chatwhich performs retrieve-then-generate against your chosen collection
FAQ — Built-in vs. External RAG
Q: will Open WebUI’s built-in RAG conflict with this server? no — you can use either, or both. Open WebUI’s knowledge base is great for ad-hoc use. this service is for programmatic/control-plane needs or when you outgrow the UI’s storage/retrieval.
Q: how do enforce tenant isolation? use one collection per tenant and never mix. for stronger guarantees, run separate RAG instances or choose Qdrant with per-collection access control.
Q: how can use my chunker/reranker?
yes. place them ahead of /upsert and /query respectively, or add endpoints like /rerank and /embed to experiment.
Q: can this call OpenAI-compatible endpoints instead of native Ollama?
Ollama exposes an experimental OpenAI-compatible API. you can add a thin client if you already point tools at /v1/chat/completions.
License
This write-up and reference code are provided under the same Apache-2.0 terms as the repository.