Knowledge Base (KB) and RAG Flow Guide
This document provides an overview of the Knowledge Base (KB) system and explains how Retrieval-Augmented Generation (RAG) is implemented within the Voicing AI.
It covers RAG fundamentals, KB architecture, indexing workflows, and the retrieval API.
1. Overview
The Knowledge Base subsystem allows users to attach documents (URLs or files) and convert them into searchable vector embeddings stored in Qdrant.
These embeddings are later retrieved using similarity search to support Retrieval-Augmented Generation (RAG).
Table of Contents
- What RAG is
- How RAG works in general
- RAG in Voicing AI
- High-Level KB Architecture
- System Flow (Overview)
- Knowledge Base Data Model
- KB Workflow
- Create Knowledge Base
- Add URL Source
- Add File Source
- Retrieval (R in RAG)
- Query Endpoint
- Augmentation (A in RAG)
- Generation
- End-to-End Example
- Summary
2. What is RAG?
RAG stands for:
- Retrieval
- Augmented
- Generation
RAG enables an AI system to answer user queries using external knowledge, such as documentation, PDFs, webpages, or manuals.
2.1 How RAG Works
Step 1: Indexing
Documents are processed into smaller text chunks.
Each chunk is converted into a vector embedding and stored in a vector database such as Qdrant.
Step 2: Retrieval
A user query is embedded and compared against stored vectors.
The most relevant chunks are returned.
Step 3: Generation
An LLM uses these retrieved chunks as context to produce an answer.
2.2 RAG in Voicing AI
The Knowledge Base subsystem implements only the Retrieval portion of RAG:
- Indexing → delegated to an external
knowledge_base.Indexerlibrary - Storage → handled by Qdrant
- Retrieval → exposed via
/knowledge-base/query/{kb_id}
Actual text generation using LLMs happens in downstream services (eg. RAGService).
3. High-Level KB Architecture
The Knowledge Base system consists of the following components:
| Component | Description |
|---|---|
| KB API | Exposes endpoints for creating KBs, adding sources, and querying |
| KnowledgeBaseService | Core orchestration logic |
| KnowledgeBaseIndexer | External library that chunks, embeds, and indexes data |
| Qdrant | Vector database storing embeddings |
| DocumentRetriever | External library used for similarity search |
| Schemas | Pydantic models for request/response formats |
3.1 System Flow
Add Source (URL/File)
↓
Store source entry in DB
↓
Background Indexing Task
↓
KnowledgeBaseIndexer
↓
Chunks + Embeddings → Qdrant Collection
↑
Query Endpoint performs Retrieval via DocumentRetriever
4. Knowledge Base Data Model
A KB record contains:
{
"id": "uuid",
"name": "Help Center",
"collection_name": "kb_xxxxxxxxxxxxxxxxxxxx",
"sources": [...],
"stats": {
"documents": 12,
"chunks": 187
},
"status": "active"
}
Each source entry contains:
{
"id": "uuid",
"type": "url",
"name": "FAQ",
"content": "https://example.com/faq",
"status": "completed",
"metadata": {
"pages": 12,
"chunks": 48
}
}
5. KB Workflow
This section describes the workflow from KB creation to indexing and retrieval.
5.1 Create a Knowledge Base
Endpoint
POST /api/v1/knowledge-base/
Request
{
"name": "Support Docs",
"description": "Internal documentation"
}
Internal Behavior
KnowledgeBaseService.create_knowledge_base()generates a unique Qdrantcollection_name.- A new KB entry is created in Postgres.
- The KB is ready to accept sources for indexing.
5.2 Add URL Source
Endpoint
POST /api/v1/knowledge-base/{kb_id}/sources/url
Request
{
"url": "https://docs.example.com/faq",
"name": "FAQ",
"indexer_config": {
"chunk_size": 1000,
"chunk_overlap": 200
}
}
Workflow
- A source record is appended to the KB with
status = "pending". - A background task
_index_url_source(...)begins indexing. - The external
KnowledgeBaseIndexerperforms:- Web scraping
- Text extraction
- Chunking
- Embedding
- Upserting vectors into Qdrant (
collection_name)
- Source status is updated to
"completed"with metadata (pages, chunks).
5.3 Add File Source
Endpoint
POST /api/v1/knowledge-base/{kb_id}/sources/file
Internals
- File is validated, uploaded to storage, and saved as a KB source.
- Background indexing reads the file and processes it:
stats = await indexer.index_file(uploaded_file)
- Chunks and embeddings are stored in Qdrant.
6. Retrieval (R in RAG)
Retrieval performs vector similarity search against the KB’s Qdrant collection.
6.1 Query Endpoint
POST /api/v1/knowledge-base/query/{kb_id}
Request
{
"query": "How do I reset my password?",
"top_results": 5
}
Internal Workflow (KnowledgeBaseService.search())
retriever = DocumentRetriever(self.indexer_config)
result = await retriever.search(
query=query,
collection_name=kb.collection_name,
limit=top_results
)
Retrieval Behavior
- Embeds the query.
- Searches Qdrant using cosine similarity.
- Returns an ordered list of chunks.
Example Response
{
"documents": [
{
"content": "To reset your password, open Settings...",
"metadata": { "filename": "help.pdf" },
"score": 0.93,
"chunk_id": "chunk-uuid",
"source_url": "https://docs.example.com/help",
"title": "Help Center"
}
],
"query": "How do I reset my password?",
"total_found": 42,
"retrieval_time": 0.12
}
7. Augmentation (A in RAG)
Augmentation is the step where retrieved KB chunks are transformed into LLM-ready context.
Although the detailed orchestration lives outside this repo, the typical flow in Voicing AI looks like:
-
Call the KB retrieval API
- A higher-level service (e.g.
RAGService, an orchestration layer for AI assistants) calls
POST /api/v1/knowledge-base/query/{kb_id}and receives aRetrievalResult.
- A higher-level service (e.g.
-
Select and post-process chunks
- Optionally re-rank, deduplicate, or filter by score or metadata (for example, limit to certain document types or recency).
- Truncate or summarize content to fit within the target model’s context window (based on max tokens/characters).
-
Build the augmented prompt
- Serialize chunks into a prompt section such as:
CONTEXT:\n[chunk1]\n\n[chunk2]\n...
- Attach them to the LLM request as:
- System messages (instructions + context),
- Tool/function parameters, or
- Extra fields in a JSON payload interpreted by the LLM orchestration layer.
- Include identifiers (titles, filenames, URLs) so the model can reference or cite specific sources.
- Serialize chunks into a prompt section such as:
-
Add guardrails and formatting
- Prepend clear instructions like:
“Answer only using the provided context. If the answer is not present, say you don’t know.” - Optionally inject citation markers (e.g.
[1],[2]) tied to retrieved documents so the UI can show which snippet supported which part of the answer.
- Prepend clear instructions like:
In summary, the Augmentation layer is responsible for turning the raw documents[] from the Knowledge Base retrieval API into a structured, well-constrained prompt that the LLM can safely and effectively use for final answer generation.
8. Generation
The Knowledge Base module does not generate answers.
After retrieval:
- RAGService
- RagOpenAILLMService
- Or LLM orchestrators
inject KB chunks into prompts and generate the final natural-language response.
The KB is responsible only for supplying relevant context.
9. End-to-End Example
User uploads PDF
↓
KB stores file → background indexing
↓
KnowledgeBaseIndexer chunks + embeds → Qdrant
↓
User queries KB
↓
DocumentRetriever fetches top-k relevant chunks
↓
LLM consumes chunks → final answer (outside KB)
10. Summary
- Voicing AI KB implements the Retrieval component of RAG.
- URL and file sources are indexed asynchronously into Qdrant.
- Retrieval API returns relevant chunks using vector similarity search.
- LLM-based answer generation is performed outside the KB subsystem.
This forms the foundation for all RAG-assisted workflows within Voicing AI.
Last updated: [04/12/2025]
Version: 1.0