Knowledge Base (KB) and RAG Flow Guide

This document provides an overview of the Knowledge Base (KB) system and explains how Retrieval-Augmented Generation (RAG) is implemented within the Voicing AI.
It covers RAG fundamentals, KB architecture, indexing workflows, and the retrieval API.

1. Overview

The Knowledge Base subsystem allows users to attach documents (URLs or files) and convert them into searchable vector embeddings stored in Qdrant.
These embeddings are later retrieved using similarity search to support Retrieval-Augmented Generation (RAG).

What RAG is
How RAG works in general
RAG in Voicing AI
High-Level KB Architecture
System Flow (Overview)
Knowledge Base Data Model
KB Workflow
Create Knowledge Base
Add URL Source
Add File Source
Retrieval (R in RAG)
Query Endpoint
Augmentation (A in RAG)
Generation
End-to-End Example
RAG Test API
Summary

2. What is RAG?

RAG stands for:

Retrieval
Augmented
Generation

RAG enables an AI system to answer user queries using external knowledge, such as documentation, PDFs, webpages, or manuals.

2.1 How RAG Works

Step 1: Indexing

Documents are processed into smaller text chunks.
Each chunk is converted into a vector embedding and stored in a vector database such as Qdrant.

Step 2: Retrieval

A user query is embedded and compared against stored vectors.
The most relevant chunks are returned.

Step 3: Generation

An LLM uses these retrieved chunks as context to produce an answer.

2.2 RAG in Voicing AI

The Knowledge Base subsystem implements only the Retrieval portion of RAG:

Indexing → delegated to an external knowledge_base.Indexer library
Storage → handled by Qdrant
Retrieval → exposed via /knowledge-base/query/{kb_id}

Actual text generation using LLMs happens in downstream services (eg. RAGService).

3. High-Level KB Architecture

The Knowledge Base system consists of the following components:

Component	Description
KB API	Exposes endpoints for creating KBs, adding sources, and querying
KnowledgeBaseService	Core orchestration logic
KnowledgeBaseIndexer	External library that chunks, embeds, and indexes data
Qdrant	Vector database storing embeddings
DocumentRetriever	External library used for similarity search
Schemas	Pydantic models for request/response formats

3.1 System Flow

Add Source (URL/File)
        ↓
Store source entry in DB
        ↓
Background Indexing Task
        ↓
KnowledgeBaseIndexer
        ↓
Chunks + Embeddings → Qdrant Collection
        ↑
Query Endpoint performs Retrieval via DocumentRetriever

4. Knowledge Base Data Model

A KB record contains:

{
  "id": "uuid",
  "name": "Help Center",
  "collection_name": "kb_xxxxxxxxxxxxxxxxxxxx",
  "sources": [...],
  "stats": {
    "documents": 12,
    "chunks": 187
  },
  "status": "active"
}

Each source entry contains:

{
  "id": "uuid",
  "type": "url",
  "name": "FAQ",
  "content": "https://example.com/faq",
  "status": "completed",
  "metadata": {
    "pages": 12,
    "chunks": 48
  }
}

5. KB Workflow

This section describes the workflow from KB creation to indexing and retrieval.

5.1 Create a Knowledge Base

Endpoint

POST /api/v1/knowledge-base/

Request

{
  "name": "Support Docs",
  "description": "Internal documentation"
}

Internal Behavior

KnowledgeBaseService.create_knowledge_base() generates a unique Qdrant collection_name.
A new KB entry is created in Postgres.
The KB is ready to accept sources for indexing.

5.2 Add URL Source

Endpoint

POST /api/v1/knowledge-base/{kb_id}/sources/url

Request

{
  "url": "https://docs.example.com/faq",
  "name": "FAQ",
  "indexer_config": {
    "chunk_size": 1000,
    "chunk_overlap": 200
  }
}

Workflow

A source record is appended to the KB with status = "pending".
A background task _index_url_source(...) begins indexing.
The external KnowledgeBaseIndexer performs:
- Web scraping
- Text extraction
- Chunking
- Embedding
- Upserting vectors into Qdrant (collection_name)
Source status is updated to "completed" with metadata (pages, chunks).

5.3 Add File Source

Endpoint

POST /api/v1/knowledge-base/{kb_id}/sources/file

Internals

File is validated, uploaded to storage, and saved as a KB source.
Background indexing reads the file and processes it:

stats = await indexer.index_file(uploaded_file)

Chunks and embeddings are stored in Qdrant.

6. Retrieval (R in RAG)

Retrieval performs vector similarity search against the KB’s Qdrant collection.

6.1 Query Endpoint

POST /api/v1/knowledge-base/query/{kb_id}

Request

{
  "query": "How do I reset my password?",
  "top_results": 5
}

Internal Workflow (`KnowledgeBaseService.search()`)

retriever = DocumentRetriever(self.indexer_config)

result = await retriever.search(
    query=query,
    collection_name=kb.collection_name,
    limit=top_results
)

Retrieval Behavior

Embeds the query.
Searches Qdrant using cosine similarity.
Returns an ordered list of chunks.

Example Response

{
  "documents": [
    {
      "content": "To reset your password, open Settings...",
      "metadata": { "filename": "help.pdf" },
      "score": 0.93,
      "chunk_id": "chunk-uuid",
      "source_url": "https://docs.example.com/help",
      "title": "Help Center"
    }
  ],
  "query": "How do I reset my password?",
  "total_found": 42,
  "retrieval_time": 0.12
}

7. Augmentation (A in RAG)

Augmentation is the step where retrieved KB chunks are transformed into LLM-ready context.

Although the detailed orchestration lives outside this repo, the typical flow in Voicing AI looks like:

Call the KB retrieval API
- A higher-level service (e.g. RAGService, an orchestration layer for AI assistants) calls
  POST /api/v1/knowledge-base/query/{kb_id} and receives a RetrievalResult.
Select and post-process chunks
- Optionally re-rank, deduplicate, or filter by score or metadata (for example, limit to certain document types or recency).
- Truncate or summarize content to fit within the target model’s context window (based on max tokens/characters).
Build the augmented prompt
- Serialize chunks into a prompt section such as:
  - CONTEXT:\n[chunk1]\n\n[chunk2]\n...
- Attach them to the LLM request as:
  - System messages (instructions + context),
  - Tool/function parameters, or
  - Extra fields in a JSON payload interpreted by the LLM orchestration layer.
- Include identifiers (titles, filenames, URLs) so the model can reference or cite specific sources.
Add guardrails and formatting
- Prepend clear instructions like:
  “Answer only using the provided context. If the answer is not present, say you don’t know.”
- Optionally inject citation markers (e.g. [1], [2]) tied to retrieved documents so the UI can show which snippet supported which part of the answer.

In summary, the Augmentation layer is responsible for turning the raw documents[] from the Knowledge Base retrieval API into a structured, well-constrained prompt that the LLM can safely and effectively use for final answer generation.

8. Generation

The Knowledge Base module does not generate answers.

After retrieval:

RAGService
RagOpenAILLMService
Or LLM orchestrators

inject KB chunks into prompts and generate the final natural-language response.

The KB is responsible only for supplying relevant context.

9. End-to-End Example

User uploads PDF
        ↓
KB stores file → background indexing
        ↓
KnowledgeBaseIndexer chunks + embeds → Qdrant
        ↓
User queries KB
        ↓
DocumentRetriever fetches top-k relevant chunks
        ↓
LLM consumes chunks → final answer (outside KB)

10. RAG Test API

The RAG Test API enables developers to test the full Retrieval-Augmented Generation flow end-to-end directly from the Knowledge Base module.
This includes:

Retrieving relevant KB chunks (R)
Augmenting them into a context prompt (A)
Generating a precise final answer using an LLM (G)

This endpoint is strictly for testing and evaluation. It does not affect production assistant workflows.

10.1 Purpose of the RAG Test API

The goal of the RAG Test API is to:

Verify if the Knowledge Base contains the required information
Assess retrieval quality
Inspect the specific chunks used
Test LLM generation using retrieved KB chunks
Debug indexing, chunking, and similarity search issues

This helps developers/QA validate RAG before using it in assistants.

10.2 Endpoint

POST `/api/v1/knowledge-base/{kb_id}/test`

This endpoint:

Accepts a query and top_results
Retrieves the most relevant KB chunks
Builds the augmented context prompt
Sends everything to the LLM
Returns the final answer + retrieved chunks + query

10.3 Request Format

{
  "query": "What is the refund policy?",
  "top_results": 5
}

Parameters

Field	Type	Default	Description
`query`	string	required	User's natural-language question
`top_results`	number	5	Number of chunks to retrieve

10.4 Response Format

{
  "answer": "Refund is available within 7 days.",
  "retrieved_chunks": [
    {
      "content": "...",
      "metadata": { "filename": "refund-policy.pdf" },
      "score": 0.92
    }
  ],
  "query": "What is the refund policy?"
}

10.5 Internal Execution Flow

Step 1 — Retrieval (R in RAG)

Your service calls the existing KB retrieval pipeline:

retrieval_result = await self.search(
    query=query,
    kb_id=kb_id,
    top_results=top_results
)

Step 2 — Augmentation (A in RAG)

Retrieved chunks are merged into a clean, LLM-friendly context block:

context_text = "\n\n".join([
    f"Source: {doc.metadata.get('filename', 'Unknown')}\nContent: {doc.content}"
    for doc in retrieval_result.documents
])

10.6 Prompt Construction

System Prompt

You are a helpful AI assistant. Answer strictly based on the provided context.
If the answer is not found in the context, say:
"I cannot answer this based on the provided information."

User Prompt

Context:
<all retrieved chunks>

Question: <query>

10.7 LLM Generation (G in RAG)

The backend calls OpenAI’s chat completion API:

response = await client.chat.completions.create(
    model=settings.OPENAI_MODEL,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    temperature=0.3
)

10.8 Extracting the Final Answer

answer = response.choices[0].message.content

10.9 Return Final RAG Test Output

return {
    "answer": answer,
    "retrieved_chunks": retrieval_result.documents,
    "query": query
}

10.10 Why This Feature Is Valuable

Confirms if relevant information exists in KB
Helps tune chunking and indexing
Shows exactly which KB excerpts were used
Helps debug retrieval issues
Produces real LLM answers for validation
Essential before integrating KB with production assistants

10.11 Difference vs. Production Assistant RAG

Feature	RAG Test API	Assistant RAG Flow
Purpose	Debug & evaluation	End-user conversation
Uses LLM?	✔ Yes	✔ Yes
Retrieval Source	KB only	KB + Assistant prompt logic
Affects users?	❌ No	✔ Yes
Audience	Developers / QA	End users
Error sensitivity	Low	High

10.12 Security

Requires KB VIEW permission
Does not modify KB or embeddings
Does not affect real users
Safe for internal testing

11. Summary

Voicing AI KB implements the Retrieval component of RAG.
URL and file sources are indexed asynchronously into Qdrant.
Retrieval API returns relevant chunks using vector similarity search.
LLM-based answer generation is performed outside the KB subsystem.

This forms the foundation for all RAG-assisted workflows within Voicing AI.

Last updated: [11/12/2025]
Version: 1.0

1. Overview

Table of Contents​

2. What is RAG?

2.1 How RAG Works​

Step 1: Indexing​

Step 2: Retrieval​

Step 3: Generation​

2.2 RAG in Voicing AI​

3. High-Level KB Architecture

3.1 System Flow​

4. Knowledge Base Data Model

5. KB Workflow

5.1 Create a Knowledge Base​

Endpoint​

Request​

Internal Behavior​

5.2 Add URL Source​

Endpoint​

Request​

Workflow​

5.3 Add File Source​

Endpoint​

Internals​

6. Retrieval (R in RAG)

6.1 Query Endpoint​

Request​

Internal Workflow (KnowledgeBaseService.search())​

Retrieval Behavior​

Example Response​

7. Augmentation (A in RAG)​

8. Generation

9. End-to-End Example

10. RAG Test API

10.1 Purpose of the RAG Test API​

10.2 Endpoint​

POST /api/v1/knowledge-base/{kb_id}/test​

10.3 Request Format​

Parameters​

10.4 Response Format​

10.5 Internal Execution Flow​

Step 1 — Retrieval (R in RAG)​

Step 2 — Augmentation (A in RAG)​

10.6 Prompt Construction​

System Prompt​

User Prompt​

10.7 LLM Generation (G in RAG)​

10.8 Extracting the Final Answer​

10.9 Return Final RAG Test Output​

10.10 Why This Feature Is Valuable​

10.11 Difference vs. Production Assistant RAG​

10.12 Security​

11. Summary

Table of Contents

2.1 How RAG Works

Step 1: Indexing

Step 2: Retrieval

Step 3: Generation

2.2 RAG in Voicing AI

3.1 System Flow

5.1 Create a Knowledge Base

Endpoint

Request

Internal Behavior

5.2 Add URL Source

Endpoint

Request

Workflow

5.3 Add File Source

Endpoint

Internals

6.1 Query Endpoint

Request

Internal Workflow (`KnowledgeBaseService.search()`)

Retrieval Behavior

Example Response

7. Augmentation (A in RAG)

10.1 Purpose of the RAG Test API

10.2 Endpoint

POST `/api/v1/knowledge-base/{kb_id}/test`

10.3 Request Format

Parameters

10.4 Response Format

10.5 Internal Execution Flow

Step 1 — Retrieval (R in RAG)

Step 2 — Augmentation (A in RAG)

10.6 Prompt Construction

System Prompt

User Prompt

10.7 LLM Generation (G in RAG)

10.8 Extracting the Final Answer

10.9 Return Final RAG Test Output

10.10 Why This Feature Is Valuable

10.11 Difference vs. Production Assistant RAG

10.12 Security