Technical Architecture
Architecture Overview - RAG Solution Based on QA Pre-generation
This system adopts a typical RAG (Retrieval-Augmented Generation) three-stage architecture, divided into three core stages: construction (ETL), retrieval (Retrieve), and generation (Generate):
- Construction (ETL): Extract knowledge from original documents, generate structured Q&A pairs, vectorize and store them in vector databases;
- Retrieval: When users ask questions, combine keyword and semantic understanding with hybrid ranking strategies to quickly locate the most relevant knowledge points;
- Generation: Based on large language models and retrieval results, generate natural language answers.
The entire process achieves high availability, scalability, and good engineering implementation capability through modular design.
Knowledge Construction (ETL)
ETL is the foundational component of the entire system, aiming to transform unstructured original documents into knowledge points usable for retrieval. This process mainly includes the following steps:
Document Acquisition
Knowledge sources include official help documents, help posts and tutorial posts from technical forums:
- Official help documents: approximately 4,000 articles
- Technical forum - Help Center: approximately 50,000 posts
- Technical forum - Topic Tutorials: approximately 4,000 posts
Future plans to expand the following content sources:
- API interface documentation
- Public course video transcripts
- Product plugin descriptions, app marketplace product materials
- Success case analyses
Supports scheduled incremental crawling mechanisms, with shorter crawling cycles for frequently updated content to ensure knowledge base timeliness.
Chunking
Unlike traditional paragraph segmentation methods, we adopted a "QA chunking" method based on large language models, converting document content into multiple question-answer pairs (QA Pairs). This approach focuses more on actual knowledge points, improving the effectiveness of subsequent retrieval and generation.
The specific process is as follows:
- Use LLM to convert original documents into structured Q&A pairs;
- Conduct quality verification on generated QAs, manually spot-check some samples to verify accuracy;
- Output standardized JSON format for convenient subsequent processing.
Compared to coarse-grained text paragraphs, QA pair format is closer to users' actual query intentions, helping improve the recall rate and relevance of retrieval systems.
Text Embedding
To improve retrieval accuracy and robustness, we adopt a "bidirectional vectorization" strategy, using both sparse vectors (BM25) and dense semantic vectors (Dense Vector) for multi-dimensional representation.
Additionally, we introduced multiple enhancement mechanisms during the vectorization process:
Summary Generation
Generate a brief summary created by large language models for each QA pair, describing the background information of its document context, used to help subsequent generation models better understand context.
Full Answer Extension
For important or high-frequency questions, generate an additional detailed answer version stored as payload for frontend display or to provide richer context support for the generation phase.
Prefix Mechanism
To avoid "spatial aliasing" phenomena caused by similar questions in different documents, we add the category and title of the document containing the QA as a prefix to the original text before vectorization.
Example template:
[Category/Title] + Question
[Forguncy/Connect to External Database/Connect to Oracle] What needs to be done before connecting to Oracle database?
This strategy can effectively distinguish questions with the same semantic meaning but different application scenarios, improving retrieval precision.
Vector Fields
Generate four types of vector features:
- Prefix_Question_Dense
- Prefix_Answer_Dense
- Prefix_Question_Sparse
- Prefix_Answer_Sparse
Payload Fields
Each knowledge entry contains the following metadata for subsequent retrieval and generation use:
Question
: Questions users might askAnswer
: Concise standard answerFullAnswer
: Detailed explanation versionSummary
: Context summaryUrl
: Original article linkTitle
: Article titleCategory
: Category classificationDate
: Creation time (Unix timestamp)
Through the above design, our vectorization solution considers both keyword matching and semantic understanding, significantly improving retrieval effectiveness while providing richer context support for the generation stage.
Payload Example
{
"Question": "What needs to be done before connecting to Oracle database?",
"Answer": "Oracle needs to be configured first, and only after configuration is complete can you connect to Oracle.",
"FullAnswer": "# Preparation for Connecting to Oracle Database Before connecting to Oracle database, some necessary configuration steps need to be completed......",
"Summary": "This document introduces how to connect to Oracle database, including steps for configuring Oracle and specific operations for connecting Oracle in Forguncy......",
"Url": "https://www.grapecity.com.cn/solutions/huozige/help/docs/connecttoexternaldatabase/connecttooracle",
"Title": "Connect to Oracle",
"Category": "Forguncy Chinese Documentation/Chapter 15 Connect to External Database",
"Date": 1746093654
}
Retrieval Mechanism
The retrieval stage aims to find the most relevant knowledge entries from the knowledge base for user questions. We adopt a multi-channel hybrid retrieval strategy combined with RRF (Reciprocal Rank Fusion) algorithm for fusion ranking, ensuring relevance and diversity of returned results.
Hybrid Retrieval Strategy
We employ two mainstream retrieval methods:
- Sparse Retrieval (BM25): Based on inverted index and keyword frequency, suitable for queries with clear keywords;
- Dense Retrieval (Dense Vector): Based on semantic vectors, suitable for fuzzy semantic understanding and complex expressions.
Each retrieval path independently obtains TopK=40 results, and finally uses RRF algorithm for fusion ranking, selecting the final TopK=8 optimal results to return.
Retrieval input path as follows:
UserInput → [Question(BM25 & Dense), Answer(BM25 & Dense)]{TopK=40} → RRF Fusion Ranking → Hits{TopK=8}
We simultaneously retrieve both question and answer fields of QAs to improve overall recall rate. For example, if a user's question expression leans toward answer content, it can still be correctly matched and returned.
Currently, no time decay factor is introduced because multiple versions of product documents coexist, and users may still need information from older versions.
Generation Mechanism
In the generation stage, the system generates natural and fluent Chinese answers based on retrieved relevant knowledge combined with large language model capabilities.
Input Structure
The system receives the following two main inputs:
- User question (User Input)
- Retrieval results (Hits from Retrieval)
These pieces of information are integrated into prompts and input to large language models for reasoning and answer generation.
Prompt Template Example
"""
You are conversing with users. Please comprehensively refer to the context and the user questions and knowledge base retrieval results below to answer user questions. Attach document links when answering.
## User Question
{keyword}
## Knowledge Base Retrieval Results
{hits_text}
"""
Reasoning Mode
In addition to direct answer output, we also support "reasoning mode," where reasoning-type LLMs explicitly output intermediate reasoning processes, enhancing user trust and transparency in answers.
Multi-turn Q&A Support
The system supports multi-turn conversation scenarios and can automatically identify what users really want to ask based on historical conversation context.
Implementation process as follows:
Historical Conversation → [LLM Prompt] → Rewritten Question → [Rewritten Question + Retrieval Results] → [LLM Prompt] → Generate Final Answer
Guide model to identify user intent through the following prompt:
"""
You are a question generator. You need to identify the question users want to query from the following conversation and directly output that text, which will be used to retrieve relevant knowledge in the knowledge base.
## Conversation Content
{contents}
"""
Summary
Through detailed design and implementation of the above technical architecture, we built a complete RAG knowledge retrieval and Q&A system based on QA pre-generation. From document parsing, QA pair generation, vector index construction, to multi-path hybrid retrieval and large model-based answer generation, all modules work collaboratively to achieve high-quality knowledge service output.
Throughout the entire process, we not only introduced advanced text processing and semantic understanding technologies but also significantly improved system accuracy, robustness, and practicality through Prefix mechanisms, Full Answer extensions, RRF fusion ranking, and other methods.
Looking Forward to the Next Stage: Engineering Implementation Solution
After completing the design and validation of core functions, the next step will focus on the engineering implementation solution of this system, ensuring its availability, stability, and scalability in actual production environments.
In the "Engineering Solution" chapter, the following content will be introduced in detail:
1. System Deployment Architecture
- Overall system structure diagram (including knowledge base construction module, retrieval service, generation service, LLM middleware, etc.)
- Call relationships and data flow between components
- Microservice splitting and containerized deployment recommendations
2. Core Component Description
- Knowledge Retrieval and Q&A Service: Vector retrieval based on Qdrant + MySQL metadata auxiliary queries
- Server Application: Provides unified API interfaces externally, coordinates retrieval and generation processes
- Client Application: Intelligent Q&A components integrated into product UI
- ETL Application: Scheduled task-driven QA extraction and index update module
- Third-party LLM Service Integration: Such as Tongyi Qianwen, DeepSeek, etc., supporting flexible configuration and switching
3. Performance Optimization Strategies
- Asynchronous processing and caching mechanisms (such as Redis caching high-frequency question results)
- Vector index preloading and memory mapping optimization
- Load balancing design for retrieval and generation tasks
4. Rate Limiting and Fault Tolerance Mechanisms
- Request frequency control (Rate Limiting) and token bucket algorithm implementation
- LLM call failure retry, timeout circuit breaker and degradation strategies
- Error log collection and monitoring alert system integration
5. User Experience Enhancement Measures
- Response latency optimization (such as preloading, asynchronous loading of partial context)
- Multi-turn conversation state management and intent recognition optimization
- Visual feedback mechanisms (users can rate answer quality or provide corrections)