Engineering Solution
After completing the technical architecture design for knowledge construction and retrieval generation, to ensure the system can run stably in production environments with good scalability and user experience, we further developed a detailed engineering implementation solution. This solution covers multiple key aspects including deployment architecture, rate limiting mechanisms, performance optimization, and user experience enhancement.
This chapter will provide comprehensive elaboration from dimensions of system deployment, component responsibilities, request control strategies, performance metrics, and user interaction experience, aiming to build a highly available, maintainable, and continuously evolving intelligent Q&A system.
Deployment Architecture
The entire system deployment is divided into two main modules: Knowledge Retrieval and Q&A Service and Knowledge Base Construction Service (ETL). Each module consists of multiple core components, each with its own responsibilities, working collaboratively.
Knowledge Retrieval and Q&A Service
This module is responsible for receiving user question requests, executing retrieval and generation processes, and returning final answers. Main components include:
Qdrant Vector Database
- Responsible for storing and querying semantic vectors and sparse vectors of QA pairs;
- Supports efficient approximate nearest neighbor search (ANN), combined with RRF for hybrid ranking;
- Provides RESTful interfaces and SDKs for integration into retrieval services.
MySQL Database
- Records user query logs and feedback records from large model Q&A
Server Application
- Core server-side logic, exposing unified API interfaces externally;
- Coordinates retrieval and generation processes, handles multi-path requests;
- Supports general capabilities like logging and exception catching.
Client Application
- User frontend interface, integrated into technical community or help center pages;
- Supports functions like inputting questions, viewing answers, like/dislike feedback, clicking links to jump to original text;
- Provides multi-turn conversation support, recognizes context and automatically performs question rewriting.
Third-party LLM Services
- Interfaces with external large language models and TextEmbedding services through APIs (such as Alibaba Cloud Bailian Platform, DeepSeek, etc.);
- Supports flexible configuration of different model services for easy switching and testing;
- Monitors request frequency, response time, and error rates to ensure service quality.
Knowledge Base Construction Service (ETL)
This module is responsible for periodically extracting QA content from original documents, vectorizing and updating indexes, serving as the core component for knowledge updates throughout the system.
ETL Application
- Scheduled task-driven data extraction and processing program;
- Executes the complete process of document crawling, QA pair generation, text embedding, and vector database insertion;
- Supports both full construction and incremental update modes, adapting to different content source update frequencies.
Third-party LLM Services
- Shares the same LLM integration capabilities with the Q&A service, used for QA extraction and summary generation tasks;
- Shares the same TextEmbedding integration capabilities with the Q&A service, used for generating text semantic vectors and sparse vectors;
- Can dynamically adjust concurrency and timeout limits based on requirements to improve construction efficiency.
Request Rate Limiting Strategy
To prevent sudden traffic from impacting the system while ensuring fairness and resource utilization, we introduced rate limiting mechanisms at the service entry layer, implementing fine-grained request control using the Moving Window Rate Limiter algorithm.
Multi-dimensional Rate Limiting Settings
- By Interface Dimension: Set independent rate limiting rules for "search interface," "generation interface," etc.;
- By Time Dimension: Support rate limiting control at three levels: per minute, per hour, and per day;
Moving Window Algorithm Advantages
Compared to traditional Fixed Window algorithms, the moving window algorithm avoids potential "sudden request bursts" at window boundaries, providing higher accuracy and stability.
Example strategy:
- Maximum 6,000 search requests per minute;
- Maximum 100,000 search requests per hour;
- Maximum 1,000,000 search requests per day.
Circuit Breaker and Degradation Mechanisms
When the system detects anomalies or excessively high response delays in downstream services (such as LLM or vector databases), circuit breaker mechanisms are triggered:
- Display "Current load is high, please try again later";
- Combine with alert systems to notify operations personnel for timely intervention.
Performance Targets and Monitoring
Overall system performance directly determines user experience and system stability. We set the following performance targets and established corresponding monitoring systems.
Retrieval Performance
- Concurrency support: Evaluate system daily usage requirements and instantaneous concurrent requests based on user monthly and daily UV and PV;
- Network bandwidth: Evaluate bandwidth usage based on text size of retrieval results, especially downstream bandwidth;
- Response time: Considering user experience, ensure completion of a complete retrieval process within at least 2 seconds;
- Support asynchronous loading of partial results to improve perceived speed.
Generation Performance
- Depends on third-party LLM response speed, typically within 30 seconds;
- Generation services implement internal request queuing and concurrency control to avoid exceeding LLM interface concurrency limits;
- Set maximum wait timeout thresholds, automatically degrading to backup answers when exceeded.
Monitoring Metrics
- Request success rate, failure rate, average response time;
- LLM service call status (Token usage, duration, error codes);
- Vector database query performance (QPS, latency distribution);
- Log collection and analysis.
User Experience Optimization
To improve user satisfaction and engagement during actual usage, we conducted multi-faceted optimization design at the product level.
Search Interface Optimization
- Simple and intuitive interface, providing product category selection and question input box;
- Display clear answer results with source links;
- Support Markdown format display of detailed answers (Full Answer).
Multi-turn Conversation Support
- System can understand context and automatically recognize follow-up intentions;
- Convert ambiguous expressions into clear question expressions through question rewriting mechanisms;
- Support context memory functionality, eliminating need to repeat background information during continuous interactions.
User Feedback Mechanisms
- Provide "like/dislike" buttons in the answer area to collect user satisfaction;
- All feedback data is stored in database for subsequent quality assessment and model optimization.
- Users or technical support can submit specific suggestions or bug reports, and the team will assess task priorities and iterate the system;
Continuous System Optimization
- Control response time within seconds to improve user perceived fluency;
- Regularly perform incremental updates of the knowledge base to maintain content freshness;
- Continuously optimize retrieval and generation strategies based on user feedback and usage behavior data.
Summary
Through the design and implementation of the above engineering solution, we built an intelligent Q&A system with clear structure, complete functionality, and strong scalability. From layered design of deployment architecture to fine control of rate limiting mechanisms; from reasonable setting of performance metrics to continuous optimization of user interaction experience—every aspect revolves around the goals of "stable availability, efficient response, and good experience."
These engineering practices not only ensure normal system operation but also lay a solid foundation for subsequent data analysis and product iteration.
Looking Forward to the Next Stage: Implementation Results Analysis
After completing system construction, we enter the critical validation and evaluation stage. The next chapter will focus on implementation results analysis of this system in actual business scenarios, including:
- Usage Data Statistics: Daily active users, question frequency, popular question distribution;
- Accuracy Assessment: Answer accuracy and relevance scores under manual spot-checking;
- User Feedback Analysis: Like/dislike ratios, common types of dissatisfaction issues;
- Performance Review: Average response time, interface success rate, system stability;
- Typical Application Scenario Cases: Such as customer service Q&A, document queries, newcomer guidance, etc.;
- Future Optimization Directions: Propose improvement plans based on data analysis, including model fine-tuning, retrieval strategy optimization, new feature planning, etc.
Through in-depth analysis of real usage scenarios, we will comprehensively evaluate the system's capability boundaries and optimization space, providing basis for subsequent functionality enhancement and technical upgrades.
Please see the detailed interpretation in the "Implementation Results" chapter.