「技术报告」- 基于LLM实现Rag重排序器——一份实用指南

October, 2025

1. 三种重排序策略的选择

Pointwise（评分制）：为每个段落独立打分（1-10分），输出{id:score}格式
Listwise（排序制）：直接输出排序结果，如 "id1>id3>id0"
Pairwise（对比制）：两两比较，成本最高（O(K²)）

决策：选择Pointwise，因其输出结构化、易优化、支持并行处理。

2. 生产环境面临的四大挑战

延迟高：输出token过多，单次调用耗时久
格式不稳定：LLM可能输出重复ID、缺失ID或格式错误
输入量大：40段×200token≈8000token，上下文窗口压力大
位置偏差：LLM对输入顺序敏感，易高估靠前段落的相关性
3. 优化策略一：减少输出Token（降延迟）
移除空格：空格是昂贵的token，改用紧凑JSON格式，减少28% token
阈值过滤：只输出≥5分的内容，低分省略，再降50%延迟
失败尝试：去掉"id"标记以进一步节省20% token，但导致模型混淆索引与分数，质量下降

效果：输出token减少显著降低端到端延迟。

4. 优化策略二：并行重排序（核心创新）

将K个候选段落分N批并行处理（如40段→4批×10段）：

批次分配策略：

问题：连续切分会加剧位置偏差（第一批全是高分段）
解决方案：Round-robin轮询分配
B_j = {p_t | t mod N = j}
确保每批包含高/中/低相似度混合样本

关键工程细节：

合并规则：按LLM分数排序，用BGE交叉编码器处理平局或补缺失
评分校准：添加详细评分标准+rubric+few-shot示例，确保各批次评分尺度一致
容错机制：设置单批次超时阈值，失败则回退到BGE交叉编码器
延迟权衡：P95/P99尾延迟监控，接受小概率超时换取平均延迟下降

收益：

可使用更小、更快、更便宜的LLM模型
每批输入/输出更短，配合prompt缓存，进一步降低延迟
轮询策略有效缓解位置偏差

5. Copilot场景：保持源多样性

Copilot需检索多种实体（内部文档、历史对话、公开文章），历史对话易淹没权威内容：

解决方案：

分通道检索：按内容类型独立检索
分层重排序：每通道内部用LLM重排序
启发式合并：确保top结果中各类型源均衡分布

效果：引用对话片段减少27%，引用公开+内部文章增加63%。

开源Prompt解析

Prompt包含四大核心组件：

评分标准：0-10分详细rubric，强调"可操作性"和"意图匹配"
- 10分：完美匹配，无需解读的精确步骤
- 5-6分：部分相关，需用户适配
- 0-4分：直接排除不输出
输入格式：<query>+<passages id='id0'>...</passages>
输出格式：紧凑JSON，如{"id0":8,"id1":6}，严格限制：
- 只输出5-10分的段落
- 无空格、无额外文本
- 保持原始ID顺序
Few-shot示例：确保评分尺度一致性
实践反思与演进
LLM重排序的优势：质量明显优于开源交叉编码器，验证了"重排序质量直接影响RAG效果"
代价：即使优化后仍有+0.9s延迟，且并行系统增加复杂度
下一步：训练自定义轻量级重排序器，以LLM重排序器作为教师模型蒸馏知识，实现更低延迟与同等质量

💡 核心启示

工程化>算法：通过token优化、并行化、批次策略等工程手段解决LLM原生延迟问题
系统思维：考虑全链路（向量搜索→重排序→生成），而非孤立优化某一环节
数据驱动：所有决策基于A/B测试，如Pointwise vs Listwise的性价比权衡
实用开源：直接提供生产级Prompt和完整优化路径，具备高度可操作性

适用场景：对回答质量要求高、可接受秒级延迟、已有稳定RAG管道的应用。若需更低延迟，可参考文中提到的自定义模型蒸馏路径。

PROMPT

You are a customer support answer service. Your task is to evaluate help center passages and score their relevance to a given customer query for a retrieval augmented generation (RAG) system.
 **Evaluation Process:** 
 1. Analyze the customer's query to identify both explicit needs and implicit context including underlying user goals 
 2. Assess each passage's ability to directly resolve the query or provide substantive supporting information with actionable guidance 
 3. Score based on how effectively the passage addresses the query's core intent while considering potential interpretations 

**Grading Criteria:** <grading_scale> 
10: EXCEPTIONAL match - Contains exact step-by-step instructions that perfectly match the query's specific scenario. Must include all required parameters/context and resolve the issue completely without any ambiguity. Reserved for definitive solutions that exactly mirror the user's described situation and require no interpretation. 
9: NEAR-PERFECT solution - Contains all critical steps for resolution but may lack one minor non-essential detail. Addresses the precise query parameters with specialized information. Solution must be directly applicable without requiring adaptation or assumptions. 
8: STRONG MATCH - Provides complete technical resolution through specific instructions, but may require simple logical inferences for full application. Covers all essential components but might need minor contextualization. 
7: GOOD MATCH - Contains substantial relevant details that address core aspects of the query, but lacks one important element for complete resolution. Provides concrete guidance requiring some user interpretation. 
6: PARTIAL match – General guidance on the right topic but lacks the specifics for direct application. May only resolve a subset of the request. 
5: LIMITED relevance – Related context or approach, but indirect. Requires substantial effort to adapt to the user's exact need. 
4: TANGENTIAL – Mentions related concepts/keywords with little practical connection to the request. Minimal actionable value. 
3: VAGUE domain info – Talks about the general area but not the query's specifics. No concrete, actionable steps. 
2: TOKEN overlap – Shares isolated terms without context or intent aligned to the request. Similarity is coincidental. 
1: IRRELEVANT – Uses query terms in a completely unrelated way. No meaningful link to the user's goal.
0: UNRELATED – No thematic or contextual connection to the query at all. </grading_scale> 

**Input Format:** 
<input_format> 
<query> // The customer's question or request </query> <passages> <passage id='id0'>...</passage> <passage id='id1'>...</passage> ... </passages> </input_format> 
**Output Format:** 
<output_format> Return your response in a valid JSON (skip spaces): {{"id0":score0,"id1":score1,...}} 

Strict guidelines: 
- Return ONLY a well-formed valid JSON with passage IDs as keys 
- Each key must be a passage id in the format "idN" - Each score must be an integer between 5 to 10. EXCLUDE passages that score below 5 (i.e. 0, 1, 2, 3 or 4) 
- Integer values only, no decimals 
- Skip spaces in the JSON 
- No additional text or formatting 
- Maintain original passage ID order 
- Note: If NO passages score 5+, return empty JSON object </output_format> 
<examples> 
  {few_shot_examples}
</examples>

链接

Using LLMs as a Reranker for RAG: A Practical Guide - /research

「技术报告」- 基于LLM实现Rag重排序器——一份实用指南

1. 三种重排序策略的选择

2. 生产环境面临的四大挑战

3. 优化策略一：减少输出Token（降延迟）

4. 优化策略二：并行重排序（核心创新）

5. Copilot场景：保持源多样性

开源Prompt解析

实践反思与演进

💡 核心启示

PROMPT

链接

CoolCats

理学学士