2. 朴素RAG

# 2.1 定义

朴素RAG（Naive RAG）是RAG技术的最基础实现形式，采用最简单直接的流程。将文档分块、向量化存储、基于相似度检索、直接输入LLM生成答案。朴素RAG虽然实现简单，但在实际应用中往往面临检索精度不高、上下文利用不充分等问题，是理解RAG技术原理和后续优化方向的重要起点。

# 2.2 朴素RAG的典型架构

# 2.2.1 知识库构建阶段

文档预处理 → 分块 → 向量化 → 落库（向量数据库）

文档预处理：

文档加载：支持多种格式（PDF、Word、TXT、HTML等）
文本清洗：去除无关格式、特殊字符、重复内容
内容提取：从结构化文档中提取纯文本内容

分块策略：

固定长度分块：按字符数或token数固定切分（如每块512个token）
段落分块：按自然段落边界分割
句子分块：以句号、问号等标点为分割点
重叠分块：相邻块之间保留一定重叠（如50-100个token）

向量化处理：

嵌入模型选择：常用模型如text-embedding-ada-002等
向量存储：将文本块及其向量表示存储到向量数据库

# 2.2.2 检索增强生成阶段

用户查询 → 查询向量化 → 相似度检索 → Context构建 → LLM生成

查询处理：

查询向量化：使用与索引阶段相同的嵌入模型将用户问题向量化

检索过程：

相似度计算：通常使用余弦相似度或点积
Top-K检索：返回相似度最高的K个文档块
阈值过滤：可选的相似度阈值过滤

生成阶段：

Prompt构建：将检索结果与用户查询组合成结构化Prompt
LLM调用：输入到大语言模型生成最终答案

# 2.3 朴素RAG的实现示例

为了更好地理解朴素RAG的工作原理，我们通过一个完整的实际案例来展示其实现过程。假设我们要构建一个基于Llama 2技术论文的智能问答系统，让用户能够通过自然语言查询获取关于Llama 2模型的准确信息。

应用场景描述：

知识源：Llama 2官方技术论文（PDF格式）
目标功能：回答用户关于Llama 2模型的技术问题
技术栈：OpenAI Embedding + Chroma向量数据库 + GPT-4

系统架构设计：

文档处理层：负责PDF文件的文本提取和预处理
向量化层：使用嵌入模型将文本转换为向量表示
存储层：使用向量数据库存储和检索向量
应用层：处理用户查询并生成最终答案

下面我们按照这个架构逐一实现各个组件。

# 2.3.1 文档提取与切割

import fitz  # PyMuPDF

def extract_text_from_pdf(filename, page_numbers=None, min_line_length=1):
    """从 PDF 文件中（按指定页码）提取文字并组织成段落"""
    paragraphs = []
    buffer = ''
    full_text = ''
    
    # 打开 PDF 文件
    doc = fitz.open(filename)
    
    # 提取文本
    for i in range(len(doc)):
        # 如果指定了页码范围，跳过范围外的页
        if page_numbers is not None and (i + 1) not in page_numbers:
            continue
        
        # 获取页面内容
        page = doc.load_page(i)
        text = page.get_text("text")
        full_text += text + '\n'
    
    # 按空行分隔，将文本重新组织成段落
    lines = full_text.split('\n')
    for line in lines:
        if len(line) >= min_line_length:
            buffer += (' ' + line) if not line.endswith('-') else line.strip('-')
        elif buffer:
            paragraphs.append(buffer)
            buffer = ''
    
    if buffer:
        paragraphs.append(buffer)
    
    return paragraphs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

# 2.3.2 向量数据库

由于ChromaDB具有轻量级、易于部署的特点，此处Demo我们选择它作为向量数据库

import chromadb
from chromadb.config import Settings

class VectorDBConnector:
    def __init__(self, collection_name, embedding_fn):
        # 使用本地内存
        # chroma_client = chromadb.Client(Settings(allow_reset=True))
        chroma_client = chromadb.HttpClient(
            settings=Settings(allow_reset=True),
            host="localhost", 
            port=8000
        )

        # 创建一个 collection
        self.collection = chroma_client.get_or_create_collection(name=collection_name)
        self.embedding_fn = embedding_fn

    def add_documents(self, documents):
        '''向 collection 中添加文档与向量'''
        # 获取当前collection中的文档数量
        current_count = self.collection.count()
        
        # 只有当collection为空时才添加文档
        if current_count == 0:
            self.collection.add(
                embeddings=self.embedding_fn(documents),  # 每个文档的向量
                documents=documents,  # 文档的原文
                ids=[f"id{i}" for i in range(len(documents))]  # 每个文档的 id
            )

    def search(self, query, top_n):
        '''检索向量数据库'''
        results = self.collection.query(
            query_embeddings=self.embedding_fn([query]),
            n_results=top_n
        )
        return results

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

# 2.3.3 Embedding与LLM集成

这里集成了OpenAI的嵌入模型和GPT大语言模型。嵌入模型负责将文本转换为向量表示，GPT模型则负责基于检索到的上下文生成最终答案。通过设计合理的Prompt模板，能够引导模型更好地利用检索到的信息，提高回答的准确性和相关性。

from openai import OpenAI

# 需要在环境中设置好密钥
client = OpenAI()

prompt_template = """
你是一个问答机器人。
你的任务是根据下述给定的已知信息回答用户问题。

已知信息:
{context}

用户问：
{query}

如果已知信息不包含用户问题的答案，或者已知信息不足以回答用户的问题，请直接回复"我无法回答您的问题"。
请不要输出已知信息中不包含的信息或答案。
请用中文回答用户问题。
"""

def get_embeddings(texts, model="text-embedding-ada-002", dimensions=None):
    """封装 OpenAI 的 Embedding 模型接口"""
    if model == "text-embedding-ada-002":
        dimensions = None
    if dimensions:
        data = client.embeddings.create(
            input=texts, model=model, dimensions=dimensions).data
    else:
        data = client.embeddings.create(input=texts, model=model).data
    return [x.embedding for x in data]

    

def get_completion(prompt, model="gpt-4o"):
    '''封装 openai 接口'''
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,  # 模型输出的随机性，0 表示随机性最小
    )
    return response.choices[0].message.content

def build_prompt(prompt_template, **kwargs):
    '''将 Prompt 模板赋值'''
    inputs = {}
    for k, v in kwargs.items():
        if isinstance(v, list) and all(isinstance(elem, str) for elem in v):
            val = '\n\n'.join(v)
        else:
            val = v
        inputs[k] = val
    return prompt_template.format(**inputs)

class RAG_Bot:
    def __init__(self, vector_db, llm_api, n_results=2):
        self.vector_db = vector_db
        self.llm_api = llm_api
        self.n_results = n_results

    def chat(self, user_query):
        # 1. 检索
        search_results = self.vector_db.search(user_query, self.n_results)

        # 2. 构建 Prompt
        prompt = build_prompt(
            prompt_template, context=search_results['documents'][0], query=user_query)

        print(f"prompt:{prompt}")
        # 3. 调用 LLM
        response = self.llm_api(prompt)
        return response

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

# 2.3.4 系统集成与主函数

将前面实现的所有组件集成起来，构建一个完整的朴素RAG系统。

from openai_call import *
from pdfutil import extract_text_from_pdf
from vector_db_connector import VectorDBConnector

# https://qiniu.agiadventurer.com/llama2.pdf
pdf_path = 'llama2.pdf'  # 替换为你的 PDF 文件路径
paragraphs = extract_text_from_pdf(pdf_path, [3,4], min_line_length=10)

vector_db = VectorDBConnector("demo", get_embeddings)
# 向向量数据库中添加文档
vector_db.add_documents(paragraphs)

user_query = "Llama 2有多少参数"
results = vector_db.search(user_query, 2)

bot = RAG_Bot(
    vector_db,
    llm_api=get_completion
)

response = bot.chat(user_query)

print(f"LLM回答：\n{response}")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

# 2.3.5 运行结果分析

以下是系统处理用户查询"Llama 2有多少参数"的完整过程和结果。从输出可以看到，系统首先从PDF中提取相关文本片段，然后构建包含检索上下文的Prompt，最后让GPT模型基于提供的上下文生成准确答案。

prompt:
你是一个问答机器人。
你的任务是根据下述给定的已知信息回答用户问题。

已知信息:
Figure 3: Safety human evaluation results for Llama 2-Chat compared to other open-source and closed source models. Human raters judged model generations for safety violations across ~2,000 adversarial prompts consisting of both single and multi-turn prompts. More details can be found in Section 4.4. It is important to caveat these safety results with the inherent bias of LLM evaluations due to limitations of the prompt set, subjectivity of the review guidelines, and subjectivity of individual raters. Additionally, these safety evaluations are performed using content standards that are likely to be biased towards the Llama 2-Chat models. We are releasing the following models to the general public for research and commercial use‡: 1. Llama 2, an updated version of Llama 1, trained on a new mix of publicly available data. We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). We are releasing variants of Llama 2 with 7B, 13B, and 70B parameters. We have also trained 34B variants, which we report on in this paper but are not releasing.§ 2. Llama 2-Chat, a ﬁne-tuned version of Llama 2 that is optimized for dialogue use cases. We release variants of this model with 7B, 13B, and 70B parameters as well. We believe that the open release of LLMs, when done safely, will be a net beneﬁt to society. Like all LLMs, Llama 2 is a new technology that carries potential risks with use (Bender et al., 2021b; Weidinger et al., 2021; Solaiman et al., 2023). Testing conducted to date has been in English and has not — and could not — cover all scenarios. Therefore, before deploying any applications of Llama 2-Chat, developers should perform safety testing and tuning tailored to their speciﬁc applications of the model. We provide a responsible use guide¶ and code examples‖ to facilitate the safe deployment of Llama 2 and Llama 2-Chat. More details of our responsible release strategy can be found in Section 5.3. The remainder of this paper describes our pretraining methodology (Section 2), ﬁne-tuning methodology (Section 3), approach to model safety (Section 4), key observations and insights (Section 5), relevant related work (Section 6), and conclusions (Section 7). ‡https://ai.meta.com/resources/models-and-libraries/llama/ §We are delaying the release of the 34B model due to a lack of time to suﬃciently red team. ¶https://ai.meta.com/llama ‖https://github.com/facebookresearch/llama

Introduction Large Language Models (LLMs) have shown great promise as highly capable AI assistants that excel in complex reasoning tasks requiring expert knowledge across a wide range of ﬁelds, including in specialized domains such as programming and creative writing. They enable interaction with humans through intuitive chat interfaces, which has led to rapid and widespread adoption among the general public. The capabilities of LLMs are remarkable considering the seemingly straightforward nature of the training methodology. Auto-regressive transformers are pretrained on an extensive corpus of self-supervised data, followed by alignment with human preferences via techniques such as Reinforcement Learning with Human Feedback (RLHF). Although the training methodology is simple, high computational requirements have limited the development of LLMs to a few players. There have been public releases of pretrained LLMs (such as BLOOM (Scao et al., 2022), LLaMa-1 (Touvron et al., 2023), and Falcon (Penedo et al., 2023)) that match the performance of closed pretrained competitors like GPT-3 (Brown et al., 2020) and Chinchilla (Hoﬀmann et al., 2022), but none of these models are suitable substitutes for closed “product” LLMs, such as ChatGPT, BARD, and Claude. These closed product LLMs are heavily ﬁne-tuned to align with human preferences, which greatly enhances their usability and safety. This step can require signiﬁcant costs in compute and human annotation, and is often not transparent or easily reproducible, limiting progress within the community to advance AI alignment research. In this work, we develop and release Llama 2, a family of pretrained and ﬁne-tuned LLMs, Llama 2 and Llama 2-Chat, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2-Chat models generally perform better than existing open-source models. They also appear to be on par with some of the closed-source models, at least on the human evaluations we performed (see Figures 1 and 3). We have taken measures to increase the safety of these models, using safety-speciﬁc data annotation and tuning, as well as conducting red-teaming and employing iterative evaluations. Additionally, this paper contributes a thorough description of our ﬁne-tuning methodology and approach to improving LLM safety. We hope that this openness will enable the community to reproduce ﬁne-tuned LLMs and continue to improve the safety of those models, paving the way for more responsible development of LLMs. We also share novel observations we made during the development of Llama 2 and Llama 2-Chat, such as the emergence of tool usage and temporal organization of knowledge.

用户问：
Llama 2有多少参数？

如果已知信息不包含用户问题的答案，或者已知信息不足以回答用户的问题，请直接回复"我无法回答您的问题"。
请不要输出已知信息中不包含的信息或答案。
请用中文回答用户问题。

LLM回答：
Llama 2有7B、13B和70B参数的变体。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# 2.4 朴素RAG的局限性

# 2.4.1 检索质量问题

语义匹配不准确：

上下文缺失：单纯的向量相似度无法捕捉复杂的语义关系

检索结果不相关：

噪声文档：检索到与查询主题不相关的文档片段
信息冗余：多个相似的文档片段提供重复信息
关键信息遗漏：重要信息可能因为表达方式差异而被遗漏

# 2.4.2 分块策略局限

语义完整性破坏：

跨块信息分割：重要信息被分割到不同的块中
上下文丢失：分块可能破坏文档的逻辑结构和上下文关系
引用关系断裂：表格、图表等结构化信息的引用关系丢失

分块粒度问题：

块太小：信息不完整，缺乏足够的上下文
块太大：包含过多无关信息，影响检索精度
固定分块：无法适应不同类型文档的特点

# 2.4.3 上下文利用不充分

信息整合能力弱：

片段化信息：难以整合多个文档片段中的信息
逻辑推理不足：无法进行跨文档的逻辑推理
时序关系忽略：忽略信息的时间顺序和发展脉络

上下文窗口限制：

长度限制：LLM的上下文长度限制了可输入的检索内容
信息截断：重要信息可能因为长度限制而被截断
优先级不明：无法区分不同检索结果的重要性

# 2.4.4 缺乏反馈机制

无法自我优化：

检索质量无反馈：系统无法知道检索结果的质量
用户满意度未知：缺乏用户反馈收集机制
错误无法纠正：系统无法从错误中学习和改进

# 2.5 朴素RAG的改进方向

# 2.5.1 检索优化

混合检索：结合关键词检索和向量检索
查询扩展：通过同义词、相关词扩展查询
重排序：对检索结果进行二次排序

# 2.5.2 分块优化

智能分块：基于文档结构和语义的智能分块
层次化分块：构建文档的层次化表示
动态分块：根据查询动态调整分块策略

# 2.5.3 生成优化

多轮检索：根据生成过程动态检索补充信息
答案验证：对生成的答案进行事实性验证
置信度评估：评估答案的可信度

朴素RAG虽然存在诸多局限性，但它为RAG技术的发展奠定了基础，理解其原理和问题是掌握高级RAG技术的前提。在实际应用中，我们通常需要根据具体场景对朴素RAG进行针对性的优化和改进。

编辑

#RAG

上次更新: 2025/09/18, 08:17:39

← 1. RAG基本概念