"Open

# 一个完整的例子

这是该 `LangChain` 极简入门系列的最后一讲。我们将利用过去9讲学习的知识,来完成一个具备完整功能集的LLM应用。该应用基于 `LangChain` 框架,以某 `PDF` 文件的内容为知识库,提供给用户基于该文件内容的问答能力。

我们利用 `LangChain` 的QA chain,结合 `Chroma` 来实现PDF文档的语义化搜索。示例代码所引用的是[AWS Serverless
Developer Guide](https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf),该PDF文档共84页。

1. 安装必要的 `Python` 包

In [1]:
!pip install -q langchain==0.0.235 openai chromadb pymupdf tiktoken

[?25l [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.3 MB[0m [31m6.5 MB/s[0m eta [36m0:00:01[0m[2K [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m0.7/1.3 MB[0m [31m10.1 MB/s[0m eta [36m0:00:01[0m[2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m1.2/1.3 MB[0m [31m11.9 MB/s[0m eta [36m0:00:01[0m[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m405.5/405.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[2K [90m━━━━━━━━━━━━━━━━━━

2. 设置OpenAI环境

In [2]:
import os
os.environ['OPENAI_API_KEY'] = ''

3. 下载PDF文件AWS Serverless Developer Guide

In [3]:
!wget https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf

PDF_NAME = 'serverless-core.pdf'

--2023-08-17 11:42:20-- https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf
Resolving docs.aws.amazon.com (docs.aws.amazon.com)... 108.159.227.88, 108.159.227.51, 108.159.227.3, ...
Connecting to docs.aws.amazon.com (docs.aws.amazon.com)|108.159.227.88|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4727395 (4.5M) [application/pdf]
Saving to: ‘serverless-core.pdf’


2023-08-17 11:42:21 (12.0 MB/s) - ‘serverless-core.pdf’ saved [4727395/4727395]



4. 加载PDF文件

In [4]:
from langchain.document_loaders import PyMuPDFLoader
docs = PyMuPDFLoader(PDF_NAME).load()

print (f'There are {len(docs)} document(s) in {PDF_NAME}.')
print (f'There are {len(docs[0].page_content)} characters in the first page of your document.')

There are 84 document(s) in serverless-core.pdf.
There are 27 characters in the first page of your document.


5. 拆分文档并存储文本嵌入的向量数据

In [5]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(docs)

embeddings = OpenAIEmbeddings()

vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name="serverless_guide")

6. 基于OpenAI创建QA链

In [6]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

llm = OpenAI(temperature=0)
chain = load_qa_chain(llm, chain_type="stuff")

7. 基于提问,进行相似性查询

In [7]:
query = "What is the use case of AWS Serverless?"
similar_docs = vectorstore.similarity_search(query, 3, include_metadata=True)

In [8]:
similar_docs

[Document(page_content='Serverless\nDeveloper Guide', metadata={'author': 'AWS', 'creationDate': 'D:20230817052259Z', 'creator': 'ZonBook XSL Stylesheets with Apache FOP', 'file_path': 'serverless-core.pdf', 'format': 'PDF 1.4', 'keywords': 'Serverless, serverless guide, getting started serverless, event-driven architecture, Lambda, API Gateway, DynamoDB, serverless, developer, guide, learn serverless, serverless, use-case, serverless, prerequisites, serverless, serverless, fundamentals, even-driven, architecture, serverless, fundamentals, serverless, developer_experience, lifecycle, deploy, packaging, serverless, hands-on, tutorial, workshop, next steps, security, serverless, compute, api, gateway, serverless, database, nosql', 'modDate': '', 'page': 0, 'producer': 'Apache FOP Version 2.6', 'source': 'serverless-core.pdf', 'subject': '', 'title': 'Serverless - Developer Guide', 'total_pages': 84, 'trapped': ''}),
 Document(page_content='needed to build serverless solutions.\nIn server

8. 基于相关文档,利用QA链完成回答

In [9]:
chain.run(input_documents=similar_docs, question=query)

' AWS Serverless can be used for interactive web- and API-based microservices or applications, data processing applications, real-time streaming applications, machine learning, and IT automation and service orchestration.'