{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "authorship_tag": "ABX9TyNqlLD/LEX3MZ6Uw13WCE8x", "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "# 一个完整的例子\n", "\n", "这是该 `LangChain` 极简入门系列的最后一讲。我们将利用过去9讲学习的知识,来完成一个具备完整功能集的LLM应用。该应用基于 `LangChain` 框架,以某 `PDF` 文件的内容为知识库,提供给用户基于该文件内容的问答能力。\n", "\n", "我们利用 `LangChain` 的QA chain,结合 `Chroma` 来实现PDF文档的语义化搜索。示例代码所引用的是[AWS Serverless\n", "Developer Guide](https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf),该PDF文档共84页。" ], "metadata": { "id": "69PRFT6WO-oK" } }, { "cell_type": "markdown", "source": [ "1. 安装必要的 `Python` 包" ], "metadata": { "id": "OBehQYkOPPWe" } }, { "cell_type": "code", "source": [ "!pip install -q langchain==0.0.235 openai chromadb pymupdf tiktoken" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_amYPxT-PULc", "outputId": "d3d7515d-b214-4140-b39b-a9a209cf00b9" }, "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.3 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.2/1.3 MB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.7/1.3 MB\u001b[0m \u001b[31m10.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m1.2/1.3 MB\u001b[0m \u001b[31m11.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m10.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.6/73.6 kB\u001b[0m \u001b[31m8.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m405.5/405.5 kB\u001b[0m \u001b[31m13.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m44.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m57.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m90.0/90.0 kB\u001b[0m \u001b[31m12.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m59.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.4/58.4 kB\u001b[0m \u001b[31m6.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m59.5/59.5 kB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.3/5.3 MB\u001b[0m \u001b[31m74.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m84.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m103.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m67.3/67.3 kB\u001b[0m \u001b[31m8.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m67.0/67.0 kB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m428.8/428.8 kB\u001b[0m \u001b[31m36.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.1/4.1 MB\u001b[0m \u001b[31m116.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m78.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m129.9/129.9 kB\u001b[0m \u001b[31m15.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m11.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Building wheel for chroma-hnswlib (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for pypika (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n" ] } ] }, { "cell_type": "markdown", "source": [ "2. 设置OpenAI环境" ], "metadata": { "id": "8Hihrnw_PeIA" } }, { "cell_type": "code", "source": [ "import os\n", "os.environ['OPENAI_API_KEY'] = ''" ], "metadata": { "id": "dALQoneUPgEH" }, "execution_count": 2, "outputs": [] }, { "cell_type": "markdown", "source": [ "3. 下载PDF文件AWS Serverless Developer Guide" ], "metadata": { "id": "8aB0OBRFP5FC" } }, { "cell_type": "code", "source": [ "!wget https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf\n", "\n", "PDF_NAME = 'serverless-core.pdf'" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zF-PFO9BP6wr", "outputId": "1d761def-1df0-4043-f00e-be6dd913f1b2" }, "execution_count": 3, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "--2023-08-17 11:42:20-- https://docs.aws.amazon.com/pdfs/serverless/latest/devguide/serverless-core.pdf\n", "Resolving docs.aws.amazon.com (docs.aws.amazon.com)... 108.159.227.88, 108.159.227.51, 108.159.227.3, ...\n", "Connecting to docs.aws.amazon.com (docs.aws.amazon.com)|108.159.227.88|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 4727395 (4.5M) [application/pdf]\n", "Saving to: ‘serverless-core.pdf’\n", "\n", "serverless-core.pdf 100%[===================>] 4.51M 12.0MB/s in 0.4s \n", "\n", "2023-08-17 11:42:21 (12.0 MB/s) - ‘serverless-core.pdf’ saved [4727395/4727395]\n", "\n" ] } ] }, { "cell_type": "markdown", "source": [ "4. 加载PDF文件" ], "metadata": { "id": "WqBDCt0HQFAA" } }, { "cell_type": "code", "source": [ "from langchain.document_loaders import PyMuPDFLoader\n", "docs = PyMuPDFLoader(PDF_NAME).load()\n", "\n", "print (f'There are {len(docs)} document(s) in {PDF_NAME}.')\n", "print (f'There are {len(docs[0].page_content)} characters in the first page of your document.')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bniPzdhUQSlw", "outputId": "01342468-bef4-4e9f-9b56-1ab6eac65ab4" }, "execution_count": 4, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "There are 84 document(s) in serverless-core.pdf.\n", "There are 27 characters in the first page of your document.\n" ] } ] }, { "cell_type": "markdown", "source": [ "5. 拆分文档并存储文本嵌入的向量数据" ], "metadata": { "id": "V9kvXY9uQ1mI" } }, { "cell_type": "code", "source": [ "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "from langchain.vectorstores import Chroma\n", "\n", "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n", "split_docs = text_splitter.split_documents(docs)\n", "\n", "embeddings = OpenAIEmbeddings()\n", "\n", "vectorstore = Chroma.from_documents(split_docs, embeddings, collection_name=\"serverless_guide\")" ], "metadata": { "id": "G4d8cwQTQ2fa" }, "execution_count": 5, "outputs": [] }, { "cell_type": "markdown", "source": [ "6. 基于OpenAI创建QA链" ], "metadata": { "id": "-T6_mIR8RwEF" } }, { "cell_type": "code", "source": [ "from langchain.llms import OpenAI\n", "from langchain.chains.question_answering import load_qa_chain\n", "\n", "llm = OpenAI(temperature=0)\n", "chain = load_qa_chain(llm, chain_type=\"stuff\")" ], "metadata": { "id": "BsW99LnUR2Ns" }, "execution_count": 6, "outputs": [] }, { "cell_type": "markdown", "source": [ "7. 基于提问,进行相似性查询" ], "metadata": { "id": "ED54hPgfSXYL" } }, { "cell_type": "code", "source": [ "query = \"What is the use case of AWS Serverless?\"\n", "similar_docs = vectorstore.similarity_search(query, 3, include_metadata=True)" ], "metadata": { "id": "bPmKM4zXSam9" }, "execution_count": 7, "outputs": [] }, { "cell_type": "code", "source": [ "similar_docs" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "DNog2ekVSxPa", "outputId": "78bf8f0c-bdf2-4b9c-8a08-433c7706b758" }, "execution_count": 8, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[Document(page_content='Serverless\\nDeveloper Guide', metadata={'author': 'AWS', 'creationDate': 'D:20230817052259Z', 'creator': 'ZonBook XSL Stylesheets with Apache FOP', 'file_path': 'serverless-core.pdf', 'format': 'PDF 1.4', 'keywords': 'Serverless, serverless guide, getting started serverless, event-driven architecture, Lambda, API Gateway, DynamoDB, serverless, developer, guide, learn serverless, serverless, use-case, serverless, prerequisites, serverless, serverless, fundamentals, even-driven, architecture, serverless, fundamentals, serverless, developer_experience, lifecycle, deploy, packaging, serverless, hands-on, tutorial, workshop, next steps, security, serverless, compute, api, gateway, serverless, database, nosql', 'modDate': '', 'page': 0, 'producer': 'Apache FOP Version 2.6', 'source': 'serverless-core.pdf', 'subject': '', 'title': 'Serverless - Developer Guide', 'total_pages': 84, 'trapped': ''}),\n", " Document(page_content='needed to build serverless solutions.\\nIn serverless solutions, you focus on writing code that serves your customers, without managing servers. \\nServerless technologies are pay-as-you-go, can scale both up and down, and are easy to expand across \\ngeographic regions.\\nMeeting you where you are now\\nWe expect that you are a developer with some traditional web application development experience, \\nbut you are new to Amazon Web Services or serverless architectures. We also assume you want to get \\n1', metadata={'author': 'AWS', 'creationDate': 'D:20230817052259Z', 'creator': 'ZonBook XSL Stylesheets with Apache FOP', 'file_path': 'serverless-core.pdf', 'format': 'PDF 1.4', 'keywords': 'Serverless, serverless guide, getting started serverless, event-driven architecture, Lambda, API Gateway, DynamoDB, serverless, developer, guide, learn serverless, serverless, use-case, serverless, prerequisites, serverless, serverless, fundamentals, even-driven, architecture, serverless, fundamentals, serverless, developer_experience, lifecycle, deploy, packaging, serverless, hands-on, tutorial, workshop, next steps, security, serverless, compute, api, gateway, serverless, database, nosql', 'modDate': '', 'page': 4, 'producer': 'Apache FOP Version 2.6', 'source': 'serverless-core.pdf', 'subject': '', 'title': 'Serverless - Developer Guide', 'total_pages': 84, 'trapped': ''}),\n", " Document(page_content='infrastructure (regions, ARNs, security model). Then, it will introduce the shift in mindset you need to \\nmake to start your serverless journey. Next, it will dive into event-driven architecture and other serverless \\nconcepts necessary to transition to serverless.\\nAlong the way, the guide will provide a list of curated resources, such as articles, workshops, and \\ntutorials, to reinforce your learning with hands-on activities.\\nThe guide will focus on common serverless use cases, such as:\\n• Interactive Web- and API-based microservices or applications\\n• Data processing applications\\n• Real-time streaming applications\\n• Machine learning\\n• IT automation and service orchestration\\nTo avoid information overload, scope will be limited to a few essential serverless services to get started:\\n• AWS Identity and Access Management for service security\\n• AWS Lambda for compute functions\\n• Amazon API Gateway for integrating HTTP/S requests with other services that handle the requests', metadata={'author': 'AWS', 'creationDate': 'D:20230817052259Z', 'creator': 'ZonBook XSL Stylesheets with Apache FOP', 'file_path': 'serverless-core.pdf', 'format': 'PDF 1.4', 'keywords': 'Serverless, serverless guide, getting started serverless, event-driven architecture, Lambda, API Gateway, DynamoDB, serverless, developer, guide, learn serverless, serverless, use-case, serverless, prerequisites, serverless, serverless, fundamentals, even-driven, architecture, serverless, fundamentals, serverless, developer_experience, lifecycle, deploy, packaging, serverless, hands-on, tutorial, workshop, next steps, security, serverless, compute, api, gateway, serverless, database, nosql', 'modDate': '', 'page': 5, 'producer': 'Apache FOP Version 2.6', 'source': 'serverless-core.pdf', 'subject': '', 'title': 'Serverless - Developer Guide', 'total_pages': 84, 'trapped': ''})]" ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "markdown", "source": [ "8. 基于相关文档,利用QA链完成回答" ], "metadata": { "id": "1XecjykTSnve" } }, { "cell_type": "code", "source": [ "chain.run(input_documents=similar_docs, question=query)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "id": "E4YOeM8aSuEY", "outputId": "14ccdea1-f586-45cf-bf4c-cce99322739c" }, "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "' AWS Serverless can be used for interactive web- and API-based microservices or applications, data processing applications, real-time streaming applications, machine learning, and IT automation and service orchestration.'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 9 } ] } ] }