开发者

面向开发者的LLM入门教程-检索(Retrieval)英文版

小智 AI教程 2025年01月17日

0 收藏 0 点赞 301 浏览 876 个字

摘要 :

面向开发者的LLM入门教程-检索(Retrieval)英文版：英文版 1.相似性检索 from langchain.vectorstores import Chroma from langchain.embeddings.openai import OpenAIE……

哈喽！伙伴们，我是小智，你们的AI向导。欢迎来到每日的AI学习时间。今天，我们将一起深入AI的奇妙世界，探索“面向开发者的LLM入门教程-检索(Retrieval)英文版”，并学会本篇文章中所讲的全部知识点。还是那句话“不必远征未知，只需唤醒你的潜能！”跟着小智的步伐，我们终将学有所成，学以致用，并发现自身的更多可能性。话不多说，现在就让我们开始这场激发潜能的AI学习之旅吧。

面向开发者的LLM入门教程-检索(Retrieval)英文版：

英文版

1.相似性检索

from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = ‘docs/chroma/cs229_lectures/’
embedding = OpenAIEmbeddings()
vectordb = Chroma(
persist_directory=persist_directory,
embedding_function=embedding
)
print(vectordb._collection.count())

209

简单示例

texts = [
“””The Amanita phalloides has a large and imposing epigeous (aboveground)
fruiting body (basidiocarp).”””,
“””A mushroom with a large fruiting body is the Amanita phalloides. Some
varieties are all-white.”””,
“””A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known
mushrooms.”””,
]
smalldb = Chroma.from_texts(texts, embedding=embedding)
question = “Tell me about all-white mushrooms with large fruiting bodies”
print(“相似性检索：”)
print(smalldb.similarity_search(question, k=2))
print(“MMR 检索：”)
print(smalldb_chinese.max_marginal_relevance_search(question,k=2, fetch_k=3))

0%| | 0/1 [00:00

相似性检索：
[Document(page_content=’A mushroom with a large fruiting body is the Amanita
phalloides. Some varieties are all-white.’, metadata={}),
Document(page_content=’The Amanita phalloides has a large and imposing epigeous
(aboveground) fruiting body (basidiocarp).’, metadata={})]
MMR 检索：
[Document(page_content=’一种具有大型子实体的蘑菇是毒鹅膏菌（Amanita phalloides）。某些品
种全白。’, metadata={}), Document(page_content=’A. phalloides，又名死亡帽，是已知所有蘑
菇中最有毒的一种。’, metadata={})]

2.最大边际相关性

question = “what did they say about matlab?”
docs_ss = vectordb.similarity_search(question,k=3)
print(“相似性检索：”)
print(“docs[0]: “)
print(docs_ss[0].page_content[:100])
print()
print(“docs[1]: “)
print(docs_ss[1].page_content[:100])
print()

docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)
print(“MMR 检索：”)
print(“mmr[0]: “)
print(docs_mmr[0].page_content[:100])
print()
print(“MMR 检索：”)
print(“mmr[1]: “)
print(docs_mmr[1].page_content[:100])

相似性检索：
docs[0]:
those homeworks will be done in either MATLA B or in Octave, which is sort of — I
know some people

docs[1]:
those homeworks will be done in either MATLA B or in Octave, which is sort of — I
know some people

MMR 检索：
mmr[0]:
those homeworks will be done in either MATLA B or in Octave, which is sort of — I
know some people

MMR 检索：
mmr[1]:
algorithm then? So what’s different? How come I was making all that noise
earlier about
least squa

3.使用元数据

question = “what did they say about regression in the third lecture?”

docs = vectordb.similarity_search(
question,
k=3,
filter={“source”:”docs/cs229_lectures/MachineLearning-Lecture03.pdf”}
)

for d in docs:
print(d.metadata)

{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 0}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 14}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 4}

4.使用自查询检索器

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
llm = OpenAI(temperature=0)
metadata_field_info = [
AttributeInfo(
name=”source”,
description=”The lecture the chunk is from, should be one of
`docs/cs229_lectures/MachineLearning-Lecture01.pdf`,
`docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or
`docs/cs229_lectures/MachineLearning-Lecture03.pdf`”,
type=”string”,
),
AttributeInfo(
name=”page”,
description=”The page from the lecture”,
type=”integer”,
),
]
document_content_description = “Lecture notes”
retriever = SelfQueryRetriever.from_llm(
llm,
vectordb,
document_content_description,
metadata_field_info,
verbose=True
)
question = “what did they say about regression in the third lecture?”
docs = retriever.get_relevant_documents(question)
for d in docs:
print(d.metadata)

/root/autodl-tmp/env/gpt/lib/python3.10/sitepackages/langchain/chains/llm.py:275: UserWarning: The predict_and_parse method
is deprecated, instead pass an output parser directly to LLMChain.
warnings.warn(

query=’regression’ filter=Comparison(comparator=,
attribute=’source’, value=’docs/cs229_lectures/MachineLearning-Lecture03.pdf’)
limit=None
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 14}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 0}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 10}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 10}

5.压缩

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
def pretty_print_docs(docs):
print(f”n{‘-‘ * 100}n”.join([f”Document {i+1}:nn” + d.page_content for i,
d in enumerate(docs)]))
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm) # 压缩器
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectordb.as_retriever()
)
question = “what did they say about matlab?”
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:
“MATLAB is I guess part of the programming language that makes it very easy to
write codes using matrices, to write code for numerical routines, to move data
around, to plot data. And it’s sort of an extremely easy to learn tool to use for
implementing a lot of learning algorithms.”
———————————————————————————
——————-
Document 2:
“MATLAB is I guess part of the programming language that makes it very easy to
write codes using matrices, to write code for numerical routines, to move data
around, to plot data. And it’s sort of an extremely easy to learn tool to use for
implementing a lot of learning algorithms.”
———————————————————————————
——————-
Document 3:
“And the student said, “Oh, it was the MATLAB.” So for those of you that don’t
know MATLAB yet, I hope you do learn it. It’s not hard, and we’ll actually have a
short MATLAB tutorial in one of the discussion sections for those of you that
don’t know it.”
———————————————————————————
——————-
Document 4:
“And the student said, “Oh, it was the MATLAB.” So for those of you that don’t
know MATLAB yet, I hope you do learn it. It’s not hard, and we’ll actually have a
short MATLAB tutorial in one of the discussion sections for those of you that
don’t know it.”

6.结合各种技术

compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectordb.as_retriever(search_type = “mmr”)
)
question = “what did they say about matlab?”
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:
“MATLAB is I guess part of the programming language that makes it very easy to
write codes using matrices, to write code for numerical routines, to move data
around, to plot data. And it’s sort of an extremely easy to learn tool to use for
implementing a lot of learning algorithms.”
———————————————————————————
——————-
Document 2:
“And the student said, “Oh, it was the MATLAB.” So for those of you that don’t
know MATLAB yet, I hope you do learn it. It’s not hard, and we’ll actually have a
short MATLAB tutorial in one of the discussion sections for those of you that
don’t know it.”

7.其他类型的检索

from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 加载PDF
loader = PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture01.pdf”)
pages = loader.load()
all_page_text = [p.page_content for p in pages]
joined_page_text = ” “.join(all_page_text)
# 分割文本
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap =
150)
splits = text_splitter.split_text(joined_page_text)
# 检索
svm_retriever = SVMRetriever.from_texts(splits, embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)
question = “What are major topics for this class?” # 这门课的主要主题是什么？
print(“SVM:”)
docs_svm = svm_retriever.get_relevant_documents(question)
print(docs_svm[0])
question = “what did they say about matlab?”
print(“TF-IDF:”)
docs_tfidf = tfidf_retriever.get_relevant_documents(question)
print(docs_tfidf[0])

SVM:
page_content=”let me just check what questions you have righ t now. So if there
are no questions, I’ll just nclose with two reminders, which are after class
today or as you start to talk with other npeople in this class, I just encourage
you again to start to form project partners, to try to nfind project partners to
do your project with. And also, this is a good time to start forming nstudy
groups, so either talk to your friends or post in the newsgroup, but we just
nencourage you to try to star t to do both of those today, okay? Form study
groups, and try nto find two other project partners. nSo thank you. I’m
looking forward to teaching this class, and I’ll see you in a couple of ndays.
[End of Audio] nDuration: 69 minutes” metadata={}
TF-IDF:
page_content=”Saxena and Min Sun here did, wh ich is given an image like this,
right? This is actually a npicture taken of the Stanford campus. You can apply
that sort of cl ustering algorithm and ngroup the picture into regions. Let me
actually blow that up so that you can see it more nclearly. Okay. So in the
middle, you see the lines sort of groupi ng the image together, ngrouping the
image into [inaudible] regions. nAnd what Ashutosh and Min did was they then
applied the learning algorithm to say can nwe take this clustering and us e it
to build a 3D model of the world? And so using the nclustering, they then had a
lear ning algorithm try to learn what the 3D structure of the nworld looks like
so that they could come up with a 3D model that you can sort of fly nthrough,
okay? Although many people used to th ink it’s not possible to take a single
nimage and build a 3D model, but using a lear ning algorithm and that sort of
clustering nalgorithm is the first step. They were able to. nI’ll just show
you one more example. I like this because it’s a picture of Stanford with our
nbeautiful Stanford campus. So again, taking th e same sort of clustering
algorithms, taking nthe same sort of unsupervised learning algor ithm, you can
group the pixels into different nregions. And using that as a pre-processing
step, they eventually built this sort of 3D model of Stanford campus in a single
picture. You can sort of walk into the ceiling, look” metadata={}

面向开发者的LLM入门教程-问答加载向量数据库

面向开发者的LLM入门教程-问答加载向量数据库：问答 Langchain 在实现与外部数据对话的功能时需要经历下面的5个阶段，它们分别是：Do...

查看文章

嘿，伙伴们，今天我们的AI探索之旅已经圆满结束。关于“面向开发者的LLM入门教程-检索(Retrieval)英文版”的内容已经分享给大家了。感谢你们的陪伴，希望这次旅程让你对AI能够更了解、更喜欢。谨记，精准提问是解锁AI潜能的钥匙哦！如果有小伙伴想要了解学习更多的AI知识，请关注我们的官网“AI智研社”，保证让你收获满满呦！

赏

微信打赏二维码微信扫一扫