面向开发者的LLM入门教程-向量数据库与词向量英文版: 英文版 1.读取文档 from langchain.document_loaders import PyPDFLoader # 加载 PDF loaders = [ # 故意添加重复……
哈喽!伙伴们,我是小智,你们的AI向导。欢迎来到每日的AI学习时间。今天,我们将一起深入AI的奇妙世界,探索“面向开发者的LLM入门教程-向量数据库与词向量英文版”,并学会本篇文章中所讲的全部知识点。还是那句话“不必远征未知,只需唤醒你的潜能!”跟着小智的步伐,我们终将学有所成,学以致用,并发现自身的更多可能性。话不多说,现在就让我们开始这场激发潜能的AI学习之旅吧。
面向开发者的LLM入门教程-向量数据库与词向量英文版:
英文版
1.读取文档
from langchain.document_loaders import PyPDFLoader
# 加载 PDF
loaders = [
# 故意添加重复文档,使数据混乱
PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture01.pdf”),
PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture01.pdf”),
PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture02.pdf”),
PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture03.pdf”)
]
docs = []
for loader in loaders:
docs.extend(loader.load())
进行分割
# 分割文本
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1500, # 每个文本块的大小。这意味着每次切分文本时,会尽量使每个块包含 1500
个字符。
chunk_overlap = 150 # 每个文本块之间的重叠部分。
)splits = text_splitter.split_documents(docs)
print(len(splits))
209
2.Embedding
from langchain.embeddings.openai import OpenAIEmbeddings
import numpy as npembedding = OpenAIEmbeddings()
sentence1 = “i like dogs”
sentence2 = “i like canines”
sentence3 = “the weather is ugly outside”embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)print(“Sentence 1 VS setence 2”)
print(np.dot(embedding1, embedding2))
print(“Sentence 1 VS setence 3”)
print(np.dot(embedding1, embedding3))
print(“Sentence 2 VS sentence 3”)
print(np.dot(embedding2, embedding3))
Sentence 1 VS setence 2
0.9632026347895142
Sentence 1 VS setence 3
0.7711302839662464
Sentence 2 VS sentence 3
0.759699788340627
3.初始化Chroma
from langchain.vectorstores import Chroma
persist_directory = ‘docs/chroma/cs229_lectures/’
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory=persist_directory # 允许我们将persist_directory目录保存到磁盘
上
)print(vectordb._collection.count())
100%|██████████| 1/1 [00:02<00:00, 2.62s/it]
209
3.相似性检索
question = “is there an email i can ask for help” # “有我可以寻求帮助的电子邮件吗”
docs = vectordb.similarity_search(question,k=3)
print(“Length of docs: “, len(docs))
print(“Page content:”)
print(docs[0].page_content)
Length of docs: 3
Page content:
cs229-qa@cs.stanford.edu. This goes to an acc ount that’s read by all the TAs and
me. So
rather than sending us email individually, if you send email to this account, it
will
actually let us get back to you maximally quickly with answers to your questions.
If you’re asking questions about homework probl ems, please say in the subject
line which
assignment and which question the email refers to, since that will also help us
to route
your question to the appropriate TA or to me appropriately and get the response
back to
you quickly.
Let’s see. Skipping ahead — let’s see — for homework, one midterm, one open and
term
project. Notice on the honor code. So one thi ng that I think will help you to
succeed and
do well in this class and even help you to enjoy this cla ss more is if you form
a study
group.
So start looking around where you’ re sitting now or at the end of class today,
mingle a
little bit and get to know your classmates. I strongly encourage you to form
study groups
and sort of have a group of people to study with and have a group of your fellow
students
to talk over these concepts with. You can also post on the class news group if
you want to
use that to try to form a study group.
But some of the problems sets in this cla ss are reasonably difficult. People
that have
taken the class before may tell you they were very difficult. And just I bet it
would be
more fun for you, and you’d probably have a be tter learning experience if you
form a
持久化数据库
vectordb.persist()
4.重复块
question = “what did they say about matlab?” # “他们对 matlab 有何评价?”
docs = vectordb.similarity_search(question,k=5)
print(“docs[0]”)
print(docs[0])print(“docs[1]”)
print(docs[1])
docs[0]
page_content=’those homeworks will be done in either MATLA B or in Octave, which
is sort of — I nknow some people call it a free ve rsion of MATLAB, which it
sort of is, sort of isn’t. nSo I guess for those of you that haven’t s een
MATLAB before, and I know most of you nhave, MATLAB is I guess part of the
programming language that makes it very easy to write codes using matrices, to
write code for numerical routines, to move data around, to nplot data. And it’s
sort of an extremely easy to learn tool to use for implementing a lot of
nlearning algorithms. nAnd in case some of you want to work on your own home
computer or something if you ndon’t have a MATLAB license, for the purposes of
this class, there’s also — [inaudible] nwrite that down [inaudible] MATLAB —
there’ s also a software package called Octave nthat you can download for free
off the Internet. And it has somewhat fewer features than MATLAB, but it’s free,
and for the purposes of this class, it will work for just about neverything.
nSo actually I, well, so yeah, just a side comment for those of you that
haven’t seen nMATLAB before I guess, once a colleague of mine at a different
university, not at nStanford, actually teaches another machine l earning course.
He’s taught it for many years. nSo one day, he was in his office, and an old
student of his from, lik e, ten years ago came ninto his office and he said,
“Oh, professo r, professor, thank you so much for your’ metadata={‘source’:
‘docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 8}
docs[1]
page_content=’those homeworks will be done in either MATLA B or in Octave, which
is sort of — I nknow some people call it a free ve rsion of MATLAB, which it
sort of is, sort of isn’t. nSo I guess for those of you that haven’t s een
MATLAB before, and I know most of you nhave, MATLAB is I guess part of the
programming language that makes it very easy to write codes using matrices, to
write code for numerical routines, to move data around, to nplot data. And it’s
sort of an extremely easy to learn tool to use for implementing a lot of
nlearning algorithms. nAnd in case some of you want to work on your own home
computer or something if you ndon’t have a MATLAB license, for the purposes of
this class, there’s also — [inaudible] nwrite that down [inaudible] MATLAB —
there’ s also a software package called Octave nthat you can download for free
off the Internet. And it has somewhat fewer features than MATLAB, but it’s free,
and for the purposes of this class, it will work for just about neverything.
nSo actually I, well, so yeah, just a side comment for those of you that
haven’t seen nMATLAB before I guess, once a colleague of mine at a different
university, not at nStanford, actually teaches another machine l earning course.
He’s taught it for many years. nSo one day, he was in his office, and an old
student of his from, lik e, ten years ago came ninto his office and he said,
“Oh, professo r, professor, thank you so much for your’ metadata={‘source’:
‘docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 8}
5.检索错误答案
question = “what did they say about regression in the third lecture?” # “他们在第
三讲中是怎么谈论回归的?”docs = vectordb.similarity_search(question,k=5)
for doc in docs:
print(doc.metadata)print(“docs-4:”)
print(docs[4].page_content)
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 0}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 14}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture02.pdf’, ‘page’: 0}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 6}
{‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 8}
docs-4:
into his office and he said, “Oh, professo r, professor, thank you so much for
your
machine learning class. I learned so much from it. There’s this stuff that I
learned in your
class, and I now use every day. And it’s help ed me make lots of money, and
here’s a
picture of my big house.”
So my friend was very excited. He said, “W ow. That’s great. I’m glad to hear
this
machine learning stuff was actually useful. So what was it that you learned? Was
it
logistic regression? Was it the PCA? Was it the data ne tworks? What was it that
you
learned that was so helpful?” And the student said, “Oh, it was the MATLAB.”
So for those of you that don’t know MATLAB yet, I hope you do learn it. It’s not
hard,
and we’ll actually have a short MATLAB tutori al in one of the discussion
sections for
those of you that don’t know it.
Okay. The very last piece of logistical th ing is the discussion s ections. So
discussion
sections will be taught by the TAs, and atte ndance at discussion sections is
optional,
although they’ll also be recorded and televi sed. And we’ll use the discussion
sections
mainly for two things. For the next two or th ree weeks, we’ll use the discussion
sections
to go over the prerequisites to this class or if some of you haven’t seen
probability or
statistics for a while or maybe algebra, we’ll go over those in the discussion
sections as a
refresher for those of you that want one.
嘿,伙伴们,今天我们的AI探索之旅已经圆满结束。关于“面向开发者的LLM入门教程-向量数据库与词向量英文版”的内容已经分享给大家了。感谢你们的陪伴,希望这次旅程让你对AI能够更了解、更喜欢。谨记,精准提问是解锁AI潜能的钥匙哦!如果有小伙伴想要了解学习更多的AI知识,请关注我们的官网“AI智研社”,保证让你收获满满呦!
还没有评论呢,快来抢沙发~