AI教程 2025年01月17日
0 收藏 0 点赞 247 浏览 2492 个字
摘要 :

面向开发者的LLM入门教程-文档分割英文版: 英文版 1.短句分割 #导入文本分割器 from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextS……

哈喽!伙伴们,我是小智,你们的AI向导。欢迎来到每日的AI学习时间。今天,我们将一起深入AI的奇妙世界,探索“面向开发者的LLM入门教程-文档分割英文版”,并学会本篇文章中所讲的全部知识点。还是那句话“不必远征未知,只需唤醒你的潜能!”跟着小智的步伐,我们终将学有所成,学以致用,并发现自身的更多可能性。话不多说,现在就让我们开始这场激发潜能的AI学习之旅吧。

面向开发者的LLM入门教程-文档分割英文版

面向开发者的LLM入门教程-文档分割英文版:

英文版

1.短句分割

#导入文本分割器
from langchain.text_splitter import RecursiveCharacterTextSplitter,
CharacterTextSplitter

chunk_size = 26 #设置块大小
chunk_overlap = 4 #设置块重叠大小

#初始化文本分割器
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)

递归字符分割器效果

text = “a b c d e f g h i j k l m n o p q r s t u v w x y z”#测试文本
r_splitter.split_text(text)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

len(” l m n o p q r s t u v w x”)

25

字符分割器效果

#字符文本分割器
c_splitter.split_text(text)

[‘a b c d e f g h i j k l m n o p q r s t u v w x y z’]

设置空格为分隔符的字符分割器

# 设置空格分隔符
c_splitter = CharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separator=’ ‘
)
c_splitter.split_text(text)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

2.长文本分割

# 递归分割长段落
some_text = “””When writing documents, writers will use document structure to
group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.
nn
Paragraphs are often delimited with a carriage return or two carriage returns.
Carriage returns are the “backslash n” you see embedded in this string.
Sentences have a period at the end, but also, have a space.
and words are separated by space.”””

c_splitter = CharacterTextSplitter(
chunk_size=450,
chunk_overlap=0,
separator=’ ‘
)
”’
对于递归字符分割器,依次传入分隔符列表,分别是双换行符、单换行符、空格、空字符,
因此在分割文本时,首先会采用双分换行符进行分割,同时依次使用其他分隔符进行分割
”’
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=450,
chunk_overlap=0,
separators=[“nn”, “n”, ” “, “”]
)

字符分割器效果:

c_splitter.split_text(some_text)

[‘When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form
a document. nn Paragraphs are often delimited with a carriage return or two
carriage returns. Carriage returns are the “backslash n” you see embedded in this
string. Sentences have a period at the end, but also,’,
‘have a space.and words are separated by space.’]

递归字符分割器效果:

#分割结果
r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related. For example, closely
related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form
a document.”,
‘Paragraphs are often delimited with a carriage return or two carriage returns.
Carriage returns are the “backslash n” you see embedded in this string. Sentences
have a period at the end, but also, have a space.and words are separated by
space.’]

增加按句子分割的效果:

r_splitter = RecursiveCharacterTextSplitter(
chunk_size=150,
chunk_overlap=0,
separators=[“nn”, “n”, “(?<=. )", " ", ""] ) r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content.
This can convey to the reader, which idea’s are related.”,
‘For example, closely related ideas are in sentances. Similar ideas are in
paragraphs. Paragraphs form a document.’,
‘Paragraphs are often delimited with a carriage return or two carriage
returns.’,
‘Carriage returns are the “backslash n” you see embedded in this string.’,
‘Sentences have a period at the end, but also, have a space.and words are
separated by space.’]

3.基于Token分割

# 使用token分割器进行分割,
# 将块大小设为1,块重叠大小设为0,相当于将任意字符串分割成了单个Token组成的列
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = “foo bar bazzyfoo”
text_splitter.split_text(text1)

[‘foo’, ‘ bar’, ‘ b’, ‘az’, ‘zy’, ‘foo’]

4.分割自定义Markdown文档

# 使用token分割器进行分割,
# 将块大小设为1,块重叠大小设为0,相当于将任意字符串分割成了单个Token组成的列
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = “foo bar bazzyfoo”
text_splitter.split_text(text1)

[‘foo’, ‘ bar’, ‘ b’, ‘az’, ‘zy’, ‘foo’]

5.分割自定义Markdown文档

# 定义一个Markdown文档
from langchain.document_loaders import NotionDirectoryLoader#Notion加载器
from langchain.text_splitter import MarkdownHeaderTextSplitter#markdown分割器

markdown_document = “””# Titlenn
## Chapter 1nn
Hi this is Jimnn Hi this is Joenn
### Section nn
Hi this is Lance nn
## Chapter 2nn
Hi this is Molly”””

# 初始化Markdown标题文本分割器,分割Markdown文档
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

print(“The first chunk”)
print(md_header_splits[0])
# Document(page_content=’Hi this is Jim nHi this is Joe’, metadata={‘Header 1’:
‘Title’, ‘Header 2’: ‘Chapter 1′})
print(“The second chunk”)
print(md_header_splits[1])
# Document(page_content=’Hi this is Lance’, metadata={‘Header 1’: ‘Title’,
‘Header 2’: ‘Chapter 1’, ‘Header 3’: ‘Section’})

The first chunk
page_content=’Hi this is Jim nHi this is Joe n### Section nHi this is
Lance’ metadata={‘Header 1’: ‘Title’, ‘Header 2’: ‘Chapter 1′}
The second chunk
page_content=’Hi this is Molly’ metadata={‘Header 1’: ‘Title’, ‘Header 2’:
‘Chapter 2’}

6.分割数据库中的Markdown文档

#加载数据库的内容
loader = NotionDirectoryLoader(“docs/Notion_DB”)
docs = loader.load()
txt = ‘ ‘.join([d.page_content for d in docs])#拼接文档
headers_to_split_on = [
(“#”, “Header 1”),
(“##”, “Header 2”),
]
#加载文档分割器
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(txt)#分割文本内容
print(md_header_splits[0])#分割结果

page_content=’Let’s talk about stress. Too much stress. nWe know this can be a
topic. nSo let’s get this conversation going. n[Intro: two things you should
know]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/Intro%20two%20things%20y
ou%20should%20know%20b5fd0c5393a9498b93396e79fe71e8bf.md) n[What is stress]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20is%20stress%20b19
8b685ed6a474ab14f6fafff7004b6.md) n[When is there too much stress?]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/When%20is%20there%20too%
20much%20stress%20dc135b9a86a843cbafd115aa128c5c90.md) n[What can I do]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20can%20I%20do%2009
c1b13703ef42d4a889e2059c5b25fe.md) n[What can Blendle do?]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20can%20Blendle%20d
o%20618ab89df4a647bf96e7b432af82779f.md) n[Good reads]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/Good%20reads%20e817491d8
4d549f886af972e0668192e.md) nGo to **#letstalkaboutstress** on slack to chat
about this topic’ metadata={‘Header 1’: ‘#letstalkaboutstress’}

面向开发者的LLM入门教程-向量数据库与词向量(1)
面向开发者的LLM入门教程-向量数据库与词向量(1):向量数据库与词向量(Vectorstoresand Embeddings) 让我们一起回顾一下检索增强生成...

嘿,伙伴们,今天我们的AI探索之旅已经圆满结束。关于“面向开发者的LLM入门教程-文档分割英文版”的内容已经分享给大家了。感谢你们的陪伴,希望这次旅程让你对AI能够更了解、更喜欢。谨记,精准提问是解锁AI潜能的钥匙哦!如果有小伙伴想要了解学习更多的AI知识,请关注我们的官网“AI智研社”,保证让你收获满满呦!

微信扫一扫

支付宝扫一扫

版权: 转载请注明出处:https://www.ai-blog.cn/2791.html

相关推荐
01-27

Kimi神级写作指令-充当正则表达式生成器的提示词: 正则表达式是不是让你又爱又恨?想匹配特定文本…

427
01-27

Kimi神级写作指令-充当数学家的提示词: 数学计算是不是让你头大?复杂的表达式、繁琐的步骤,是不…

247
01-27

Kimi神级写作指令-充当全栈软件开发人员的提示词: 想开发一个Web应用程序,却不知道从何下手?或…

247
01-27

Kimi神级写作指令-充当对弈棋手的提示词: 喜欢下棋但找不到对手?或者想提升棋艺却苦于没有合适的…

247
01-27

Kimi神级写作指令-作为专业DBA的提示词: 数据库查询是不是让你头大?写SQL语句时总是担心性能不够…

247
01-27

Kimi神级写作指令-作为项目经理的提示词: 项目管理是不是让你头大?进度拖延、任务混乱、团队沟通…

247
01-27

Kimi神级写作指令-作为 IT 专家的提示词: 电脑蓝屏、软件崩溃、网络连接失败……这些技术问题是不是…

247
01-27

Kimi神级写作指令-担任 SVG 设计师的提示词: 你是不是经常需要一些简单的图像素材,但又不想打开…

247
发表评论
暂无评论

还没有评论呢,快来抢沙发~

助力原创内容

快速提升站内名气成为大牛

扫描二维码

手机访问本站