开发者

面向开发者的LLM入门教程-文档加载英文版

小智 AI教程 2025年01月17日

0 收藏 0 点赞 325 浏览 5779 个字

摘要 :

面向开发者的LLM入门教程-文档加载英文版：英文版 1.加载 PDF 文档 from langchain.document_loaders import PyPDFLoader # 创建一个 PyPDFLoader Class 实例，输入为……

哈喽！伙伴们，我是小智，你们的AI向导。欢迎来到每日的AI学习时间。今天，我们将一起深入AI的奇妙世界，探索“面向开发者的LLM入门教程-文档加载英文版”，并学会本篇文章中所讲的全部知识点。还是那句话“不必远征未知，只需唤醒你的潜能！”跟着小智的步伐，我们终将学有所成，学以致用，并发现自身的更多可能性。话不多说，现在就让我们开始这场激发潜能的AI学习之旅吧。

面向开发者的LLM入门教程-文档加载英文版：

英文版

1.加载 PDF 文档

from langchain.document_loaders import PyPDFLoader

# 创建一个 PyPDFLoader Class 实例，输入为待加载的pdf文档路径
loader = PyPDFLoader(“docs/cs229_lectures/MachineLearning-Lecture01.pdf”)

# 调用 PyPDFLoader Class 的函数 load对pdf文件进行加载
pages = loader.load()

2.探索加载的数据

print(“Type of pages: “, type(pages))
print(“Length of pages: “, len(pages))

page = pages[0]
print(“Type of page: “, type(page))
print(“Page_content: “, page.page_content[:500])
print(“Meta Data: “, page.metadata)

Type of pages:
Length of pages: 22
Type of page:
Page_content: MachineLearning-Lecture01
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine
learning class. So what I wanna do today is ju st spend a little time going over
the logistics
of the class, and then we’ll start to talk a bit about machine learning.
By way of introduction, my name’s Andrew Ng and I’ll be instru ctor for this
class. And so
I personally work in machine learning, and I’ ve worked on it for about 15 years
now, and
I actually think that machine learning i
Meta Data: {‘source’: ‘docs/cs229_lectures/MachineLearning-Lecture01.pdf’,
‘page’: 0}

3.加载 Youtube 音频

# 注：由于该视频较长，容易出现网络问题，此处没有运行，读者可自行运行探索

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser

from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url=”https://www.youtube.com/watch?v=jGwO_UgTS7I”
save_dir=”docs/youtube/”

# 创建一个 GenericLoader Class 实例
loader = GenericLoader(
#将链接url中的Youtube视频的音频下载下来,存在本地路径save_dir
YoutubeAudioLoader([url],save_dir),

#使用OpenAIWhisperPaser解析器将音频转化为文本
OpenAIWhisperParser()
)

# 调用 GenericLoader Class 的函数 load对视频的音频文件进行加载
docs = loader.load()

4.探索加载的数据

print(“Type of pages: “, type(pages))
print(“Length of pages: “, len(pages))

page = pages[0]
print(“Type of page: “, type(page))
print(“Page_content: “, page.page_content[:500])
print(“Meta Data: “, page.metadata)

5.加载网页文档

from langchain.document_loaders import WebBaseLoader

# 创建一个 WebBaseLoader Class 实例
url = “https://github.com/basecamp/handbook/blob/master/37signals-is-you.md”
header = {‘User-Agent’: ‘python-requests/2.27.1’,
‘Accept-Encoding’: ‘gzip, deflate, br’,
‘Accept’: ‘*/*’,
‘Connection’: ‘keep-alive’}
loader = WebBaseLoader(web_path=url,header_template=header)

# 调用 WebBaseLoader Class 的函数 load对文件进行加载
pages = loader.load()

6.探索加载的数据

print(“Type of pages: “, type(pages))
print(“Length of pages: “, len(pages))
page = pages[0]
print(“Type of page: “, type(page))
print(“Page_content: “, page.page_content[:500])
print(“Meta Data: “, page.metadata)

Type of pages:
Length of pages: 1
Type of page:
Page_content: {“payload”:{“allShortcutsEnabled”:false,”fileTree”:{“”:{“items”:
[{“name”:”37signals-is-you.md”,”path”:”37signals-isyou.md”,”contentType”:”file”},
{“name”:”LICENSE.md”,”path”:”LICENSE.md”,”contentType”:”file”},
{“name”:”README.md”,”path”:”README.md”,”contentType”:”file”},{“name”:”benefitsand-perks.md”,”path”:”benefits-and-perks.md”,”contentType”:”file”},{“name”:”codeof-conduct.md”,”path”:”code-of-conduct.md”,”contentType”:”file”},
{“name”:”faq.md”,”path”:”faq.md”,”contentType”:”file”},{“name”:”ge
Meta Data: {‘source’:
‘https://github.com/basecamp/handbook/blob/master/37signals-is-you.md’}

进行进一步处理

import json
convert_to_json = json.loads(page.page_content)
extracted_markdow = convert_to_json[‘payload’][‘blob’][‘richText’]
print(extracted_markdow)

37signals Is You
Everyone working at 37signals represents 37signals. When a customer gets a
response from Merissa on support, Merissa is 37signals. When a customer reads a
tweet by Eron that our systems are down, Eron is 37signals. In those situations,
all the other stuff we do to cultivate our best image is secondary. What’s right
in front of someone in a time of need is what they’ll remember.
That’s what we mean when we say marketing is everyone’s responsibility, and that
it pays to spend the time to recognize that. This means avoiding the bullshit of
outage language and bending our policies, not just lending your ears. It means
taking the time to get the writing right and consider how you’d feel if you were
on the other side of the interaction.
The vast majority of our customers come from word of mouth and much of that word
comes from people in our audience. This is an audience we’ve been educating and
entertaining for 20 years and counting, and your voice is part of us now, whether
you like it or not! Tell us and our audience what you have to say!
This goes for tools and techniques as much as it goes for prose. 37signals not
only tries to out-teach the competition, but also out-share and out-collaborate.
We’re prolific open source contributors through Ruby on Rails, Trix, Turbolinks,
Stimulus, and many other projects. Extracting the common infrastructure that
others could use as well is satisfying, important work, and we should continue to
do that.
It’s also worth mentioning that joining 37signals can be all-consuming. We’ve
seen it happen. You dig 37signals, so you feel pressure to contribute, maybe
overwhelmingly so. The people who work here are some of the best and brightest in
our industry, so the self-imposed burden to be exceptional is real. But here’s
the thing: stop it. Settle in. We’re glad you love this job because we all do
too, but at the end of the day it’s a job. Do your best work, collaborate with
your team, write, read, learn, and then turn off your computer and play with your
dog. We’ll all be better for it.

7.加载 Notion 文档

from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader(“docs/Notion_DB”)
pages = loader.load()

8.探索加载的数据

print(“Type of pages: “, type(pages))
print(“Length of pages: “, len(pages))

page = pages[0]
print(“Type of page: “, type(page))
print(“Page_content: “, page.page_content[:500])
print(“Meta Data: “, page.metadata)

Type of pages:
Length of pages: 51
Type of page:
Page_content: # #letstalkaboutstress
Let’s talk about stress. Too much stress.
We know this can be a topic.
So let’s get this conversation going.
[Intro: two things you should know]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/Intro%20two%20things%20y
ou%20should%20know%20b5fd0c5393a9498b93396e79fe71e8bf.md)
[What is stress]
(#letstalkaboutstress%2064040a0733074994976118bbe0acc7fb/What%20is%20stress%20b19
8b685ed6a474ab14f6fafff7004b6.md)
[When is there too much stress?](#letstalkaboutstress%2
Meta Data: {‘source’: ‘docs/Notion_DB/#letstalkaboutstress
64040a0733074994976118bbe0acc7fb.md’}

面向开发者的LLM入门教程-为什么要进行文档分割

面向开发者的LLM入门教程-为什么要进行文档分割：为什么要进行文档分割 1. 模型大小和内存限制：GPT 模型，特别是大型版本如 GPT-3 ...

查看文章

嘿，伙伴们，今天我们的AI探索之旅已经圆满结束。关于“面向开发者的LLM入门教程-文档加载英文版”的内容已经分享给大家了。感谢你们的陪伴，希望这次旅程让你对AI能够更了解、更喜欢。谨记，精准提问是解锁AI潜能的钥匙哦！如果有小伙伴想要了解学习更多的AI知识，请关注我们的官网“AI智研社”，保证让你收获满满呦！

赏

微信打赏二维码微信扫一扫