URL
This example covers how to load HTML
documents from a list of URLs
into the Document
format that we can use downstream.
Unstructured URL Loader
You have to install the unstructured
library:
!pip install -U unstructured
from langchain_community.document_loaders import UnstructuredURLLoader
API Reference:
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]
Pass in ssl_verify=False with headers=headers to get past ssl_verification error.
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()
Selenium URL Loader
This covers how to load HTML documents from a list of URLs using the
SeleniumURLLoader
.
Using Selenium
allows us to load pages that require JavaScript to
render.
To use the SeleniumURLLoader
, you have to install selenium
and
unstructured
.
!pip install -U selenium unstructured
from langchain_community.document_loaders import SeleniumURLLoader
API Reference:
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = SeleniumURLLoader(urls=urls)
data = loader.load()
Playwright URL Loader
This covers how to load HTML documents from a list of URLs using the
PlaywrightURLLoader
.
Playwright enables reliable end-to-end testing for modern web apps.
As in the Selenium case, Playwright
allows us to load and render the
JavaScript pages.
To use the PlaywrightURLLoader
, you have to install playwright
and
unstructured
. Additionally, you have to install the
Playwright Chromium
browser:
!pip install -U playwright unstructured
!playwright install
from langchain_community.document_loaders import PlaywrightURLLoader
API Reference:
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])
data = loader.load()