Microsoft Excel
The UnstructuredExcelLoader is used to load Microsoft Excel files.
The loader works with both .xlsx and .xls files. The page content
will be the raw text of the Excel file. If you use the loader in
"elements" mode, an HTML representation of the Excel file will be
available in the document metadata under the text_as_html key.
from langchain_community.document_loaders import UnstructuredExcelLoader
API Reference:
loader = UnstructuredExcelLoader("example_data/stanley-cups.xlsx", mode="elements")
docs = loader.load()
docs[0]
Document(page_content='\n  \n    \n      Team\n      Location\n      Stanley Cups\n    \n    \n      Blues\n      STL\n      1\n    \n    \n      Flyers\n      PHI\n      2\n    \n    \n      Maple Leafs\n      TOR\n      13\n    \n  \n', metadata={'source': 'example_data/stanley-cups.xlsx', 'filename': 'stanley-cups.xlsx', 'file_directory': 'example_data', 'filetype': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'page_number': 1, 'page_name': 'Stanley Cups', 'text_as_html': '<table border="1" class="dataframe">\n  <tbody>\n    <tr>\n      <td>Team</td>\n      <td>Location</td>\n      <td>Stanley Cups</td>\n    </tr>\n    <tr>\n      <td>Blues</td>\n      <td>STL</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Flyers</td>\n      <td>PHI</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Maple Leafs</td>\n      <td>TOR</td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>', 'category': 'Table'})
Using Azure AI Document Intelligenceβ
Azure AI Document Intelligence (formerly known as
Azure Form Recognizer) is machine-learning based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from digital or scanned PDFs, images, Office and HTML files.Document Intelligence supports
JPEG/JPG,PNG,BMP,TIFF,HEIF,DOCX,XLSX,PPTXandHTML.
This current implementation of a loader using Document Intelligence
can incorporate content page-wise and turn it into LangChain documents.
The default output format is markdown, which can be easily chained with
MarkdownHeaderTextSplitter for semantic document chunking. You can
also use mode="single" or mode="page" to return pure texts in a
single page or document split by page.
Prerequisiteβ
An Azure AI Document Intelligence resource in one of the 3 preview
regions: East US, West US2, West Europe - follow this
document
to create one if you donβt have. You will be passing <endpoint> and
<key> as parameters to the loader.
%pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)
documents = loader.load()