Subtitle
The SubRip file format is described on the
Matroska
multimedia container format website as βperhaps the most basic of all subtitle formats.βSubRip (SubRip Text)
files are named with the extension.srt
, and contain formatted lines of plain text in groups separated by a blank line. Subtitles are numbered sequentially, starting at 1. The timecode format used is hours:minutes:seconds,milliseconds with time units fixed to two zero-padded digits and fractions fixed to three zero-padded digits (00:00:00,000). The fractional separator used is the comma, since the program was written in France.
How to load data from subtitle (.srt
) files
Please, download the example .srt file from here.
%pip install --upgrade --quiet pysrt
from langchain_community.document_loaders import SRTLoader
API Reference:
loader = SRTLoader(
"example_data/Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt"
)
docs = loader.load()
docs[0].page_content[:100]
'<i>Corruption discovered\nat the core of the Banking Clan!</i> <i>Reunited, Rush Clovis\nand Senator A'