Skip to content

Instantly share code, notes, and snippets.

@100ideas
Forked from glasslion/vtt2text.py
Last active December 3, 2023 17:37
Show Gist options
  • Save 100ideas/7178de2967783743b944dcf429096cbf to your computer and use it in GitHub Desktop.
Save 100ideas/7178de2967783743b944dcf429096cbf to your computer and use it in GitHub Desktop.
This script convert youtube subtitle file(vtt) to plain text.

how to extract subtitles or closed caption from youtube url

Hello, I personally was looking for a simple minimal script that performed just this function: parsing vtt, discarding timecodes, merging chronologically close lines into a larger block, and outputting the result in a human-readable txt file. Just wanted to say that in my use case I prefer the way it merges multiple lines into a less-fine-grained time code.

@glasslion, thanks a lot for sharing this script!

vtt2text.py is a nice little script by glasslion I just found that seems to do what I am looking for - convert subtitle file, even closed-captioning "roll-up" style webvtt formats like what I have, into human-friendly full-page transcript.

Here are some usage notes:

# install youtube-dl & clone glasslion's vtt2text.py script
$ git clone https://gist.github.com/glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e caps2txt
Cloning into 'caps2txt'..
$ cd ./caps2txt

$ youtube-dl -o ytdl-subs --skip-download --write-sub --sub-format vtt "https://www.youtube.com/watch?v=KzWS7gJX5Z8"
[youtube] KzWS7gJX5Z8: Downloading webpage
[info] Writing video subtitles to: ytdl-subs.en.vtt

# 'l' is alias for 'tree --dirsfirst -aFCNL 1'
$ l
.
├── .git/
├── ytdl-subs.en.vtt
└── vtt2text.py

# convert...
$ python3 vtt2text.py ytdl-subs.en.vtt
$ l
.
├── .git/
├── vtt2text.py
├── ytdl-subs.en.txt
└── ytdl-subs.en.vtt

1 directory, 3 files

$ head -n 40 ytdl-subs.en.vtt ytdl-subs.en.txt
==> ytdl-subs.en.vtt <==
WEBVTT
Kind: captions
Language: en

00:03:54.333 --> 00:03:55.201 align:start position:0%

TH<00:03:54.366><c>E </c><00:03:54.399><c>SE</c><00:03:54.433><c>RG</c><00:03:54.466><c>EA</c><00:03:54.500><c>NT</c><00:03:54.533><c> A</c><00:03:54.566><c>T </c><00:03:54.600><c>AR</c><00:03:54.633><c>MS</c><00:03:54.666><c>: </c><00:03:54.700><c>MA</c><00:03:54.733><c>DA</c><00:03:54.766><c>M</c><00:03:55.101><c> </c>

00:03:55.201 --> 00:03:55.334 align:start position:0%
THE SERGEANT AT ARMS: MADAM


00:03:55.334 --> 00:03:57.236 align:start position:0%
THE SERGEANT AT ARMS: MADAM
SP<00:03:55.367><c>EA</c><00:03:55.401><c>KE</c><00:03:55.434><c>R,</c><00:03:55.468><c> T</c><00:03:55.501><c>HE</c><00:03:56.102><c> V</c><00:03:56.135><c>IC</c><00:03:56.168><c>E </c><00:03:56.202><c>PR</c><00:03:56.235><c>ES</c><00:03:56.268><c>ID</c><00:03:56.302><c>EN</c><00:03:56.335><c>T </c><00:03:56.368><c>AN</c><00:03:56.402><c>D</c><00:03:57.103><c> </c>

00:03:57.236 --> 00:03:57.369 align:start position:0%
SPEAKER, THE VICE PRESIDENT AND


00:03:57.369 --> 00:07:49.535 align:start position:0%
SPEAKER, THE VICE PRESIDENT AND
TH<00:03:57.403><c>E </c><00:03:57.436><c>UN</c><00:03:57.470><c>IT</c><00:03:57.503><c>ED</c><00:03:57.536><c> S</c><00:03:57.570><c>TA</c><00:03:57.603><c>TE</c><00:03:57.636><c>S </c><00:03:57.670><c>SE</c><00:03:57.703><c>NA</c><00:03:57.736><c>TE</c><00:03:57.770><c>.</c>

00:07:49.535 --> 00:07:50.603 align:start position:0%

TH<00:07:49.568><c>E </c><00:07:49.601><c>SP</c><00:07:49.635><c>EA</c><00:07:49.668><c>KE</c><00:07:49.702><c>R:</c><00:07:49.735><c> T</c><00:07:49.768><c>HE</c><00:07:50.303><c> H</c><00:07:50.336><c>OU</c><00:07:50.369><c>SE</c><00:07:50.403><c> C</c><00:07:50.436><c>OM</c><00:07:50.469><c>ES</c><00:07:50.503><c> </c>

00:07:50.603 --> 00:07:50.736 align:start position:0%
THE SPEAKER: THE HOUSE COMES


00:07:50.736 --> 00:07:54.773 align:start position:0%
THE SPEAKER: THE HOUSE COMES
TO<00:07:50.770><c>UR</c><00:07:50.803><c>ED</c><00:07:50.836><c> F</c><00:07:50.870><c>OR</c><00:07:51.304><c> T</c><00:07:51.337><c>HI</c><00:07:51.370><c>S</c><00:07:54.506><c> I</c><00:07:54.540><c>MP</c><00:07:54.573><c>OR</c><00:07:54.606><c>TA</c><00:07:54.640><c>NT</c><00:07:54.673><c>, </c>

00:07:54.773 --> 00:07:54.907 align:start position:0%
TOURED FOR THIS IMPORTANT,



==> ytdl-subs.en.txt <==

00:03
THE SERGEANT AT ARMS: MADAM  SPEAKER, THE VICE PRESIDENT AND

00:07
THE UNITED STATES SENATE. THE SPEAKER: THE HOUSE COMES
TOURED FOR THIS IMPORTANT,  HISTORIC MEETING. LET US REMIND THAT EACH SIDE,

00:08
HOUSE AND SENATE, DEMOCRATS AND  REPUBLICANS, EACH HAVE 11
MEMBERS ALLOWED TO BE PRESENT ON THE FLOOR. OTHERS MAY BE IN THE GALLERY.
THIS IS AT THE GUIDANCE OF THE  OFFICIATING -- ATTENDING
PHYSICIAN AND THE SERGEANT AT  ARMS. THE GENTLEMAN ON THE REPUBLICAN
SIDE OF THE AISLE WILL PLEASE  OBSERVE THE SOCIAL DISTANCING
AND AGREE TO WHAT WE HAVE, 11  MEMBERS ON EACH SIDE, SO THAT --
RESPONSIBILITIES TO THIS  CHAMBER, TO THIS RESPONSIBILITY, AND TO THIS HOUSE OF
REPRESENTATIVES. PLEASE EXIT THE FLOOR IF YOU DO
NOT HAVE AN ASSIGNED ROLE FROM  YOUR LEADERSHIP.
YOU CAN SHARE WITH YOUR STAFF IF YOU WANT TO HAVE A FEW MORE, BUT

00:09
YOU CANNOT BE TOGETHER ON THE  FLOOR OF THE HOUSE WITH THAT
MANY PEOPLE IN HERE. I'LL THANK THE SENATE AND  THOSE -- LET'S GO.
LET'S JUST START. &gt;&gt; MADAM SPEAKER. VICE PRESIDENT PENCE: MADAM
SPEAKER, MEMBERS OF CONGRESS,  PURSUANT TO THE CONSTITUTION AND
THE LAWS OF THE UNITED STATES,  THE SENATE AND HOUSE OF
REPRESENTATIVES ARE MEETING IN  JOINT SESSION TO VERIFY THE
CERTIFICATES AND COUNT THE VOTES OF THE ELECTORS IN THE SEVERAL
STATES FOR PRESIDENT AND VICE  PRESIDENT OF THE UNITED STATES.
AFTER ASCERTAINMENT HAS BEEN  HAD, CORRECT IN FORM, THE
TELLERS WILL COUNT AND MAKE A  LIST OF THE VOTES CAST BY THE

00:10
ELECTORS OF THE SEVERAL STATES. THE TELLERS ON THE PART OF THE
TWO HOUSES HAVE TAKEN THEIR  PLACES AT THE CLERK'S DESK.
WITHOUT OBJECTION, THE TELLERS  WILL DISPENSED WITH THE READING
OF THE FORMAL PORTIONS OF THE  CERTIFICATES. AFTER ASCERTAINING THAT THE
CERTIFICATES ARE REGULAR IN FORM AND AUTHENTIC, THE TELLERS WILL
ANNOUNCE THE VOTES CAST BY THE  ELECTORS FOR EACH STATE,
BEGINNING WITH ALABAMA. WHICH THE PARLIAMENTARIANS  ADVISE ME IS THE ONLY

@Crowdscriber/caption-parser - scala vtt parser that dedupes cues in roll-up style captions

check out the implementation of how deduping works - state machine + regex matcher that descriminates roll-up cues from finished ones.

@bausano and others, if you want more control over the parsing and the structure of the output format, check out the webvtt-py python package. I learned about it from a blog post written by William Morgan.

He wrote a tutorial showing how to programatically fetch vtt caption files from google/youtube in bulk, then use webvtt and pandas dataframe in python to parse and extract the caption content, including formatting it into tidy csv files to use as a downstream NLP corpus. Sounds like just what you are looking for...

Creating an NLP data set from YouTube subtitles. William Morgan Mar 8, 2019·12 min read

This project started out just like most data science projects do: collecting data. In my case I needed subtitles from videos on YouTube. Not just any videos, but videos of math lectures. The idea was to process the subtitles using NLP techniques and build a classifier that could differentiate subjects in mathematics. In this article I will show you both of the ways I like to “scrape” subtitles from YouTube videos: Manually downloading and cleaning the subtitles. Programmatically obtaining the subtitles using the API and youtube -dl.

from https://medium.com/@morga046/creating-an-nlp-data-set-from-youtube-subtitles-fb59c0955c2

# code details from W Morgan (python):

# First, we need a list of the .vtt files:
filenames_vtt = [os.fsdecode(file) for file in os.listdir(os.getcwd()) if os.fsdecode(file).endswith(".vtt")]

#Check file names
filenames_vtt[:2]

# Then, we write a function to extract the information and store it.
import webvtt
def convert_vtt(filenames):    
    #create an assets folder if one does not yet exist
    if os.path.isdir('{}/assets'.format(os.getcwd())) == False:
        os.makedirs('assets')
    #extract the text and times from the vtt file
    for file in filenames:
        captions = webvtt.read(file)
        text_time = pd.DataFrame()
        text_time['text'] = [caption.text for caption in captions]
        text_time['start'] = [caption.start for caption in captions]
        text_time['stop'] = [caption.end for caption in captions]
        text_time.to_csv('assets/{}.csv'.format(file[:-4]),index=False) #-4 to remove '.vtt'
        #remove files from local drive
        os.remove(file)

another option: node-webvtt

nice coding style but no attempt to deal with duplicate cues.

here's a browser code sandbox to play around in: https://frontarm.com/demoboard/?id=344821fa-577d-42ed-939c-8d6468d7685c


another option: @plussub/srt-vtt-parser"

it is well-written in typescript with minimal dependancies, but not so obvious without diving into the source how to implement de-duplication. will look at other libs in meantime.

var srtVttParser = require("@plussub/srt-vtt-parser")

/*
* note, the webvtt files I've been working begin with the required WEBVTT line but also have two lines of metadata.
* see https://github.com/osk/node-webvtt#metadata for background and code that works properly with it.  art-vtt-parser 
* chokes if these are present.
*
* typical header of file from youtube-dl:
*
*    WEBVTT
*    Kind: captions
*    Language: en
* 
*    <timecode>...
*/

let input = `WEBVTT

00:03:54.333 --> 00:03:55.201 align:start position:0%
TH<00:03:54.366><c>E </c><00:03:54.399><c>SE</c><00:03:54.433><c>RG</c><00:03:54.466><c>EA</c><00:03:54.500><c>NT</c><00:03:54.533><c> A</c><00:03:54.566><c>T </c><00:03:54.600><c>AR</c><00:03:54.633><c>MS</c><00:03:54.666><c>: </c><00:03:54.700><c>MA</c><00:03:54.733><c>DA</c><00:03:54.766><c>M</c><00:03:55.101><c> </c>

00:03:55.201 --> 00:03:55.334 align:start position:0%
THE SERGEANT AT ARMS: MADAM 
 
00:03:55.334 --> 00:03:57.236 align:start position:0%
THE SERGEANT AT ARMS: MADAM 
SP<00:03:55.367><c>EA</c><00:03:55.401><c>KE</c><00:03:55.434><c>R,</c><00:03:55.468><c> T</c><00:03:55.501><c>HE</c><00:03:56.102><c> V</c><00:03:56.135><c>IC</c><00:03:56.168><c>E </c><00:03:56.202><c>PR</c><00:03:56.235><c>ES</c><00:03:56.268><c>ID</c><00:03:56.302><c>EN</c><00:03:56.335><c>T </c><00:03:56.368><c>AN</c><00:03:56.402><c>D</c><00:03:57.103><c> </c>

00:03:57.236 --> 00:03:57.369 align:start position:0%
SPEAKER, THE VICE PRESIDENT AND 
 
00:03:57.369 --> 00:07:49.535 align:start position:0%
SPEAKER, THE VICE PRESIDENT AND 
TH<00:03:57.403><c>E </c><00:03:57.436><c>UN</c><00:03:57.470><c>IT</c><00:03:57.503><c>ED</c><00:03:57.536><c> S</c><00:03:57.570><c>TA</c><00:03:57.603><c>TE</c><00:03:57.636><c>S </c><00:03:57.670><c>SE</c><00:03:57.703><c>NA</c><00:03:57.736><c>TE</c><00:03:57.770><c>.</c>

00:07:49.535 --> 00:07:50.603 align:start position:0%
TH<00:07:49.568><c>E </c><00:07:49.601><c>SP</c><00:07:49.635><c>EA</c><00:07:49.668><c>KE</c><00:07:49.702><c>R:</c><00:07:49.735><c> T</c><00:07:49.768><c>HE</c><00:07:50.303><c> H</c><00:07:50.336><c>OU</c><00:07:50.369><c>SE</c><00:07:50.403><c> C</c><00:07:50.436><c>OM</c><00:07:50.469><c>ES</c><00:07:50.503><c> </c>

00:07:50.603 --> 00:07:50.736 align:start position:0%
THE SPEAKER: THE HOUSE COMES 

00:07:50.736 --> 00:07:54.773 align:start position:0%
THE SPEAKER: THE HOUSE COMES 
TO<00:07:50.770><c>UR</c><00:07:50.803><c>ED</c><00:07:50.836><c> F</c><00:07:50.870><c>OR</c><00:07:51.304><c> T</c><00:07:51.337><c>HI</c><00:07:51.370><c>S</c><00:07:54.506><c> I</c><00:07:54.540><c>MP</c><00:07:54.573><c>OR</c><00:07:54.606><c>TA</c><00:07:54.640><c>NT</c><00:07:54.673><c>, </c>

00:07:54.773 --> 00:07:54.907 align:start position:0%
TOURED FOR THIS IMPORTANT, 
 
00:07:54.907 --> 00:07:55.708 align:start position:0%
TOURED FOR THIS IMPORTANT, 
HI<00:07:54.940><c>ST</c><00:07:54.973><c>OR</c><00:07:55.007><c>IC</c><00:07:55.475><c> M</c><00:07:55.508><c>EE</c><00:07:55.541><c>TI</c><00:07:55.575><c>NG</c><00:07:55.608><c>.</c>

00:07:55.708 --> 00:07:55.842 align:start position:0%
HISTORIC MEETING.
`
console.log(JSON.stringify(srtVttParser.parse(input)), null, 2)

result:

{
  "entries": [
    {
      "id": "",
      "from": 234333,
      "to": 235201,
      "text": "TH<00:03:54.366><c>E </c><00:03:54.399><c>SE</c><00:03:54.433><c>RG</c><00:03:54.466><c>EA</c><00:03:54.500><c>NT</c><00:03:54.533><c> A</c><00:03:54.566><c>T </c><00:03:54.600><c>AR</c><00:03:54.633><c>MS</c><00:03:54.666><c>: </c><00:03:54.700><c>MA</c><00:03:54.733><c>DA</c><00:03:54.766><c>M</c><00:03:55.101><c> </c>"
    },
    {
      "id": "",
      "from": 235201,
      "to": 235334,
      "text": "THE SERGEANT AT ARMS: MADAM "
    },
    {
      "id": "",
      "from": 235334,
      "to": 237236,
      "text": "THE SERGEANT AT ARMS: MADAM \nSP<00:03:55.367><c>EA</c><00:03:55.401><c>KE</c><00:03:55.434><c>R,</c><00:03:55.468><c> T</c><00:03:55.501><c>HE</c><00:03:56.102><c> V</c><00:03:56.135><c>IC</c><00:03:56.168><c>E </c><00:03:56.202><c>PR</c><00:03:56.235><c>ES</c><00:03:56.268><c>ID</c><00:03:56.302><c>EN</c><00:03:56.335><c>T </c><00:03:56.368><c>AN</c><00:03:56.402><c>D</c><00:03:57.103><c> </c>"
    },
    {
      "id": "",
      "from": 237236,
      "to": 237369,
      "text": "SPEAKER, THE VICE PRESIDENT AND "
    },
    {
      "id": "",
      "from": 237369,
      "to": 469535,
      "text": "SPEAKER, THE VICE PRESIDENT AND \nTH<00:03:57.403><c>E </c><00:03:57.436><c>UN</c><00:03:57.470><c>IT</c><00:03:57.503><c>ED</c><00:03:57.536><c> S</c><00:03:57.570><c>TA</c><00:03:57.603><c>TE</c><00:03:57.636><c>S </c><00:03:57.670><c>SE</c><00:03:57.703><c>NA</c><00:03:57.736><c>TE</c><00:03:57.770><c>.</c>"
    },
    {
      "id": "",
      "from": 469535,
      "to": 470603,
      "text": "TH<00:07:49.568><c>E </c><00:07:49.601><c>SP</c><00:07:49.635><c>EA</c><00:07:49.668><c>KE</c><00:07:49.702><c>R:</c><00:07:49.735><c> T</c><00:07:49.768><c>HE</c><00:07:50.303><c> H</c><00:07:50.336><c>OU</c><00:07:50.369><c>SE</c><00:07:50.403><c> C</c><00:07:50.436><c>OM</c><00:07:50.469><c>ES</c><00:07:50.503><c> </c>"
    },
    {
      "id": "",
      "from": 470603,
      "to": 470736,
      "text": "THE SPEAKER: THE HOUSE COMES "
    },
    {
      "id": "",
      "from": 470736,
      "to": 474773,
      "text": "THE SPEAKER: THE HOUSE COMES \nTO<00:07:50.770><c>UR</c><00:07:50.803><c>ED</c><00:07:50.836><c> F</c><00:07:50.870><c>OR</c><00:07:51.304><c> T</c><00:07:51.337><c>HI</c><00:07:51.370><c>S</c><00:07:54.506><c> I</c><00:07:54.540><c>MP</c><00:07:54.573><c>OR</c><00:07:54.606><c>TA</c><00:07:54.640><c>NT</c><00:07:54.673><c>, </c>"
    },
    {
      "id": "",
      "from": 474773,
      "to": 474907,
      "text": "TOURED FOR THIS IMPORTANT, "
    },
    {
      "id": "",
      "from": 474907,
      "to": 475708,
      "text": "TOURED FOR THIS IMPORTANT, \nHI<00:07:54.940><c>ST</c><00:07:54.973><c>OR</c><00:07:55.007><c>IC</c><00:07:55.475><c> M</c><00:07:55.508><c>EE</c><00:07:55.541><c>TI</c><00:07:55.575><c>NG</c><00:07:55.608><c>.</c>"
    },
    {
      "id": "",
      "from": 475708,
      "to": 475842,
      "text": "HISTORIC MEETING."
    }
  ]
}

---

## misc notes and research on webvtt format and conventions
basically, most tools and users assume the file is more like a subtitle file - polished, no duplicate lines, etc - but
youtube, c-span, ahd many other video producers that are more oriented towards producing and sharing live broadcasts 
produce less polished "roll-up style live captioning" files that are valid webvtt but include a lot of repeated lines.

- https://gist.github.com/glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e
  - this script works
- **https://www.reddit.com/r/youtubedl/comments/jvn6jx/how_to_convert_vtt_subtitles_into_human_readable/**
- https://github.com/glut23/webvtt-py
- https://medium.com/@morga046/creating-an-nlp-data-set-from-youtube-subtitles-fb59c0955c2
- https://python-pytube.readthedocs.io/en/latest/user/quickstart.html#subtitle-caption-tracks
- https://www.ccextractor.org/public:gsoc:subtitle_extractor_technical_docs
- https://github.com/jdepoix/youtube-transcript-api#cli
- https://github.com/TimEllis/vttprocessor
  - hosted: https://www.lancaster.ac.uk/staff/ellist/vtttocaqdas.html

overlapping cue timing in webvtt
- gets messy in live captioning that build cues incrementally
- https://github.com/w3c/webvtt/issues/318
- these older pro format conventions follow `CEA608` ?
  - https://dvcs.w3.org/hg/text-tracks/raw-file/default/608toVTT/608toVTT.html
  - https://github.com/Dash-Industry-Forum/cea608.js/blob/aa3d036106f3f06aaebea57c470b70f238683f11/lib/cea608-towebvtt.js
- pop-on vs paint-on vs roll-up caption modes & fed regulations & webvtt
  - https://www.w3.org/community/texttracks/wiki/RollupCaptions
- *videojs/http-streaming: fix: VTTCues with identical time intervals being incorrectly removed*
  - https://github.com/videojs/http-streaming/pull/1005
  - > We can meet those criteria by only removing cues that have identical time intervals and identical text. This will ensure we remove any cues that overlap VTT segments, while keeping any cues that are actually intended to be displayed at the same time (which we can reasonably assume will have different text)
- **caption-parser: WebVTT De-duping**
  - > "A lot of times you aren't using a caption display mechanism that supports multi-line rollup captions. In these situations, you really want to "de-duplicate" the captions by only keeping one line of captions. SubtitleUtil provides a vttToSubtitles convenience method that lets you control whether or not captions are de-duped."
  - https://github.com/crowdscriber/caption-parser/#webvtt-de-duping


- 
  - SRT vs TTML vs webvtt
  - https://mux.com/blog/subtitles-captions-webvtt-hls-and-those-magic-flags/

---

https://stackoverflow.com/a/54818581

<!-- language-all: none -->

Another option is to use `youtube-dl`:

    youtube-dl --skip-download --write-auto-sub $youtube_url

The default format is `vtt` and the other available format is `ttml` (`--sub-format ttml`).

    --write-sub
           Write subtitle file

    --write-auto-sub
           Write automatically generated subtitle file (YouTube only)

    --all-subs
           Download all the available subtitles of the video

    --list-subs
           List all available subtitles for the video

    --sub-format FORMAT
           Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"

    --sub-lang LANGS
           Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags

You can use `ffmpeg` to convert the subtitle file to another format:

    ffmpeg -i input.vtt output.srt

This is what the VTT subtitles look like:

    WEBVTT
    Kind: captions
    Language: en

    00:00:01.429 --> 00:00:04.249 align:start position:0%

    ladies<00:00:02.429><c> and</c><00:00:02.580><c> gentlemen</c><c.colorE5E5E5><00:00:02.879><c> I'd</c></c><c.colorCCCCCC><00:00:03.870><c> like</c></c><c.colorE5E5E5><00:00:04.020><c> to</c><00:00:04.110><c> thank</c></c>

    00:00:04.249 --> 00:00:04.259 align:start position:0%
    ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
     </c>

    00:00:04.259 --> 00:00:05.930 align:start position:0%
    ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
    you<00:00:04.440><c> for</c><00:00:04.620><c> coming</c><00:00:05.069><c> tonight</c><00:00:05.190><c> especially</c></c><c.colorCCCCCC><00:00:05.609><c> at</c></c>

    00:00:05.930 --> 00:00:05.940 align:start position:0%
    you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
     </c>

    00:00:05.940 --> 00:00:07.730 align:start position:0%
    you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
    such<00:00:06.180><c> short</c><00:00:06.690><c> notice</c></c>

    00:00:07.730 --> 00:00:07.740 align:start position:0%
    such short notice


    00:00:07.740 --> 00:00:09.620 align:start position:0%
    such short notice
    I'm<00:00:08.370><c> sure</c><c.colorE5E5E5><00:00:08.580><c> mr.</c><00:00:08.820><c> Irving</c><00:00:09.000><c> will</c><00:00:09.120><c> fill</c><00:00:09.300><c> you</c><00:00:09.389><c> in</c><00:00:09.420><c> on</c></c>

    00:00:09.620 --> 00:00:09.630 align:start position:0%
    I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
     </c>

    00:00:09.630 --> 00:00:11.030 align:start position:0%
    I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
    the<00:00:09.750><c> circumstances</c><00:00:10.440><c> that's</c><00:00:10.620><c> brought</c><00:00:10.920><c> us</c></c>

    00:00:11.030 --> 00:00:11.040 align:start position:0%
    <c.colorE5E5E5>the circumstances that's brought us
     </c>

Here are the same subtitles without the part at the top of the file and without tags:

    00:00:01.429 --> 00:00:04.249 align:start position:0%

    ladies and gentlemen I'd like to thank

    00:00:04.249 --> 00:00:04.259 align:start position:0%
    ladies and gentlemen I'd like to thank


    00:00:04.259 --> 00:00:05.930 align:start position:0%
    ladies and gentlemen I'd like to thank
    you for coming tonight especially at

    00:00:05.930 --> 00:00:05.940 align:start position:0%
    you for coming tonight especially at


    00:00:05.940 --> 00:00:07.730 align:start position:0%
    you for coming tonight especially at
    such short notice

    00:00:07.730 --> 00:00:07.740 align:start position:0%
    such short notice


    00:00:07.740 --> 00:00:09.620 align:start position:0%
    such short notice
    I'm sure mr. Irving will fill you in on

    00:00:09.620 --> 00:00:09.630 align:start position:0%
    I'm sure mr. Irving will fill you in on


    00:00:09.630 --> 00:00:11.030 align:start position:0%
    I'm sure mr. Irving will fill you in on
    the circumstances that's brought us

You can see that each subtitle text is repeated three times. There is a new subtitle text every eighth line (3rd, 11th, 19th, and 27th).

This converts the VTT subtitles to a simpler format:

    sed '1,/^$/d' *.vtt| # remove the part at the top
    sed 's/<[^>]*>//g'| # remove tags
    awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3' # print each new subtitle text and its start time without milliseconds

This is what the output of the command above looks like:

    00:00:01 ladies and gentlemen I'd like to thank
    00:00:04 you for coming tonight especially at
    00:00:05 such short notice
    00:00:07 I'm sure mr. Irving will fill you in on
    00:00:09 the circumstances that's brought us

This prints the closed captions of a video in the simplified format:

`cap()(cd /tmp;rm -f -- *.vtt;youtube-dl --skip-download --write-auto-sub -- "$1";sed '1,/^$/d' -- *.vtt|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3')`

The command below downloads the captions of all videos on a channel. When there is an error like `Unable to extract video data`, `-i` (`--ignore-errors`) causes `youtube-dl` to skip the video instead of exiting with an error.

`youtube-dl -i --skip-download --write-auto-sub -o '%(upload_date)s.%(title)s.%(id)s.%(ext)s' https://www.youtube.com/channel/$channelid;for f in *.vtt;do sed '1,/^$/d' "$f"|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3'>"${f%.vtt}";done`
"""
Convert YouTube subtitles(vtt) to human readable text.
Download only subtitles from YouTube with youtube-dl:
youtube-dl --skip-download --convert-subs vtt <video_url>
Note that default subtitle format provided by YouTube is ass, which is hard
to process with simple regex. Luckily youtube-dl can convert ass to vtt, which
is easier to process.
To conver all vtt files inside a directory:
find . -name "*.vtt" -exec python vtt2text.py {} \;
"""
import sys
import re
def remove_tags(text):
"""
Remove vtt markup tags
"""
tags = [
r'</c>',
r'<c(\.color\w+)?>',
r'<\d{2}:\d{2}:\d{2}\.\d{3}>',
]
for pat in tags:
text = re.sub(pat, '', text)
# extract timestamp, only kep HH:MM
text = re.sub(
r'(\d{2}:\d{2}):\d{2}\.\d{3} --> .* align:start position:0%',
r'\g<1>',
text
)
text = re.sub(r'^\s+$', '', text, flags=re.MULTILINE)
return text
def remove_header(lines):
"""
Remove vtt file header
"""
pos = -1
for mark in ('##', 'Language: en',):
if mark in lines:
pos = lines.index(mark)
lines = lines[pos+1:]
return lines
def merge_duplicates(lines):
"""
Remove duplicated subtitles. Duplacates are always adjacent.
"""
last_timestamp = ''
last_cap = ''
for line in lines:
if line == "":
continue
if re.match('^\d{2}:\d{2}$', line):
if line != last_timestamp:
yield line
last_timestamp = line
else:
if line != last_cap:
yield line
last_cap = line
def merge_short_lines(lines):
buffer = ''
for line in lines:
if line == "" or re.match('^\d{2}:\d{2}$', line):
yield '\n' + line
continue
if len(line+buffer) < 80:
buffer += ' ' + line
else:
yield buffer.strip()
buffer = line
yield buffer
def main():
vtt_file_name = sys.argv[1]
txt_name = re.sub(r'.vtt$', '.txt', vtt_file_name)
with open(vtt_file_name) as f:
text = f.read()
text = remove_tags(text)
lines = text.splitlines()
lines = remove_header(lines)
lines = merge_duplicates(lines)
lines = list(lines)
lines = merge_short_lines(lines)
lines = list(lines)
with open(txt_name, 'w') as f:
for line in lines:
f.write(line)
f.write("\n")
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment