解析和转换TED包含JSON字幕

时间:2022-01-14 09:00:17

This question is related to this other question @ SuperUser.

这个问题与另一个@ SuperUser的问题有关。

I want to download the TED Talks and the respective subtitles for offline viewing, for instance lets take this short talk by Richard St. John, the high-resolution video download URL is the following:

我想要下载TED演讲和线下观看的字幕,例如,让我们来看看Richard St. John的简短演讲,高清视频下载网址是:

http://www.ted.com/talks/download/video/5118/talk/70

http://www.ted.com/talks/download/video/5118/talk/70

And the respective JSON encoded english subtitles can be downloaded at:

有关的JSON编码英文字幕可于以下网址下载:

http://www.ted.com/talks/subtitles/id/70/lang/eng

http://www.ted.com/talks/subtitles/id/70/lang/eng

Here is an except from the beginning of actual subtitle:

这是一个除了开始的实际副标题:

{"captions":[{"content":"This is really a two hour presentation I give to high school students,","startTime":0,"duration":3000,"startOfParagraph":false},{"content":"cut down to three minutes.","startTime":3000,"duration":1000,"startOfParagraph":false},{"content":"And it all started one day on a plane, on my way to TED,","startTime":4000,"duration":3000,"startOfParagraph":false},{"content":"seven years ago."

And from the end of the subtitle:

从副标题结尾开始

{"content":"Or failing that, do the eight things -- and trust me,","startTime":177000,"duration":3000,"startOfParagraph":false},{"content":"these are the big eight things that lead to success.","startTime":180000,"duration":4000,"startOfParagraph":false},{"content":"Thank you TED-sters for all your interviews!","startTime":184000,"duration":2000,"startOfParagraph":false}]}

I want to write an app that automatically downloads the high-resolution version of the video and all the available subtitles, but I'm having a really hard time since I have to convert the subtitle to a (VLC or any other decent video player) compatible format (.srt or .sub are my first choices) and I've no idea what the startTime and duration keys of the JSON file represent.

我想编写一个应用程序,它可以自动下载视频的高分辨率版本和所有可用的字幕,但由于我必须将字幕转换为a (VLC或任何其他合适的视频播放器)兼容格式,所以我的日子很艰难。srt或.sub是我的首选),我不知道JSON文件的起始时间和持续时间键代表什么。

What I know so far is this:

到目前为止,我所知道的是:

  • The downloaded video lasts for 3 minutes and 30 seconds, and has 29 FPS = 6090 frames.
  • 下载的视频持续3分钟30秒,有29帧FPS = 6090帧。
  • startTime starts at 0 with a duration of 3000 = 3000
  • startTime从0开始,持续时间为3000 = 3000。
  • startTime ends at 184000 with a duration of 2000 = 186000
  • startTime结束于184000,持续时间为2000 = 186000

It may also be worthwhile noticing the following Javascript snippet:

同样值得注意以下Javascript代码片段:

introDuration:16500,
adDuration:4000,
postAdDuration:2000,

So my question is, what logic should I apply to convert startTime and duration values to a .srt compatible format:

所以我的问题是,我应该用什么逻辑将startTime和duration值转换为.srt兼容的格式:

1
00:01:30,200 --> 00:01:32,201
MEGA DENG COOPER MINE, INDIA

2
00:01:37,764 --> 00:01:39,039
Watch out, watch out!

Or to a .sub compatible format:

或以。sub兼容格式:

{FRAME_FROM}{FRAME_TO}This is really a two hour presentation I give to high school students,
{FRAME_FROM}{FRAME_TO}cut down to three minutes.

Can anyone help me out with this?

有人能帮我一下吗?


Ninh Bui nailed it, the formula is the following:

Ninh Bui把它钉在了上面,公式如下:

introDuration - adDuration + startTime ... introDuration - adDuration + startTime + duration

This approach allows to me convert directly to .srt format (no need to know length and FPS) in two ways:

这种方法允许我以两种方式直接转换为.srt格式(不需要知道长度和FPS):

00:00:12,500 --> 00:00:15,500
This is really a two hour presentation I give to high school students,

00:00:15,500 --> 00:00:16,500
cut down to three minutes.

And:

和:

00:00:00,16500 --> 00:00:00,19500
And it all started one day on a plane, on my way to TED,

00:00:00,19500 --> 00:00:00,20500
seven years ago.

5 个解决方案

#1


4  

My guess would be that the times in the json are expressed in milliseconds, e.g. 1000 = 1 second. There is probably a maintimer, where startTime indicates the time on the timeline at which the subtitle should appear and the duration is probably the amount of time the subtitle should remain in vision. This theory is further affirmed by dividing 186000 / 1000 = 186 seconds = 186 / 60 = 3.1 minutes = 3 minutes and 6 seconds. The remaining seconds are probably applause ;-) With this information you should also be able to calculate from what frame to what frame you should apply your conversion to, i.e. you already know what the frames per second is so all you need to do is multiply the number of seconds of starttime with the FPS to get the begin frame. The end frame can be obtained by: (startTime + duration) * fps :-)

我的猜测是,json中的时间以毫秒表示,例如1000 = 1秒。可能有一个maintimer,其中startTime表示字幕出现的时间轴上的时间,持续时间可能是字幕在视觉上应该保持的时间。这一理论得到了进一步的肯定,将186000 / 1000 = 186秒= 186 / 60 = 3.1分钟= 3分6秒。剩下的秒可能是掌声;-)与这些信息你还应该能够计算的框架,框架应该运用你的转换,也就是说你已经知道帧每秒的速度是所有你需要做的是用的开始时间的秒数FPS的开始帧。最终帧可通过以下方式获得:(startTime + duration) * fps:-)

#2


3  

I made a simple console-based program to download the subtitles. I was thinking of making it available via web using some script system like grease monkey... Here is the link to my blogpost with the code.: http://estebanordano.com.ar/ted-talks-download-subtitles/

我制作了一个简单的基于控制台的程序来下载字幕。我想通过一些脚本系统,比如“油脂猴”,让它在网络上可用。这是我的博文链接和代码。:http://estebanordano.com.ar/ted-talks-download-subtitles/

#3


1  

I found another site which used this format. I quickly hacked a function to convert them into srt, should be self-explanatory:

我找到了另一个使用这种格式的网站。我快速破解了一个函数将它们转换为srt,应该不言自明:

import urllib2
import json

def json2srt(url, fname):
    data = json.load(urllib2.urlopen(url))['captions']

    def conv(t):
        return '%02d:%02d:%02d,%03d' % (
            t / 1000 / 60 / 60,
            t / 1000 / 60 % 60,
            t / 1000 % 60,
            t % 1000)

    with open(fname, 'wb') as fhandle:
        for i, item in enumerate(data):
            fhandle.write('%d\n%s --> %s\n%s\n\n' %
                (i,
                 conv(item['startTime']),
                 conv(item['startTime'] + item['duration'] - 1),
                 item['content'].encode('utf8')))

#4


0  

TEDGrabber beta2 : my program : http://sourceforge.net/projects/tedgrabber/

TEDGrabber beta2:我的程序:http://sourceforge.net/projects/tedgrabber/。

#5


0  

I've written a python script that downloads any TED video and creates an mkv file with all the subtitles/metadata embedded in it ( https://github.com/oxplot/ted2mkv ).

我已经编写了一个python脚本,它可以下载任何TED视频,并创建一个mkv文件,其中包含所有的子标题/元数据(https://github.com/oxplot/ted2mkv)。

I used the variable pad_seconds in the javascript code of the TED talk page as an offset to be added to all the timestamps in JSON subtitle files. It is what the flash player uses, I assume.

我在TED talk页面的javascript代码中使用变量pad_seconds作为将添加到JSON子标题文件中的所有时间戳的偏移量。我猜这就是flash播放器所使用的。

#1


4  

My guess would be that the times in the json are expressed in milliseconds, e.g. 1000 = 1 second. There is probably a maintimer, where startTime indicates the time on the timeline at which the subtitle should appear and the duration is probably the amount of time the subtitle should remain in vision. This theory is further affirmed by dividing 186000 / 1000 = 186 seconds = 186 / 60 = 3.1 minutes = 3 minutes and 6 seconds. The remaining seconds are probably applause ;-) With this information you should also be able to calculate from what frame to what frame you should apply your conversion to, i.e. you already know what the frames per second is so all you need to do is multiply the number of seconds of starttime with the FPS to get the begin frame. The end frame can be obtained by: (startTime + duration) * fps :-)

我的猜测是,json中的时间以毫秒表示,例如1000 = 1秒。可能有一个maintimer,其中startTime表示字幕出现的时间轴上的时间,持续时间可能是字幕在视觉上应该保持的时间。这一理论得到了进一步的肯定,将186000 / 1000 = 186秒= 186 / 60 = 3.1分钟= 3分6秒。剩下的秒可能是掌声;-)与这些信息你还应该能够计算的框架,框架应该运用你的转换,也就是说你已经知道帧每秒的速度是所有你需要做的是用的开始时间的秒数FPS的开始帧。最终帧可通过以下方式获得:(startTime + duration) * fps:-)

#2


3  

I made a simple console-based program to download the subtitles. I was thinking of making it available via web using some script system like grease monkey... Here is the link to my blogpost with the code.: http://estebanordano.com.ar/ted-talks-download-subtitles/

我制作了一个简单的基于控制台的程序来下载字幕。我想通过一些脚本系统,比如“油脂猴”,让它在网络上可用。这是我的博文链接和代码。:http://estebanordano.com.ar/ted-talks-download-subtitles/

#3


1  

I found another site which used this format. I quickly hacked a function to convert them into srt, should be self-explanatory:

我找到了另一个使用这种格式的网站。我快速破解了一个函数将它们转换为srt,应该不言自明:

import urllib2
import json

def json2srt(url, fname):
    data = json.load(urllib2.urlopen(url))['captions']

    def conv(t):
        return '%02d:%02d:%02d,%03d' % (
            t / 1000 / 60 / 60,
            t / 1000 / 60 % 60,
            t / 1000 % 60,
            t % 1000)

    with open(fname, 'wb') as fhandle:
        for i, item in enumerate(data):
            fhandle.write('%d\n%s --> %s\n%s\n\n' %
                (i,
                 conv(item['startTime']),
                 conv(item['startTime'] + item['duration'] - 1),
                 item['content'].encode('utf8')))

#4


0  

TEDGrabber beta2 : my program : http://sourceforge.net/projects/tedgrabber/

TEDGrabber beta2:我的程序:http://sourceforge.net/projects/tedgrabber/。

#5


0  

I've written a python script that downloads any TED video and creates an mkv file with all the subtitles/metadata embedded in it ( https://github.com/oxplot/ted2mkv ).

我已经编写了一个python脚本,它可以下载任何TED视频,并创建一个mkv文件,其中包含所有的子标题/元数据(https://github.com/oxplot/ted2mkv)。

I used the variable pad_seconds in the javascript code of the TED talk page as an offset to be added to all the timestamps in JSON subtitle files. It is what the flash player uses, I assume.

我在TED talk页面的javascript代码中使用变量pad_seconds作为将添加到JSON子标题文件中的所有时间戳的偏移量。我猜这就是flash播放器所使用的。