linux文本编码格式转化 字幕处理

时间:2023-03-09 03:41:16
linux文本编码格式转化      字幕处理

在处理字幕的时候,linux的编码格式转换很烦。

步骤: 用python先判断 其编码,再用iconv 转编码,再用awk处理格式。

file不能判断吗?file有时不准。

1.python判断编码

$ cat t1.py
# -*- coding:utf8 -*-
import sys
#f1=open(sys.argv[2],'w')
with open(sys.argv[1], 'rb') as f:
for line in f:
# 转码,因为文件内的编码不一致
try:
line = line.decode('utf-8')
except:
try:
line = line.decode('GB2312') #right
print('hehe')
except:
try:
line = line.decode('gbk')
print('hehe1')
except:
try:
line = line.decode('GB18030')
print('hehe2')
except:
try:
line = line.decode('iso-8859-1') #wrong
except:
continue line = line.strip() # 去除首尾的空格tab回车换行
print(line)
#f1.write(line)

也是试出来的。

如果用file判断:   file -b --mime-encoding  text

2.iconv 转码:        iconv -f "GB2312" -t "utf-8" Ep._20:Valar_Morghulis.ass >  Ep._20:Valar_Morghulis.txt

参考  http://kjetilvalle.com/posts/text-file-encodings.html

综合:

$ cat readme.sh
#!/bin/sh
TO='utf-8'
for i in *ass
do
FROM=$(file -b --mime-encoding $i)
p=`basename $i .ass`
[ $FROM != "iso-8859-1" ] && iconv -f $FROM -t $TO $i > ${p}.txt
[ $FROM = "iso-8859-1" ] && iconv -f "GB2312" -t $TO $i > ${p}.txt
awk -F',,' '/Dialogue.*正文/{split($0,arr,",正文,,");split($3,brr,"N");split($3,crr,"{");print "\n"arr[1]"\n" brr[1]"\n"crr[length(crr)-1]}' ${p}.txt |sed -e 's/.*}//g' -e 's/\\$//g' > ${p}.norm
done