前言
相信对于每一个编程人员来说,在文本处理的时候,经常会遇到全角半角不一致的问题。于是需要程序能够快速的在两者之间互转。由于全角半角本身存在着映射关系,所以处理起来并不复杂。
具体规则为:
全角字符unicode编码从65281~65374 (十六进制 0xFF01 ~ 0xFF5E)
半角字符unicode编码从33~126 (十六进制 0x21~ 0x7E)
空格比较特殊,全角为 12288(0x3000),半角为 32(0x20)
而且除空格外,全角/半角按unicode编码排序在顺序上是对应的(半角 + 65248 = 全角)
所以可以直接通过用+-法来处理非空格数据,对空格单独处理。
用到的一些函数
chr()
函数用一个范围在range(256)内的(就是0~255)整数作参数,返回一个对应的字符。
unichr()
跟它一样,只不过返回的是Unicode字符。
ord()
函数是chr()
函数或unichr()
函数的配对函数,它以一个字符(长度为1的字符串)作为参数,返回对应的ASCII数值,或者Unicode数值。
先来打印下映射关系:
1
2
|
for i in xrange ( 33 , 127 ):
print i, chr (i),i + 65248 , unichr (i + 65248 )
|
返回结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
|
33 ! 65281 !
34 " 65282 "
35 # 65283 #
36 $ 65284 $
37 % 65285 %
38 & 65286 &
39 ' 65287 '
40 ( 65288 (
41 ) 65289 )
42 * 65290 *
43 + 65291 +
44 , 65292 ,
45 - 65293 -
46 . 65294 .
47 / 65295 /
48 0 65296 0
49 1 65297 1
50 2 65298 2
51 3 65299 3
52 4 65300 4
53 5 65301 5
54 6 65302 6
55 7 65303 7
56 8 65304 8
57 9 65305 9
58 : 65306 :
59 ; 65307 ;
60 < 65308 <
61 = 65309 =
62 > 65310 >
63 ? 65311 ?
64 @ 65312 @
65 A 65313 A
66 B 65314 B
67 C 65315 C
68 D 65316 D
69 E 65317 E
70 F 65318 F
71 G 65319 G
72 H 65320 H
73 I 65321 I
74 J 65322 J
75 K 65323 K
76 L 65324 L
77 M 65325 M
78 N 65326 N
79 O 65327 O
80 P 65328 P
81 Q 65329 Q
82 R 65330 R
83 S 65331 S
84 T 65332 T
85 U 65333 U
86 V 65334 V
87 W 65335 W
88 X 65336 X
89 Y 65337 Y
90 Z 65338 Z
91 [ 65339 [
92 \ 65340 \
93 ] 65341 ]
94 ^ 65342 ^
95 _ 65343 _
96 ` 65344 `
97 a 65345 a
98 b 65346 b
99 c 65347 c
100 d 65348 d
101 e 65349 e
102 f 65350 f
103 g 65351 g
104 h 65352 h
105 i 65353 i
106 j 65354 j
107 k 65355 k
108 l 65356 l
109 m 65357 m
110 n 65358 n
111 o 65359 o
112 p 65360 p
113 q 65361 q
114 r 65362 r
115 s 65363 s
116 t 65364 t
117 u 65365 u
118 v 65366 v
119 w 65367 w
120 x 65368 x
121 y 65369 y
122 z 65370 z
123 { 65371 {
124 | 65372 |
125 } 65373 }
126 ~ 65374 ~
|
把全角转成半角:
1
2
3
4
5
6
7
8
9
10
11
12
|
def full2half(s):
n = []
s = s.decode( 'utf-8' )
for char in s:
num = ord (char)
if num = = 0x3000 :
num = 32
elif 0xFF01 < = num < = 0xFF5E :
num - = 0xfee0
num = unichr (num)
n.append(num)
return ''.join(n)
|
把半角转成全角:
1
2
3
4
5
6
7
8
9
10
11
12
|
def half2full(s):
n = []
s = s.decode( 'utf-8' )
for char in s:
num = char(char)
if num = = 320 :
num = 0x3000
elif 0x21 < = num < = 0x7E :
num + = 0xfee0
num = unichr (num)
n.append(num)
return ''.join(n)
|
上面的实现方式非常的简单,但是现实情况下可能并不会把所以的字符统一进行转换,比如中文文章中我们期望将所有出现的字母和数字全部转化成半角,而常见标点符号统一使用全角,上面的转化就不适合了。
解决方案,是自定义词典。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
|
#!/usr/bin/env python
# -*- coding: utf-8 -*-
FH_SPACE = FHS = ((u " " , u " " ),)
FH_NUM = FHN = (
(u "0" , u "0" ), (u "1" , u "1" ), (u "2" , u "2" ), (u "3" , u "3" ), (u "4" , u "4" ),
(u "5" , u "5" ), (u "6" , u "6" ), (u "7" , u "7" ), (u "8" , u "8" ), (u "9" , u "9" ),
)
FH_ALPHA = FHA = (
(u "a" , u "a" ), (u "b" , u "b" ), (u "c" , u "c" ), (u "d" , u "d" ), (u "e" , u "e" ),
(u "f" , u "f" ), (u "g" , u "g" ), (u "h" , u "h" ), (u "i" , u "i" ), (u "j" , u "j" ),
(u "k" , u "k" ), (u "l" , u "l" ), (u "m" , u "m" ), (u "n" , u "n" ), (u "o" , u "o" ),
(u "p" , u "p" ), (u "q" , u "q" ), (u "r" , u "r" ), (u "s" , u "s" ), (u "t" , u "t" ),
(u "u" , u "u" ), (u "v" , u "v" ), (u "w" , u "w" ), (u "x" , u "x" ), (u "y" , u "y" ), (u "z" , u "z" ),
(u "A" , u "A" ), (u "B" , u "B" ), (u "C" , u "C" ), (u "D" , u "D" ), (u "E" , u "E" ),
(u "F" , u "F" ), (u "G" , u "G" ), (u "H" , u "H" ), (u "I" , u "I" ), (u "J" , u "J" ),
(u "K" , u "K" ), (u "L" , u "L" ), (u "M" , u "M" ), (u "N" , u "N" ), (u "O" , u "O" ),
(u "P" , u "P" ), (u "Q" , u "Q" ), (u "R" , u "R" ), (u "S" , u "S" ), (u "T" , u "T" ),
(u "U" , u "U" ), (u "V" , u "V" ), (u "W" , u "W" ), (u "X" , u "X" ), (u "Y" , u "Y" ), (u "Z" , u "Z" ),
)
FH_PUNCTUATION = FHP = (
(u "." , u "." ), (u "," , u "," ), (u "!" , u "!" ), (u "?" , u "?" ), (u "”" , u '"' ),
(u "'", u"'" ), (u "‘" , u "`" ), (u "@" , u "@" ), (u "_" , u "_" ), (u ":" , u ":" ),
(u ";" , u ";" ), (u "#" , u "#" ), (u "$" , u "$" ), (u "%" , u "%" ), (u "&" , u "&" ),
(u "(" , u "(" ), (u ")" , u ")" ), (u "‐" , u "-" ), (u "=" , u "=" ), (u "*" , u "*" ),
(u "+" , u "+" ), (u "-" , u "-" ), (u "/" , u "/" ), (u "<" , u "<" ), (u ">" , u ">" ),
(u "[" , u "[" ), (u "¥" , u "\\"), (u" ] ", u" ] "), (u" ^ ", u" ^ "), (u" { ", u" {"),
(u "|" , u "|" ), (u "}" , u "}" ), (u "~" , u "~" ),
)
FH_ASCII = HAC = lambda : ((fr, to) for m in (FH_ALPHA, FH_NUM, FH_PUNCTUATION) for fr, to in m)
HF_SPACE = HFS = ((u " " , u " " ),)
HF_NUM = HFN = lambda : ((h, z) for z, h in FH_NUM)
HF_ALPHA = HFA = lambda : ((h, z) for z, h in FH_ALPHA)
HF_PUNCTUATION = HFP = lambda : ((h, z) for z, h in FH_PUNCTUATION)
HF_ASCII = ZAC = lambda : ((h, z) for z, h in FH_ASCII())
def convert(text, * maps, * * ops):
""" 全角/半角转换
args:
text: unicode string need to convert
maps: conversion maps
skip: skip out of character. In a tuple or string
return: converted unicode string
"""
if "skip" in ops:
skip = ops[ "skip" ]
if isinstance (skip, basestring ):
skip = tuple (skip)
def replace(text, fr, to):
return text if fr in skip else text.replace(fr, to)
else :
def replace(text, fr, to):
return text.replace(fr, to)
for m in maps:
if callable (m):
m = m()
elif isinstance (m, dict ):
m = m.items()
for fr, to in m:
text = replace(text, fr, to)
return text
if __name__ = = '__main__' :
text = u "成田空港—【JR特急成田エクスプレス号・横浜行,2站】—東京—【JR新幹線はやぶさ号・新青森行,6站 】—新青森—【JR特急スーパー白鳥号・函館行,4站 】—函館"
print convert(text, FH_ASCII, {u "【" : u "[" , u "】" : u "]" , u "," : u "," , u "." : u "。" , u "?" : u "?" , u "!" : u "!" }, spit = ",。?!“”" )
|
特别注意:引号在英语体系中引号是不区分前引号和后引号。
总结
以上就是关于Python实现全角半角字符互转的方法,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流。
原文链接:http://www.biaodianfu.com/python-convert-between-unicode-fullwidth-halfwidth-characters.html