We are currently migrating one of our oracle databases to UTF8 and we have found a few records that are near the 4000 byte varchar limit. When we try and migrate these record they fail as they contain characters that become multibyte UF8 characters. What I want to do within PL/SQL is locate these characters to see what they are and then either change them or remove them.
我们目前正在将其中一个oracle数据库迁移到UTF8,并发现了一些接近4000字节varchar限制的记录。当我们尝试迁移这些记录时,它们失败了,因为它们包含了变成多字节UF8字符的字符。我想在PL/SQL中找到这些字符,看看它们是什么,然后要么修改它们,要么删除它们。
I would like to do :
我想做的是:
SELECT REGEXP_REPLACE(COLUMN,'[^[:ascii:]],'')
but Oracle does not implement the [:ascii:] character class.
但是Oracle没有实现[:ascii:]字符类。
Is there a simple way doing what I want to do?
有没有一种简单的方法来做我想做的事?
16 个解决方案
#1
6
In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF]
to detect non-ASCII characters.
在与单字节ASCII兼容的编码(例如,Latin-1)中,ASCII字符只是0到127之间的字节。因此,您可以使用类似[\x80-\xFF]的方法来检测非ascii字符。
#2
22
If you use the ASCIISTR
function to convert the Unicode to literals of the form \nnnn
, you can then use REGEXP_REPLACE
to strip those literals out, like so...
如果使用ASCIISTR函数将Unicode转换为表单\nnnn的文字,则可以使用REGEXP_REPLACE将这些文字删除,如下所示……
UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
...where field and table are your field and table names respectively.
…字段和表分别是字段和表名。
#3
14
I think this will do the trick:
我认为这能起到作用:
SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')
#4
9
I wouldn't recommend it for production code, but it makes sense and seems to work:
我不会推荐它用于生产代码,但它是有意义的,而且似乎可以工作:
SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')
#5
3
There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.
使用正则表达式可能有一种更直接的方法。运气好的话,别人会提供的。但这是我不需要查阅手册就能做到的。
Create a PLSQL function to receive your input string and return a varchar2.
创建一个PLSQL函数来接收输入字符串并返回一个varchar2。
In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.
在PLSQL函数中,执行输入的asciistr()。PLSQL是因为它返回的字符串长度可能超过4000,而在PLSQL中,varchar2有32K可用。
That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.
该函数将非ascii字符转换为\xxxx符号。所以你可以使用正则表达式来查找和删除它们。然后返回结果。
#6
3
The following also works:
以下工作:
select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
CONVERT (
'3735844533120%$03 ',
'US7ASCII',
'WE8ISO8859P1'),
'[^!@/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
FROM DUAL);
#7
3
The select may look like the following sample:
该select可能看起来像以下示例:
select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)
order by nvalue;
#8
2
I had a similar issue and blogged about it here. I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:
我也有一个类似的问题,我在这里写了博客。我从字母数字的正则表达式开始,然后添加了我喜欢的几个基本的标点符号:
select dump(a,1016), a, b
from
(select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
COLUMN b
from TABLE)
where a is not null
order by a;
I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.
我使用带有1016变体的dump来给出我想要替换的十六进制字符,然后我可以在utl_raw.cast_to_varchar2中使用它们。
#9
2
I found the answer here:
我在这里找到了答案:
http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html
http://www.squaredba.com/remove -非ascii字符- - -柱- 255. - html
CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/
Then run this to update your data
然后运行它来更新数据
update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);
#10
2
Try the following:
试试以下:
-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')
-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual
#11
0
Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.
Francisco Hayoz给出的答案是最好的。如果sql能帮到你,就不要使用pl/sql函数。
Here is the simple test in Oracle 11.2.03
下面是Oracle 11.2.03中的简单测试
select s
, regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
, dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
from (select 127+level l,chr(127+level) c from dual connect by level < 129))
And "rep 127-255" is
和“代表127 - 255”
Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
Typ = 1 Len = 30:226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255
i.e for some reason this version of Oracle does not replace char(226) and above. Using '['||chr(127)||'-'||chr(225)||']' gives the desired result. If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).
我。由于某种原因,这个版本的Oracle并没有替换char(226)和above。使用“[' | |杆(127)| |”——“| |杆(225)| |“]”给了期望的结果。如果您需要替换其他字符,只需将它们添加到上面的regex中,或者使用嵌套的replace|regexp_replace(如果替换不同的话)”(null string)。
#12
0
Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.
谢谢,这对我有用。顺便说一下,上面的示例中缺少一个单引号。
REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))
REGEXP_REPLACE(列,[^ | |杆(32)| |”——“| |杆(127)| | ']',' '))
I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.
我在一个词包函数中使用了它。偶尔会有一个内嵌的换行/ NL / CHR(10) / 0A在传入的文本中把事情弄得一团糟。
#13
0
Please note that whenever you use
请注意,无论何时使用
regexp_like(column, '[A-Z]')
Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.
Oracle的regexp引擎也将匹配Latin-1范围内的某些字符:这适用于所有看起来类似于ASCII字符的字符,如A- >A、O ->O、U ->U等,因此[A- z]不是您从其他环境(比如Perl)了解到的。
Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.
在进行字符集升级之前,尝试修改NVARCHAR2数据类型,而不是修改正则表达式。
Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like
另一种方法:您可以尝试使用SOUNDEX函数,而不是删除字段的一部分内容,前提是您的数据库只包含欧洲字符(例如,Latin-1)。或者,您只需编写一个函数,将从Latin-1范围的字符转换为类似的ASCII字符,比如
- å => a
- 一个= >
- ä => a
- 一个= >
- ö => o
- o = > o
of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.
当然,只适用于转换为UTF-8时超过4000字节的文本块。
#14
0
You can try something like following to search for the column containing non-ascii character :
您可以尝试以下操作来搜索包含非ascii字符的列:
select * from your_table where your_col <> asciistr(your_col);
#15
-2
Do this, it will work.
做这个,它会有用的。
trim(replace(ntwk_slctor_key_txt, chr(0), ''))
#16
-3
I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is). The following is a simple character whitelist approach:
我回答这个问题有点晚了,但是最近遇到了同样的问题(人们把各种各样的东西剪切并粘贴到一个字符串中,我们并不总是知道它是什么)。下面是一个简单的字符白名单方法:
SELECT est.clients_ref
,TRANSLATE (
est.clients_ref
, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
|| REPLACE (
TRANSLATE (
est.clients_ref
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
)
,'~'
)
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
)
clean_ref
FROM edms_staging_table est
从edms_staging_table美国东部时间
#1
6
In a single-byte ASCII-compatible encoding (e.g. Latin-1), ASCII characters are simply bytes in the range 0 to 127. So you can use something like [\x80-\xFF]
to detect non-ASCII characters.
在与单字节ASCII兼容的编码(例如,Latin-1)中,ASCII字符只是0到127之间的字节。因此,您可以使用类似[\x80-\xFF]的方法来检测非ascii字符。
#2
22
If you use the ASCIISTR
function to convert the Unicode to literals of the form \nnnn
, you can then use REGEXP_REPLACE
to strip those literals out, like so...
如果使用ASCIISTR函数将Unicode转换为表单\nnnn的文字,则可以使用REGEXP_REPLACE将这些文字删除,如下所示……
UPDATE table SET field = REGEXP_REPLACE(ASCIISTR(field), '\\[[:xdigit:]]{4}', '')
...where field and table are your field and table names respectively.
…字段和表分别是字段和表名。
#3
14
I think this will do the trick:
我认为这能起到作用:
SELECT REGEXP_REPLACE(COLUMN, '[^[:print:]]', '')
#4
9
I wouldn't recommend it for production code, but it makes sense and seems to work:
我不会推荐它用于生产代码,但它是有意义的,而且似乎可以工作:
SELECT REGEXP_REPLACE(COLUMN,'[^' || CHR(1) || '-' || CHR(127) || '],'')
#5
3
There's probably a more direct way using regular expressions. With luck, somebody else will provide it. But here's what I'd do without needing to go to the manuals.
使用正则表达式可能有一种更直接的方法。运气好的话,别人会提供的。但这是我不需要查阅手册就能做到的。
Create a PLSQL function to receive your input string and return a varchar2.
创建一个PLSQL函数来接收输入字符串并返回一个varchar2。
In the PLSQL function, do an asciistr() of your input. The PLSQL is because that may return a string longer than 4000 and you have 32K available for varchar2 in PLSQL.
在PLSQL函数中,执行输入的asciistr()。PLSQL是因为它返回的字符串长度可能超过4000,而在PLSQL中,varchar2有32K可用。
That function converts the non-ASCII characters to \xxxx notation. So you can use regular expressions to find and remove those. Then return the result.
该函数将非ascii字符转换为\xxxx符号。所以你可以使用正则表达式来查找和删除它们。然后返回结果。
#6
3
The following also works:
以下工作:
select dump(a,1016), a from (
SELECT REGEXP_REPLACE (
CONVERT (
'3735844533120%$03 ',
'US7ASCII',
'WE8ISO8859P1'),
'[^!@/\.,;:<>#$%&()_=[:alnum:][:blank:]]') a
FROM DUAL);
#7
3
The select may look like the following sample:
该select可能看起来像以下示例:
select nvalue from table
where length(asciistr(nvalue))!=length(nvalue)
order by nvalue;
#8
2
I had a similar issue and blogged about it here. I started with the regular expression for alpha numerics, then added in the few basic punctuation characters I liked:
我也有一个类似的问题,我在这里写了博客。我从字母数字的正则表达式开始,然后添加了我喜欢的几个基本的标点符号:
select dump(a,1016), a, b
from
(select regexp_replace(COLUMN,'[[:alnum:]/''%()> -.:=;[]','') a,
COLUMN b
from TABLE)
where a is not null
order by a;
I used dump with the 1016 variant to give out the hex characters I wanted to replace which I could then user in a utl_raw.cast_to_varchar2.
我使用带有1016变体的dump来给出我想要替换的十六进制字符,然后我可以在utl_raw.cast_to_varchar2中使用它们。
#9
2
I found the answer here:
我在这里找到了答案:
http://www.squaredba.com/remove-non-ascii-characters-from-a-column-255.html
http://www.squaredba.com/remove -非ascii字符- - -柱- 255. - html
CREATE OR REPLACE FUNCTION O1DW.RECTIFY_NON_ASCII(INPUT_STR IN VARCHAR2)
RETURN VARCHAR2
IS
str VARCHAR2(2000);
act number :=0;
cnt number :=0;
askey number :=0;
OUTPUT_STR VARCHAR2(2000);
begin
str:=’^'||TO_CHAR(INPUT_STR)||’^';
cnt:=length(str);
for i in 1 .. cnt loop
askey :=0;
select ascii(substr(str,i,1)) into askey
from dual;
if askey < 32 or askey >=127 then
str :=’^'||REPLACE(str, CHR(askey),”);
end if;
end loop;
OUTPUT_STR := trim(ltrim(rtrim(trim(str),’^'),’^'));
RETURN (OUTPUT_STR);
end;
/
Then run this to update your data
然后运行它来更新数据
update o1dw.rate_ipselect_p_20110505
set NCANI = RECTIFY_NON_ASCII(NCANI);
#10
2
Try the following:
试试以下:
-- To detect
select 1 from dual
where regexp_like(trim('xx test text æ¸¬è© ¦ “xmx” number²'),'['||chr(128)||'-'||chr(255)||']','in')
-- To strip out
select regexp_replace(trim('xx test text æ¸¬è© ¦ “xmxmx” number²'),'['||chr(128)||'-'||chr(255)||']','',1,0,'in')
from dual
#11
0
Answer given by Francisco Hayoz is the best. Don't use pl/sql functions if sql can do it for you.
Francisco Hayoz给出的答案是最好的。如果sql能帮到你,就不要使用pl/sql函数。
Here is the simple test in Oracle 11.2.03
下面是Oracle 11.2.03中的简单测试
select s
, regexp_replace(s,'[^'||chr(1)||'-'||chr(127)||']','') "rep ^1-127"
, dump(regexp_replace(s,'['||chr(127)||'-'||chr(225)||']','')) "rep 127-255"
from (
select listagg(c, '') within group (order by c) s
from (select 127+level l,chr(127+level) c from dual connect by level < 129))
And "rep 127-255" is
和“代表127 - 255”
Typ=1 Len=30: 226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
Typ = 1 Len = 30:226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255
i.e for some reason this version of Oracle does not replace char(226) and above. Using '['||chr(127)||'-'||chr(225)||']' gives the desired result. If you need to replace other characters just add them to the regex above or use nested replace|regexp_replace if the replacement is different then '' (null string).
我。由于某种原因,这个版本的Oracle并没有替换char(226)和above。使用“[' | |杆(127)| |”——“| |杆(225)| |“]”给了期望的结果。如果您需要替换其他字符,只需将它们添加到上面的regex中,或者使用嵌套的replace|regexp_replace(如果替换不同的话)”(null string)。
#12
0
Thanks, this worked for my purposes. BTW there is a missing single-quote in the example, above.
谢谢,这对我有用。顺便说一下,上面的示例中缺少一个单引号。
REGEXP_REPLACE (COLUMN,'[^' || CHR (32) || '-' || CHR (127) || ']', ' '))
REGEXP_REPLACE(列,[^ | |杆(32)| |”——“| |杆(127)| | ']',' '))
I used it in a word-wrap function. Occasionally there was an embedded NewLine/ NL / CHR(10) / 0A in the incoming text that was messing things up.
我在一个词包函数中使用了它。偶尔会有一个内嵌的换行/ NL / CHR(10) / 0A在传入的文本中把事情弄得一团糟。
#13
0
Please note that whenever you use
请注意,无论何时使用
regexp_like(column, '[A-Z]')
Oracle's regexp engine will match certain characters from the Latin-1 range as well: this applies to all characters that look similar to ASCII characters like Ä->A, Ö->O, Ü->U, etc., so that [A-Z] is not what you know from other environments like, say, Perl.
Oracle的regexp引擎也将匹配Latin-1范围内的某些字符:这适用于所有看起来类似于ASCII字符的字符,如A- >A、O ->O、U ->U等,因此[A- z]不是您从其他环境(比如Perl)了解到的。
Instead of fiddling with regular expressions try changing for the NVARCHAR2 datatype prior to character set upgrade.
在进行字符集升级之前,尝试修改NVARCHAR2数据类型,而不是修改正则表达式。
Another approach: instead of cutting away part of the fields' contents you might try the SOUNDEX function, provided your database contains European characters (i.e. Latin-1) characters only. Or you just write a function that translates characters from the Latin-1 range into similar looking ASCII characters, like
另一种方法:您可以尝试使用SOUNDEX函数,而不是删除字段的一部分内容,前提是您的数据库只包含欧洲字符(例如,Latin-1)。或者,您只需编写一个函数,将从Latin-1范围的字符转换为类似的ASCII字符,比如
- å => a
- 一个= >
- ä => a
- 一个= >
- ö => o
- o = > o
of course only for text blocks exceeding 4000 bytes when transformed to UTF-8.
当然,只适用于转换为UTF-8时超过4000字节的文本块。
#14
0
You can try something like following to search for the column containing non-ascii character :
您可以尝试以下操作来搜索包含非ascii字符的列:
select * from your_table where your_col <> asciistr(your_col);
#15
-2
Do this, it will work.
做这个,它会有用的。
trim(replace(ntwk_slctor_key_txt, chr(0), ''))
#16
-3
I'm a bit late in answering this question, but had the same problem recently (people cut and paste all sorts of stuff into a string and we don't always know what it is). The following is a simple character whitelist approach:
我回答这个问题有点晚了,但是最近遇到了同样的问题(人们把各种各样的东西剪切并粘贴到一个字符串中,我们并不总是知道它是什么)。下面是一个简单的字符白名单方法:
SELECT est.clients_ref
,TRANSLATE (
est.clients_ref
, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
|| REPLACE (
TRANSLATE (
est.clients_ref
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
)
,'~'
)
,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890#$%^&*()_+-={}|[]:";<>?,./'
)
clean_ref
FROM edms_staging_table est
从edms_staging_table美国东部时间