如何替换Perl中所有HTML编码的重音符?

时间:2021-05-03 00:10:35

I have the following situation:

我有以下情况:

There is a tool that gets an XSLT from a web interface and embeds the XSLT in an XML file (Someone should have been fired). "Unfortunately" I work in a French speaking country and therefore the XSLT has a number of words with accents. When the XSLT is embedded in the XML, the tool converts all the accents to their HTML codes (Iacute, igrave, etc...) .

有一个工具可以从Web界面获取XSLT并将XSLT嵌入到XML文件中(某人应该被触发)。 “不幸的是”我在一个讲法语的国家工作,因此XSLT有很多带有重音符号的单词。当XSLT嵌入到XML中时,该工具会将所有重音转换为其HTML代码(Iacute,igrave等...)。

My Perl code is retrieving the XSLT from the XML and is executing it against an other XML using Xalan command line tool. Every time there is some accent in the XSLT the Xalan tool throws an exception.

我的Perl代码正在从XML中检索XSLT,并使用Xalan命令行工具针对其他XML执行它。每次XSLT中都有一些重音时,Xalan工具会抛出异常。

I initially though to do a regexp to change all the accents in the XSLT usch as:

我最初想做一个正则表达式来改变XSLT中的所有重音:

# the & is omitted in the codes becuase it will be rendered in the page
$xslt =~s/Aacute;/Á/gso;
$xslt =~s/aacute;/á/gso;
$xslt =~s/Agrave;/À/gso;
$xslt =~s/Acirc;/Â/gso;
$xslt =~s/agrave;/à/gso;

but doing so means that I have to write a regexp for each of the accent codes....

但这样做意味着我必须为每个重音代码写一个正则表达式....

My question is, is there anyway to do this without writing a regexp per code? (thinking that is the only solution makes be want to vomit.)

我的问题是,无论如何在没有为每个代码编写正则表达式的情况下执行此操作? (认为​​这是唯一的解决方案,想要呕吐。)

By the way the tool is TeamSite, and it sucks.....

顺便说一句,该工具是TeamSite,它很糟糕.....

Edited: I forgot to mention that I need to have a Perl only solution, security does not let me install any type of libs they have not checked for a week or so :(

编辑:我忘了提到我需要一个Perl唯一的解决方案,安全性不允许我安装他们没有检查一周左右的任何类型的库:(

4 个解决方案

#1


6  

You can try something like HTML::Entities. From the POD:

您可以尝试像HTML :: Entities这样的东西。从POD:

use HTML::Entities;
$a = "Våre norske tegn bør &#230res";
decode_entities($a);
#encode_entities($a, "\200-\377");  ## not needed for what you are doing

In response to your edit, HTML::Entities is not in the perl core. It might still be installed on your system because it is used by a lot of other libraries. You can check by running this command:

为了响应您的编辑,HTML :: Entities不在perl核心中。它可能仍然安装在您的系统上,因为它被许多其他库使用。您可以通过运行此命令来检查:

perl -MHTML::Entities -le 'print "If this prints, the it is installed"'

#2


1  

For your purpose is HTML::Entities far best solution but if you will not found some existing package fits your needs following approach is more effective than multiple s/// statements

为了您的目的,HTML :: Entities是最好的解决方案,但是如果您不会发现某些现有的软件包符合您的需求,那么下面的方法比多个// //语句更有效

# this part do in inter function module code which is executed in compile time
# or place in BEGIN or do once before first s/// statement using it
my %trans = (
  'Aacute;' => 'Á',
  'aacute;' => 'á',
  'Agrave;' => 'À',
  'Acirc;' => 'Â',
  'agrave;' => 'à',
); # remember you can generate parts of this hash for example by map

my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;

# this code place in your functions or methods
s/($re)/$trans{$1}/g; # 'o' is almost useless here because $re has been compiled yet

Edit: There is no need of e regexp modifier as mentioned by Chas. Owens.

编辑:Chas提到不需要e regexp修饰符。欧文斯。

#3


0  

I don't suppose it's possible to make TeamSite leave it as utf-8/convert it to utf-8?

我不认为可以让TeamSite将其保留为utf-8 /将其转换为utf-8?

CGI.pm has an (undocumented) unescapeHTML function. However, since it IS undocumented (and I haven't looked through the source), I don't know if it just handles basic HTML entities (<, >, &) or more. However, I'd GUESS that it only does the basic entities.

CGI.pm有一个(未记录的)unescapeHTML函数。但是,由于它没有记录(我没有查看源代码),我不知道它是否只处理基本的HTML实体(<,>,&)或更多。但是,我认为只做基本实体。

#4


0  

Why should someone be fired for putting XSL, which is XML, into an XML file?

为什么有人会因将XSL(XML)放入XML文件而被解雇?

#1


6  

You can try something like HTML::Entities. From the POD:

您可以尝试像HTML :: Entities这样的东西。从POD:

use HTML::Entities;
$a = "V&aring;re norske tegn b&oslash;r &#230res";
decode_entities($a);
#encode_entities($a, "\200-\377");  ## not needed for what you are doing

In response to your edit, HTML::Entities is not in the perl core. It might still be installed on your system because it is used by a lot of other libraries. You can check by running this command:

为了响应您的编辑,HTML :: Entities不在perl核心中。它可能仍然安装在您的系统上,因为它被许多其他库使用。您可以通过运行此命令来检查:

perl -MHTML::Entities -le 'print "If this prints, the it is installed"'

#2


1  

For your purpose is HTML::Entities far best solution but if you will not found some existing package fits your needs following approach is more effective than multiple s/// statements

为了您的目的,HTML :: Entities是最好的解决方案,但是如果您不会发现某些现有的软件包符合您的需求,那么下面的方法比多个// //语句更有效

# this part do in inter function module code which is executed in compile time
# or place in BEGIN or do once before first s/// statement using it
my %trans = (
  'Aacute;' => 'Á',
  'aacute;' => 'á',
  'Agrave;' => 'À',
  'Acirc;' => 'Â',
  'agrave;' => 'à',
); # remember you can generate parts of this hash for example by map

my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;

# this code place in your functions or methods
s/($re)/$trans{$1}/g; # 'o' is almost useless here because $re has been compiled yet

Edit: There is no need of e regexp modifier as mentioned by Chas. Owens.

编辑:Chas提到不需要e regexp修饰符。欧文斯。

#3


0  

I don't suppose it's possible to make TeamSite leave it as utf-8/convert it to utf-8?

我不认为可以让TeamSite将其保留为utf-8 /将其转换为utf-8?

CGI.pm has an (undocumented) unescapeHTML function. However, since it IS undocumented (and I haven't looked through the source), I don't know if it just handles basic HTML entities (<, >, &) or more. However, I'd GUESS that it only does the basic entities.

CGI.pm有一个(未记录的)unescapeHTML函数。但是,由于它没有记录(我没有查看源代码),我不知道它是否只处理基本的HTML实体(<,>,&)或更多。但是,我认为只做基本实体。

#4


0  

Why should someone be fired for putting XSL, which is XML, into an XML file?

为什么有人会因将XSL(XML)放入XML文件而被解雇?