I'm trying to replace certain strings of text and then remove all RTF tags from the same text string.
我正在尝试替换某些文本字符串,然后从同一文本字符串中删除所有RTF标记。
So the initial value is:
所以初始值是:
<test>
<data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2{\fonttbl{\f0\fcharset0 Times New Roman;}{\f2\fcharset0 Segoe UI;}{\f3\fcharset0 arial;}}{\colortbl\red0\green0\blue0;\red255\green255\blue255;}\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{{\ltrch Ingredients: roast British chicken breast \'b7 chicken stock mayo and smoked \'b7 prawns with mayo on malted brown bread \'b7 smoked British ham with mustard mayo on oatmeal bread \'b7 .}\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltrch }{\ltrch }{\ltrch }\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltrch roast British chicken breast \'b7 chicken stock mayo and smoked : Chicken Breast (25.89%) \'b7 }{\ltrch {\b Wheatflour}}{\ltrch contains }{\ltrch {\b Gluten}}{\ltrch (with Wheatflour \'b7 Calcium Carbonate \'b7 Iron \'b7 Niacin \'b7 Thiamin) \'b7 Water \'b7 Pork (10.32%) \'b7 Malted }{\ltrch {\b Wheatflakes}}{\ltrch (contain }{\ltrch {\b Gluten}}{\ltrch ) \'b7 Rapeseed Oil \'b7 }{\ltrch {\b Wheat}}\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltArch }{\ltrch }{\ltrch }\li0\ri0\sa0\sb0\fi0\ql\par}
}
}
</test>
So what needs to be done:
那么需要做些什么:
- Values like
{\b Wheat}
should become<bold>Wheat</bold>
- where the Wheat can be anything or different. - \'b7 should become a comma (',')
像{\ b Wheat}这样的值应该变成 <粗体> 小麦 - 小麦可以是任何东西或不同的东西。
'b7应该变成逗号(',')
The result would be:
结果将是:
<test>
<data>Ingredients: roast British chicken breast , chicken stock mayo and smoked , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .
roast British chicken breast , chicken stock mayo and smoked : Chicken Breast (25.89%) , <bold> Wheatflour</bold> contains <bold>Gluten</bold>(with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold> Wheatflakes</bold>contain <bold> Gluten</bold>, Rapeseed Oil , <bold> Wheat</bold>
</data>
</test>
Can this be done? If so, how?
可以这样做吗?如果是这样,怎么样?
1 个解决方案
#1
0
This isn't terribly difficult if you can use XSLT 2.0 or newer, which includes regular expression functionality. The key for you is the replace()
function.
如果您可以使用XSLT 2.0或更新版本(包括正则表达式功能),这并不是非常困难。关键是replace()函数。
Here's a snippet of code that begins to clean up your RTF mess:
这是一段代码,开始清理你的RTF混乱:
<xsl:template match="data">
<xsl:copy>
<!-- Note: XSL variables are _immutable_: once created, their values
cannot be changed. I use a chain of variables here simply for
purposes of illustration, as a means of showing each regex
replacement operation on its own. These could all be stacked
into a single statement, but that is somewhat harder for
humans to read. :) -->
<xsl:variable name="bolded" select="replace(., '\{\\b (.*?)\}', '<bold>$1</bold>')"/>
<xsl:variable name="commas" select="replace($bolded, '\\''b7', ',')"/>
<xsl:variable name="unfonted" select="replace($commas, '\{\\fonttbl\{.*?\}\}', '')"/>
<xsl:variable name="uncolored" select="replace($unfonted, '\{\\colortbl\\.*?\}', '')"/>
<xsl:variable name="no-ltrch" select="replace($uncolored, '\{\\ltrch (.*?)\}', '$1')"/>
<xsl:value-of select="$no-ltrch" disable-output-escaping="yes"/>
</xsl:copy>
</xsl:template>
This currently outputs (after adding the closing </data>
tag that's missing in your sample input XML):
这当前输出(在添加示例输入XML中缺少的结束 标记之后):
<?xml version="1.0" encoding="UTF-8"?><test>
<data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{Ingredients: roast British chicken breast , chicken stock mayo and smoked , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .\li0\ri0\sa0\sb0\fi0\ql\par}
{ \li0\ri0\sa0\sb0\fi0\ql\par}
{roast British chicken breast , chicken stock mayo and smoked : Chicken Breast (25.89%) , <bold>Wheatflour</bold> contains <bold>Gluten</bold> (with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold>Wheatflakes</bold> (contain <bold>Gluten</bold>) , Rapeseed Oil , <bold>Wheat</bold>\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltArch } \li0\ri0\sa0\sb0\fi0\ql\par}
}
}</data>
</test>
At this point, you just need to figure out the rest of the regular expressions needed to strip out the remainder of the RTF codes.
此时,您只需要找出去除剩余RTF代码所需的其余正则表达式。
#1
0
This isn't terribly difficult if you can use XSLT 2.0 or newer, which includes regular expression functionality. The key for you is the replace()
function.
如果您可以使用XSLT 2.0或更新版本(包括正则表达式功能),这并不是非常困难。关键是replace()函数。
Here's a snippet of code that begins to clean up your RTF mess:
这是一段代码,开始清理你的RTF混乱:
<xsl:template match="data">
<xsl:copy>
<!-- Note: XSL variables are _immutable_: once created, their values
cannot be changed. I use a chain of variables here simply for
purposes of illustration, as a means of showing each regex
replacement operation on its own. These could all be stacked
into a single statement, but that is somewhat harder for
humans to read. :) -->
<xsl:variable name="bolded" select="replace(., '\{\\b (.*?)\}', '<bold>$1</bold>')"/>
<xsl:variable name="commas" select="replace($bolded, '\\''b7', ',')"/>
<xsl:variable name="unfonted" select="replace($commas, '\{\\fonttbl\{.*?\}\}', '')"/>
<xsl:variable name="uncolored" select="replace($unfonted, '\{\\colortbl\\.*?\}', '')"/>
<xsl:variable name="no-ltrch" select="replace($uncolored, '\{\\ltrch (.*?)\}', '$1')"/>
<xsl:value-of select="$no-ltrch" disable-output-escaping="yes"/>
</xsl:copy>
</xsl:template>
This currently outputs (after adding the closing </data>
tag that's missing in your sample input XML):
这当前输出(在添加示例输入XML中缺少的结束 标记之后):
<?xml version="1.0" encoding="UTF-8"?><test>
<data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{Ingredients: roast British chicken breast , chicken stock mayo and smoked , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .\li0\ri0\sa0\sb0\fi0\ql\par}
{ \li0\ri0\sa0\sb0\fi0\ql\par}
{roast British chicken breast , chicken stock mayo and smoked : Chicken Breast (25.89%) , <bold>Wheatflour</bold> contains <bold>Gluten</bold> (with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold>Wheatflakes</bold> (contain <bold>Gluten</bold>) , Rapeseed Oil , <bold>Wheat</bold>\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltArch } \li0\ri0\sa0\sb0\fi0\ql\par}
}
}</data>
</test>
At this point, you just need to figure out the rest of the regular expressions needed to strip out the remainder of the RTF codes.
此时,您只需要找出去除剩余RTF代码所需的其余正则表达式。