I'm trying to replace certain strings of text and then remove all RTF tags from the same text string.
So the initial value is:
<data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2{\fonttbl{\f0\fcharset0 Times New Roman;}{\f2\fcharset0 Segoe UI;}{\f3\fcharset0 arial;}}{\colortbl\red0\green0\blue0;\red255\green255\blue255;}\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{{\ltrch Ingredients: roast British chicken breast \'b7 chicken stock mayo and smoked \'b7 prawns with mayo on malted brown bread \'b7 smoked British ham with mustard mayo on oatmeal bread \'b7 .}\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltrch }{\ltrch }{\ltrch }\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltrch roast British chicken breast \'b7 chicken stock mayo and smoked : Chicken Breast (25.89%) \'b7 }{\ltrch {\b Wheatflour}}{\ltrch contains }{\ltrch {\b Gluten}}{\ltrch (with Wheatflour \'b7 Calcium Carbonate \'b7 Iron \'b7 Niacin \'b7 Thiamin) \'b7 Water \'b7 Pork (10.32%) \'b7 Malted }{\ltrch {\b Wheatflakes}}{\ltrch (contain }{\ltrch {\b Gluten}}{\ltrch ) \'b7 Rapeseed Oil \'b7 }{\ltrch {\b Wheat}}\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltArch }{\ltrch }{\ltrch }\li0\ri0\sa0\sb0\fi0\ql\par}
So what needs to be done:
- Values like
{\b Wheat}
should become<bold>Wheat</bold>
- where the Wheat can be anything or different. - \'b7 should become a comma (',')
像{\ b Wheat}这样的值应该变成 <粗体> 小麦 - 小麦可以是任何东西或不同的东西。
The result would be:
<data>Ingredients: roast British chicken breast , chicken stock mayo and smoked , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .
roast British chicken breast , chicken stock mayo and smoked : Chicken Breast (25.89%) , <bold> Wheatflour</bold> contains <bold>Gluten</bold>(with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold> Wheatflakes</bold>contain <bold> Gluten</bold>, Rapeseed Oil , <bold> Wheat</bold>
Can this be done? If so, how?
1 个解决方案
This isn't terribly difficult if you can use XSLT 2.0 or newer, which includes regular expression functionality. The key for you is the replace()
如果您可以使用XSLT 2.0或更新版本(包括正则表达式功能),这并不是非常困难。关键是replace()函数。
Here's a snippet of code that begins to clean up your RTF mess:
<xsl:template match="data">
<!-- Note: XSL variables are _immutable_: once created, their values
cannot be changed. I use a chain of variables here simply for
purposes of illustration, as a means of showing each regex
replacement operation on its own. These could all be stacked
into a single statement, but that is somewhat harder for
humans to read. :) -->
<xsl:variable name="bolded" select="replace(., '\{\\b (.*?)\}', '<bold>$1</bold>')"/>
<xsl:variable name="commas" select="replace($bolded, '\\''b7', ',')"/>
<xsl:variable name="unfonted" select="replace($commas, '\{\\fonttbl\{.*?\}\}', '')"/>
<xsl:variable name="uncolored" select="replace($unfonted, '\{\\colortbl\\.*?\}', '')"/>
<xsl:variable name="no-ltrch" select="replace($uncolored, '\{\\ltrch (.*?)\}', '$1')"/>
<xsl:value-of select="$no-ltrch" disable-output-escaping="yes"/>
This currently outputs (after adding the closing </data>
tag that's missing in your sample input XML):
这当前输出(在添加示例输入XML中缺少的结束 标记之后):
<?xml version="1.0" encoding="UTF-8"?><test>
<data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{Ingredients: roast British chicken breast , chicken stock mayo and smoked , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .\li0\ri0\sa0\sb0\fi0\ql\par}
{ \li0\ri0\sa0\sb0\fi0\ql\par}
{roast British chicken breast , chicken stock mayo and smoked : Chicken Breast (25.89%) , <bold>Wheatflour</bold> contains <bold>Gluten</bold> (with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold>Wheatflakes</bold> (contain <bold>Gluten</bold>) , Rapeseed Oil , <bold>Wheat</bold>\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltArch } \li0\ri0\sa0\sb0\fi0\ql\par}
At this point, you just need to figure out the rest of the regular expressions needed to strip out the remainder of the RTF codes.
This isn't terribly difficult if you can use XSLT 2.0 or newer, which includes regular expression functionality. The key for you is the replace()
如果您可以使用XSLT 2.0或更新版本(包括正则表达式功能),这并不是非常困难。关键是replace()函数。
Here's a snippet of code that begins to clean up your RTF mess:
<xsl:template match="data">
<!-- Note: XSL variables are _immutable_: once created, their values
cannot be changed. I use a chain of variables here simply for
purposes of illustration, as a means of showing each regex
replacement operation on its own. These could all be stacked
into a single statement, but that is somewhat harder for
humans to read. :) -->
<xsl:variable name="bolded" select="replace(., '\{\\b (.*?)\}', '<bold>$1</bold>')"/>
<xsl:variable name="commas" select="replace($bolded, '\\''b7', ',')"/>
<xsl:variable name="unfonted" select="replace($commas, '\{\\fonttbl\{.*?\}\}', '')"/>
<xsl:variable name="uncolored" select="replace($unfonted, '\{\\colortbl\\.*?\}', '')"/>
<xsl:variable name="no-ltrch" select="replace($uncolored, '\{\\ltrch (.*?)\}', '$1')"/>
<xsl:value-of select="$no-ltrch" disable-output-escaping="yes"/>
This currently outputs (after adding the closing </data>
tag that's missing in your sample input XML):
这当前输出(在添加示例输入XML中缺少的结束 标记之后):
<?xml version="1.0" encoding="UTF-8"?><test>
<data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{Ingredients: roast British chicken breast , chicken stock mayo and smoked , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .\li0\ri0\sa0\sb0\fi0\ql\par}
{ \li0\ri0\sa0\sb0\fi0\ql\par}
{roast British chicken breast , chicken stock mayo and smoked : Chicken Breast (25.89%) , <bold>Wheatflour</bold> contains <bold>Gluten</bold> (with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold>Wheatflakes</bold> (contain <bold>Gluten</bold>) , Rapeseed Oil , <bold>Wheat</bold>\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltArch } \li0\ri0\sa0\sb0\fi0\ql\par}
At this point, you just need to figure out the rest of the regular expressions needed to strip out the remainder of the RTF codes.