Given a markup file like this:
给定一个这样的标记文件:
<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
<seg id="5">In addition, participants at the event plan to discuss the rules for forming an expert panel, which is responsible for evaluating the work of scientific groups, as well as the criteria for carrying out evaluations.</seg>
<seg id="6">The third Expert Session will be the final meeting in a series of events on the formation of a unified approach for all three academies to the evaluation of the effectiveness of activities of scientific organizations.</seg>
<seg id="7">Over the past five months, we were able to achieve this, and the final version of the regulatory documents is undergoing approval.</seg>
<seg id="8">According to the plans for the upcoming session, we should complete the development of procedures for scientometric and expert analysis, and come to an agreement on the stages and timeframes for the evaluation process”, said the Head of FANO’s Expert-Analytical Department, Elena Aksenova.</seg>
<seg id="9">Representatives from more than one hundred Russian scientific institutes will take part in the event.</seg>
<seg id="10">It is expected that a resolution will be adopted based on its results.</seg>
<seg id="11">The meeting will begin at 10 am, Moscow time, on September 16, 2014, at the following address: 14 Solyanka Street, Moscow.</seg>
</p>
</doc>
</srcset>
I can remove the markup tags with Sed remove tags from html file:
我可以从html文件中删除带有Sed remove标记的标记:
sed -e 's/<[^>]*>//g' file.txt
which will leave me outputs with empty lines and I have to do this Delete empty lines using SED:
这将使我的输出有空行,我必须使用SED删除空行:
sed -e 's/<[^>]*>//g' file.txt | sed '/^\s*$/d'
How should I combine the remove tag and remove empty lines regexes into one?
我应该如何将删除标记和删除空行regexes合并为一个?
1 个解决方案
#1
2
What about deleting right away? :
马上删除呢?:
sed -e 's/<[^>]*>//g;/^\s*$/d' file.txt
#1
2
What about deleting right away? :
马上删除呢?:
sed -e 's/<[^>]*>//g;/^\s*$/d' file.txt