Perl - 如何计算包含多个xml开始和结束标记的XML文件中的某些标记

时间:2022-06-01 18:18:54

I have an XML file like this.

我有一个像这样的XML文件。

Each line of the file starts and ends with a process_info tag. The file can contain many lines like this, there may be many similar files.

文件的每一行都以process_info标记开头和结尾。该文件可以包含许多这样的行,可能有许多类似的文件。

<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>

I'd like to keep a count of all the different values of the result element, so my ouput would be something like this:

我想保留结果元素的所有不同值的计数,所以我的输出将是这样的:

"D14 - Calls *144" count 2
"Duplicate CDR" count 1
"Error on CDR level; File processing continued." count 1

“D14 - 呼叫* 144”计数2“重复CDR”计数1“CDR级别错误;文件处理继续。”数1

How can I should this? I guess using XML:Twig or XML:Parser, but as there are many start/end tags inside a file i'm unable to figure out a solution.

我怎么能这样呢?我想使用XML:Twig或XML:Parser,但由于文件中有许多开始/结束标记,我无法找到解决方案。

5 个解决方案

#1


1  

You could use the excellent DOM parser Mojo::DOM from the Mojolicious suite to count these. It's pretty straightforward. Use a hash (%count) to keep track of how often you found a result. This is the typical Perl idiom for this kind of problems.

您可以使用来自Mojolicious套件的优秀DOM解析器Mojo :: DOM来计算这些。这很简单。使用哈希值(%count)来跟踪您找到结果的频率。对于这类问题,这是典型的Perl习语。

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::DOM;

# read all input lines at once
my $dom = Mojo::DOM->new(do {local $/; <DATA>});

# prepare count hash
my %count = ();

# iterate result elements
$dom->find('result')->each(sub {
    my $element = shift;
    $count{$element->text}++;
});

# output
say "$_: $count{$_}" for keys %count;

__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>

Output:

Duplicate CDR: 1
Error on CDR level; File processing continued.: 1
D14 - Calls *144: 2

#2


1  

This is conveniently done with any of the Perl XML modules, but since you mention XML::Twig, that is what I have used in this solution.

这可以通过任何Perl XML模块方便地完成,但是既然你提到了XML :: Twig,那就是我在这个解决方案中使用的。

You say there may be many similar XML files but do not say how they are to be identified, so all I can do is offer you a solution for a single file and hope you can extrapolate from here.

你说可能有许多类似的XML文件,但没有说明如何识别它们,所以我所能做的就是为你提供单个文件的解决方案,并希望你能从这里推断出来。

The program works by reading the file line by line, parsing each line as a separate XML document, and extracting the text value of the first child element of the root document that has a result tag. This text value is used as a hash key to keep track of the number of occurrences of each different result.

该程序通过逐行读取文件,将每行解析为单独的XML文档,并提取具有结果标记的根文档的第一个子元素的文本值来工作。此文本值用作哈希键,以跟踪每个不同结果的出现次数。

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new;

my %results;

open my $fh, '<', 'my.xml' or die $!;

while (<$fh>) {
  $twig->parse($_);
  my $result = $twig->root->first_child('result');
  if ($result) {
    $result = $result->trimmed_text;
    $results{$result}++;
  }
}

for (sort keys %results) {
  my $n = $results{$_};
  printf qq("%s" count %d\n), $_, $n;
}

output

"D14 - Calls *144" count 2
"Duplicate CDR" count 1
"Error on CDR level; File processing continued." count 1

#3


0  

You could use XML::SAX::PurePerl, it is very fail-proof and, in my experience, handle well messy XML:

您可以使用XML :: SAX :: PurePerl,它非常防故障,并且根据我的经验,处理凌乱的XML:

#!/usr/bin/env perl
package Result::Extractor;
use strict;
use warnings qw(all);

use base qw(XML::SAX::Base);

sub new {
    return bless {
        count   => {},
        data    => '',
    };
}

sub start_element {
    my ($self, $el) = @_;
    $self->{data} = '';
}

sub end_element {
    my ($self, $el) = @_;
    if ($el->{Name} eq 'result') {
        ++$self->{count}{$self->{data}};
    }
}

sub characters {
    my ($self, $data) = @_;
    $self->{data} .= $data->{Data};
}

1;

package main;
use strict;
use warnings qw(all);

use Data::Printer;
use XML::SAX::PurePerl;

my $handler = Result::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);

$parser->parse_string(do { local $/; '<wrapper>' . <DATA> . '</wrapper>' });

p $handler->{count};

__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>

Result:

\ {
    'Duplicate CDR'                                    1,
    'D14 - Calls *144'                                 2,
    'Error on CDR level; File processing continued.'   1
}

You can also check XML::SAX::Expat, XML::SAX::ExpatXS and XML::LibXML::SAX; they are faster, but more error-prone.

您还可以检查XML :: SAX :: Expat,XML :: SAX :: ExpatXS和XML :: LibXML :: SAX;它们更快,但更容易出错。

#4


-1  

perl -MXML::Twig -E'XML::Twig->new( twig_handlers => { result => sub { $count{$_->text}++ } })->parsefile( $ARGV[0]); say "$_: $count{$_}" foreach sort keys %count; ' count.xml

would work, IF YOUR DATA WAS XML.

如果您的数据是XML,那将会工作。

It is not.

它不是。

#5


-1  

If you make the assumption that every instance of <result>...</result> is one that you are interested in, then you might be able to get away with a regex:

如果您假设 ... 的每个实例都是您感兴趣的实例,那么您可以使用正则表达式:

my $doc = read_file("file.xml"); # slurp in the doc
my %count;

while ($doc =~ m,<result.*?>(.*?)</result>,g) {
  $count{$1}++;
}

But I would use a real XML processing library for this, like XML::XPath. It's very easy to adapt the example program for XML::Path to your XML file:

但是我会使用一个真正的XML处理库,比如XML :: XPath。将XML :: Path的示例程序调整到XML文件非常容易:

use XML::XPath;
use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => 'file.xml');

my $nodeset = $xp->find('/zzz/process_info/result'); # find all results

my %count;
foreach my $node ($nodeset->get_nodelist) {
  $count{ $node->string_value } ++;
}

Note I am using an xpath of /zzz/... - the top level of your XML document must be a single element, so I enclosed your example with <zzz>...</zzz>.

注意我使用的是/ zzz / ...的xpath - XML文档的顶层必须是单个元素,所以我用 ... 附上了你的例子。

This is a much more robust solution since it will only locate result elements which are children of process_info elements.

这是一个更加强大的解决方案,因为它只能找到作为process_info元素的子元素的结果元素。

#1


1  

You could use the excellent DOM parser Mojo::DOM from the Mojolicious suite to count these. It's pretty straightforward. Use a hash (%count) to keep track of how often you found a result. This is the typical Perl idiom for this kind of problems.

您可以使用来自Mojolicious套件的优秀DOM解析器Mojo :: DOM来计算这些。这很简单。使用哈希值(%count)来跟踪您找到结果的频率。对于这类问题,这是典型的Perl习语。

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::DOM;

# read all input lines at once
my $dom = Mojo::DOM->new(do {local $/; <DATA>});

# prepare count hash
my %count = ();

# iterate result elements
$dom->find('result')->each(sub {
    my $element = shift;
    $count{$element->text}++;
});

# output
say "$_: $count{$_}" for keys %count;

__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>

Output:

Duplicate CDR: 1
Error on CDR level; File processing continued.: 1
D14 - Calls *144: 2

#2


1  

This is conveniently done with any of the Perl XML modules, but since you mention XML::Twig, that is what I have used in this solution.

这可以通过任何Perl XML模块方便地完成,但是既然你提到了XML :: Twig,那就是我在这个解决方案中使用的。

You say there may be many similar XML files but do not say how they are to be identified, so all I can do is offer you a solution for a single file and hope you can extrapolate from here.

你说可能有许多类似的XML文件,但没有说明如何识别它们,所以我所能做的就是为你提供单个文件的解决方案,并希望你能从这里推断出来。

The program works by reading the file line by line, parsing each line as a separate XML document, and extracting the text value of the first child element of the root document that has a result tag. This text value is used as a hash key to keep track of the number of occurrences of each different result.

该程序通过逐行读取文件,将每行解析为单独的XML文档,并提取具有结果标记的根文档的第一个子元素的文本值来工作。此文本值用作哈希键,以跟踪每个不同结果的出现次数。

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new;

my %results;

open my $fh, '<', 'my.xml' or die $!;

while (<$fh>) {
  $twig->parse($_);
  my $result = $twig->root->first_child('result');
  if ($result) {
    $result = $result->trimmed_text;
    $results{$result}++;
  }
}

for (sort keys %results) {
  my $n = $results{$_};
  printf qq("%s" count %d\n), $_, $n;
}

output

"D14 - Calls *144" count 2
"Duplicate CDR" count 1
"Error on CDR level; File processing continued." count 1

#3


0  

You could use XML::SAX::PurePerl, it is very fail-proof and, in my experience, handle well messy XML:

您可以使用XML :: SAX :: PurePerl,它非常防故障,并且根据我的经验,处理凌乱的XML:

#!/usr/bin/env perl
package Result::Extractor;
use strict;
use warnings qw(all);

use base qw(XML::SAX::Base);

sub new {
    return bless {
        count   => {},
        data    => '',
    };
}

sub start_element {
    my ($self, $el) = @_;
    $self->{data} = '';
}

sub end_element {
    my ($self, $el) = @_;
    if ($el->{Name} eq 'result') {
        ++$self->{count}{$self->{data}};
    }
}

sub characters {
    my ($self, $data) = @_;
    $self->{data} .= $data->{Data};
}

1;

package main;
use strict;
use warnings qw(all);

use Data::Printer;
use XML::SAX::PurePerl;

my $handler = Result::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);

$parser->parse_string(do { local $/; '<wrapper>' . <DATA> . '</wrapper>' });

p $handler->{count};

__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>

Result:

\ {
    'Duplicate CDR'                                    1,
    'D14 - Calls *144'                                 2,
    'Error on CDR level; File processing continued.'   1
}

You can also check XML::SAX::Expat, XML::SAX::ExpatXS and XML::LibXML::SAX; they are faster, but more error-prone.

您还可以检查XML :: SAX :: Expat,XML :: SAX :: ExpatXS和XML :: LibXML :: SAX;它们更快,但更容易出错。

#4


-1  

perl -MXML::Twig -E'XML::Twig->new( twig_handlers => { result => sub { $count{$_->text}++ } })->parsefile( $ARGV[0]); say "$_: $count{$_}" foreach sort keys %count; ' count.xml

would work, IF YOUR DATA WAS XML.

如果您的数据是XML,那将会工作。

It is not.

它不是。

#5


-1  

If you make the assumption that every instance of <result>...</result> is one that you are interested in, then you might be able to get away with a regex:

如果您假设 ... 的每个实例都是您感兴趣的实例,那么您可以使用正则表达式:

my $doc = read_file("file.xml"); # slurp in the doc
my %count;

while ($doc =~ m,<result.*?>(.*?)</result>,g) {
  $count{$1}++;
}

But I would use a real XML processing library for this, like XML::XPath. It's very easy to adapt the example program for XML::Path to your XML file:

但是我会使用一个真正的XML处理库,比如XML :: XPath。将XML :: Path的示例程序调整到XML文件非常容易:

use XML::XPath;
use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => 'file.xml');

my $nodeset = $xp->find('/zzz/process_info/result'); # find all results

my %count;
foreach my $node ($nodeset->get_nodelist) {
  $count{ $node->string_value } ++;
}

Note I am using an xpath of /zzz/... - the top level of your XML document must be a single element, so I enclosed your example with <zzz>...</zzz>.

注意我使用的是/ zzz / ...的xpath - XML文档的顶层必须是单个元素,所以我用 ... 附上了你的例子。

This is a much more robust solution since it will only locate result elements which are children of process_info elements.

这是一个更加强大的解决方案,因为它只能找到作为process_info元素的子元素的结果元素。