I have an XML file like this.
我有一个像这样的XML文件。
Each line of the file starts and ends with a process_info
tag. The file can contain many lines like this, there may be many similar files.
文件的每一行都以process_info标记开头和结尾。该文件可以包含许多这样的行,可能有许多类似的文件。
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>
I'd like to keep a count of all the different values of the result
element, so my ouput would be something like this:
我想保留结果元素的所有不同值的计数,所以我的输出将是这样的:
"D14 - Calls *144" count 2
"Duplicate CDR" count 1
"Error on CDR level; File processing continued." count 1
“D14 - 呼叫* 144”计数2“重复CDR”计数1“CDR级别错误;文件处理继续。”数1
How can I should this? I guess using XML:Twig
or XML:Parser
, but as there are many start/end tags inside a file i'm unable to figure out a solution.
我怎么能这样呢?我想使用XML:Twig或XML:Parser,但由于文件中有许多开始/结束标记,我无法找到解决方案。
5 个解决方案
#1
1
You could use the excellent DOM parser Mojo::DOM from the Mojolicious suite to count these. It's pretty straightforward. Use a hash (%count
) to keep track of how often you found a result. This is the typical Perl idiom for this kind of problems.
您可以使用来自Mojolicious套件的优秀DOM解析器Mojo :: DOM来计算这些。这很简单。使用哈希值(%count)来跟踪您找到结果的频率。对于这类问题,这是典型的Perl习语。
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::DOM;
# read all input lines at once
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
# prepare count hash
my %count = ();
# iterate result elements
$dom->find('result')->each(sub {
my $element = shift;
$count{$element->text}++;
});
# output
say "$_: $count{$_}" for keys %count;
__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>
Output:
Duplicate CDR: 1
Error on CDR level; File processing continued.: 1
D14 - Calls *144: 2
#2
1
This is conveniently done with any of the Perl XML modules, but since you mention XML::Twig
, that is what I have used in this solution.
这可以通过任何Perl XML模块方便地完成,但是既然你提到了XML :: Twig,那就是我在这个解决方案中使用的。
You say there may be many similar XML files but do not say how they are to be identified, so all I can do is offer you a solution for a single file and hope you can extrapolate from here.
你说可能有许多类似的XML文件,但没有说明如何识别它们,所以我所能做的就是为你提供单个文件的解决方案,并希望你能从这里推断出来。
The program works by reading the file line by line, parsing each line as a separate XML document, and extracting the text value of the first child element of the root document that has a result
tag. This text value is used as a hash key to keep track of the number of occurrences of each different result.
该程序通过逐行读取文件,将每行解析为单独的XML文档,并提取具有结果标记的根文档的第一个子元素的文本值来工作。此文本值用作哈希键,以跟踪每个不同结果的出现次数。
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new;
my %results;
open my $fh, '<', 'my.xml' or die $!;
while (<$fh>) {
$twig->parse($_);
my $result = $twig->root->first_child('result');
if ($result) {
$result = $result->trimmed_text;
$results{$result}++;
}
}
for (sort keys %results) {
my $n = $results{$_};
printf qq("%s" count %d\n), $_, $n;
}
output
"D14 - Calls *144" count 2
"Duplicate CDR" count 1
"Error on CDR level; File processing continued." count 1
#3
0
You could use XML::SAX::PurePerl, it is very fail-proof and, in my experience, handle well messy XML:
您可以使用XML :: SAX :: PurePerl,它非常防故障,并且根据我的经验,处理凌乱的XML:
#!/usr/bin/env perl
package Result::Extractor;
use strict;
use warnings qw(all);
use base qw(XML::SAX::Base);
sub new {
return bless {
count => {},
data => '',
};
}
sub start_element {
my ($self, $el) = @_;
$self->{data} = '';
}
sub end_element {
my ($self, $el) = @_;
if ($el->{Name} eq 'result') {
++$self->{count}{$self->{data}};
}
}
sub characters {
my ($self, $data) = @_;
$self->{data} .= $data->{Data};
}
1;
package main;
use strict;
use warnings qw(all);
use Data::Printer;
use XML::SAX::PurePerl;
my $handler = Result::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);
$parser->parse_string(do { local $/; '<wrapper>' . <DATA> . '</wrapper>' });
p $handler->{count};
__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>
Result:
\ {
'Duplicate CDR' 1,
'D14 - Calls *144' 2,
'Error on CDR level; File processing continued.' 1
}
You can also check XML::SAX::Expat, XML::SAX::ExpatXS and XML::LibXML::SAX; they are faster, but more error-prone.
您还可以检查XML :: SAX :: Expat,XML :: SAX :: ExpatXS和XML :: LibXML :: SAX;它们更快,但更容易出错。
#4
-1
perl -MXML::Twig -E'XML::Twig->new( twig_handlers => { result => sub { $count{$_->text}++ } })->parsefile( $ARGV[0]); say "$_: $count{$_}" foreach sort keys %count; ' count.xml
would work, IF YOUR DATA WAS XML.
如果您的数据是XML,那将会工作。
It is not.
它不是。
#5
-1
If you make the assumption that every instance of <result>...</result>
is one that you are interested in, then you might be able to get away with a regex:
如果您假设
my $doc = read_file("file.xml"); # slurp in the doc
my %count;
while ($doc =~ m,<result.*?>(.*?)</result>,g) {
$count{$1}++;
}
But I would use a real XML processing library for this, like XML::XPath
. It's very easy to adapt the example program for XML::Path
to your XML file:
但是我会使用一个真正的XML处理库,比如XML :: XPath。将XML :: Path的示例程序调整到XML文件非常容易:
use XML::XPath;
use XML::XPath::XMLParser;
my $xp = XML::XPath->new(filename => 'file.xml');
my $nodeset = $xp->find('/zzz/process_info/result'); # find all results
my %count;
foreach my $node ($nodeset->get_nodelist) {
$count{ $node->string_value } ++;
}
Note I am using an xpath of /zzz/...
- the top level of your XML document must be a single element, so I enclosed your example with <zzz>...</zzz>
.
注意我使用的是/ zzz / ...的xpath - XML文档的顶层必须是单个元素,所以我用
This is a much more robust solution since it will only locate result
elements which are children of process_info
elements.
这是一个更加强大的解决方案,因为它只能找到作为process_info元素的子元素的结果元素。
#1
1
You could use the excellent DOM parser Mojo::DOM from the Mojolicious suite to count these. It's pretty straightforward. Use a hash (%count
) to keep track of how often you found a result. This is the typical Perl idiom for this kind of problems.
您可以使用来自Mojolicious套件的优秀DOM解析器Mojo :: DOM来计算这些。这很简单。使用哈希值(%count)来跟踪您找到结果的频率。对于这类问题,这是典型的Perl习语。
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::DOM;
# read all input lines at once
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
# prepare count hash
my %count = ();
# iterate result elements
$dom->find('result')->each(sub {
my $element = shift;
$count{$element->text}++;
});
# output
say "$_: $count{$_}" for keys %count;
__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>
Output:
Duplicate CDR: 1
Error on CDR level; File processing continued.: 1
D14 - Calls *144: 2
#2
1
This is conveniently done with any of the Perl XML modules, but since you mention XML::Twig
, that is what I have used in this solution.
这可以通过任何Perl XML模块方便地完成,但是既然你提到了XML :: Twig,那就是我在这个解决方案中使用的。
You say there may be many similar XML files but do not say how they are to be identified, so all I can do is offer you a solution for a single file and hope you can extrapolate from here.
你说可能有许多类似的XML文件,但没有说明如何识别它们,所以我所能做的就是为你提供单个文件的解决方案,并希望你能从这里推断出来。
The program works by reading the file line by line, parsing each line as a separate XML document, and extracting the text value of the first child element of the root document that has a result
tag. This text value is used as a hash key to keep track of the number of occurrences of each different result.
该程序通过逐行读取文件,将每行解析为单独的XML文档,并提取具有结果标记的根文档的第一个子元素的文本值来工作。此文本值用作哈希键,以跟踪每个不同结果的出现次数。
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new;
my %results;
open my $fh, '<', 'my.xml' or die $!;
while (<$fh>) {
$twig->parse($_);
my $result = $twig->root->first_child('result');
if ($result) {
$result = $result->trimmed_text;
$results{$result}++;
}
}
for (sort keys %results) {
my $n = $results{$_};
printf qq("%s" count %d\n), $_, $n;
}
output
"D14 - Calls *144" count 2
"Duplicate CDR" count 1
"Error on CDR level; File processing continued." count 1
#3
0
You could use XML::SAX::PurePerl, it is very fail-proof and, in my experience, handle well messy XML:
您可以使用XML :: SAX :: PurePerl,它非常防故障,并且根据我的经验,处理凌乱的XML:
#!/usr/bin/env perl
package Result::Extractor;
use strict;
use warnings qw(all);
use base qw(XML::SAX::Base);
sub new {
return bless {
count => {},
data => '',
};
}
sub start_element {
my ($self, $el) = @_;
$self->{data} = '';
}
sub end_element {
my ($self, $el) = @_;
if ($el->{Name} eq 'result') {
++$self->{count}{$self->{data}};
}
}
sub characters {
my ($self, $data) = @_;
$self->{data} .= $data->{Data};
}
1;
package main;
use strict;
use warnings qw(all);
use Data::Printer;
use XML::SAX::PurePerl;
my $handler = Result::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);
$parser->parse_string(do { local $/; '<wrapper>' . <DATA> . '</wrapper>' });
p $handler->{count};
__DATA__
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991222</OtherParty><OtherLocation>55009999999991222</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TIM+ZGNA01-99703-1211241250-D.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>29722352746</networkCallReference><switchIdentity>7274</switchIdentity><originatedCode>1</originatedCode><subscriptionType>1</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BGNA05N</incomingAssignedRoute><translatedNumber>12#222</translatedNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BGNA05N</incomingRoute><outgoingRoute>ZBSA1CO</outgoingRoute><mSCIdentification>11556281138800</mSCIdentification><exchangeIdentity>ZGNA01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>0</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>5</chargeableDuration><timeForStopOfCharge>194949</timeForStopOfCharge><timeForStartOfCharge>194944</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#222</calledPartyNumber><callingSubscriberIMEI>355921042890190</callingSubscriberIMEI><callingSubscriberIMSI>724046008971498</callingSubscriberIMSI><callingPartyNumber>11556281020633</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>2987070</recordSequenceNumber><callIdentificationNumber>1362570</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>724046213C64F8A</cellIDForLastCellCalling><cellIDFor1stCellCalling>7240400C64F8A</cellIDFor1stCellCalling><timeForTCSeizureCalling>194943</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11556281138800</MSC_ID><CallStart>20121123194944</CallStart><CallDuration>5</CallDuration><CallDuration_30_inf>30</CallDuration_30_inf><CallDuration_60_inf>60</CallDuration_60_inf><CallDuration_MC>30</CallDuration_MC><CallDuration_30_60>30</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">556281020633</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP30158</OtherZone></event_data><dupChk></dupChk><account map_type="2">556281020633</account><other_account map_type="2">55#222</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9C-0000DB98-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF;TICKET=6</Report></transaction><start>20121123194944</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TIM+ZGNA01-99703-1211241250-D.TTF</filename><index_into_file>6</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="707">Error on CDR level; File processing continued.</result><data><file_info result="partial">CDR-Counter: (IN=16, BAD=0): (NORM_ERR=0 DUP_ERR=0, RAL_ERR=0), DUPLICATE=1, DISCARDED=0, OK=15</file_info></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename></process_info>
<process_info><module>pe_gw_a</module><result code="705">Duplicate CDR</result><data><input><event_data origin_id="asn1"><CallType>mosms</CallType><OtherParty ton="1" npi="1" int_code="55">556291860209</OtherParty><OtherLocation>55006234191860209</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeOtherParty>55</IntCodeOtherParty><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TKM_SMS+STKM01-28129-1211241251-A.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report><CDRType>5</CDRType><serviceCentreAddress>11556291860209</serviceCentreAddress><miscellaneousInformation>41</miscellaneousInformation><gSMTeleServiceCode>34</gSMTeleServiceCode><cellIDFor1stCellCalling>7240462003E0000</cellIDFor1stCellCalling><mSCIdentification>11551189848200</mSCIdentification><exchangeIdentity>STKM01</exchangeIdentity><originForCharging>62</originForCharging><chargedParty>00</chargedParty><timeForStartOfCharge>124619</timeForStartOfCharge><dateForStartOfCharge>20121124</dateForStartOfCharge><callingSubscriberIMSI>724046012529641</callingSubscriberIMSI><callingPartyNumber>11556282361092</callingPartyNumber></original_cdr><TypeOfCommunication>sms</TypeOfCommunication><CallDuration>0.9</CallDuration><CallStart>20121124124619</CallStart><MSC_ID>11551189848200</MSC_ID><ServedParty int_code="55" ton="1" npi="1">556282361092</ServedParty><ServedLocation>7240462</ServedLocation><ScenarioName>Sms_SMS___TIM_TIM</ScenarioName><ServedZone>ZO00031</ServedZone><OtherZone>ZP37744</OtherZone></event_data><dupChk></dupChk><account map_type="2">556282361092</account><other_account map_type="2">556291860209</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DE9D-0000DBC9-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF;TICKET=15</Report></transaction><start>20121124124619</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TKM_SMS+STKM01-28129-1211241251-A.TTF</filename><index_into_file>15</index_into_file></process_info>
<process_info><module>pe_gw_a</module><result code="3">D14 - Calls *144</result><data><input><event_data origin_id="asn1"><CallType>moc</CallType><OtherParty ton="2" npi="1" int_code="55">55009999999991144</OtherParty><OtherLocation>55009999999991144</OtherLocation><IntCodeCallingPartyNumber>55</IntCodeCallingPartyNumber><IntCodeServedParty>55</IntCodeServedParty><TicketType>0</TicketType><original_cdr FILENAME="TMX+ZBHE01-95068-1211241251-AG.TTF"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report><CDRType>1</CDRType><networkCallReference>447382755812</networkCallReference><switchIdentity>6628</switchIdentity><originatedCode>1</originatedCode><subscriptionType>21</subscriptionType><speechCoderPreferenceList>2010005030</speechCoderPreferenceList><radioChannelProperty>30</radioChannelProperty><incomingAssignedRoute>BMCL01B</incomingAssignedRoute><translatedNumber>12#144</translatedNumber><originatingLocationNumber>11553191938800</originatingLocationNumber><miscellaneousInformation>0</miscellaneousInformation><incomingRoute>BMCL01B</incomingRoute><outgoingRoute>XMCL1AO</outgoingRoute><mSCIdentification>11553191938800</mSCIdentification><exchangeIdentity>ZBHE01</exchangeIdentity><tariffClass>0010</tariffClass><chargingCase>1</chargingCase><originForCharging>38</originForCharging><chargedParty>00</chargedParty><timeFromRegisterSeizureToStartOfCharging>4</timeFromRegisterSeizureToStartOfCharging><interruptionTime>0</interruptionTime><chargeableDuration>426</chargeableDuration><timeForStopOfCharge>182128</timeForStopOfCharge><timeForStartOfCharge>181421</timeForStartOfCharge><dateForStartOfCharge>20121123</dateForStartOfCharge><disconnectingParty>00</disconnectingParty><calledPartyNumber>12#144</calledPartyNumber><callingSubscriberIMEI>358855043501160</callingSubscriberIMEI><callingSubscriberIMSI>724023016557605</callingSubscriberIMSI><callingPartyNumber>11553891610047</callingPartyNumber><typeOfCallingSubscriber>10</typeOfCallingSubscriber><recordSequenceNumber>1489944</recordSequenceNumber><callIdentificationNumber>11705419</callIdentificationNumber><tAC>721421</tAC><internalCauseAndLoc>3</internalCauseAndLoc><eosInfo>00</eosInfo><callPosition>30</callPosition><firstRadioChannelUsed>00</firstRadioChannelUsed><gSMTeleServiceCode>17</gSMTeleServiceCode><cellIDForLastCellCalling>7240238279ADEE5</cellIDForLastCellCalling><cellIDFor1stCellCalling>72402009ADEE5</cellIDFor1stCellCalling><timeForTCSeizureCalling>181417</timeForTCSeizureCalling></original_cdr><TypeOfCommunication>voi</TypeOfCommunication><MSC_ID>11553191938800</MSC_ID><CallStart>20121123181421</CallStart><CallDuration>426</CallDuration><CallDuration_30_inf>426</CallDuration_30_inf><CallDuration_60_inf>426</CallDuration_60_inf><CallDuration_MC>426</CallDuration_MC><CallDuration_30_60>60</CallDuration_30_60><ServedParty ton="1" npi="1" int_code="55">553891610047</ServedParty><ServedLocation>7240238</ServedLocation><ScenarioName>NA</ScenarioName><ServedZone>ZO00461</ServedZone><OtherZone>ZP30411</OtherZone></event_data><dupChk></dupChk><account map_type="2">553891610047</account><other_account map_type="2">55#144</other_account><operation alternate_rating="1" type="charge"/><transaction id="0000000050B0DEA8-0000DBE8-00002876-62F2B0C6"><Report>FILE=/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF;TICKET=6</Report></transaction><start>20121123181421</start></input></data><filename>/gold/rte/data/IncomingCDRs/ASN1/010/TMX+ZBHE01-95068-1211241251-AG.TTF</filename><index_into_file>6</index_into_file></process_info>
Result:
\ {
'Duplicate CDR' 1,
'D14 - Calls *144' 2,
'Error on CDR level; File processing continued.' 1
}
You can also check XML::SAX::Expat, XML::SAX::ExpatXS and XML::LibXML::SAX; they are faster, but more error-prone.
您还可以检查XML :: SAX :: Expat,XML :: SAX :: ExpatXS和XML :: LibXML :: SAX;它们更快,但更容易出错。
#4
-1
perl -MXML::Twig -E'XML::Twig->new( twig_handlers => { result => sub { $count{$_->text}++ } })->parsefile( $ARGV[0]); say "$_: $count{$_}" foreach sort keys %count; ' count.xml
would work, IF YOUR DATA WAS XML.
如果您的数据是XML,那将会工作。
It is not.
它不是。
#5
-1
If you make the assumption that every instance of <result>...</result>
is one that you are interested in, then you might be able to get away with a regex:
如果您假设
my $doc = read_file("file.xml"); # slurp in the doc
my %count;
while ($doc =~ m,<result.*?>(.*?)</result>,g) {
$count{$1}++;
}
But I would use a real XML processing library for this, like XML::XPath
. It's very easy to adapt the example program for XML::Path
to your XML file:
但是我会使用一个真正的XML处理库,比如XML :: XPath。将XML :: Path的示例程序调整到XML文件非常容易:
use XML::XPath;
use XML::XPath::XMLParser;
my $xp = XML::XPath->new(filename => 'file.xml');
my $nodeset = $xp->find('/zzz/process_info/result'); # find all results
my %count;
foreach my $node ($nodeset->get_nodelist) {
$count{ $node->string_value } ++;
}
Note I am using an xpath of /zzz/...
- the top level of your XML document must be a single element, so I enclosed your example with <zzz>...</zzz>
.
注意我使用的是/ zzz / ...的xpath - XML文档的顶层必须是单个元素,所以我用
This is a much more robust solution since it will only locate result
elements which are children of process_info
elements.
这是一个更加强大的解决方案,因为它只能找到作为process_info元素的子元素的结果元素。