使用Ruby和Nokogiri处理XML文件

时间:2021-11-12 00:50:49

I am new to programming so bear with me. I have many XML documents that look like this:

我对编程很陌生,所以请耐心听我说。我有很多这样的XML文档:

File name: PRIDE_Exp_Complete_Ac_10094.xml.gz

文件名称:PRIDE_Exp_Complete_Ac_10094.xml.gz

<ExperimentCollection version="2.1">
<Experiment>
    <ExperimentAccession>1015</ExperimentAccession>
    <Title>Protein complexes in Saccharomyces cerevisiae (GPM06600002310)</Title>
    <ShortLabel>GPM06600002310</ShortLabel>
    <Protocol>
        <ProtocolName>None</ProtocolName>
    </Protocol>
    <mzData version="1.05" accessionNumber="1015">
        <cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
        <cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
        <description>
            <admin>
                <sampleName>GPM06600002310</sampleName>
                <sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
                    <cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
                </sampleDescription>
                            </admin>
        </description>
        <spectrumList count="0" />
    </mzData>
        </Experiment>

I want to take out the text in between "Title", "ProtocolName", and "SampleName" and save into a text file that has the same name as the .xml.gz. I have the following code so far (based on posts I saw on this site), but it seems not to work:

我想取出“Title”、“ProtocolName”和“SampleName”之间的文本,并将其保存到与.xml.gz同名的文本文件中。到目前为止,我有以下的代码(基于我在这个网站上看到的文章),但它似乎不起作用:

require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(File.open("PRIDE_Exp_Complete_Ac_10094.xml.gz"))
@ExperimentCollection = doc.css("ExperimentCollection Title").map {|node| node.children.text }

Can someone help me?

有人能帮助我吗?

Thanks

谢谢

1 个解决方案

#1


0  

IF you are happy with REXML, AND there's only one <Experiment> per file, then something like the following should help ... (by the way, above text is invalid XML since no closing <ExperimentCollection> tag)

如果您对REXML很满意,并且每个文件只有一个 <实验> ,那么下面的内容应该会有所帮助……(顺便说一下,上面的文本是无效的XML,因为没有关闭< experimental collection >标签)

require "rexml/document"
include REXML
xml=<<EOD
<Experiment>
    <ExperimentAccession>1015</ExperimentAccession>
    <Title>Protein complexes in Saccharomyces cerevisiae (GPM06600002310)</Title>
    <ShortLabel>GPM06600002310</ShortLabel>
    <Protocol>
        <ProtocolName>None</ProtocolName>
    </Protocol>
    <mzData version="1.05" accessionNumber="1015">
        <cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
        <cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
        <description>
            <admin>
                <sampleName>GPM06600002310</sampleName>
                <sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
                    <cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
                </sampleDescription>
                            </admin>
        </description>
        <spectrumList count="0" />
    </mzData>
        </Experiment>
EOD

doc = Document.new xml
doc.elements["Experiment/Title"].text
doc.elements["Experiment/Protocol/ProtocolName"].text
doc.elements["Experiment/mzData/description/admin/sampleName"].text

#1


0  

IF you are happy with REXML, AND there's only one <Experiment> per file, then something like the following should help ... (by the way, above text is invalid XML since no closing <ExperimentCollection> tag)

如果您对REXML很满意,并且每个文件只有一个 <实验> ,那么下面的内容应该会有所帮助……(顺便说一下,上面的文本是无效的XML,因为没有关闭< experimental collection >标签)

require "rexml/document"
include REXML
xml=<<EOD
<Experiment>
    <ExperimentAccession>1015</ExperimentAccession>
    <Title>Protein complexes in Saccharomyces cerevisiae (GPM06600002310)</Title>
    <ShortLabel>GPM06600002310</ShortLabel>
    <Protocol>
        <ProtocolName>None</ProtocolName>
    </Protocol>
    <mzData version="1.05" accessionNumber="1015">
        <cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
        <cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
        <description>
            <admin>
                <sampleName>GPM06600002310</sampleName>
                <sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
                    <cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
                </sampleDescription>
                            </admin>
        </description>
        <spectrumList count="0" />
    </mzData>
        </Experiment>
EOD

doc = Document.new xml
doc.elements["Experiment/Title"].text
doc.elements["Experiment/Protocol/ProtocolName"].text
doc.elements["Experiment/mzData/description/admin/sampleName"].text

相关文章