来自csv.Sniffer的has_header为具有相同布局的文件提供不同的结果

时间:2022-02-20 20:29:38

I have the following snippet of code:

我有以下代码片段:

import csv

def has_header(first_lines):
    sniffer = csv.Sniffer()
    return sniffer.has_header(first_lines) 

Where first_lines are the first 2048 bytes of the file. The function works well most of the time and returns True for a file that begins like this:

其中first_lines是文件的前2048个字节。该函数在大多数情况下都能正常工作,并且对于以如下开头的文件返回True:

SPEC#: 1, SIZE: 18473, TIME: 0.000000
1998.304312 2.15686
1998.773585 3.13725
1999.242914 3.13725
1999.712298 2.7451
2000.181736 2.94118
2000.651230 2.94118
2001.120780 2.15686
2001.590384 2.35294
2002.060043 2.94118
2002.529758 3.13725
2002.999527 2.54902
2003.469352 3.13725
2003.939232 1.96078
2004.409167 1.76471
2004.879158 2.94118
2005.349203 3.72549
2005.819304 3.33333
2006.289459 2.35294
2006.759670 1.76471
2007.229936 3.13725
2007.700258 3.52941
2008.170634 3.92157
2008.641065 3.92157
2009.111552 3.52941
2009.582094 4.70588
2010.052691 3.52941
2010.523343 3.33333
2010.994050 1.37255
2011.464812 2.35294
2011.935630 2.15686
2012.406502 3.52941
2012.877430 3.13725
2013.348413 2.15686
2013.819451 1.96078
2014.290544 1.56863
2014.761693 3.13725
2015.232896 1.76471
2015.704155 1.96078
2016.175469 3.33333
2016.646838 4.90196
2017.118262 3.52941
2017.589741 2.94118
2018.061275 1.96078
2018.532865 1.76471
2019.004510 4.11765
2019.476210 3.72549
2019.947965 2.35294
2020.419775 1.56863
2020.891640 2.15686
2021.363560 0.784314
2021.835536 1.37255
2022.307567 2.94118
2022.779653 2.15686
2023.251794 4.11765
2023.723990 4.5098
2024.196241 2.7451
2024.668548 2.54902
2025.140909 1.56863
2025.613326 2.94118
2026.085798 2.35294
2026.558325 2.94118
2027.030907 3.33333
2027.503545 3.52941
2027.976237 3.72549
2028.448985 5.09804
2028.921788 4.11765
2029.394645 3.92157
2029.867559 3.13725
2030.340527 2.15686
2030.813550 2.35294
2031.286629 5.09804
2031.759762 3.33333
2032.232951 3.52941
2032.706195 3.13725
2033.179494 4.70588
2033.652849 4.31373
2034.126258 3.92157
2034.599723 3.33333
2035.073242 3.72549
2035.546817 2.54902
2036.020447 2.35294
2036.494132 2.15686
2036.967873 2.94118
2037.441668 2.7451
2037.915519 2.15686
2038.389425 2.54902
2038.863385 2.54902
2039.337401 2.35294
2039.811473 3.52941
2040.285599 3.52941
2040.759781 4.11765
2041.234017 3.52941
2041.708309 3.72549
2042.182656 3.33333
2042.657058 2.54902
2043.131515 4.11765
2043.606028 3.52941
2044.080595 4.11765
2044.555218 2.35294
2045.029896 1.96078

However, the function returns False for the following file that has the same overall structure:

但是,对于具有相同整体结构的以下文件,该函数返回False:

SPEC#: 1, SIZE: 18474, TIME: 0.000000
1998.228113 36.8627
1998.697368 30.1961
1999.166679 35.8824
1999.636044 41.3726
2000.105465 38.6275
2000.574941 39.0196
2001.044473 41.9608
2001.514059 37.8431
2001.983701 35.098
2002.453397 37.2549
2002.923149 36.4706
2003.392956 40.5882
2003.862818 39.0196
2004.332735 36.6667
2004.802708 33.3333
2005.272735 37.8431
2005.742818 35.098
2006.212955 33.3333
2006.683148 39.0196
2007.153397 41.3726
2007.623700 41.7647
2008.094058 39.4118
2008.564472 40.7843
2009.034940 44.1176
2009.505464 42.1569
2009.976043 40
2010.446677 38.2353
2010.917366 39.2157
2011.388111 39.2157
2011.858910 36.2745
2012.329765 38.0392
2012.800675 42.3529
2013.271640 44.7059
2013.742660 38.0392
2014.213735 40
2014.684866 39.8039
2015.156051 44.902
2015.627292 41.7647
2016.098588 44.3137
2016.569939 43.9216
2017.041345 50.1961
2017.512806 51.5686
2017.984323 46.4706
2018.455894 44.7059
2018.927521 41.9608
2019.399203 46.6667
2019.870940 41.3726
2020.342732 45.8824
2020.814579 45.2941
2021.286482 45.2941
2021.758439 49.2157
2022.230452 42.1569
2022.702520 45.2941
2023.174643 45.4902
2023.646821 40.5882
2024.119054 47.0588
2024.591343 42.1569
2025.063686 43.7255
2025.536085 46.0784
2026.008539 45.098
2026.481048 44.902
2026.953612 50
2027.426231 48.4314
2027.898906 45.098
2028.371636 49.2157
2028.844420 47.8431
2029.317260 51.7647
2029.790155 49.4118
2030.263105 45.8824
2030.736111 51.1765
2031.209171 47.6471
2031.682287 52.549
2032.155458 50.7843
2032.628684 45.6863
2033.101965 48.6275
2033.575301 49.4118
2034.048692 48.2353
2034.522139 49.0196
2034.995641 51.9608
2035.469197 51.3726
2035.942809 50.1961
2036.416476 54.3137
2036.890199 50
2037.363976 48.8235
2037.837809 49.4118
2038.311696 51.9608
2038.785639 55.098
2039.259637 56.2745
2039.733690 50.7843
2040.207798 55.2941
2040.681962 58.2353
2041.156180 56.8627
2041.630454 60
2042.104783 61.1765
2042.579167 64.3137
2043.053606 60.5882
2043.528100 64.3137
2044.002650 62.1569
2044.477254 60.1961
2044.951914 68.2353
2045.426629 62.1569
2045.901399 62.7451

Is this a bug in has_header or some edge-case where the heuristics of has_header fail?

这是has_header中的错误还是某些边缘情况,其中has_header的启发式失败?

Some differences I have noted:

我注意到的一些差异:

  • the variable header on line 394 in csv.py is ['SPEC#: 1, SIZE: 18473, TIME: 0.000000'] (list of length 1) for the first file where the header is correctly determined and ['SPEC#: 1, SIZE: 184', '4, TIME: 0.000000'] (list of length 2) for the second file.

    csv.py第394行的变量头是['SPEC#:1,SIZE:18473,TIME:0.000000'](长度为1的列表),第一个正确确定标题的文件和['SPEC#:1 ,SIZE:第二个文件的184','4,TIME:0.000000'](长度为2的列表)。

  • after creating a dictionary of types of data for each column, the variable columnTypes is {0: None} for the first file and {} for the second file.

    在为每列创建数据类型字典后,第一个文件的变量columnTypes为{0:None},第二个文件为{}。

1 个解决方案

#1


13  

when i try this (with data1 being your first and data2 your second sample:

当我尝试这个时(数据1是你的第一个,data2是你的第二个样本:

for data in (data1, data2):
    s=csv.Sniffer()
    d = s.sniff(data)
    print(repr(d.delimiter))

I get:

我明白了:

' '
'7'

That means the problem is actually the detection of the delimiter. As you can see in docstring of _guess_delimiter the sniffer uses frequency analysis to determine the best candidate for the delimiter, which in this case happens to be '7' instead of space.

这意味着问题实际上是分隔符的检测。正如您在_guess_delimiter的docstring中所看到的,嗅探器使用频率分析来确定分隔符的最佳候选者,在这种情况下恰好是“7”而不是空格。

If you treat your second sample as csv with a delimiter of '7' you'll get this:

如果您将第二个样本视为带有分隔符“7”的csv,您将得到以下结果:

[['SPEC#: 1, SIZE: 184', '3, TIME: 0.000000'],
 ['1998.304312 2.15686'],
 ['1998.', '', '3585 3.13', '25'],
 ['1999.242914 3.13', '25'],
 ['1999.', '12298 2.', '451'],
 ['2000.181', '36 2.94118'],
 ['2000.651230 2.94118'], ...]

where the first row would be a perfectly acceptable data row instead of a header.

其中第一行是完全可接受的数据行而不是标题。

Unfortunately has_header doesn't let you specify the used dialect manually, but you could patch the sniffer's sniff method to return a dialect with the right delimiter ' '.

不幸的是,has_header不允许您手动指定使用过的方言,但是您可以修补嗅探器的嗅探方法以返回带有正确分隔符''的方言。

#1


13  

when i try this (with data1 being your first and data2 your second sample:

当我尝试这个时(数据1是你的第一个,data2是你的第二个样本:

for data in (data1, data2):
    s=csv.Sniffer()
    d = s.sniff(data)
    print(repr(d.delimiter))

I get:

我明白了:

' '
'7'

That means the problem is actually the detection of the delimiter. As you can see in docstring of _guess_delimiter the sniffer uses frequency analysis to determine the best candidate for the delimiter, which in this case happens to be '7' instead of space.

这意味着问题实际上是分隔符的检测。正如您在_guess_delimiter的docstring中所看到的,嗅探器使用频率分析来确定分隔符的最佳候选者,在这种情况下恰好是“7”而不是空格。

If you treat your second sample as csv with a delimiter of '7' you'll get this:

如果您将第二个样本视为带有分隔符“7”的csv,您将得到以下结果:

[['SPEC#: 1, SIZE: 184', '3, TIME: 0.000000'],
 ['1998.304312 2.15686'],
 ['1998.', '', '3585 3.13', '25'],
 ['1999.242914 3.13', '25'],
 ['1999.', '12298 2.', '451'],
 ['2000.181', '36 2.94118'],
 ['2000.651230 2.94118'], ...]

where the first row would be a perfectly acceptable data row instead of a header.

其中第一行是完全可接受的数据行而不是标题。

Unfortunately has_header doesn't let you specify the used dialect manually, but you could patch the sniffer's sniff method to return a dialect with the right delimiter ' '.

不幸的是,has_header不允许您手动指定使用过的方言,但是您可以修补嗅探器的嗅探方法以返回带有正确分隔符''的方言。