自动检测文件中CSV标题的存在

时间:2022-08-04 23:25:51

Short question: How do I automatically detect whether a CSV file has headers in the first row?

简短问题:如何自动检测CSV文件是否在第一行中有标题?

Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.

详细信息:我编写了一个小型CSV解析引擎,将数据放入一个我可以作为(大约)内存数据库访问的对象。编写原始代码是为了解析具有可预测格式的第三方CSV,但我希望能够更普遍地使用此代码。

I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).

我正在尝试找出一种自动检测CSV标头存在的可靠方法,因此脚本可以决定是使用CSV文件的第一行作为键/列名还是立即开始解析数据。因为我需要的只是一个布尔测试,所以我可以在自己检查CSV文件之后轻松指定一个参数,但我宁愿不必(去自动化)。

I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:

我想我必须解析前3个? CSV文件的行,并查找某种类型的模式以与标头进行比较。我正在做三个特别糟糕的噩梦,其中:

  1. The headers include numeric data for some reason
  2. 由于某种原因,标题包括数字数据
  3. The first few rows (or large portions of the CSV) are null
  4. 前几行(或CSV的大部分)为空
  5. There headers and data look too similar to tell them apart
  6. 标题和数据看起来太相似,无法区分它们

If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".

如果我能得到一个“最好的猜测”并让解析器失败并出现错误,或者如果无法做出决定就吐出警告,那就没关系。如果这在时间或计算方面会非常昂贵(并且花费的时间超过它应该拯救我的时间),我很乐意废弃这个想法并回到“重要的事情”上。

I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.

我正在使用PHP,但这让我更像是一个算法/计算问题而不是特定于实现的问题。如果有一个我可以使用的简单算法,那很好。如果你能指出一些相关的理论/讨论,那也很棒。如果有一个巨型库可以进行自然语言处理或300种不同的解析,我对此并不感兴趣。

5 个解决方案

#1


13  

As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:

正如其他人所指出的那样,你不能以100%的可靠性来做到这一点。在某些情况下,“大多数情况下”获取它是有用的,例如,具有CSV导入功能的电子表格工具通常会尝试自行解决这个问题。这里有一些启发式方法可能表明第一行不是标题:

  • The first row has columns that are not strings or are empty
  • 第一行包含不是字符串或为空的列
  • The first row's columns are not all unique
  • 第一行的列并非都是唯一的
  • The first row appears to contain dates or other common data formats (eg, xx-xx-xx)
  • 第一行似乎包含日期或其他常见数据格式(例如,xx-xx-xx)

#2


5  

In the most general sense, this is impossible. This is a valid csv file:
Name
Jim
Tom
Bill

从最普遍的意义上讲,这是不可能的。这是一个有效的csv文件:名字Jim Tom Bill

Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
M
3
5
7

大多数csv读者只需将hasHeader作为选项,并允许您根据需要传入自己的标题。即使在您认为可以检测到的情况下,即字符标题和数字数据,您也可能会遇到灾难性的失败。如果您的专栏是宝马系列列表怎么办? M 3 5 7

You will process this incorrectly. Worst of all, you will lose the best car!

您将错误地处理此问题。最糟糕的是,你将失去最好的车!

#3


2  

In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".

在纯粹抽象的意义上,我认为对你的问题没有一个万无一失的算法答案,因为它归结为:“如果我对它们中的任何一个一无所知,我如何区分dataA和dataB?”。 dataA总是有可能与dataB无法区分。也就是说,我会从简单开始,只根据需要增加复杂性。例如,如果检查前五行,对于给定列(或多列),如果第2-5行中的数据类型全部相同但与第1行中的数据类型不同,则很可能存在标题行(增加样本量可减少出错的可能性)。这将(sorta)解决#1 /#3 - 如果所有行都被填充但是数据无法区分以允许调用程序决定下一步该做什么,则可能抛出异常。对于#2,简单地不要将行计为一行,除非并且直到它拉出非空数据....除了空文件之外的所有数据都会起作用(在这种情况下你会点击EOF)。它永远不会是万无一失的,但它可能“足够接近”。

#4


1  

It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).

这实际上取决于您希望工具的“一般”程度。如果数据总是数字的,只要你假设非数字标题(这似乎是一个非常公平的假设),你就可以轻松实现。

But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.

但除此之外,如果您还不知道数据中存在哪些模式,那么您无法提前测试它们。

FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.

FWIW,我实际上只是编写了一个脚本来解析TSV中的一些东西,这些东西都来自同一个源。源代码的标题/格式是如此分散,以至于只需让脚本在执行时从命令行向我询问问题。 (这是一个标题吗?哪些列很重要?)。所以没有自动化,但它让我飞过我正在研究的数据集,而不是试图预测每个有趣的格式化案例。此外,我的答案保存在一个文件中,所以我只需每个文件一次。不理想,但效率很高。

#5


0  

If you CSV has a header like this.

如果CSV有这样的标题。

ID, Name, Email, Date 1, john, john@john.com, 12 jan 2020

ID,姓名,电子邮件,日期1,john,john @ john.com,2020年1月12日

Then doing a filter_var(str, FILTER_VALIDATE_EMAIL) on the header row will fail. Since the email address is only in the row data. So check header row for an email address (assuming your CSV has email addresses in it).

然后在标题行上执行filter_var(str,FILTER_VALIDATE_EMAIL)将失败。由于电子邮件地址仅在行数据中。因此,请检查标题行中的电子邮件地址(假设您的CSV中包含电子邮件地址)。

Second idea. http://php.net/manual/en/function.is-numeric.php Check header row for is_numeric, most likely a header row does not have numeric data in it. But most likely a data row would have numeric data.

第二个想法。 http://php.net/manual/en/function.is-numeric.php检查is_numeric的标题行,很可能标题行中没有数字数据。但很可能数据行会有数字数据。

If you know you have dates in your columns, then checking the header row for a date would also work.

如果您知道列中有日期,那么检查标题行的日期也会有效。

Obviously you need to what type of data you are expecting. I am "expecting" email addresses.

显然,您需要预期的数据类型。我“期待”电子邮件地址。

#1


13  

As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:

正如其他人所指出的那样,你不能以100%的可靠性来做到这一点。在某些情况下,“大多数情况下”获取它是有用的,例如,具有CSV导入功能的电子表格工具通常会尝试自行解决这个问题。这里有一些启发式方法可能表明第一行不是标题:

  • The first row has columns that are not strings or are empty
  • 第一行包含不是字符串或为空的列
  • The first row's columns are not all unique
  • 第一行的列并非都是唯一的
  • The first row appears to contain dates or other common data formats (eg, xx-xx-xx)
  • 第一行似乎包含日期或其他常见数据格式(例如,xx-xx-xx)

#2


5  

In the most general sense, this is impossible. This is a valid csv file:
Name
Jim
Tom
Bill

从最普遍的意义上讲,这是不可能的。这是一个有效的csv文件:名字Jim Tom Bill

Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
M
3
5
7

大多数csv读者只需将hasHeader作为选项,并允许您根据需要传入自己的标题。即使在您认为可以检测到的情况下,即字符标题和数字数据,您也可能会遇到灾难性的失败。如果您的专栏是宝马系列列表怎么办? M 3 5 7

You will process this incorrectly. Worst of all, you will lose the best car!

您将错误地处理此问题。最糟糕的是,你将失去最好的车!

#3


2  

In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".

在纯粹抽象的意义上,我认为对你的问题没有一个万无一失的算法答案,因为它归结为:“如果我对它们中的任何一个一无所知,我如何区分dataA和dataB?”。 dataA总是有可能与dataB无法区分。也就是说,我会从简单开始,只根据需要增加复杂性。例如,如果检查前五行,对于给定列(或多列),如果第2-5行中的数据类型全部相同但与第1行中的数据类型不同,则很可能存在标题行(增加样本量可减少出错的可能性)。这将(sorta)解决#1 /#3 - 如果所有行都被填充但是数据无法区分以允许调用程序决定下一步该做什么,则可能抛出异常。对于#2,简单地不要将行计为一行,除非并且直到它拉出非空数据....除了空文件之外的所有数据都会起作用(在这种情况下你会点击EOF)。它永远不会是万无一失的,但它可能“足够接近”。

#4


1  

It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).

这实际上取决于您希望工具的“一般”程度。如果数据总是数字的,只要你假设非数字标题(这似乎是一个非常公平的假设),你就可以轻松实现。

But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.

但除此之外,如果您还不知道数据中存在哪些模式,那么您无法提前测试它们。

FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.

FWIW,我实际上只是编写了一个脚本来解析TSV中的一些东西,这些东西都来自同一个源。源代码的标题/格式是如此分散,以至于只需让脚本在执行时从命令行向我询问问题。 (这是一个标题吗?哪些列很重要?)。所以没有自动化,但它让我飞过我正在研究的数据集,而不是试图预测每个有趣的格式化案例。此外,我的答案保存在一个文件中,所以我只需每个文件一次。不理想,但效率很高。

#5


0  

If you CSV has a header like this.

如果CSV有这样的标题。

ID, Name, Email, Date 1, john, john@john.com, 12 jan 2020

ID,姓名,电子邮件,日期1,john,john @ john.com,2020年1月12日

Then doing a filter_var(str, FILTER_VALIDATE_EMAIL) on the header row will fail. Since the email address is only in the row data. So check header row for an email address (assuming your CSV has email addresses in it).

然后在标题行上执行filter_var(str,FILTER_VALIDATE_EMAIL)将失败。由于电子邮件地址仅在行数据中。因此,请检查标题行中的电子邮件地址(假设您的CSV中包含电子邮件地址)。

Second idea. http://php.net/manual/en/function.is-numeric.php Check header row for is_numeric, most likely a header row does not have numeric data in it. But most likely a data row would have numeric data.

第二个想法。 http://php.net/manual/en/function.is-numeric.php检查is_numeric的标题行,很可能标题行中没有数字数据。但很可能数据行会有数字数据。

If you know you have dates in your columns, then checking the header row for a date would also work.

如果您知道列中有日期,那么检查标题行的日期也会有效。

Obviously you need to what type of data you are expecting. I am "expecting" email addresses.

显然,您需要预期的数据类型。我“期待”电子邮件地址。