我该如何检测文本文件中使用的分隔符?

时间:2022-09-10 22:46:29

I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use?

我需要能够解析CSV和TSV文件。我不能依赖用户知道差异,所以我想避免要求用户选择类型。有没有一种简单的方法来检测正在使用哪个分隔符?

One way would be to read in every line and count both tabs and commas and find out which is most consistently used in every line. Of course, the data could include commas or tabs, so that may be easier said than done.

一种方法是读取每一行并计算制表符和逗号,并找出每行中最常用的。当然,数据可能包括逗号或标签,因此说起来容易做起来难。

Edit: Another fun aspect of this project is that I will also need to detect the schema of the file when I read it in because it could be one of many. This means that I won't know how many fields I have until I can parse it.

编辑:这个项目的另一个有趣的方面是,当我阅读它时,我还需要检测文件的模式,因为它可能是众多文件中的一个。这意味着在解析之前我不知道有多少个字段。

13 个解决方案

#1


14  

You could show them the results in preview window - similar to the way Excel does it. It's pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.

您可以在预览窗口中显示结果 - 类似于Excel的方式。在这种情况下使用错误的分隔符时非常清楚。然后,您可以允许他们选择一系列分隔符并实时预览更新。

Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).

然后你可以简单地猜测开始时的分隔符(例如,首先是逗号或制表符)。

#2


15  

In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):

在Python中,csv模块中有一个Sniffer类,可用于猜测给定文件的分隔符和引号字符。它的策略是(引自csv.py的文档字符串):


[First, look] for text enclosed between two identical quotes (the probable quotechar) which are preceded and followed by the same character (the probable delimiter). For example:

[首先,查看]包含在两个相同引号(可能的quotechar)之间的文本,这些引号之前和之后是相同的字符(可能的分隔符)。例如:

         ,'some text',

The quote with the most wins, same with the delimiter. If there is no quotechar the delimiter can't be determined this way.

获胜最多的引用,与分隔符相同。如果没有quotechar,则无法以这种方式确定分隔符。

In that case, try the following:

在这种情况下,请尝试以下方法:

The delimiter should occur the same number of times on each row. However, due to malformed data, it may not. We don't want an all or nothing approach, so we allow for small variations in this number.

分隔符应在每行上出现相同的次数。但是,由于数据格式错误,可能不会。我们不想要全有或全无的方法,所以我们允许这个数字的微小变化。

  1. build a table of the frequency of each character on every line.
  2. 建立每行每个字符频率的表格。
  3. build a table of freqencies of this frequency (meta-frequency?), e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows, 7 times in 2 rows'
  4. 建立一个这个频率的频率表(元频率?),例如: 'x在10行中出现5次,在1000行中出现6次,在2行中出现7次'
  5. use the mode of the meta-frequency to determine the expected frequency for that character
  6. 使用元频率的模式来确定该角色的预期频率
  7. find out how often the character actually meets that goal
  8. 找出角色实际达到目标的频率
  9. the character that best meets its goal is the delimiter
  10. 最符合其目标的角色是分隔符

For performance reasons, the data is evaluated in chunks, so it can try and evaluate the smallest portion of the data possible, evaluating additional chunks as necessary.

出于性能原因,数据以块的形式进行评估,因此它可以尝试评估可能的最小部分数据,并根据需要评估其他块。


I'm not going to quote the source code here - it's in the Lib directory of every Python installation.

我不打算在这里引用源代码 - 它位于每个Python安装的Lib目录中。

Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)

请记住,CSV也可以使用分号而不是逗号作为分隔符(例如,在德语版本的Excel中,CSV以分号分隔,因为逗号在德国用作小数分隔符...)

#3


4  

Do you know how many fields should be present per line? If so, I'd read the first few lines of the file and check based on that.

你知道每行应该有多少个字段吗?如果是这样,我会阅读文件的前几行并根据它进行检查。

In my experience, "normal" data quite often contains commas but rarely contains tab characters. This would suggest that you should check for a consistent number of tabs in the first few lines, and go with that choice as a preferred guess. Of course, it depends on exactly what data you've got.

根据我的经验,“普通”数据通常包含逗号但很少包含制表符。这表明您应该检查前几行中的一致数量的选项卡,并将该​​选项作为首选猜测。当然,这取决于你所拥有的确切数据。

Ultimately, it would be quite possible to have a file which is completely valid for both formats - so you can't make it absolutely foolproof. It'll have to be a "best effort" job.

最终,很可能有一个对两种格式都完全有效的文件 - 所以你不能让它绝对万无一失。它必须是“尽力而为”的工作。

#4


3  

It's in PHP but this seems to be quite reliable:

它在PHP中,但这看起来非常可靠:

$csv = 'something;something;something
someotherthing;someotherthing;someotherthing
';
$candidates = array(',', ';', "\t");
$csvlines = explode("\n", $csv);
foreach ($candidates as $candidatekey => $candidate) {
 $lastcnt = 0;
 foreach ($csvlines as $csvline) {
  if (strlen($csvline) <= 2) continue;
  $thiscnt = substr_count($csvline, $candidate);
  if (($thiscnt == 0) || ($thiscnt != $lastcnt) && ($lastcnt != 0)) {
   unset($candidates[$candidatekey]);
   break;
  }
  $lastcnt = $thiscnt;
 }
}
$delim = array_shift($candidates);
echo $delim;

What it does is the following: For every specified possible delimiter, it reads every line in the CSV and checks if the number of times each seperator occurs is constant. If not, the candidate seperator is removed and ultimately you should end up with one seperator.

它的作用如下:对于每个指定的可能分隔符,它读取CSV中的每一行并检查每个分隔符出现的次数是否恒定。如果没有,候选分离器将被移除,最终你应该最终选择一个分离器。

#5


2  

I'd imagine that your suggested solution would be the best way to go. In a well-formed CSV or TSV file, the number of commas or tabs respectively per line should be constant (no variation at all). Do a count of each for every line of the file, and check which one is constant for all lines. It would seem quite unlikely that the count of both delimeters for each line is identical, but in this inconceivably rare case, you could of course prompt the user.

我想你建议的解决方案是最好的方法。在格式良好的CSV或TSV文件中,每行的逗号或制表符数应保持不变(完全没有变化)。对文件的每一行进行计数,并检查哪一行对于所有行都是常量。每行的两个分界符的计数似乎不太可能相同,但在这种不可思议的罕见情况下,您当然可以提示用户。

If neither the number of tabs nor commas is constant, then display a message to the user telling them that the file is malformed but the program thinks it is a (whatever format has the lowest standard deviation of delimeters per line) file.

如果选项卡和逗号的数量都不是常量,则向用户显示一条消息,告诉他们文件格式错误但程序认为它是(无论格式具有每行的分界线的最低标准偏差)文件。

#6


2  

Just read a few lines, count the number of commas and the number of tabs and compare them. If there's 20 commas and no tabs, it's in CSV. If there's 20 tabs and 2 commas (maybe in the data), it's in TSV.

只需阅读几行,计算逗号数量和标签数量并进行比较。如果有20个逗号且没有标签,则为CSV格式。如果有20个标签和2个逗号(可能在数据中),则它在TSV中。

#7


2  

I ran into a similar need and thought I would share what I came up with. I haven't run a lot of data through it yet, so there are possible edge cases. Also, keep in mind the goal of this function isn't 100% certainty of the delimiter, but best guess to be presented to user.

我遇到了类似的需求,并认为我会分享我想出的东西。我还没有通过它运行大量数据,因此可能存在边缘情况。另外,请记住,此功能的目标不是100%确定分隔符,但最好猜测要呈现给用户。

/// <summary>
/// Analyze the given lines of text and try to determine the correct delimiter used. If multiple
/// candidate delimiters are found, the highest frequency delimiter will be returned.
/// </summary>
/// <example>
/// string discoveredDelimiter = DetectDelimiter(dataLines, new char[] { '\t', '|', ',', ':', ';' });
/// </example>
/// <param name="lines">Lines to inspect</param>
/// <param name="delimiters">Delimiters to search for</param>
/// <returns>The most probable delimiter by usage, or null if none found.</returns>
public string DetectDelimiter(IEnumerable<string> lines, IEnumerable<char> delimiters) {
  Dictionary<char, int> delimFrequency = new Dictionary<char, int>();

  // Setup our frequency tracker for given delimiters
  delimiters.ToList().ForEach(curDelim => 
    delimFrequency.Add(curDelim, 0)
  );

  // Get a total sum of all occurrences of each delimiter in the given lines
  delimFrequency.ToList().ForEach(curDelim => 
    delimFrequency[curDelim.Key] = lines.Sum(line => line.Count(p => p == curDelim.Key))
  );

  // Find delimiters that have a frequency evenly divisible by the number of lines
  // (correct & consistent usage) and order them by largest frequency
  var possibleDelimiters = delimFrequency
                    .Where(f => f.Value > 0 && f.Value % lines.Count() == 0)
                    .OrderByDescending(f => f.Value)
                    .ToList();

  // If more than one possible delimiter found, return the most used one
  if (possibleDelimiters.Any()) {
    return possibleDelimiters.First().Key.ToString();
  }
  else {
    return null;
  }   

}

#8


1  

There is no "efficient" way.

没有“有效”的方式。

#9


1  

Assuming that there are a fixed number of fields per line and that any commas or tabs within values are enclosed by quotes ("), you should be able to work it out on the frequency of each character in each line. If the fields aren't fixed, this is harder, and if quotes aren't used to enclose otherwise delimiting characters, it will be, I suspect, near impossible (and depending on the data, locale-specific).

假设每行有固定数量的字段,并且值中的任何逗号或制表符都用引号(“)括起来,您应该能够根据每行中每个字符的频率进行处理。如果字段不是'固定,这是更难,如果引号不用于包含其他分隔字符,我怀疑,它几乎是不可能的(并且取决于数据,特定于语言环境)。

#10


1  

In my experience, data rarely contains tabs, so a line of tab delimited fields would (generally) be fairly obvious.

根据我的经验,数据很少包含制表符,因此一行制表符分隔字段(通常)相当明显。

Commas are more difficult, though - especially if you're reading data in non-US locales. Numerical data can contain huge numbers of commas if you're reading files generated out of country, since floating point numbers will often contain them.

但是逗号更难 - 特别是如果你在非美国语言环境中阅读数据。如果您正在读取由国家/地区生成的文件,则数字数据可能包含大量逗号,因为浮点数通常会包含它们。

In the end, the only safe thing, though, is usually to try, then present it to the user and allow them to adjust, especially if your data will contain commas and/or tabs.

最后,唯一安全的方法通常是尝试,然后将其呈现给用户并允许他们进行调整,特别是如果您的数据将包含逗号和/或制表符。

#11


1  

I would assume that in normal text, tabs are very rare except as the first character(s) on a line -- think indented paragraphs or source code. I think if you find embedded tabs (i.e. ones that don't follow commas), you can assume that the tabs are being used as the delimiters and be correct most of the time. This is just a hunch, not verified with any research. I'd of course give the user the option to override the auto-calculated mode.

我认为在普通文本中,标签很少见,除非是一行中的第一个字符 - 想想缩进的段落或源代码。我想如果你发现嵌入式标签(即那些不遵循逗号的标签),你可以假设标签被用作分隔符并且在大多数时候都是正确的。这只是一种预感,未经任何研究验证。我当然会给用户提供覆盖自动计算模式的选项。

#12


1  

Assuming you have a standard set of columns you are going to expect...

假设你有一套标准的列,你会期望......

I would use FileHelper (open source project on SourceForge). http://filehelpers.sourceforge.net/

我会使用FileHelper(SourceForge上的开源项目)。 http://filehelpers.sourceforge.net/

Define two reader templates, one for comas, one for tabs.

定义两个阅读器模板,一个用于comas,一个用于标签。

If the first one fails, try the second.

如果第一个失败,请尝试第二个。

#13


0  

You can check whether a line is using one delimiter or another like this:

您可以检查一行是使用一个分隔符还是另一个分隔符,如下所示:

while ((line = readFile.ReadLine()) != null)
{
    if (line.Split('\t').Length > line.Split(',').Length) // tab delimited or comma delimited?
        row = line.Split('\t');
    else
        row = line.Split(',');

    parsedData.Add(row);
}

#1


14  

You could show them the results in preview window - similar to the way Excel does it. It's pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.

您可以在预览窗口中显示结果 - 类似于Excel的方式。在这种情况下使用错误的分隔符时非常清楚。然后,您可以允许他们选择一系列分隔符并实时预览更新。

Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).

然后你可以简单地猜测开始时的分隔符(例如,首先是逗号或制表符)。

#2


15  

In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):

在Python中,csv模块中有一个Sniffer类,可用于猜测给定文件的分隔符和引号字符。它的策略是(引自csv.py的文档字符串):


[First, look] for text enclosed between two identical quotes (the probable quotechar) which are preceded and followed by the same character (the probable delimiter). For example:

[首先,查看]包含在两个相同引号(可能的quotechar)之间的文本,这些引号之前和之后是相同的字符(可能的分隔符)。例如:

         ,'some text',

The quote with the most wins, same with the delimiter. If there is no quotechar the delimiter can't be determined this way.

获胜最多的引用,与分隔符相同。如果没有quotechar,则无法以这种方式确定分隔符。

In that case, try the following:

在这种情况下,请尝试以下方法:

The delimiter should occur the same number of times on each row. However, due to malformed data, it may not. We don't want an all or nothing approach, so we allow for small variations in this number.

分隔符应在每行上出现相同的次数。但是,由于数据格式错误,可能不会。我们不想要全有或全无的方法,所以我们允许这个数字的微小变化。

  1. build a table of the frequency of each character on every line.
  2. 建立每行每个字符频率的表格。
  3. build a table of freqencies of this frequency (meta-frequency?), e.g. 'x occurred 5 times in 10 rows, 6 times in 1000 rows, 7 times in 2 rows'
  4. 建立一个这个频率的频率表(元频率?),例如: 'x在10行中出现5次,在1000行中出现6次,在2行中出现7次'
  5. use the mode of the meta-frequency to determine the expected frequency for that character
  6. 使用元频率的模式来确定该角色的预期频率
  7. find out how often the character actually meets that goal
  8. 找出角色实际达到目标的频率
  9. the character that best meets its goal is the delimiter
  10. 最符合其目标的角色是分隔符

For performance reasons, the data is evaluated in chunks, so it can try and evaluate the smallest portion of the data possible, evaluating additional chunks as necessary.

出于性能原因,数据以块的形式进行评估,因此它可以尝试评估可能的最小部分数据,并根据需要评估其他块。


I'm not going to quote the source code here - it's in the Lib directory of every Python installation.

我不打算在这里引用源代码 - 它位于每个Python安装的Lib目录中。

Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)

请记住,CSV也可以使用分号而不是逗号作为分隔符(例如,在德语版本的Excel中,CSV以分号分隔,因为逗号在德国用作小数分隔符...)

#3


4  

Do you know how many fields should be present per line? If so, I'd read the first few lines of the file and check based on that.

你知道每行应该有多少个字段吗?如果是这样,我会阅读文件的前几行并根据它进行检查。

In my experience, "normal" data quite often contains commas but rarely contains tab characters. This would suggest that you should check for a consistent number of tabs in the first few lines, and go with that choice as a preferred guess. Of course, it depends on exactly what data you've got.

根据我的经验,“普通”数据通常包含逗号但很少包含制表符。这表明您应该检查前几行中的一致数量的选项卡,并将该​​选项作为首选猜测。当然,这取决于你所拥有的确切数据。

Ultimately, it would be quite possible to have a file which is completely valid for both formats - so you can't make it absolutely foolproof. It'll have to be a "best effort" job.

最终,很可能有一个对两种格式都完全有效的文件 - 所以你不能让它绝对万无一失。它必须是“尽力而为”的工作。

#4


3  

It's in PHP but this seems to be quite reliable:

它在PHP中,但这看起来非常可靠:

$csv = 'something;something;something
someotherthing;someotherthing;someotherthing
';
$candidates = array(',', ';', "\t");
$csvlines = explode("\n", $csv);
foreach ($candidates as $candidatekey => $candidate) {
 $lastcnt = 0;
 foreach ($csvlines as $csvline) {
  if (strlen($csvline) <= 2) continue;
  $thiscnt = substr_count($csvline, $candidate);
  if (($thiscnt == 0) || ($thiscnt != $lastcnt) && ($lastcnt != 0)) {
   unset($candidates[$candidatekey]);
   break;
  }
  $lastcnt = $thiscnt;
 }
}
$delim = array_shift($candidates);
echo $delim;

What it does is the following: For every specified possible delimiter, it reads every line in the CSV and checks if the number of times each seperator occurs is constant. If not, the candidate seperator is removed and ultimately you should end up with one seperator.

它的作用如下:对于每个指定的可能分隔符,它读取CSV中的每一行并检查每个分隔符出现的次数是否恒定。如果没有,候选分离器将被移除,最终你应该最终选择一个分离器。

#5


2  

I'd imagine that your suggested solution would be the best way to go. In a well-formed CSV or TSV file, the number of commas or tabs respectively per line should be constant (no variation at all). Do a count of each for every line of the file, and check which one is constant for all lines. It would seem quite unlikely that the count of both delimeters for each line is identical, but in this inconceivably rare case, you could of course prompt the user.

我想你建议的解决方案是最好的方法。在格式良好的CSV或TSV文件中,每行的逗号或制表符数应保持不变(完全没有变化)。对文件的每一行进行计数,并检查哪一行对于所有行都是常量。每行的两个分界符的计数似乎不太可能相同,但在这种不可思议的罕见情况下,您当然可以提示用户。

If neither the number of tabs nor commas is constant, then display a message to the user telling them that the file is malformed but the program thinks it is a (whatever format has the lowest standard deviation of delimeters per line) file.

如果选项卡和逗号的数量都不是常量,则向用户显示一条消息,告诉他们文件格式错误但程序认为它是(无论格式具有每行的分界线的最低标准偏差)文件。

#6


2  

Just read a few lines, count the number of commas and the number of tabs and compare them. If there's 20 commas and no tabs, it's in CSV. If there's 20 tabs and 2 commas (maybe in the data), it's in TSV.

只需阅读几行,计算逗号数量和标签数量并进行比较。如果有20个逗号且没有标签,则为CSV格式。如果有20个标签和2个逗号(可能在数据中),则它在TSV中。

#7


2  

I ran into a similar need and thought I would share what I came up with. I haven't run a lot of data through it yet, so there are possible edge cases. Also, keep in mind the goal of this function isn't 100% certainty of the delimiter, but best guess to be presented to user.

我遇到了类似的需求,并认为我会分享我想出的东西。我还没有通过它运行大量数据,因此可能存在边缘情况。另外,请记住,此功能的目标不是100%确定分隔符,但最好猜测要呈现给用户。

/// <summary>
/// Analyze the given lines of text and try to determine the correct delimiter used. If multiple
/// candidate delimiters are found, the highest frequency delimiter will be returned.
/// </summary>
/// <example>
/// string discoveredDelimiter = DetectDelimiter(dataLines, new char[] { '\t', '|', ',', ':', ';' });
/// </example>
/// <param name="lines">Lines to inspect</param>
/// <param name="delimiters">Delimiters to search for</param>
/// <returns>The most probable delimiter by usage, or null if none found.</returns>
public string DetectDelimiter(IEnumerable<string> lines, IEnumerable<char> delimiters) {
  Dictionary<char, int> delimFrequency = new Dictionary<char, int>();

  // Setup our frequency tracker for given delimiters
  delimiters.ToList().ForEach(curDelim => 
    delimFrequency.Add(curDelim, 0)
  );

  // Get a total sum of all occurrences of each delimiter in the given lines
  delimFrequency.ToList().ForEach(curDelim => 
    delimFrequency[curDelim.Key] = lines.Sum(line => line.Count(p => p == curDelim.Key))
  );

  // Find delimiters that have a frequency evenly divisible by the number of lines
  // (correct & consistent usage) and order them by largest frequency
  var possibleDelimiters = delimFrequency
                    .Where(f => f.Value > 0 && f.Value % lines.Count() == 0)
                    .OrderByDescending(f => f.Value)
                    .ToList();

  // If more than one possible delimiter found, return the most used one
  if (possibleDelimiters.Any()) {
    return possibleDelimiters.First().Key.ToString();
  }
  else {
    return null;
  }   

}

#8


1  

There is no "efficient" way.

没有“有效”的方式。

#9


1  

Assuming that there are a fixed number of fields per line and that any commas or tabs within values are enclosed by quotes ("), you should be able to work it out on the frequency of each character in each line. If the fields aren't fixed, this is harder, and if quotes aren't used to enclose otherwise delimiting characters, it will be, I suspect, near impossible (and depending on the data, locale-specific).

假设每行有固定数量的字段,并且值中的任何逗号或制表符都用引号(“)括起来,您应该能够根据每行中每个字符的频率进行处理。如果字段不是'固定,这是更难,如果引号不用于包含其他分隔字符,我怀疑,它几乎是不可能的(并且取决于数据,特定于语言环境)。

#10


1  

In my experience, data rarely contains tabs, so a line of tab delimited fields would (generally) be fairly obvious.

根据我的经验,数据很少包含制表符,因此一行制表符分隔字段(通常)相当明显。

Commas are more difficult, though - especially if you're reading data in non-US locales. Numerical data can contain huge numbers of commas if you're reading files generated out of country, since floating point numbers will often contain them.

但是逗号更难 - 特别是如果你在非美国语言环境中阅读数据。如果您正在读取由国家/地区生成的文件,则数字数据可能包含大量逗号,因为浮点数通常会包含它们。

In the end, the only safe thing, though, is usually to try, then present it to the user and allow them to adjust, especially if your data will contain commas and/or tabs.

最后,唯一安全的方法通常是尝试,然后将其呈现给用户并允许他们进行调整,特别是如果您的数据将包含逗号和/或制表符。

#11


1  

I would assume that in normal text, tabs are very rare except as the first character(s) on a line -- think indented paragraphs or source code. I think if you find embedded tabs (i.e. ones that don't follow commas), you can assume that the tabs are being used as the delimiters and be correct most of the time. This is just a hunch, not verified with any research. I'd of course give the user the option to override the auto-calculated mode.

我认为在普通文本中,标签很少见,除非是一行中的第一个字符 - 想想缩进的段落或源代码。我想如果你发现嵌入式标签(即那些不遵循逗号的标签),你可以假设标签被用作分隔符并且在大多数时候都是正确的。这只是一种预感,未经任何研究验证。我当然会给用户提供覆盖自动计算模式的选项。

#12


1  

Assuming you have a standard set of columns you are going to expect...

假设你有一套标准的列,你会期望......

I would use FileHelper (open source project on SourceForge). http://filehelpers.sourceforge.net/

我会使用FileHelper(SourceForge上的开源项目)。 http://filehelpers.sourceforge.net/

Define two reader templates, one for comas, one for tabs.

定义两个阅读器模板,一个用于comas,一个用于标签。

If the first one fails, try the second.

如果第一个失败,请尝试第二个。

#13


0  

You can check whether a line is using one delimiter or another like this:

您可以检查一行是使用一个分隔符还是另一个分隔符,如下所示:

while ((line = readFile.ReadLine()) != null)
{
    if (line.Split('\t').Length > line.Split(',').Length) // tab delimited or comma delimited?
        row = line.Split('\t');
    else
        row = line.Split(',');

    parsedData.Add(row);
}