我在向CSV文件发出Http请求时收到这种CSV数据。非常畸形的字符串

时间:2022-10-14 17:10:59

I am getting this sort of CSV data while making Http request to the CSV file. Very malformed string.

我在向CSV文件发出Http请求时收到这种CSV数据。非常畸形的字符串。

response = '"Subject";"Start Date";"Start Time";"End Date";"End Time";"All day event";"Description""Play football";"16/11/2009";"10:00 PM";"16/11/2009";"11:00 PM";"false";"""Watch 2012";"20/11/2009";"07:00 PM";"20/11/2009";"08:00 PM";"false";""' 

And i want to convert this into list of dictionary

我想把它转换成字典列表

[{"Subject": "Play football", "Start Date": "16/11/2009", "Start Time": "10:00 PM", "End Date": "16/11/2009", "End Time": "11:00 PM", "All day event", false, "Description": ""},
 {"Subject": "Watch 2012", "Start Date": "20/11/2009", "Start Time": "07:00 PM", "End Date": "20/11/2009", "End Time": "08:00 PM", "All day event", false, "Description": ""}]

I tried solving this using python csv module but didn't work.

我尝试使用python csv模块解决这个问题但是没有用。

import csv
from cStringIO import StringIO

>>> str_obj = StringIO(response)
>>> reader = csv.reader(str_obj, delimiter=';')
>>> [x for x in reader] 
    [['Subject',
      'Start Date',
      'Start Time',
      'End Date',
      'End Time',
      'All day event',
      'Description"Play football',
      '16/11/2009',
      '10:00 PM',
      '16/11/2009',
      '11:00 PM',
      'false',
      '"Watch 2012',
      '20/11/2009',
      '07:00 PM',
      '20/11/2009',
      '08:00 PM',
      'false',
      '']]

I get the above result.

我得到了上面的结果。

Any sort of help will be appreciated. Thanks in advance.

任何形式的帮助将不胜感激。提前致谢。

4 个解决方案

#1


8  

Here's a pyparsing solution:

这是一个pyparsing解决方案:

from pyparsing import QuotedString, Group, delimitedList, OneOrMore

# a row of headings or data is a list of quoted strings, delimited by ';'s
qs = QuotedString('"')
datarow = Group(delimitedList(qs, ';'))

# an entire data set is a single data row containing the headings, followed by
# one or more data rows containing the data
dataset_parser = datarow("headings") + OneOrMore(datarow)("rows")

# parse the returned response
data = dataset_parser.parseString(response)

# create dict by zipping headings with each row's data values
datadict = [dict(zip(data.headings, row)) for row in data.rows]

print datadict

Prints:

打印:

[{'End Date': '16/11/2009', 'Description': '', 'All day event': 'false', 
  'Start Time': '10:00 PM', 'End Time': '11:00 PM', 'Start Date': '16/11/2009', 
  'Subject': 'Play football'}, 
 {'End Date': '20/11/2009', 'Description': '', 'All day event': 'false', 
  'Start Time': '07:00 PM', 'End Time': '08:00 PM', 'Start Date': '20/11/2009', 
  'Subject': 'Watch 2012'}]

This will also handle the case if the quoted strings contain embedded semicolons.

如果引用的字符串包含嵌入的分号,这也将处理这种情况。

#2


3  

Here's one approach.

这是一种方法。

I notice there is no delimiter between rows. In an effort to clean up the input data, I make a few assumptions:

我注意到行之间没有分隔符。为了清理输入数据,我做了一些假设:

  • The first "row" is the "heading" of a "table", these will be our dictionary keys
  • 第一个“行”是“表”的“标题”,这些将是我们的字典键
  • There are no empty fields in the first row (ie: no "")
  • 第一行中没有空字段(即:no“”)
  • Any other field can be empty (ie: "")
  • 任何其他字段都可以为空(即:“”)
  • The first occurrence of two successive " indicates the end of the heading row
  • 第一次出现两个连续的“表示标题行的结束

First I create a response based on your input string:

首先,我根据您的输入字符串创建响应:

>>> response = '"Subject";"Start Date";"Start Time";"End Date";"End Time";"All day event";"Description""Play football";"16/11/2009";"10:00 PM";"16/11/2009";"11:00 PM";"false";"""Watch 2012";"20/11/2009";"07:00 PM";"";"08:00 PM";"false";"""";"17/11/2009";"9:00 AM";"17/11/2009";"10:00 AM";"false";""'    

Note that

注意

  • the "End Date" for "Watch 2012" is empty
  • “Watch 2012”的“结束日期”为空
  • there is a third event with an empty "Subject" heading
  • 第三个事件带有一个空的“主题”标题

These two modifications illustrate some "edge cases" I'm concerned about.

这两个修改说明了我所关注的一些“边缘情况”。

First I will replace all occurrences of two consecutive " with a pipe (|) and strip out all other " characters because I don't need them:

首先,我将替换所有出现的两个连续“管道(|)并删除所有其他”字符,因为我不需要它们:

>>> response.replace('""', '|').replace('"', '')
'Subject;Start Date;Start Time;End Date;End Time;All day event;Description|Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;|Watch 2012;20/11/2009;07:00 PM;|;08:00 PM;false;||;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;|'

If we had any empty cells not at the start or end of a row (ie: Watch 2012's End Date), it looks like this: ;|; -- let's simply leave it blank:

如果我们有任何空单元格不在行的开头或结尾(即:观察2012年的结束日期),它看起来像这样:; |; - 让我们把它留空:

>>> response.replace('""', '|').replace('"', '').replace(';|;', ';;')
'Subject;Start Date;Start Time;End Date;End Time;All day event;Description|Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;|Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;||;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;|'

Now the | indicates the split between the heading row and the next row. What happens if we split our string on |?

现在|表示标题行和下一行之间的分割。如果我们将字符串拆分为|会发生什么?

>>> response.replace('""', '|').replace('"', '').replace(';|;', ';;').split('|')
['Subject;Start Date;Start Time;End Date;End Time;All day event;Description',
 'Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;',
 'Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;',
 '',
 ';17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;',
 '']

Looks like we're getting somewhere. There's a problem, though; there are two items in that list that are just the empty string ''. They're there because we sometimes have a | at the end of a row and the beginning of the next row, and splitting creates an empty element:

看起来我们正在某个地方。但是有一个问题;该列表中有两个项目只是空字符串''。他们在那里是因为我们有时会有一个|在行的末尾和下一行的开头,并且拆分创建一个空元素:

>>> "a|b||c".split('|')
['a', 'b', '', 'c']

Same goes for a lone delimited at the end of a line, too:

同样也适用于行尾的单独分隔:

>>> "a||b|c|".split('|')
['a', '', 'b', 'c', '']

Let's filter our list to drop those empty "rows":

让我们过滤我们的列表以删除那些空的“行”:

>>> rows = [row for row in response.replace('""', '|').replace('"', '').replace(';|;', ';;').split('|') if row]
>>> rows
['Subject;Start Date;Start Time;End Date;End Time;All day event;Description',
 'Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;',
 'Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;',
 ';17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;']

That's it for massaging the input; now we just need to build the dictionary. First, let's get the dictionary keys:

这就是按摩输入;现在我们只需要构建字典。首先,让我们获取字典键:

>>> dict_keys = rows[0].split(';')
>>> dict_keys
['Subject',
 'Start Date',
 'Start Time',
 'End Date',
 'End Time',
 'All day event',
 'Description']

And build a list of dictionaries, one for each event:

并构建一个字典列表,每个事件一个:

>>> import itertools
>>> events = []
>>> for row in rows[1:]:
...     d = {}
...     for k, v in itertools.izip(dict_keys, row.split(';')):
...         d[k] = v
...     events.append(d)
... 
>>> events
[{'All day event': 'false',
  'Description': '',
  'End Date': '16/11/2009',
  'End Time': '11:00 PM',
  'Start Date': '16/11/2009',
  'Start Time': '10:00 PM',
  'Subject': 'Play football'},
 {'All day event': 'false',
  'Description': '',
  'End Date': '',
  'End Time': '08:00 PM',
  'Start Date': '20/11/2009',
  'Start Time': '07:00 PM',
  'Subject': 'Watch 2012'},
 {'All day event': 'false',
  'Description': '',
  'End Date': '17/11/2009',
  'End Time': '10:00 AM',
  'Start Date': '17/11/2009',
  'Start Time': '9:00 AM',
  'Subject': ''}]

Hope that helps!

希望有所帮助!

Some notes:

一些说明:

  • if you expect | to appear in your data, you might want to encode it first; or use a different delimiter
  • 如果你期望|要显示在数据中,您可能需要先对其进行编码;或使用不同的分隔符
  • supporting quotes in the data might be tricky (ie: 'Subject': 'Watching "2012"')
  • 在数据中支持报价可能很棘手(即:'主题':'看着'2012“')
  • I leave conversion of 'All day event' values from string to boolean as an exercise to the reader :D
  • 我将“全天事件”值从字符串转换为布尔值作为练习给读者:D

#3


2  

Are you sure, you got this response.

你确定,你有这个回应。

Looks corrupted to me. In this case, no reader will be able to make sense of it.

看起来腐败了我。在这种情况下,没有读者能够理解它。

First fix the response, then parsing will be better ....

首先修复响应,然后解析会更好....

#4


1  

response = response.split(';') # split it into words
response = [w[1:-1] for w in response] # strip off the quotes 
response = [w.replace('""','"\n"') for w in response] # add in the newlines
response = ['"%s"'%w for w in response] # add the quotes back
response = ';'.join(response) 

But it won't work if you have a ";" character in the data that should have been escaped. You should find what happened to the missing newlines in the first place.

但如果你有一个“;”它将无法工作应该已转义的数据中的字符。您应该首先找到丢失的换行符发生的事情。

#1


8  

Here's a pyparsing solution:

这是一个pyparsing解决方案:

from pyparsing import QuotedString, Group, delimitedList, OneOrMore

# a row of headings or data is a list of quoted strings, delimited by ';'s
qs = QuotedString('"')
datarow = Group(delimitedList(qs, ';'))

# an entire data set is a single data row containing the headings, followed by
# one or more data rows containing the data
dataset_parser = datarow("headings") + OneOrMore(datarow)("rows")

# parse the returned response
data = dataset_parser.parseString(response)

# create dict by zipping headings with each row's data values
datadict = [dict(zip(data.headings, row)) for row in data.rows]

print datadict

Prints:

打印:

[{'End Date': '16/11/2009', 'Description': '', 'All day event': 'false', 
  'Start Time': '10:00 PM', 'End Time': '11:00 PM', 'Start Date': '16/11/2009', 
  'Subject': 'Play football'}, 
 {'End Date': '20/11/2009', 'Description': '', 'All day event': 'false', 
  'Start Time': '07:00 PM', 'End Time': '08:00 PM', 'Start Date': '20/11/2009', 
  'Subject': 'Watch 2012'}]

This will also handle the case if the quoted strings contain embedded semicolons.

如果引用的字符串包含嵌入的分号,这也将处理这种情况。

#2


3  

Here's one approach.

这是一种方法。

I notice there is no delimiter between rows. In an effort to clean up the input data, I make a few assumptions:

我注意到行之间没有分隔符。为了清理输入数据,我做了一些假设:

  • The first "row" is the "heading" of a "table", these will be our dictionary keys
  • 第一个“行”是“表”的“标题”,这些将是我们的字典键
  • There are no empty fields in the first row (ie: no "")
  • 第一行中没有空字段(即:no“”)
  • Any other field can be empty (ie: "")
  • 任何其他字段都可以为空(即:“”)
  • The first occurrence of two successive " indicates the end of the heading row
  • 第一次出现两个连续的“表示标题行的结束

First I create a response based on your input string:

首先,我根据您的输入字符串创建响应:

>>> response = '"Subject";"Start Date";"Start Time";"End Date";"End Time";"All day event";"Description""Play football";"16/11/2009";"10:00 PM";"16/11/2009";"11:00 PM";"false";"""Watch 2012";"20/11/2009";"07:00 PM";"";"08:00 PM";"false";"""";"17/11/2009";"9:00 AM";"17/11/2009";"10:00 AM";"false";""'    

Note that

注意

  • the "End Date" for "Watch 2012" is empty
  • “Watch 2012”的“结束日期”为空
  • there is a third event with an empty "Subject" heading
  • 第三个事件带有一个空的“主题”标题

These two modifications illustrate some "edge cases" I'm concerned about.

这两个修改说明了我所关注的一些“边缘情况”。

First I will replace all occurrences of two consecutive " with a pipe (|) and strip out all other " characters because I don't need them:

首先,我将替换所有出现的两个连续“管道(|)并删除所有其他”字符,因为我不需要它们:

>>> response.replace('""', '|').replace('"', '')
'Subject;Start Date;Start Time;End Date;End Time;All day event;Description|Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;|Watch 2012;20/11/2009;07:00 PM;|;08:00 PM;false;||;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;|'

If we had any empty cells not at the start or end of a row (ie: Watch 2012's End Date), it looks like this: ;|; -- let's simply leave it blank:

如果我们有任何空单元格不在行的开头或结尾(即:观察2012年的结束日期),它看起来像这样:; |; - 让我们把它留空:

>>> response.replace('""', '|').replace('"', '').replace(';|;', ';;')
'Subject;Start Date;Start Time;End Date;End Time;All day event;Description|Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;|Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;||;17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;|'

Now the | indicates the split between the heading row and the next row. What happens if we split our string on |?

现在|表示标题行和下一行之间的分割。如果我们将字符串拆分为|会发生什么?

>>> response.replace('""', '|').replace('"', '').replace(';|;', ';;').split('|')
['Subject;Start Date;Start Time;End Date;End Time;All day event;Description',
 'Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;',
 'Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;',
 '',
 ';17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;',
 '']

Looks like we're getting somewhere. There's a problem, though; there are two items in that list that are just the empty string ''. They're there because we sometimes have a | at the end of a row and the beginning of the next row, and splitting creates an empty element:

看起来我们正在某个地方。但是有一个问题;该列表中有两个项目只是空字符串''。他们在那里是因为我们有时会有一个|在行的末尾和下一行的开头,并且拆分创建一个空元素:

>>> "a|b||c".split('|')
['a', 'b', '', 'c']

Same goes for a lone delimited at the end of a line, too:

同样也适用于行尾的单独分隔:

>>> "a||b|c|".split('|')
['a', '', 'b', 'c', '']

Let's filter our list to drop those empty "rows":

让我们过滤我们的列表以删除那些空的“行”:

>>> rows = [row for row in response.replace('""', '|').replace('"', '').replace(';|;', ';;').split('|') if row]
>>> rows
['Subject;Start Date;Start Time;End Date;End Time;All day event;Description',
 'Play football;16/11/2009;10:00 PM;16/11/2009;11:00 PM;false;',
 'Watch 2012;20/11/2009;07:00 PM;;08:00 PM;false;',
 ';17/11/2009;9:00 AM;17/11/2009;10:00 AM;false;']

That's it for massaging the input; now we just need to build the dictionary. First, let's get the dictionary keys:

这就是按摩输入;现在我们只需要构建字典。首先,让我们获取字典键:

>>> dict_keys = rows[0].split(';')
>>> dict_keys
['Subject',
 'Start Date',
 'Start Time',
 'End Date',
 'End Time',
 'All day event',
 'Description']

And build a list of dictionaries, one for each event:

并构建一个字典列表,每个事件一个:

>>> import itertools
>>> events = []
>>> for row in rows[1:]:
...     d = {}
...     for k, v in itertools.izip(dict_keys, row.split(';')):
...         d[k] = v
...     events.append(d)
... 
>>> events
[{'All day event': 'false',
  'Description': '',
  'End Date': '16/11/2009',
  'End Time': '11:00 PM',
  'Start Date': '16/11/2009',
  'Start Time': '10:00 PM',
  'Subject': 'Play football'},
 {'All day event': 'false',
  'Description': '',
  'End Date': '',
  'End Time': '08:00 PM',
  'Start Date': '20/11/2009',
  'Start Time': '07:00 PM',
  'Subject': 'Watch 2012'},
 {'All day event': 'false',
  'Description': '',
  'End Date': '17/11/2009',
  'End Time': '10:00 AM',
  'Start Date': '17/11/2009',
  'Start Time': '9:00 AM',
  'Subject': ''}]

Hope that helps!

希望有所帮助!

Some notes:

一些说明:

  • if you expect | to appear in your data, you might want to encode it first; or use a different delimiter
  • 如果你期望|要显示在数据中,您可能需要先对其进行编码;或使用不同的分隔符
  • supporting quotes in the data might be tricky (ie: 'Subject': 'Watching "2012"')
  • 在数据中支持报价可能很棘手(即:'主题':'看着'2012“')
  • I leave conversion of 'All day event' values from string to boolean as an exercise to the reader :D
  • 我将“全天事件”值从字符串转换为布尔值作为练习给读者:D

#3


2  

Are you sure, you got this response.

你确定,你有这个回应。

Looks corrupted to me. In this case, no reader will be able to make sense of it.

看起来腐败了我。在这种情况下,没有读者能够理解它。

First fix the response, then parsing will be better ....

首先修复响应,然后解析会更好....

#4


1  

response = response.split(';') # split it into words
response = [w[1:-1] for w in response] # strip off the quotes 
response = [w.replace('""','"\n"') for w in response] # add in the newlines
response = ['"%s"'%w for w in response] # add the quotes back
response = ';'.join(response) 

But it won't work if you have a ";" character in the data that should have been escaped. You should find what happened to the missing newlines in the first place.

但如果你有一个“;”它将无法工作应该已转义的数据中的字符。您应该首先找到丢失的换行符发生的事情。