I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step. My limited python knowledge has to be blamed for this. I am trying to read the datasets into a pandas dataframe by executing following command:
我在这里尝试kaggle挑战,不幸的是我陷入了一个非常基本的步骤。我有限的蟒蛇知识必须归咎于此。我试图通过执行以下命令将数据集读入pandas数据帧:
test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
The problem is that this file as you would find out has over 300,000 records, but I am reading only 7945, 21.
问题是,您发现的这个文件有超过300,000条记录,但我只阅读7945,21。
print (test.shape)
(7945, 21)
Now I have double checked the file and I cannot find anything special about line number 7945. Any pointers why this could be happening. Seems very ordinary situation, I hope some of you who have ran across this error can help me out.
现在我已经仔细检查了文件,我找不到关于行号7945的任何特殊内容。任何指针都说明为什么会发生这种情况。似乎非常普通的情况,我希望你们中的一些人遇到过这个错误可以帮助我。
1 个解决方案
#1
3
I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE
and error_bad_lines=False
. link
我认为更好的是使用函数read_csv,参数quoting = csv.QUOTE_NONE和error_bad_lines = False。链接
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
But some data (problematic) will be skipped.
但是会跳过一些数据(有问题的)。
If you want skip emails body data, you can use:
如果您想要跳过电子邮件正文数据,您可以使用:
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, sep=',', error_bad_lines=False, header=None,
names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']
#1
3
I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE
and error_bad_lines=False
. link
我认为更好的是使用函数read_csv,参数quoting = csv.QUOTE_NONE和error_bad_lines = False。链接
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
But some data (problematic) will be skipped.
但是会跳过一些数据(有问题的)。
If you want skip emails body data, you can use:
如果您想要跳过电子邮件正文数据,您可以使用:
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, sep=',', error_bad_lines=False, header=None,
names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']