Using pandas to read in large tab delimited file
使用熊猫读取大标签分隔文件
df = pd.read_csv(file_path, sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, na_values='')
The problem is that there are 200 columns and the 3rd column is text with occasional newline characters. The text is not delimited with any special characters. These lines get chopped into multiple lines with data going into the wrong columns.
问题是,有200列,第三列是文本,偶尔有换行符。文本不受任何特殊字符的限制。这些行被分割成多行,数据进入错误的列。
There are a fixed number of tabs in each line - that is all I have to go on.
每行都有固定数量的制表符——这就是我要做的。
2 个解决方案
#1
3
The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.
这个想法是使用regex来查找所有被给定的选项卡分隔的实例,并以换行符结尾。然后获取所有这些并创建一个dataframe。
import pandas as pd
import re
def wonky_parser(fn):
txt = open(fn).read()
# This is where I specified 8 tabs
# V
preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
parsed = [t[0].split('\t') for t in preparse]
return pd.DataFrame(parsed)
Pass a filename to the function and get your dataframe back.
将文件名传递给函数,并返回数据aframe。
#2
0
name your third column
命名您的第三列
df.columns.values[2] = "some_name"
and use converters to pass your function.
使用转换器传递函数。
pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})
you could use any manipulating function which works for you under lambda.
你可以使用任意的操作函数在lambda下工作。
#1
3
The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.
这个想法是使用regex来查找所有被给定的选项卡分隔的实例,并以换行符结尾。然后获取所有这些并创建一个dataframe。
import pandas as pd
import re
def wonky_parser(fn):
txt = open(fn).read()
# This is where I specified 8 tabs
# V
preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
parsed = [t[0].split('\t') for t in preparse]
return pd.DataFrame(parsed)
Pass a filename to the function and get your dataframe back.
将文件名传递给函数,并返回数据aframe。
#2
0
name your third column
命名您的第三列
df.columns.values[2] = "some_name"
and use converters to pass your function.
使用转换器传递函数。
pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})
you could use any manipulating function which works for you under lambda.
你可以使用任意的操作函数在lambda下工作。