从.csv文件中读取值并将其转换为浮点数组

时间:2021-09-26 13:26:20

I stumbled upon a little coding problem. I have to basically read data from a .csv file which looks a lot like this:

我偶然发现了一个编码问题。我需要从一个。csv文件中读取数据看起来很像这样

2011-06-19 17:29:00.000,72,44,56,0.4772,0.3286,0.8497,31.3587,0.3235,0.9147,28.5751,0.3872,0.2803,0,0.2601,0.2073,0.1172,0,0.0,0,5.8922,1,0,0,0,1.2759

Now, I need to basically an entire file consisting of rows like this and parse them into numpy arrays. Till now, I have been able to get them into a big string type object using code similar to this:

现在,我需要一个由这样的行组成的完整文件,并将它们解析为numpy数组。到目前为止,我已经能够使用类似这样的代码将它们转换成一个大字符串类型的对象:

order_hist = np.loadtxt(filename_input,delimiter=',',dtype={'names': ('Year', 'Mon', 'Day', 'Stock', 'Action', 'Amount'), 'formats': ('i4', 'i4', 'i4', 'S10', 'S10', 'i4')})

The format for this file consists of a set of S20 data types as of now. I need to basically extract all of the data in the big ORDER_HIST data type into a set of arrays for each column. I do not know how to save the date time column (I've kept it as String for now). I need to convert the rest to float, but the below code is giving me an error:

该文件的格式目前由一组S20数据类型组成。我基本上需要将ORDER_HIST数据类型中的所有数据提取到每个列的一组数组中。我不知道如何保存date time列(我暂时将它保存为String)。我需要将其余的转换为float,但是下面的代码给了我一个错误:

    temparr=float[:len(order_hist)]
    for x in range(len(order_hist['Stock'])): 
        temparr[x]=float(order_hist['Stock'][x]);

Can someone show me just how I can convert all the columns to the arrays that I need??? Or possibly direct me to some link to do so?

有人能告诉我如何将所有列转换成我需要的数组吗?或者可能给我找个链接来做?

1 个解决方案

#1


5  

Boy, have I got a treat for you. numpy.genfromtxt has a converters parameter, which allows you to specify a function for each column as the file is parsed. The function is fed the CSV string value. Its return value becomes the corresponding value in the numpy array.

孩子,我有好吃的给你。numpy。genfromtxt有一个转换器参数,允许您在解析文件时为每个列指定一个函数。函数被提供CSV字符串值。它的返回值成为numpy数组中的对应值。

Morever, the dtype = None parameter tells genfromtxt to make an intelligent guess as to the type of each column. In particular, numeric columns are automatically cast to an appropriate dtype.

Morever, dtype =没有参数告诉genfromtxt对每个列的类型进行智能猜测。特别是,数字列将自动转换为适当的dtype。

For example, suppose your data file contains

例如,假设您的数据文件包含

2011-06-19 17:29:00.000,72,44,56

Then

然后

import numpy as np
import datetime as DT

def make_date(datestr):
    return DT.datetime.strptime(datestr, '%Y-%m-%d %H:%M:%S.%f')

arr = np.genfromtxt(filename, delimiter = ',',
                    converters = {'Date':make_date},
                    names =  ('Date', 'Stock', 'Action', 'Amount'),
                    dtype = None)
print(arr)
print(arr.dtype)

yields

收益率

(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56)
[('Date', '|O4'), ('Stock', '<i4'), ('Action', '<i4'), ('Amount', '<i4')]

Your real csv file has more columns, so you'd want to add more items to names, but otherwise, the example should still stand.

真正的csv文件有更多的列,所以您希望向名称中添加更多的项,否则,示例应该仍然有效。

If you don't really care about the extra columns, you can assign a fluff-name like this:

如果你不关心额外的列数,你可以指定一个错误的名字:

arr = np.genfromtxt(filename, delimiter=',',
                    converters={'Date': make_date},
                    names=('Date', 'Stock', 'Action', 'Amount') +
                    tuple('col{i}'.format(i=i) for i in range(22)),
                    dtype = None)

yields

收益率

(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56, 0.4772, 0.3286, 0.8497, 31.3587, 0.3235, 0.9147, 28.5751, 0.3872, 0.2803, 0, 0.2601, 0.2073, 0.1172, 0, 0.0, 0, 5.8922, 1, 0, 0, 0, 1.2759)

You might also be interested in checking out the pandas module which is built on top of numpy, and which takes parsing CSV to an even higher level of luxury: It has a pandas.read_csv function whose parse_dates = True parameter will automatically parse date strings (using dateutil).

您可能还会对熊猫模块感兴趣,该模块构建在numpy之上,它将解析CSV带到了更高的级别:它有一只熊猫。read_csv函数的parse_dates = True参数将自动解析日期字符串(使用dateutil)。

Using pandas, your csv could be parsed with

使用熊猫,可以解析您的csv

df = pd.read_csv(filename, parse_dates = [0,1], header = None,
                    names=('Date', 'Stock', 'Action', 'Amount') +
                    tuple('col{i}'.format(i=i) for i in range(22)))

Note there is no need to specify the make_date function. Just to be clear --pands.read_csvreturns aDataFrame, not a numpy array. The DataFrame may actually be more useful for your purpose, but you should be aware it is a different object with a whole new world of methods to exploit and explore.

注意,不需要指定make_date函数。我要澄清的是,pand。read_csvreturns aDataFrame,而不是numpy数组。实际上,DataFrame对于您的目的可能更有用,但是您应该知道,它是一个不同的对象,具有可以利用和探索的全新方法。

#1


5  

Boy, have I got a treat for you. numpy.genfromtxt has a converters parameter, which allows you to specify a function for each column as the file is parsed. The function is fed the CSV string value. Its return value becomes the corresponding value in the numpy array.

孩子,我有好吃的给你。numpy。genfromtxt有一个转换器参数,允许您在解析文件时为每个列指定一个函数。函数被提供CSV字符串值。它的返回值成为numpy数组中的对应值。

Morever, the dtype = None parameter tells genfromtxt to make an intelligent guess as to the type of each column. In particular, numeric columns are automatically cast to an appropriate dtype.

Morever, dtype =没有参数告诉genfromtxt对每个列的类型进行智能猜测。特别是,数字列将自动转换为适当的dtype。

For example, suppose your data file contains

例如,假设您的数据文件包含

2011-06-19 17:29:00.000,72,44,56

Then

然后

import numpy as np
import datetime as DT

def make_date(datestr):
    return DT.datetime.strptime(datestr, '%Y-%m-%d %H:%M:%S.%f')

arr = np.genfromtxt(filename, delimiter = ',',
                    converters = {'Date':make_date},
                    names =  ('Date', 'Stock', 'Action', 'Amount'),
                    dtype = None)
print(arr)
print(arr.dtype)

yields

收益率

(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56)
[('Date', '|O4'), ('Stock', '<i4'), ('Action', '<i4'), ('Amount', '<i4')]

Your real csv file has more columns, so you'd want to add more items to names, but otherwise, the example should still stand.

真正的csv文件有更多的列,所以您希望向名称中添加更多的项,否则,示例应该仍然有效。

If you don't really care about the extra columns, you can assign a fluff-name like this:

如果你不关心额外的列数,你可以指定一个错误的名字:

arr = np.genfromtxt(filename, delimiter=',',
                    converters={'Date': make_date},
                    names=('Date', 'Stock', 'Action', 'Amount') +
                    tuple('col{i}'.format(i=i) for i in range(22)),
                    dtype = None)

yields

收益率

(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56, 0.4772, 0.3286, 0.8497, 31.3587, 0.3235, 0.9147, 28.5751, 0.3872, 0.2803, 0, 0.2601, 0.2073, 0.1172, 0, 0.0, 0, 5.8922, 1, 0, 0, 0, 1.2759)

You might also be interested in checking out the pandas module which is built on top of numpy, and which takes parsing CSV to an even higher level of luxury: It has a pandas.read_csv function whose parse_dates = True parameter will automatically parse date strings (using dateutil).

您可能还会对熊猫模块感兴趣,该模块构建在numpy之上,它将解析CSV带到了更高的级别:它有一只熊猫。read_csv函数的parse_dates = True参数将自动解析日期字符串(使用dateutil)。

Using pandas, your csv could be parsed with

使用熊猫,可以解析您的csv

df = pd.read_csv(filename, parse_dates = [0,1], header = None,
                    names=('Date', 'Stock', 'Action', 'Amount') +
                    tuple('col{i}'.format(i=i) for i in range(22)))

Note there is no need to specify the make_date function. Just to be clear --pands.read_csvreturns aDataFrame, not a numpy array. The DataFrame may actually be more useful for your purpose, but you should be aware it is a different object with a whole new world of methods to exploit and explore.

注意,不需要指定make_date函数。我要澄清的是,pand。read_csvreturns aDataFrame,而不是numpy数组。实际上,DataFrame对于您的目的可能更有用,但是您应该知道,它是一个不同的对象,具有可以利用和探索的全新方法。