Pandas (Python) reading and working on Java BigInteger/ large numbers

I have a data file (csv) with Nilsimsa hash values. Some of them would have as long as 80 characters. I wish to read them in Python for data analysis tasks. Is there a way to import the data in python without information loss?

我有一个带有Nilsimsa哈希值的数据文件(csv)。其中一些将有80个字符。我希望用Python阅读它们以进行数据分析。有没有办法在python中导入数据而不会丢失信息?

EDIT: I have tried the implementations proposed in the comments but that does not work for me. Example data in csv file would be: 77241756221441762028881402092817125017724447303212139981668021711613168152184106

编辑:我已经尝试了评论中提出的实现,但这对我不起作用。 csv文件中的示例数据将是:77241756221441762028881402092817125017724447303212139981668021711613168152184106

2 个解决方案

#1

Start with a simple text file to read in, just one variable and one row.

从一个简单的文本文件开始读入,只需一个变量和一行。

%more foo.txt
x
77241756221441762028881402092817125017724447303212139981668021711613168152184106

In [268]: df=pd.read_csv('foo.txt')

Pandas will read it in as a string because it's too big to store as a core number type like int64 or float64. But the info is there, you didn't lose anything.

Pandas会将其作为字符串读取,因为它太大而无法存储为核心数字类型,如int64或float64。但信息在那里,你没有失去任何东西。

In [269]: df.x
Out[269]: 
0    7724175622144176202888140209281712501772444730...
Name: x, dtype: object

In [270]: type(df.x[0])
Out[270]: str

And you can use plain python to treat it as a number. Recall the caveats from the links in the comments, this isn't going to be as fast as stuff in numpy and pandas where you have stored a whole column as int64. This is using the more flexible but slower object mode to handle things.

你可以使用普通的python将其视为一个数字。回想一下注释中链接的注意事项,这不会像numpy和pandas中的东西一样快,你将整个列存储为int64。这是使用更灵活但更慢的对象模式来处理事情。

You can change a column to be stored as longs (long integers) like this. (But note that the dtype is still object because everything except the core numpy types (int32, int64, float64, etc.) are stored as objects.)

您可以将列更改为像这样存储为long(长整数)。 (但请注意,dtype仍然是对象,因为除核心numpy类型(int32,int64,float64等)之外的所有内容都存储为对象。)

In [271]: df.x = df.x.map(int)

And then can more or less treat it like a number.

然后可以或多或少地像对待一样对待它。

In [272]: df.x * 2
Out[272]: 
0    1544835124428835240577628041856342500354488946...
Name: x, dtype: object

You'll have to do some formatting to see the whole number. Or go the numpy route which will default to showing the whole number.

你必须做一些格式化才能看到整个数字。或者去numpy路线,默认显示整数。

In [273]: df.x.values * 2
Out[273]: array([ 154483512442883524057762804185634250035448894606424279963336043423226336304368212L], dtype=object)

#2

As explained by @JohnE in his answer that we do not lose any information while reading big numbers using Pandas. They are stored as dtype=object, to make numerical computation on them we need to transform this data into numerical type.

正如@JohnE在他的回答中所解释的那样,我们在使用熊猫阅读大数字时不会丢失任何信息。它们存储为dtype = object,为了对它们进行数值计算,我们需要将这些数据转换为数字类型。

For series:

We have to apply the map(func) to the series in the dataframe:

我们必须将map(func)应用于数据帧中的系列:

df['columnName'].map(int)

Whole dataframe:

If for some reason, our entire dataframe is composed of columns with dtype=object, we look at applymap(func)

如果由于某种原因,我们的整个数据框由dtype = object的列组成,我们来看看applymap(func)

from the documentation of Pandas:

来自熊猫的文件:

DataFrame.applymap(func): Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame

DataFrame.applymap(func):将函数应用于旨在以元素方式运行的DataFrame,即为DataFrame中的每个系列执行map(func,series)

so to transform all columns in dataframe:

所以要转换dataframe中的所有列:

 df.applymap(int)

#1