如何在合理的时间内将绝对大量的数字转换为字符串？

This is quite an odd problem I know, but I'm trying to get a copy of the current largest prime number in a file. Getting the number in integer form is fairly easy. I just run this.

这是我所知道的一个奇怪的问题,但我正在尝试获取文件中当前最大素数的副本。以整数形式获取数字非常简单。我跑了这个。

prime = 2**74207281 - 1

It takes about half a second and it works just fine. Operations are fairly quick as well. Dividing it by 10 (without decimals) to shift the digits is quick. However, str(prime) is taking a very long time. I reimplemented str like this, and found it was processing about a hundred or so digits per second.

它需要大约半秒钟,它工作得很好。操作也相当快。将它除以10(不带小数)来移动数字很快。但是,str(素数)需要很长时间。我重新实现了这样的str,发现它每秒处理大约一百个数字。

while prime > 0:
    strprime += str(prime%10)
    prime //= 10

Is there a way to do this more efficiently? I'm doing this in Python. Should I even try this with Python, or is there a better tool for this?

有没有办法更有效地做到这一点?我在Python中这样做。我应该尝试使用Python,还是有更好的工具?

4 个解决方案

#1

Repeated string concatenation is notoriously inefficient since Python strings are immutable. I would go for

由于Python字符串是不可变的,因此重复的字符串连接是非常低效的。我会去的

strprime = str(prime)

In my benchmarks, this is consistently the fastest solution. Here's my little benchmark program:

在我的基准测试中,这始终是最快的解决方案。这是我的小基准程序:

import decimal

def f1(x):
    ''' Definition by OP '''
    strprime = ""
    while x > 0:
        strprime += str(x%10)
        x //= 10
    return strprime

def digits(x):
    while x > 0:
        yield x % 10
        x //= 10

def f2(x):
    ''' Using string.join() to avoid repeated string concatenation '''
    return "".join((chr(48 + d) for d in digits(x)))

def f3(x):
    ''' Plain str() '''
    return str(x)

def f4(x):
    ''' Using Decimal class'''
    return decimal.Decimal(x).to_eng_string()

x = 2**100

if __name__ == '__main__':
    import timeit
    for i in range(1,5):
        funcName = "f" + str(i)
        print(funcName+ ": " + str(timeit.timeit(funcName + "(x)", setup="from __main__ import " + funcName + ", x")))

For me, this prints (using Python 2.7.10):

对我来说,这打印(使用Python 2.7.10):

f1: 15.3430171013
f2: 20.8928260803
f3: 0.310356140137
f4: 2.80087995529

#2

Python's integer to string conversion algorithm uses a simplistic algorithm with a running of O(n**2). As the length of the number doubles, the conversion time quadruples.

Python的整数到字符串转换算法使用简单算法,运行O(n ** 2)。随着数字的长度加倍,转换时间翻了两番。

Some simple tests on my computer show the increase in running time:

我的计算机上的一些简单测试显示运行时间增加:

$ time py35 -c "n=str(2**1000000)"
user    0m1.808s
$ time py35 -c "n=str(2**2000000)"
user    0m7.128s
$ time py35 -c "n=str(2**4000000)"
user    0m28.444s
$ time py35 -c "n=str(2**8000000)"
user    1m54.164s

Since the actual exponent is about 10 times larger than my last test value, it should take about 100 times longer. Or just over 3 hours.

由于实际指数大约是我上一次测试值的10倍,因此它应该花费大约100倍的时间。或者只需3个多小时。

Can it be done faster? Yes. There are several methods that are faster.

可以更快地完成吗?是。有几种方法更快。

Method 1

It is faster to divide the very large number by a power-of-10 into two roughly equal-sized but smaller numbers. The process is repeated until the numbers are relatively small. Then str() is used on each number and leading zeroes are used to pad the result to the same length as the last power-of-10. Then the strings are joined to form the final result. This method is used by the mpmath library and the documentation implies it should be about 3x faster.

将非常大的数字除以10的幂可以更快地分成两个大致相等但数量更小的数字。重复该过程直到数字相对较小。然后在每个数字上使用str(),并使用前导零将结果填充到与最后10次幂相同的长度。然后连接字符串以形成最终结果。 mpmath库使用此方法,文档暗示它应该快3倍。

Method 2

Python's integers are stored in binary format. Binary is great for calculations but binary-to-decimal conversion is the bottleneck. It is possible to define your own integer type that stores the value in blocks of 100 (or some similar value) decimal digits. Operations (exponentiation, multiplication, division) will be slower but conversion to a string will be very fast.

Python的整数以二进制格式存储。二进制非常适合计算,但二进制到十进制转换是瓶颈。可以定义自己的整数类型,以100(或某些类似值)的十进制数字为单位存储值。操作(取幂,乘法,除法)将变慢,但转换为字符串将非常快。

Many years ago, I implemented such a class and used efficient algorithms for multiplication and division. The code is no longer available on the Internet but I did find a backup copy that I tested. The running time was reduced to ~14 seconds.

许多年前,我实现了这样一个类,并使用高效的算法进行乘法和除法。代码在Internet上不再可用,但我找到了我测试过的备份副本。运行时间减少到约14秒。

Update

I updated the DecInt code referenced above and it is now available at https://github.com/casevh/DecInt.

我更新了上面引用的DecInt代码,现在可以在https://github.com/casevh/DecInt上找到它。

If Python's native integer type is used, the total running time is less than 14 seconds on my computer. If gmpy2's integer type is used instead, the running time is ~3.5 seconds.

如果使用Python的本机整数类型,则计算机上的总运行时间少于14秒。如果使用gmpy2的整数类型,则运行时间约为3.5秒。

$ py35 DecInt.py
Calculating 2^74207281
Exponentiation time: 3.236
Conversion to decimal format: 0.304
Total elapsed time: 3.540
Length of result: 22338618 digits

Method 3

I maintain the gmpy2 library that provide easy access to the GMP library for fast integer arithmetic. GMP implements Method 1 in highly optimized C and assembly code and calculates the prime number and the string representation in ~5 seconds.

我维护gmpy2库,可以方便地访问GMP库以进行快速整数运算。 GMP在高度优化的C和汇编代码中实现方法1,并在~5秒内计算素数和字符串表示。

Method 4

The decimal module in Python stores values as decimal digits. Recent versions of Python 3 include a C implementation of the decimal library that is much faster that the pure-Python implementation include with Python 2. The C implementation run in just over 3 seconds on my computer.

Python中的十进制模块将值存储为十进制数字。 Python 3的最新版本包括十进制库的C实现,它比Python 2中的纯Python实现快得多.C实现在我的计算机上运行3秒多一点。

from decimal import *
getcontext().prec = 23000000
getcontext().Emin = -999999999
getcontext().Emax = 999999999
x=Decimal(2)**74207281 - 1
s=str(x)

#3

Took about 32 seconds to output the file using WinGhci (Haskell language):

使用WinGhci(Haskell语言)输出文件大约需要32秒:

import System.IO

main = writeFile "prime.txt" (show (2^74207281 - 1))

The file was 21 megabytes; the last four digits, 6351.

该文件是21兆字节;最后四位数,6351。

#4

There is gmp, the GNU Multiple Precision Arithmetic Library. It is particularly designed at handling huge numbers fast.

有gmp,GNU多精度算术库。它特别设计用于快速处理大量数字。

#1