I get a 512^3 array representing a Temperature distribution from a simulation (written in Fortran). The array is stored in a binary file that's about 1/2G in size. I need to know the minimum, maximum and mean of this array and as I will soon need to understand Fortran code anyway, I decided to give it a go and came up with the following very easy routine.
我得到一个512 ^ 3数组表示的温度分布模拟(用Fortran编写)。数组存储在大约1/2G大小的二进制文件中。我需要知道这个数组的最小值、最大值和平均值,因为我很快就需要理解Fortran代码了,所以我决定尝试一下,并得出了以下非常简单的例程。
integer gridsize,unit,j
real mini,maxi
double precision mean
gridsize=512
unit=40
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
mini=tmp
maxi=tmp
mean=tmp
do j=2,gridsize**3
read(unit=unit) tmp
if(tmp>maxi)then
maxi=tmp
elseif(tmp<mini)then
mini=tmp
end if
mean=mean+tmp
end do
mean=mean/gridsize**3
close(unit=unit)
This takes about 25 seconds per file on the machine I use. That struck me as being rather long and so I went ahead and did the following in Python:
在我使用的机器上,每个文件大约需要25秒。这让我觉得很长,所以我用Python做了如下的事情:
import numpy
mmap=numpy.memmap('T.out',dtype='float32',mode='r',offset=4,\
shape=(512,512,512),order='F')
mini=numpy.amin(mmap)
maxi=numpy.amax(mmap)
mean=numpy.mean(mmap)
Now, I expected this to be faster of course, but I was really blown away. It takes less than a second under identical conditions. The mean deviates from the one my Fortran routine finds (which I also ran with 128-bit floats, so I somehow trust it more) but only on the 7th significant digit or so.
当然,我原以为会更快,但我真的被吓到了。在相同的条件下只需要不到一秒钟。均值偏离了我的Fortran例程发现(我也运行了128位浮点数,所以我更信任它),但只在第7位有效数字左右。
How can numpy be so fast? I mean you have to look at every entry of an array to find these values, right? Am I doing something very stupid in my Fortran routine for it to take so much longer?
numpy怎么这么快?我的意思是,你必须查看数组的每个元素才能找到这些值,对吧?我在Fortran程序中做了一些非常愚蠢的事情,让它花这么长时间吗?
EDIT:
编辑:
To answer the questions in the comments:
在评论中回答以下问题:
- Yes, also I ran the Fortran routine with 32-bit and 64-bit floats but it had no impact on performance.
- 是的,我也使用32位和64位浮点数运行Fortran例程,但是它对性能没有影响。
- I used
iso_fortran_env
which provides 128-bit floats. - 我使用了iso_fortran_env,它提供128位浮点数。
- Using 32-bit floats my mean is off quite a bit though, so precision is really an issue.
- 使用32位浮点数,我的意思是偏离了很多,所以精度确实是个问题。
- I ran both routines on different files in different order, so the caching should have been fair in the comparison I guess ?
- 我在不同的文件中运行了两个例程,所以缓存应该是比较公平的,我猜?
- I actually tried open MP, but to read from the file at different positions at the same time. Having read your comments and answers this sounds really stupid now and it made the routine take a lot longer as well. I might give it a try on the array operations but maybe that won't even be necessary.
- 我实际上尝试过打开MP,但同时从不同位置读取文件。读了你的评论和答案之后,这听起来真的很愚蠢,而且这也让你的日常生活花费了更长的时间。我可能会对数组操作进行尝试,但这可能是不必要的。
- The files are actually 1/2G in size, that was a typo, Thanks.
- 文件实际上是1/2G,这是一个打印错误,谢谢。
- I will try the array implementation now.
- 现在我将尝试数组实现。
EDIT 2:
编辑2:
I implemented what @Alexander Vogt and @casey suggested in their answers, and it is as fast as numpy
but now I have a precision problem as @Luaan pointed out I might get. Using a 32-bit float array the mean computed by sum
is 20% off. Doing
我实现了@Alexander Vogt和@casey在他们的答案中建议的,它和numpy一样快,但是现在我有一个精度问题,正如@Luaan指出的那样。使用32位浮点数组,求和计算的平均值是20%
...
real,allocatable :: tmp (:,:,:)
double precision,allocatable :: tmp2(:,:,:)
...
tmp2=tmp
mean=sum(tmp2)/size(tmp)
...
Solves the issue but increases computing time (not by very much, but noticeably). Is there a better way to get around this issue? I couldn't find a way to read singles from the file directly to doubles. And how does numpy
avoid this?
解决了这个问题,但是增加了计算时间(不是很多,但是很明显)。有更好的方法来解决这个问题吗?我找不到一种方法从文件中直接读单打到双打。numpy是如何避免这种情况的呢?
Thanks for all the help so far.
谢谢你的帮助。
2 个解决方案
#1
109
Your Fortran implementation suffers two major shortcomings:
您的Fortran实现有两个主要缺点:
- You mix IO and computations (and read from the file entry by entry).
- 您将IO和计算混合(并按条目从文件条目中读取)。
- You don't use vector/matrix operations.
- 你不用向量/矩阵运算。
This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:
这个实现执行的操作与您的相同,并且在我的机器上的速度是20倍:
program test
integer gridsize,unit
real mini,maxi,mean
real, allocatable :: tmp (:,:,:)
gridsize=512
unit=40
allocate( tmp(gridsize, gridsize, gridsize))
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
close(unit=unit)
mini = minval(tmp)
maxi = maxval(tmp)
mean = sum(tmp)/gridsize**3
print *, mini, maxi, mean
end program
The idea is to read in the whole file into one array tmp
in one go. Then, I can use the functions MAXVAL
, MINVAL
, and SUM
on the array directly.
其思想是将整个文件一口气读入一个数组tmp。然后,我可以直接使用函数MAXVAL, MINVAL和SUM。
For the accuracy issue: Simply using double precision values and doing the conversion on the fly as
对于精度问题:简单地使用双精度值并动态地进行转换
mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.
只稍微增加了计算时间。我尝试了按元素和分段执行操作,但这只增加了默认优化级别所需的时间。
At -O3
, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more).
在-O3点,元素加法比数组操作性能好3%。在我的机器上,双精度操作和单精度操作之间的差距小于2%(个人运行偏差要大得多)。
Here is a very fast implementation using LAPACK:
下面是一个使用LAPACK的快速实现:
program test
integer gridsize,unit, i, j
real mini,maxi
integer :: t1, t2, rate
real, allocatable :: tmp (:,:,:)
real, allocatable :: work(:)
! double precision :: mean
real :: mean
real :: slange
call system_clock(count_rate=rate)
call system_clock(t1)
gridsize=512
unit=40
allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
close(unit=unit)
mini = minval(tmp)
maxi = maxval(tmp)
! mean = sum(tmp)/gridsize**3
! mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
mean = 0.d0
do j=1,gridsize
do i=1,gridsize
mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)
enddo !i
enddo !j
mean = mean / gridsize**3
print *, mini, maxi, mean
call system_clock(t2)
print *,real(t2-t1)/real(rate)
end program
This uses the single precision matrix 1-norm SLANGE
on matrix columns. The run-time is even faster than the approach using single precision array functions - and does not show the precision issue.
这使用了矩阵列上的单精度矩阵1范数。运行时甚至比使用单精度数组函数的方法更快,并且不显示精度问题。
#2
54
The numpy is faster because you wrote much more efficient code in python (and much of the numpy backend is written in optimized Fortran and C) and terribly inefficient code in Fortran.
numpy更快,因为您在python中编写了更高效的代码(许多numpy后端是用优化的Fortran和C编写的),而在Fortran中编写的代码非常低效。
Look at your python code. You load the entire array at once and then call functions that can operate on an array.
看看您的python代码。一次加载整个数组,然后调用可以对数组进行操作的函数。
Look at your fortran code. You read one value at a time and do some branching logic with it.
看看你的fortran代码。你一次读取一个值,然后用它做一些分支逻辑。
The majority of your discrepancy is the fragmented IO you have written in Fortran.
你的大部分差异是你用Fortran写的支离破碎的IO。
You can write the Fortran just about the same way as you wrote the python and you'll find it runs much faster that way.
你可以像编写python一样编写Fortran,你会发现它运行得更快。
program test
implicit none
integer :: gridsize, unit
real :: mini, maxi, mean
real, allocatable :: array(:,:,:)
gridsize=512
allocate(array(gridsize,gridsize,gridsize))
unit=40
open(unit=unit, file='T.out', status='old', access='stream',&
form='unformatted', action='read')
read(unit) array
maxi = maxval(array)
mini = minval(array)
mean = sum(array)/size(array)
close(unit)
end program test
#1
109
Your Fortran implementation suffers two major shortcomings:
您的Fortran实现有两个主要缺点:
- You mix IO and computations (and read from the file entry by entry).
- 您将IO和计算混合(并按条目从文件条目中读取)。
- You don't use vector/matrix operations.
- 你不用向量/矩阵运算。
This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:
这个实现执行的操作与您的相同,并且在我的机器上的速度是20倍:
program test
integer gridsize,unit
real mini,maxi,mean
real, allocatable :: tmp (:,:,:)
gridsize=512
unit=40
allocate( tmp(gridsize, gridsize, gridsize))
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
close(unit=unit)
mini = minval(tmp)
maxi = maxval(tmp)
mean = sum(tmp)/gridsize**3
print *, mini, maxi, mean
end program
The idea is to read in the whole file into one array tmp
in one go. Then, I can use the functions MAXVAL
, MINVAL
, and SUM
on the array directly.
其思想是将整个文件一口气读入一个数组tmp。然后,我可以直接使用函数MAXVAL, MINVAL和SUM。
For the accuracy issue: Simply using double precision values and doing the conversion on the fly as
对于精度问题:简单地使用双精度值并动态地进行转换
mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.
只稍微增加了计算时间。我尝试了按元素和分段执行操作,但这只增加了默认优化级别所需的时间。
At -O3
, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more).
在-O3点,元素加法比数组操作性能好3%。在我的机器上,双精度操作和单精度操作之间的差距小于2%(个人运行偏差要大得多)。
Here is a very fast implementation using LAPACK:
下面是一个使用LAPACK的快速实现:
program test
integer gridsize,unit, i, j
real mini,maxi
integer :: t1, t2, rate
real, allocatable :: tmp (:,:,:)
real, allocatable :: work(:)
! double precision :: mean
real :: mean
real :: slange
call system_clock(count_rate=rate)
call system_clock(t1)
gridsize=512
unit=40
allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
close(unit=unit)
mini = minval(tmp)
maxi = maxval(tmp)
! mean = sum(tmp)/gridsize**3
! mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
mean = 0.d0
do j=1,gridsize
do i=1,gridsize
mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)
enddo !i
enddo !j
mean = mean / gridsize**3
print *, mini, maxi, mean
call system_clock(t2)
print *,real(t2-t1)/real(rate)
end program
This uses the single precision matrix 1-norm SLANGE
on matrix columns. The run-time is even faster than the approach using single precision array functions - and does not show the precision issue.
这使用了矩阵列上的单精度矩阵1范数。运行时甚至比使用单精度数组函数的方法更快,并且不显示精度问题。
#2
54
The numpy is faster because you wrote much more efficient code in python (and much of the numpy backend is written in optimized Fortran and C) and terribly inefficient code in Fortran.
numpy更快,因为您在python中编写了更高效的代码(许多numpy后端是用优化的Fortran和C编写的),而在Fortran中编写的代码非常低效。
Look at your python code. You load the entire array at once and then call functions that can operate on an array.
看看您的python代码。一次加载整个数组,然后调用可以对数组进行操作的函数。
Look at your fortran code. You read one value at a time and do some branching logic with it.
看看你的fortran代码。你一次读取一个值,然后用它做一些分支逻辑。
The majority of your discrepancy is the fragmented IO you have written in Fortran.
你的大部分差异是你用Fortran写的支离破碎的IO。
You can write the Fortran just about the same way as you wrote the python and you'll find it runs much faster that way.
你可以像编写python一样编写Fortran,你会发现它运行得更快。
program test
implicit none
integer :: gridsize, unit
real :: mini, maxi, mean
real, allocatable :: array(:,:,:)
gridsize=512
allocate(array(gridsize,gridsize,gridsize))
unit=40
open(unit=unit, file='T.out', status='old', access='stream',&
form='unformatted', action='read')
read(unit) array
maxi = maxval(array)
mini = minval(array)
mean = sum(array)/size(array)
close(unit)
end program test