numpy怎么会比我的Fortran程序快得多呢?

时间:2022-07-04 21:24:56

I get a 512^3 array representing a Temperature distribution from a simulation (written in Fortran). The array is stored in a binary file that's about 1/2G in size. I need to know the minimum, maximum and mean of this array and as I will soon need to understand Fortran code anyway, I decided to give it a go and came up with the following very easy routine.

我得到一个512 ^ 3数组表示的温度分布模拟(用Fortran编写)。数组存储在大约1/2G大小的二进制文件中。我需要知道这个数组的最小值、最大值和平均值,因为我很快就需要理解Fortran代码了,所以我决定尝试一下,并得出了以下非常简单的例程。

  integer gridsize,unit,j
  real mini,maxi
  double precision mean

  gridsize=512
  unit=40
  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp
  mini=tmp
  maxi=tmp
  mean=tmp
  do j=2,gridsize**3
      read(unit=unit) tmp
      if(tmp>maxi)then
          maxi=tmp
      elseif(tmp<mini)then
          mini=tmp
      end if
      mean=mean+tmp
  end do
  mean=mean/gridsize**3
  close(unit=unit)

This takes about 25 seconds per file on the machine I use. That struck me as being rather long and so I went ahead and did the following in Python:

在我使用的机器上,每个文件大约需要25秒。这让我觉得很长,所以我用Python做了如下的事情:

    import numpy

    mmap=numpy.memmap('T.out',dtype='float32',mode='r',offset=4,\
                                  shape=(512,512,512),order='F')
    mini=numpy.amin(mmap)
    maxi=numpy.amax(mmap)
    mean=numpy.mean(mmap)

Now, I expected this to be faster of course, but I was really blown away. It takes less than a second under identical conditions. The mean deviates from the one my Fortran routine finds (which I also ran with 128-bit floats, so I somehow trust it more) but only on the 7th significant digit or so.

当然,我原以为会更快,但我真的被吓到了。在相同的条件下只需要不到一秒钟。均值偏离了我的Fortran例程发现(我也运行了128位浮点数,所以我更信任它),但只在第7位有效数字左右。

How can numpy be so fast? I mean you have to look at every entry of an array to find these values, right? Am I doing something very stupid in my Fortran routine for it to take so much longer?

numpy怎么这么快?我的意思是,你必须查看数组的每个元素才能找到这些值,对吧?我在Fortran程序中做了一些非常愚蠢的事情,让它花这么长时间吗?

EDIT:

编辑:

To answer the questions in the comments:

在评论中回答以下问题:

  • Yes, also I ran the Fortran routine with 32-bit and 64-bit floats but it had no impact on performance.
  • 是的,我也使用32位和64位浮点数运行Fortran例程,但是它对性能没有影响。
  • I used iso_fortran_env which provides 128-bit floats.
  • 我使用了iso_fortran_env,它提供128位浮点数。
  • Using 32-bit floats my mean is off quite a bit though, so precision is really an issue.
  • 使用32位浮点数,我的意思是偏离了很多,所以精度确实是个问题。
  • I ran both routines on different files in different order, so the caching should have been fair in the comparison I guess ?
  • 我在不同的文件中运行了两个例程,所以缓存应该是比较公平的,我猜?
  • I actually tried open MP, but to read from the file at different positions at the same time. Having read your comments and answers this sounds really stupid now and it made the routine take a lot longer as well. I might give it a try on the array operations but maybe that won't even be necessary.
  • 我实际上尝试过打开MP,但同时从不同位置读取文件。读了你的评论和答案之后,这听起来真的很愚蠢,而且这也让你的日常生活花费了更长的时间。我可能会对数组操作进行尝试,但这可能是不必要的。
  • The files are actually 1/2G in size, that was a typo, Thanks.
  • 文件实际上是1/2G,这是一个打印错误,谢谢。
  • I will try the array implementation now.
  • 现在我将尝试数组实现。

EDIT 2:

编辑2:

I implemented what @Alexander Vogt and @casey suggested in their answers, and it is as fast as numpy but now I have a precision problem as @Luaan pointed out I might get. Using a 32-bit float array the mean computed by sum is 20% off. Doing

我实现了@Alexander Vogt和@casey在他们的答案中建议的,它和numpy一样快,但是现在我有一个精度问题,正如@Luaan指出的那样。使用32位浮点数组,求和计算的平均值是20%

...
real,allocatable :: tmp (:,:,:)
double precision,allocatable :: tmp2(:,:,:)
...
tmp2=tmp
mean=sum(tmp2)/size(tmp)
...

Solves the issue but increases computing time (not by very much, but noticeably). Is there a better way to get around this issue? I couldn't find a way to read singles from the file directly to doubles. And how does numpy avoid this?

解决了这个问题,但是增加了计算时间(不是很多,但是很明显)。有更好的方法来解决这个问题吗?我找不到一种方法从文件中直接读单打到双打。numpy是如何避免这种情况的呢?

Thanks for all the help so far.

谢谢你的帮助。

2 个解决方案

#1


109  

Your Fortran implementation suffers two major shortcomings:

您的Fortran实现有两个主要缺点:

  • You mix IO and computations (and read from the file entry by entry).
  • 您将IO和计算混合(并按条目从文件条目中读取)。
  • You don't use vector/matrix operations.
  • 你不用向量/矩阵运算。

This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:

这个实现执行的操作与您的相同,并且在我的机器上的速度是20倍:

program test
  integer gridsize,unit
  real mini,maxi,mean
  real, allocatable :: tmp (:,:,:)

  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize))

  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)
  mean = sum(tmp)/gridsize**3
  print *, mini, maxi, mean

end program

The idea is to read in the whole file into one array tmp in one go. Then, I can use the functions MAXVAL, MINVAL, and SUM on the array directly.

其思想是将整个文件一口气读入一个数组tmp。然后,我可以直接使用函数MAXVAL, MINVAL和SUM。


For the accuracy issue: Simply using double precision values and doing the conversion on the fly as

对于精度问题:简单地使用双精度值并动态地进行转换

mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))

only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.

只稍微增加了计算时间。我尝试了按元素和分段执行操作,但这只增加了默认优化级别所需的时间。

At -O3, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more).

在-O3点,元素加法比数组操作性能好3%。在我的机器上,双精度操作和单精度操作之间的差距小于2%(个人运行偏差要大得多)。


Here is a very fast implementation using LAPACK:

下面是一个使用LAPACK的快速实现:

program test
  integer gridsize,unit, i, j
  real mini,maxi
  integer  :: t1, t2, rate
  real, allocatable :: tmp (:,:,:)
  real, allocatable :: work(:)
!  double precision :: mean
  real :: mean
  real :: slange

  call system_clock(count_rate=rate)
  call system_clock(t1)
  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))

  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)

!  mean = sum(tmp)/gridsize**3
!  mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
  mean = 0.d0
  do j=1,gridsize
    do i=1,gridsize
      mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)
    enddo !i
  enddo !j
  mean = mean / gridsize**3

  print *, mini, maxi, mean
  call system_clock(t2)
  print *,real(t2-t1)/real(rate)

end program

This uses the single precision matrix 1-norm SLANGE on matrix columns. The run-time is even faster than the approach using single precision array functions - and does not show the precision issue.

这使用了矩阵列上的单精度矩阵1范数。运行时甚至比使用单精度数组函数的方法更快,并且不显示精度问题。

#2


54  

The numpy is faster because you wrote much more efficient code in python (and much of the numpy backend is written in optimized Fortran and C) and terribly inefficient code in Fortran.

numpy更快,因为您在python中编写了更高效的代码(许多numpy后端是用优化的Fortran和C编写的),而在Fortran中编写的代码非常低效。

Look at your python code. You load the entire array at once and then call functions that can operate on an array.

看看您的python代码。一次加载整个数组,然后调用可以对数组进行操作的函数。

Look at your fortran code. You read one value at a time and do some branching logic with it.

看看你的fortran代码。你一次读取一个值,然后用它做一些分支逻辑。

The majority of your discrepancy is the fragmented IO you have written in Fortran.

你的大部分差异是你用Fortran写的支离破碎的IO。

You can write the Fortran just about the same way as you wrote the python and you'll find it runs much faster that way.

你可以像编写python一样编写Fortran,你会发现它运行得更快。

program test
  implicit none
  integer :: gridsize, unit
  real :: mini, maxi, mean
  real, allocatable :: array(:,:,:)

  gridsize=512
  allocate(array(gridsize,gridsize,gridsize))
  unit=40
  open(unit=unit, file='T.out', status='old', access='stream',&
       form='unformatted', action='read')
  read(unit) array    
  maxi = maxval(array)
  mini = minval(array)
  mean = sum(array)/size(array)
  close(unit)
end program test

#1


109  

Your Fortran implementation suffers two major shortcomings:

您的Fortran实现有两个主要缺点:

  • You mix IO and computations (and read from the file entry by entry).
  • 您将IO和计算混合(并按条目从文件条目中读取)。
  • You don't use vector/matrix operations.
  • 你不用向量/矩阵运算。

This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:

这个实现执行的操作与您的相同,并且在我的机器上的速度是20倍:

program test
  integer gridsize,unit
  real mini,maxi,mean
  real, allocatable :: tmp (:,:,:)

  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize))

  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)
  mean = sum(tmp)/gridsize**3
  print *, mini, maxi, mean

end program

The idea is to read in the whole file into one array tmp in one go. Then, I can use the functions MAXVAL, MINVAL, and SUM on the array directly.

其思想是将整个文件一口气读入一个数组tmp。然后,我可以直接使用函数MAXVAL, MINVAL和SUM。


For the accuracy issue: Simply using double precision values and doing the conversion on the fly as

对于精度问题:简单地使用双精度值并动态地进行转换

mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))

only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.

只稍微增加了计算时间。我尝试了按元素和分段执行操作,但这只增加了默认优化级别所需的时间。

At -O3, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more).

在-O3点,元素加法比数组操作性能好3%。在我的机器上,双精度操作和单精度操作之间的差距小于2%(个人运行偏差要大得多)。


Here is a very fast implementation using LAPACK:

下面是一个使用LAPACK的快速实现:

program test
  integer gridsize,unit, i, j
  real mini,maxi
  integer  :: t1, t2, rate
  real, allocatable :: tmp (:,:,:)
  real, allocatable :: work(:)
!  double precision :: mean
  real :: mean
  real :: slange

  call system_clock(count_rate=rate)
  call system_clock(t1)
  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))

  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)

!  mean = sum(tmp)/gridsize**3
!  mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
  mean = 0.d0
  do j=1,gridsize
    do i=1,gridsize
      mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)
    enddo !i
  enddo !j
  mean = mean / gridsize**3

  print *, mini, maxi, mean
  call system_clock(t2)
  print *,real(t2-t1)/real(rate)

end program

This uses the single precision matrix 1-norm SLANGE on matrix columns. The run-time is even faster than the approach using single precision array functions - and does not show the precision issue.

这使用了矩阵列上的单精度矩阵1范数。运行时甚至比使用单精度数组函数的方法更快,并且不显示精度问题。

#2


54  

The numpy is faster because you wrote much more efficient code in python (and much of the numpy backend is written in optimized Fortran and C) and terribly inefficient code in Fortran.

numpy更快,因为您在python中编写了更高效的代码(许多numpy后端是用优化的Fortran和C编写的),而在Fortran中编写的代码非常低效。

Look at your python code. You load the entire array at once and then call functions that can operate on an array.

看看您的python代码。一次加载整个数组,然后调用可以对数组进行操作的函数。

Look at your fortran code. You read one value at a time and do some branching logic with it.

看看你的fortran代码。你一次读取一个值,然后用它做一些分支逻辑。

The majority of your discrepancy is the fragmented IO you have written in Fortran.

你的大部分差异是你用Fortran写的支离破碎的IO。

You can write the Fortran just about the same way as you wrote the python and you'll find it runs much faster that way.

你可以像编写python一样编写Fortran,你会发现它运行得更快。

program test
  implicit none
  integer :: gridsize, unit
  real :: mini, maxi, mean
  real, allocatable :: array(:,:,:)

  gridsize=512
  allocate(array(gridsize,gridsize,gridsize))
  unit=40
  open(unit=unit, file='T.out', status='old', access='stream',&
       form='unformatted', action='read')
  read(unit) array    
  maxi = maxval(array)
  mini = minval(array)
  mean = sum(array)/size(array)
  close(unit)
end program test