从API中提取数据,内存增长

时间:2022-06-18 09:47:58

I'm working on a project where I pull data (JSON) from an API. The problem I'm having is that the memory is slowly growing until I get the dreaded fatal error:

我正在做一个项目,从API提取数据(JSON)。我遇到的问题是,我的记忆力在慢慢增长,直到我犯了一个可怕的致命错误:

Fatal error: Allowed memory size of * bytes exhausted (tried to allocate * bytes) in C:... on line *

致命错误:允许在C中耗尽的字节内存大小(试图分配*字节):…联机*

I don't think there should be any memory growth. I tried unsetting everything at the end of the loop but no difference. So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

我不认为应该有记忆增长。我尝试在循环的末尾取消所有的设置,但是没有区别。所以我的问题是:我做错了什么吗?是正常的吗?我能做什么来解决这个问题?

<?php

$start = microtime(true);

$time = microtime(true) - $start;
echo "Start: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";

include ('start.php');
include ('connect.php');

set_time_limit(0);

$api_key = 'API-KEY';
$tier = 'Platinum';
$threads = 10; //number of urls called simultaneously

function multiRequest($urls, $start) {

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;start function: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    $nbrURLS = count($urls); // number of urls in array $urls
    $ch = array(); // array of curl handles
    $result = array(); // data to be returned

    $mh = curl_multi_init(); // create a multi handle 

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;Creation multi handle: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    // set URL and other appropriate options
    for($i = 0; $i < $nbrURLS; $i++) {
        $ch[$i]=curl_init();

        curl_setopt($ch[$i], CURLOPT_URL, $urls[$i]);
        curl_setopt($ch[$i], CURLOPT_RETURNTRANSFER, 1); // return data as string
        curl_setopt($ch[$i], CURLOPT_SSL_VERIFYPEER, 0); // Doesn't verifies certificate

        curl_multi_add_handle ($mh, $ch[$i]); // Add a normal cURL handle to a cURL multi handle
    }

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;For loop options: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    // execute the handles
    do {
        $mrc = curl_multi_exec($mh, $active);          
        curl_multi_select($mh, 0.1); // without this, we will busy-loop here and use 100% CPU
    } while ($active);

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;Execution: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    echo '&nbsp;&nbsp;&nbsp;For loop2<br>';

    // get content and remove handles
    for($i = 0; $i < $nbrURLS; $i++) {

        $error = curl_getinfo($ch[$i], CURLINFO_HTTP_CODE); // Last received HTTP code 

        echo "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

        //error handling if not 200 ok code
        if($error != 200){

            if($error == 429 || $error == 500 || $error == 503 || $error == 504){
                echo "Again error: $error<br>";
                $result['again'][] = $urls[$i];

            } else {
                echo "Error error: $error<br>";
                $result['errors'][] = array("Url" => $urls[$i], "errornbr" => $error);
            }

        } else {
            $result['json'][] = curl_multi_getcontent($ch[$i]);

            echo "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Content: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";
        }

        curl_multi_remove_handle($mh, $ch[$i]);
        curl_close($ch[$i]);
    }

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp; after loop2: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    curl_multi_close($mh);

    return $result;
}


$gamesId = mysqli_query($connect, "SELECT gameId FROM `games` WHERE `region` = 'EUW1' AND `tier` = '$tier ' LIMIT 20 ");
$urls = array();

while($result = mysqli_fetch_array($gamesId))
{
    $urls[] = 'https://euw.api.pvp.net/api/lol/euw/v2.2/match/' . $result['gameId'] . '?includeTimeline=true&api_key=' . $api_key;
}

$time = microtime(true) - $start;
echo "After URL array: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";

$x = 1; //number of loops

while($urls){ 

    $chunk = array_splice($urls, 0, $threads); // take the first chunk ($threads) of all urls

    $time = microtime(true) - $start;
    echo "<br>After chunk: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";

    $result = multiRequest($chunk, $start); // Get json

    unset($chunk);

    $nbrComplete = count($result['json']); //number of retruned json strings

    echo 'For loop: <br/>';

    for($y = 0; $y < $nbrComplete; $y++){
        // parse the json
        $decoded = json_decode($result['json'][$y], true);

        $time = microtime(true) - $start;
        echo "&nbsp;&nbsp;&nbsp;Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";


    }

    unset($nbrComplete);
    unset($decoded);

    $time = microtime(true) - $start;
    echo $x . ": ". memory_get_peak_usage(true) . " | " . $time . "<br>";

    // reuse urls
    if(isset($result['again'])){
        $urls = array_merge($urls, $result['again']);
        unset($result['again']);
    }

    unset($result);
    unset($time);

    sleep(15); // limit the request rate

    $x++;
}

include ('end.php');

?>

PHP Version 5.3.9 - 100 loops:

PHP版本5.3.9 - 100循环:

loop: memory | time (sec)
1: 5505024 | 0.98330211639404
3: 6291456 | 33.190237045288
65: 6553600 | 1032.1401019096
73: 6815744 | 1160.4345710278
75: 7077888 | 1192.6274609566
100: 7077888 | 1595.2397520542

EDIT:
After trying it with PHP 5.6.14 xampp on windows:

编辑:在windows上试用PHP 5.6.14 xampp后:

loop: memory | time (sec)
1: 5505024 | 1.0365679264069
3: 6291456 | 33.604479074478
60: 6553600 | 945.90159296989
62: 6815744 | 977.82566595078
93: 7077888 | 1474.5941500664
94: 7340032 | 1490.6698410511
100: 7340032 | 1587.2434458733

EDIT2: I only see the memory increase after json_decode

我只看到json_decode之后的内存增加

Start: 262144 | 135448
After URL array: 262144 | 151984
After chunk: 262144 | 152272
   start function: 262144 | 152464
   Creation multi handle: 262144 | 152816
   For loop options: 262144 | 161424
   Execution: 3145728 | 1943472
   For loop2
      error: 3145728 | 1943520
      Content: 3145728 | 2095056
      error: 3145728 | 1938952
      Content: 3145728 | 2131992
      error: 3145728 | 1938072
      Content: 3145728 | 2135424
      error: 3145728 | 1933288
      Content: 3145728 | 2062312
      error: 3145728 | 1928504
      Content: 3145728 | 2124360
      error: 3145728 | 1923720
      Content: 3145728 | 2089768
      error: 3145728 | 1918936
      Content: 3145728 | 2100768
      error: 3145728 | 1914152
      Content: 3145728 | 2089272
      error: 3145728 | 1909368
      Content: 3145728 | 2067184
      error: 3145728 | 1904616
      Content: 3145728 | 2102976
    after loop2: 3145728 | 1899824
For loop: 
   Decode: 3670016 | 2962208
   Decode: 4980736 | 3241232
   Decode: 5242880 | 3273808
   Decode: 5242880 | 2802024
   Decode: 5242880 | 3258152
   Decode: 5242880 | 3057816
   Decode: 5242880 | 3169160
   Decode: 5242880 | 3122360
   Decode: 5242880 | 3004216
   Decode: 5242880 | 3277304

4 个解决方案

#1


1  

I tested your script on 10 URLS. I removed all your comments except one comment at the end of the script and one in problem loop when used json_decode. Also I opened one page which you encode from API and looked very big array and I think you're right, you have an issue in json_decode.

我在10个url上测试了你的脚本。在使用json_decode时,除了脚本末尾的一个注释和问题循环中的一个注释外,我删除了所有的注释。我打开了一个页面,你可以从API中进行编码看起来非常大,我认为你是对的,你在json_decode中有一个问题。

Results and fixes.

结果和修复。

Result without changes:

结果没有变化:

Code:

代码:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

Result:

结果:

Decode: 3407872 | 2947584
Decode: 3932160 | 2183872
Decode: 3932160 | 2491440
Decode: 4980736 | 3291288
Decode: 6291456 | 3835848
Decode: 6291456 | 2676760
Decode: 6291456 | 4249376
Decode: 6291456 | 2832080
Decode: 6291456 | 4081888
Decode: 6291456 | 3214112
Decode: 6291456 | 244400

Result with unset($decode):

结果与复原(解码):

Code:

代码:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   unset($decoded);
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

Result:

结果:

Decode: 3407872 | 1573296
Decode: 3407872 | 1573296
Decode: 3407872 | 1573296
Decode: 3932160 | 1573296
Decode: 4456448 | 1573296
Decode: 4456448 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 244448

Also you can add gc_collect_cycles:

您还可以添加gc_collect_cycle:

Code:

代码:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   unset($decoded);
   gc_collect_cycles();
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

It's can help for you in some cases, but in results it can comes to performance degradation.

在某些情况下,它可以帮助您,但是在结果中它可能会导致性能下降。

You can try restart script with unset, and unset+gc and write ago if you will have the same issue after changes.

您可以尝试使用unset和unset+gc重新启动脚本,如果您在更改之后遇到相同的问题,可以在此之前编写。

Also I don't see where you use $decoded variable, if it's error in code, you can remove json_decode :)

我也不知道你在哪里使用$decoded变量,如果是代码错误,你可以删除json_decode:)

#2


4  

Your method is quite long, so I don't believe that garbage collection wont get fired until the very end of a function, which means your unused variables can build up. If they aren't going to be used anymore, then garbage collection would take care of this for you.

您的方法相当长,所以我认为直到函数的最后才会触发垃圾收集,这意味着您的未使用的变量可以生成。如果它们不再被使用,那么垃圾收集将为您处理这个问题。

You might think about refactoring this code into smaller methods to take advantage of this, and with all the other good stuff that comes with having smaller methods, however in the meantime you could try putting gc_collect_cycles(); at the very end of your loop to see if you can free some memory:

您可以考虑将这段代码重构为更小的方法,以利用这一点,并且使用更小的方法带来的所有其他好处,但是与此同时,您可以尝试放入gc_collect_cycle ();在循环的最后看看能不能释放一些内存:

if(isset($result['again'])){
    $urls = array_merge($urls, $result['again']);
    unset($result['again']);
}

unset($result);
unset($time);

gc_collect_cycles();//add this line here
sleep(15); // limit the request rate

Edit : the segment I have updated actually doesn't belong to the big function, however I suspect maybe the size of $result may bowl things over, and it wont get cleaned until the loop terminates, possibly. This is a worth a try however.

编辑:我更新的部分实际上并不属于这个大函数,但是我怀疑$result的大小可能会导致一些事情的结束,并且在循环结束之前不会被清理。然而,这是值得一试的。

#3


3  

So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

所以我的问题是:我做错了什么吗?是正常的吗?我能做什么来解决这个问题?

Yes, running out of memory is normal when you use all of it. You are requesting 10 simultaneous HTTP requests and unserializing the JSON responses in to PHP memory. Without limiting the size of the responses you will always be in danger of running out of memory.

是的,当您使用所有内存时,耗尽内存是正常的。您正在请求10个并发HTTP请求,并将JSON响应序列化为PHP内存。如果不限制响应的大小,您将永远处于内存耗尽的危险中。

What else can you do?

你还能做什么?

  1. Do not run multiple http connections simultaneously. Turn $threads down to 1 to test this. If there is a memory leak in a C extension calling gc_collect_cycles() will not free any memory, this only affects memory allocated in the Zend Engine which is no longer reachable.
  2. 不要同时运行多个http连接。将$threads设置为1以测试这一点。如果调用gc_collect_cycle()的C扩展中存在内存泄漏,则不会释放任何内存,这只会影响在Zend引擎中分配的内存,而Zend引擎已经无法访问了。
  3. Save the results to a folder and process them in another script. You can move the processed files into a sub directory to mark when you have successfully processed a json file.
  4. 将结果保存到一个文件夹中,并在另一个脚本中处理它们。当您成功处理json文件时,您可以将处理过的文件移动到子目录中。
  5. Investigate forking or a message queue to have multiple processes work on a portion of the problem at the same time - either multiple PHP processes listening to a queue bucket or forked children of the parent process with their own process memory.
  6. 调查分叉或消息队列,让多个进程同时处理问题的一部分——或者多个PHP进程监听队列桶,或者用它们自己的进程内存分叉父进程的子进程。

#4


1  

So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

所以我的问题是:我做错了什么吗?是正常的吗?我能做什么来解决这个问题?

There is nothing wrong with your code because this is the normal behaviour, you are requesting data from an external source, which in turn is loaded into memory.

您的代码没有问题,因为这是正常的行为,您正在从外部源请求数据,而外部源又将数据加载到内存中。

Of course a solution to your problem could be as simple as:

当然,解决你的问题的办法可以很简单:

ini_set('memory_limit', -1);

Which allows for all the memory needed to be used.

它允许使用所有需要的内存。


When I'm using dummy content the memory usage stays the same between requests.

当我使用虚拟内容时,请求之间的内存使用保持不变。

This is using PHP 5.5.19 in XAMPP on Windows.

这是在Windows的XAMPP中使用PHP 5.5.19。

There has been a cURL memory leak related bug which was fixed in Version 5.5.4

在5.5.4版本中修复了一个与cURL内存泄漏相关的bug

#1


1  

I tested your script on 10 URLS. I removed all your comments except one comment at the end of the script and one in problem loop when used json_decode. Also I opened one page which you encode from API and looked very big array and I think you're right, you have an issue in json_decode.

我在10个url上测试了你的脚本。在使用json_decode时,除了脚本末尾的一个注释和问题循环中的一个注释外,我删除了所有的注释。我打开了一个页面,你可以从API中进行编码看起来非常大,我认为你是对的,你在json_decode中有一个问题。

Results and fixes.

结果和修复。

Result without changes:

结果没有变化:

Code:

代码:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

Result:

结果:

Decode: 3407872 | 2947584
Decode: 3932160 | 2183872
Decode: 3932160 | 2491440
Decode: 4980736 | 3291288
Decode: 6291456 | 3835848
Decode: 6291456 | 2676760
Decode: 6291456 | 4249376
Decode: 6291456 | 2832080
Decode: 6291456 | 4081888
Decode: 6291456 | 3214112
Decode: 6291456 | 244400

Result with unset($decode):

结果与复原(解码):

Code:

代码:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   unset($decoded);
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

Result:

结果:

Decode: 3407872 | 1573296
Decode: 3407872 | 1573296
Decode: 3407872 | 1573296
Decode: 3932160 | 1573296
Decode: 4456448 | 1573296
Decode: 4456448 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 1573296
Decode: 4980736 | 244448

Also you can add gc_collect_cycles:

您还可以添加gc_collect_cycle:

Code:

代码:

for($y = 0; $y < $nbrComplete; $y++){
   $decoded = json_decode($result['json'][$y], true);
   unset($decoded);
   gc_collect_cycles();
   $time = microtime(true) - $start;
   echo "Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "\n";
}

It's can help for you in some cases, but in results it can comes to performance degradation.

在某些情况下,它可以帮助您,但是在结果中它可能会导致性能下降。

You can try restart script with unset, and unset+gc and write ago if you will have the same issue after changes.

您可以尝试使用unset和unset+gc重新启动脚本,如果您在更改之后遇到相同的问题,可以在此之前编写。

Also I don't see where you use $decoded variable, if it's error in code, you can remove json_decode :)

我也不知道你在哪里使用$decoded变量,如果是代码错误,你可以删除json_decode:)

#2


4  

Your method is quite long, so I don't believe that garbage collection wont get fired until the very end of a function, which means your unused variables can build up. If they aren't going to be used anymore, then garbage collection would take care of this for you.

您的方法相当长,所以我认为直到函数的最后才会触发垃圾收集,这意味着您的未使用的变量可以生成。如果它们不再被使用,那么垃圾收集将为您处理这个问题。

You might think about refactoring this code into smaller methods to take advantage of this, and with all the other good stuff that comes with having smaller methods, however in the meantime you could try putting gc_collect_cycles(); at the very end of your loop to see if you can free some memory:

您可以考虑将这段代码重构为更小的方法,以利用这一点,并且使用更小的方法带来的所有其他好处,但是与此同时,您可以尝试放入gc_collect_cycle ();在循环的最后看看能不能释放一些内存:

if(isset($result['again'])){
    $urls = array_merge($urls, $result['again']);
    unset($result['again']);
}

unset($result);
unset($time);

gc_collect_cycles();//add this line here
sleep(15); // limit the request rate

Edit : the segment I have updated actually doesn't belong to the big function, however I suspect maybe the size of $result may bowl things over, and it wont get cleaned until the loop terminates, possibly. This is a worth a try however.

编辑:我更新的部分实际上并不属于这个大函数,但是我怀疑$result的大小可能会导致一些事情的结束,并且在循环结束之前不会被清理。然而,这是值得一试的。

#3


3  

So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

所以我的问题是:我做错了什么吗?是正常的吗?我能做什么来解决这个问题?

Yes, running out of memory is normal when you use all of it. You are requesting 10 simultaneous HTTP requests and unserializing the JSON responses in to PHP memory. Without limiting the size of the responses you will always be in danger of running out of memory.

是的,当您使用所有内存时,耗尽内存是正常的。您正在请求10个并发HTTP请求,并将JSON响应序列化为PHP内存。如果不限制响应的大小,您将永远处于内存耗尽的危险中。

What else can you do?

你还能做什么?

  1. Do not run multiple http connections simultaneously. Turn $threads down to 1 to test this. If there is a memory leak in a C extension calling gc_collect_cycles() will not free any memory, this only affects memory allocated in the Zend Engine which is no longer reachable.
  2. 不要同时运行多个http连接。将$threads设置为1以测试这一点。如果调用gc_collect_cycle()的C扩展中存在内存泄漏,则不会释放任何内存,这只会影响在Zend引擎中分配的内存,而Zend引擎已经无法访问了。
  3. Save the results to a folder and process them in another script. You can move the processed files into a sub directory to mark when you have successfully processed a json file.
  4. 将结果保存到一个文件夹中,并在另一个脚本中处理它们。当您成功处理json文件时,您可以将处理过的文件移动到子目录中。
  5. Investigate forking or a message queue to have multiple processes work on a portion of the problem at the same time - either multiple PHP processes listening to a queue bucket or forked children of the parent process with their own process memory.
  6. 调查分叉或消息队列,让多个进程同时处理问题的一部分——或者多个PHP进程监听队列桶,或者用它们自己的进程内存分叉父进程的子进程。

#4


1  

So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

所以我的问题是:我做错了什么吗?是正常的吗?我能做什么来解决这个问题?

There is nothing wrong with your code because this is the normal behaviour, you are requesting data from an external source, which in turn is loaded into memory.

您的代码没有问题,因为这是正常的行为,您正在从外部源请求数据,而外部源又将数据加载到内存中。

Of course a solution to your problem could be as simple as:

当然,解决你的问题的办法可以很简单:

ini_set('memory_limit', -1);

Which allows for all the memory needed to be used.

它允许使用所有需要的内存。


When I'm using dummy content the memory usage stays the same between requests.

当我使用虚拟内容时,请求之间的内存使用保持不变。

This is using PHP 5.5.19 in XAMPP on Windows.

这是在Windows的XAMPP中使用PHP 5.5.19。

There has been a cURL memory leak related bug which was fixed in Version 5.5.4

在5.5.4版本中修复了一个与cURL内存泄漏相关的bug