使用eventlet处理大量数据返回的大量https请求

I am trying to use eventlets to process a large number of data requests, approx. 100,000 requests at a time to a remote server, each of which should generate a 10k-15k byte JSON response. I have to decode the JSON, then perform some data transformations (some field name changes, some simple transforms like English->metric, but a few require minor parsing), and send all 100,000 requests out the back end as XML in a couple of formats expected by a legacy system. I'm using the code from the eventlet example which uses imap() "for body in pool.imap(fetch, urls):...."; lightly modified. eventlet is working well so far on a small sample (5K urls), to fetch the JSON data. My question is whether I should add the non-I/O processing (JSON decode, field transform, XML encode) to the "fetch()" function so that all that transform processing happens in the greenthread, or should I do the bare minimum in the greenthread, return the raw response body, and do the main processing in the "for body in pool.imap():" loop? I'm concerned that if I do the latter, the amount of data from completed threads will start building up, and will bloat memory, where doing the former would essentially throttle the process to where the XML output would keep up. Suggestions as to preferred method to implement this welcome. Oh, and this will eventually run off of cron hourly, so it really has a time window it has to fit into. Thanks!

我正在尝试使用eventlet来处理大量的数据请求。一次向远程服务器发出100,000个请求,每个请求应生成10k-15k字节的JSON响应。我必须解码JSON,然后执行一些数据转换(一些字段名称更改,一些简单的转换,如英语 - >度量,但一些需要轻微的解析),并将所有100,000个请求作为XML发送到后端遗留系统预期的格式。我正在使用eventlet示例中的代码,它使用了imap()“for pool in pool.imap(fetch,urls):....”;轻微修改。到目前为止,eventlet在一个小样本(5K url)上运行良好,可以获取JSON数据。我的问题是我是否应该将非I / O处理(JSON解码,字段转换,XML编码)添加到“fetch()”函数中,以便所有转换处理都发生在greenthread中,或者我应该做到最低限度在greenthread中,返回原始响应体,并在“for body in pool.imap():”循环中进行主处理?我担心如果我执行后者,来自已完成线程的数据量将开始累积,并且会使内存膨胀,而前者实际上会将进程限制在XML输出将保持的位置。建议作为实施此欢迎的首选方法。哦,这最终将每小时耗尽一次,所以它确实有一个时间窗口,它必须适应。谢谢!

1 个解决方案

#1

Ideally, you put each data processing operation into separate green thread. Then, only when required, combine several operations into batch or use a pool to throttle concurrency.

理想情况下,您将每个数据处理操作放在单独的绿色线程中。然后,仅在需要时,将多个操作组合成批处理或使用池来限制并发。

When you do non-IO-bound processing in one loop, essentially you throttle concurrency to 1 simultaneous task. But you can run those in parallel using (OS) thread pool in eventlet.tpool module.

当您在一个循环中执行非IO绑定处理时,基本上您将并发性限制为1个同时任务。但是您可以使用eventlet.tpool模块中的(OS)线程池并行运行它们。

Throttle concurrency only when you have too many parallel CPU-bound code running.

只有在运行了太多并行CPU绑定代码时才会限制并发性。

#1