如何批量运行python脚本?

I am looking for a solution to run the python command for a set of data in batches. For example, i want to run the below mentioned code for the first 10 rows,print output and run for the next batch until the row ends. Reason for doing this is that currently it is taking a lot of time to run 1000 rows.

我正在寻找一种解决方案，可以为一组数据批量运行python命令。例如，我希望运行以下代码，为前10行打印输出并运行下一批，直到行结束。这样做的原因是，现在需要花费大量的时间来运行1000行。

Trying to use concurrent.futures.ProcessPoolExecutor but it is of no help. Is there a better way to do this?

尝试使用concurrent.futures。ProcessPoolExecutor，但它没有帮助。有没有更好的办法?

here is the code:

这是代码:

import os, sys
import xlwt
import numpy

import tensorflow as tf
import xlsxwriter
import urllib

filename = "/home/shri/Desktop/tf_files/test1"

def getimg(count):
# open file to read
with open("{0}.csv".format(filename), 'r') as csvfile:
# iterate on all lines
i = 0
for line in csvfile:
    splitted_line = line.split(',')
    # check if we have an image URL
    if splitted_line[1] != '' and splitted_line[1] != "\n":
        urllib.urlretrieve(splitted_line[1], '/home/shri/Desktop/tf_files/images/{0}.jpg'.format (splitted_line[0]))
        print "Image saved for {0}".format(splitted_line[0])
        i += 1
    else:
        print "No result for {0}".format(splitted_line[0])

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

def run_inference(count):
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('output.xlsx') 
worksheet = workbook.add_worksheet()
# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0

# search for files in 'images' dir
files_dir = os.getcwd() + '/images'
files = os.listdir(files_dir)

# loop over files, print prediction if it is an image
for f in files:
if f.lower().endswith(('.png', '.jpg', '.jpeg')):
        image_path = files_dir + '/' + f

        # Read in the image_data
        image_data = tf.gfile.FastGFile(image_path, 'rb').read()

        # Loads label file, strips off carriage return
        label_lines = [line.rstrip() for line
                    in tf.gfile.GFile("retrained_labels.txt")]

# Unpersists graph from file
with tf.gfile.FastGFile("retrained_graph.pb", 'rb') as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
        tf.import_graph_def(graph_def, name='')

 with tf.Session() as sess:
    # Feed the image_data as input to the graph and get first prediction
           softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')

           predictions = sess.run(softmax_tensor, \
                              {'DecodeJpeg/contents:0': image_data})

  # Sort to show labels of first highest prediction in order of confidence
  top_k = predictions[0].argsort()[-len(predictions):][::-1]

  for node_id in top_k:
        human_string = label_lines[node_id]
        score = predictions[0][node_id]

        worksheet.write_string(row, 1, image_path)
        worksheet.write(row, 2,  human_string)
        worksheet.write(row, 3, score)
        print(row)
        print(node_id)
        print(image_path)
        print('%s (score = %.5f)' % (human_string, score))
        row +=1

workbook.close()

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as e:
    for i in range(10):
        e.submit(run_inference, i)

here is the data in excel sheet

这是excel表中的数据。

2 个解决方案

#1

I'd suggest using the GNU Parallel. Create a text file with each line being a command you need to run, eg

我建议使用GNU Parallel。创建一个文本文件，每一行都是您需要运行的命令。

python mycode.py someargs
python mycode.py someotherargs
...

Then simply run

然后简单地运行

parallel commands.txt -j 8

It will bring up 8 (or however many you choose) instances of your script in parallel to process the whole command list.

它将在并行处理整个命令列表的同时，带来8个(或多个您选择的)脚本实例。

#2

GNU Parallel cannot make a serial program run faster or change a serial program to a parallel program.

GNU Parallel不能使串行程序运行更快，也不能将串行程序更改为并行程序。

What GNU Parallel can do, is to run a serial program many times in parallel with different arguments. But for this to work you need to make your serial program be able to run in parallel and be able to split up the work.

GNU并行可以做的是，多次并行地运行一个串行程序，并使用不同的参数。但是为了使这个工作，你需要使你的串行程序能够并行运行并且能够分割工作。

So you need to make your serial program be able to take a part of the problem and solve that. This may mean that you in the end need to collect all the partial solutions into a full solution.

所以你需要让你的串行程序能够解决问题并解决它。这可能意味着您最终需要将所有部分解决方案收集到一个完整的解决方案中。

This technique is called Map-Reduce today. GNU Parallel does the Map-stage.

这种技术被称为mapreduce。GNU并行执行地图阶段。

In your case it will be a good idea to indentify which section is slow, and see how you can change that section into something that can be run as partial solutions.

在您的情况下，识别哪个部分是慢的是一个好主意，并且了解如何将该部分更改为可以作为部分解决方案运行的部分。

Let us assume it is the fetching of URLs that is slow. Then you make a program that fetches URL number i and makes it possible to give i on the command line:

让我们假设是抓取速度很慢的url。然后你创建一个程序，获取URL号码i并使它可以在命令行上给我:

seq 10000 | parallel -j30 python get_url_number.py {}

Here we run 30 jobs in parallel. This will typically not crash the webserver and may be able to fill your bandwidth.

这里我们同时运行30个工作。这通常不会使webserver崩溃，并且可能会占用您的带宽。

#1