
时间:2022-04-21 01:14:57

I am looking for a solution to run the python command for a set of data in batches. For example, i want to run the below mentioned code for the first 10 rows,print output and run for the next batch until the row ends. Reason for doing this is that currently it is taking a lot of time to run 1000 rows.


Trying to use concurrent.futures.ProcessPoolExecutor but it is of no help. Is there a better way to do this?


here is the code:


import os, sys
import xlwt
import numpy

import tensorflow as tf
import xlsxwriter
import urllib

filename = "/home/shri/Desktop/tf_files/test1"

def getimg(count):
# open file to read
with open("{0}.csv".format(filename), 'r') as csvfile:
# iterate on all lines
i = 0
for line in csvfile:
    splitted_line = line.split(',')
    # check if we have an image URL
    if splitted_line[1] != '' and splitted_line[1] != "\n":
        urllib.urlretrieve(splitted_line[1], '/home/shri/Desktop/tf_files/images/{0}.jpg'.format (splitted_line[0]))
        print "Image saved for {0}".format(splitted_line[0])
        i += 1
        print "No result for {0}".format(splitted_line[0])

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

def run_inference(count):
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('output.xlsx') 
worksheet = workbook.add_worksheet()
# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0

# search for files in 'images' dir
files_dir = os.getcwd() + '/images'
files = os.listdir(files_dir)

# loop over files, print prediction if it is an image
for f in files:
if f.lower().endswith(('.png', '.jpg', '.jpeg')):
        image_path = files_dir + '/' + f

        # Read in the image_data
        image_data = tf.gfile.FastGFile(image_path, 'rb').read()

        # Loads label file, strips off carriage return
        label_lines = [line.rstrip() for line
                    in tf.gfile.GFile("retrained_labels.txt")]

# Unpersists graph from file
with tf.gfile.FastGFile("retrained_graph.pb", 'rb') as f:
        graph_def = tf.GraphDef()
        tf.import_graph_def(graph_def, name='')

 with tf.Session() as sess:
    # Feed the image_data as input to the graph and get first prediction
           softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')

           predictions = sess.run(softmax_tensor, \
                              {'DecodeJpeg/contents:0': image_data})

  # Sort to show labels of first highest prediction in order of confidence
  top_k = predictions[0].argsort()[-len(predictions):][::-1]

  for node_id in top_k:
        human_string = label_lines[node_id]
        score = predictions[0][node_id]

        worksheet.write_string(row, 1, image_path)
        worksheet.write(row, 2,  human_string)
        worksheet.write(row, 3, score)
        print('%s (score = %.5f)' % (human_string, score))
        row +=1


with concurrent.futures.ThreadPoolExecutor(max_workers=5) as e:
    for i in range(10):
        e.submit(run_inference, i)

here is the data in excel sheet



2 个解决方案



I'd suggest using the GNU Parallel. Create a text file with each line being a command you need to run, eg

我建议使用GNU Parallel。创建一个文本文件,每一行都是您需要运行的命令。

python mycode.py someargs
python mycode.py someotherargs

Then simply run


parallel commands.txt -j 8

It will bring up 8 (or however many you choose) instances of your script in parallel to process the whole command list.




GNU Parallel cannot make a serial program run faster or change a serial program to a parallel program.

GNU Parallel不能使串行程序运行更快,也不能将串行程序更改为并行程序。

What GNU Parallel can do, is to run a serial program many times in parallel with different arguments. But for this to work you need to make your serial program be able to run in parallel and be able to split up the work.


So you need to make your serial program be able to take a part of the problem and solve that. This may mean that you in the end need to collect all the partial solutions into a full solution.


This technique is called Map-Reduce today. GNU Parallel does the Map-stage.


In your case it will be a good idea to indentify which section is slow, and see how you can change that section into something that can be run as partial solutions.


Let us assume it is the fetching of URLs that is slow. Then you make a program that fetches URL number i and makes it possible to give i on the command line:


seq 10000 | parallel -j30 python get_url_number.py {}

Here we run 30 jobs in parallel. This will typically not crash the webserver and may be able to fill your bandwidth.




I'd suggest using the GNU Parallel. Create a text file with each line being a command you need to run, eg

我建议使用GNU Parallel。创建一个文本文件,每一行都是您需要运行的命令。

python mycode.py someargs
python mycode.py someotherargs

Then simply run


parallel commands.txt -j 8

It will bring up 8 (or however many you choose) instances of your script in parallel to process the whole command list.




GNU Parallel cannot make a serial program run faster or change a serial program to a parallel program.

GNU Parallel不能使串行程序运行更快,也不能将串行程序更改为并行程序。

What GNU Parallel can do, is to run a serial program many times in parallel with different arguments. But for this to work you need to make your serial program be able to run in parallel and be able to split up the work.


So you need to make your serial program be able to take a part of the problem and solve that. This may mean that you in the end need to collect all the partial solutions into a full solution.


This technique is called Map-Reduce today. GNU Parallel does the Map-stage.


In your case it will be a good idea to indentify which section is slow, and see how you can change that section into something that can be run as partial solutions.


Let us assume it is the fetching of URLs that is slow. Then you make a program that fetches URL number i and makes it possible to give i on the command line:


seq 10000 | parallel -j30 python get_url_number.py {}

Here we run 30 jobs in parallel. This will typically not crash the webserver and may be able to fill your bandwidth.
