boto EMR添加步骤并自动终止

时间:2022-02-03 00:52:51

Python 2.7.12

boto3==1.3.1

How can I add a step to a running EMR cluster and have the cluster terminated after the step is complete, regardless of it fails or succeeds?

如何在步骤完成后向正在运行的EMR群集添加步骤并终止群集,无论其失败还是成功?

Create the cluster

创建群集

response = client.run_job_flow(
    Name=name,
    LogUri='s3://mybucket/emr/',
    ReleaseLabel='emr-5.9.0',
    Instances={
        'MasterInstanceType': instance_type,
        'SlaveInstanceType': instance_type,
        'InstanceCount': instance_count,
        'KeepJobFlowAliveWhenNoSteps': True,
        'Ec2KeyName': 'KeyPair',
        'EmrManagedSlaveSecurityGroup': 'sg-1234',
        'EmrManagedMasterSecurityGroup': 'sg-1234',
        'Ec2SubnetId': 'subnet-1q234',
    },
    Applications=[
        {'Name': 'Spark'},
        {'Name': 'Hadoop'}
    ],
    BootstrapActions=[
        {
            'Name': 'Install Python packages',
            'ScriptBootstrapAction': {
                'Path': 's3://mybucket/code/spark/bootstrap_spark_cluster.sh'
            }
        }
    ],
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole',
    Configurations=[
        {
            'Classification': 'spark',
            'Properties': {
                'maximizeResourceAllocation': 'true'
            }
        },
    ],
)

Add a step

添加一个步骤

response = client.add_job_flow_steps(
    JobFlowId=cluster_id,
    Steps=[
        {
            'Name': 'Run Step',
            'ActionOnFailure': 'TERMINATE_CLUSTER',
            'HadoopJarStep': {
                'Args': [
                    'spark-submit',
                    '--deploy-mode', 'cluster',
                    '--py-files',
                    's3://mybucket/code/spark/spark_udfs.py',
                    's3://mybucket/code/spark/{}'.format(spark_script),
                    '--some-arg'
                ],
                'Jar': 'command-runner.jar'
            }
        }
    ]
)

This successfully adds a step and runs, however, when the step completes successfully, I would like the cluster to auto-terminate as noted in the AWS CLI: http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html

这成功添加了一个步骤并运行,但是,当步骤成功完成时,我希望群集自动终止,如AWS CLI中所述:http://docs.aws.amazon.com/cli/latest/reference/ EMR /创建-cluster.html

1 个解决方案

#1


3  

In your case (creating the cluster using boto3) you can add these flags 'TerminationProtected': False, 'AutoTerminate': True, to your cluster creation. In this way after your step finished to run the cluster will be shut-down.

在您的情况下(使用boto3创建集群),您可以将这些标志'TerminationProtected':False,'AutoTerminate':True添加到您的集群创建中。这样,在您完成步骤运行后,群集将被关闭。

Another solution is to add another step to kill the cluster immediately after the step that you want to run. So basically you need to run this command as step

另一种解决方案是在您要运行的步骤之后立即添加另一个步骤以终止群集。所以基本上你需要运行这个命令作为步骤

aws emr terminate-clusters --cluster-ids your_cluster_id

The tricky part is to retrive the cluster_id. Here you can find some solution: Does an EMR master node know it's cluster id?

棘手的部分是检索cluster_id。在这里你可以找到一些解决方案:EMR主节点是否知道它的集群ID?

#1


3  

In your case (creating the cluster using boto3) you can add these flags 'TerminationProtected': False, 'AutoTerminate': True, to your cluster creation. In this way after your step finished to run the cluster will be shut-down.

在您的情况下(使用boto3创建集群),您可以将这些标志'TerminationProtected':False,'AutoTerminate':True添加到您的集群创建中。这样,在您完成步骤运行后,群集将被关闭。

Another solution is to add another step to kill the cluster immediately after the step that you want to run. So basically you need to run this command as step

另一种解决方案是在您要运行的步骤之后立即添加另一个步骤以终止群集。所以基本上你需要运行这个命令作为步骤

aws emr terminate-clusters --cluster-ids your_cluster_id

The tricky part is to retrive the cluster_id. Here you can find some solution: Does an EMR master node know it's cluster id?

棘手的部分是检索cluster_id。在这里你可以找到一些解决方案:EMR主节点是否知道它的集群ID?