【AWS】Amazon EMR ~ boto3 編 ~

■ はじめに

https://dk521123.hatenablog.com/entry/2020/02/20/230519
https://dk521123.hatenablog.com/entry/2020/05/27/175610

の続き。
boto3 を使って、 Amazon EMR を操作する

目次

【1】boto3 API 仕様
【2】主なAPI
 1)run_job_flow
 2)add_job_flow_steps
【3】使用上の注意
 1)EMR用のエンドポイントが必要
【4】Tips
 1)AWS Glue Data Catalog の使用には
 2)EMRFS consistent view を有効にするには
【5】サンプル
 例1:EMR起動

【1】boto3 API 仕様

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html

【2】主なAPI

1)run_job_flow

* EMR起動

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow
https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html

KeepJobFlowAliveWhenNoSteps (boolean)

* 全Step完了後、クラスタを残すかどうかを指定
(true: 残す、false:残さない) 

2)add_job_flow_steps

* 既存で起動しているEMRに対して、Stepを渡して実行してもらう

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.add_job_flow_steps

【3】使用上の注意

1)EMR用のエンドポイントが必要

EMR用のエンドポイント(elasticmapreduce)が必要
 => e.g. 'https://vpce-xxxx-xxxx.elasticmapreduce.us-west-2.amazonaws.com'

https://docs.aws.amazon.com/ja_jp/general/latest/gr/emr.html

【4】Tips

1)AWS Glue Data Catalog の使用には

* 以下の公式サイトに載っている

Hive のメタストアとしての AWS Glue Data Catalog の使用
https://docs.aws.amazon.com/ja_jp/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
サンプル

response = emr_client.run_job_flow(
  Instances={
    'InstanceGroups': [
      {
        "Name": "emr_master",
        "Market": "ON_DEMAND",
        "InstanceCount": 5,
        "InstanceRole": "MASTER",
        "InstanceType": "m5.4xlarge",
        "Configurations": [
          {
            "Classification": "hive-site",
            "Properties": {
              # ★ここで指定
              "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
            }
          }, 
        ],
      },
    ],

2)EMRFS consistent view を有効にするには

* EMRFS consistent view (EMRFS 整合性のあるビュー)
* 以下の公式サイトを参考に実装した

https://docs.aws.amazon.com/ja_jp/emr/latest/ReleaseGuide/emrfs-configure-consistent-view.html

サンプル

response = emr_client.run_job_flow(
  # ★一番上の階層のConfigurations
  "Configurations": [
    {
      "Classification": "emrfs-site",
      "Properties": {
        "fs.s3.consistent": "true"
      }
    }
  ],

【5】サンプル

例1:EMR起動

import boto3

emr_endpoint_url = "https://vpce-xxxx-xxxx.elasticmapreduce.us-west-2.amazonaws.com"

emr_client = boto3.client(
  'emr',
  region_name='us-west-2',
  endpoint_url=emr_endpoint_url)

# ★ここで起動★
response = emr_client.run_job_flow(
  Name="Hello_world",
  LogUri="s3://your-bucket-name/logs/",
  # https://docs.aws.amazon.com/ja_jp/emr/latest/ReleaseGuide/emr-release-components.html
  ReleaseLabel="emr-5.29.0",
  Applications=[
    {
      "Name": "Spark"
    },
    {
      "Name": "Hive"
    },
  ],
  Instances={
    'InstanceGroups': [
      {
        "Name": "emr_master",
        # ON_DEMAND | SPOT
        "Market": "ON_DEMAND",
        "InstanceCount": 5,
        # MASTER | CORE | TASK
        "InstanceRole": "MASTER",
        "InstanceType": "m5.4xlarge",
        ...
      },
      {
        ...
      }
    ],
    'EC2KeyName': "sample_ec2_key_name",
    'KeepJobFlowAliveWhenNoSteps': False,
    'TerminationProtected': False,
    'Ec2SubnetId': "xxxxxxx",
    'EmrManagedMasterSecurityGroup': "xxxxx",
    'EmrManagedSlaveSecurityGroup': "xxxxxx",
    'ServiceAccessSecurityGroup': "xxxxx"
  },
  Steps=[
    {
      'Name': 'hello_hive',
      # TERMINATE_JOB_FLOW | TERMINATE_CLUSTER | CANCEL_AND_WAIT | CONTINUE
      'ActionOnFailure': "TERMINATE_CLUSTER",
      'HadoopJarStep': {
        'Jar': 'command-runner.jar',
        'Args': [
          's3://your-bucket-name/script-path/my_script.sh',
        ]
      }
    },
  ],
  # https://dev.classmethod.jp/articles/emr-bootstrap-action/
  BootstrapActions=[
    {
      'Name': 'hello_bootstrap_action',
      'ScriptBootstrapAction': {
        'Path': 's3://your-bucket-name/script-path/install_script.sh',
        'Args': [
          'param1',
        ]
      }
    }
  ],
  # https://qiita.com/kazz_ogawa/items/eea9c378193d84139b5d
  VisibleToAllUsers=True,
  JobFlowRole="xxxxxx",
  ServiceRole="xxxxxx",
  AutoScalingRole="xxxxxx",
  Tags={
    "Name": "hello_world_emr",
  }
)

# [Response]
# {
#    'JobFlowId': 'string',
#    'ClusterArn': 'string'
# }
print(response)

参考文献

http://laughingman7743.hatenablog.com/entry/2016/02/11/185319
https://gist.github.com/laughingman7743/5c675c9b1d9ed02539e6
https://dev.classmethod.jp/articles/boto3-emr-step-wait/

関連記事

Amazon EMR ~ 入門編 ~
https://dk521123.hatenablog.com/entry/2020/02/20/230519
Amazon EMR ~ 基本編 ~
https://dk521123.hatenablog.com/entry/2020/05/27/175610
Amazon EMR ~ AWS Glueとの連携 ~
https://dk521123.hatenablog.com/entry/2020/11/12/113312
Amazon EMR ~ EMRFS ~
https://dk521123.hatenablog.com/entry/2020/11/13/145545
Amazon EMR ~IAM Role周り ~
https://dk521123.hatenablog.com/entry/2023/07/24/160124
Amazon EMR に関するトラブルシューティング
https://dk521123.hatenablog.com/entry/2020/08/05/144724
Amazon S3Python boto3でS3を操作する ~
https://dk521123.hatenablog.com/entry/2019/10/21/230004