■ はじめに
https://dk521123.hatenablog.com/entry/2020/02/20/230519
https://dk521123.hatenablog.com/entry/2020/05/27/175610
の続き。 boto3 を使って、 Amazon EMR を操作する
目次
【1】boto3 API 仕様 【2】主なAPI 1)run_job_flow 2)add_job_flow_steps 【3】使用上の注意 1)EMR用のエンドポイントが必要 【4】Tips 1)AWS Glue Data Catalog の使用には 2)EMRFS consistent view を有効にするには 【5】サンプル 例1:EMR起動
【1】boto3 API 仕様
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html
【2】主なAPI
1)run_job_flow
* EMR起動
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow
https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html
KeepJobFlowAliveWhenNoSteps (boolean)
* 全Step完了後、クラスタを残すかどうかを指定 (true: 残す、false:残さない)
2)add_job_flow_steps
* 既存で起動しているEMRに対して、Stepを渡して実行してもらう
【3】使用上の注意
1)EMR用のエンドポイントが必要
EMR用のエンドポイント(elasticmapreduce)が必要 => e.g. 'https://vpce-xxxx-xxxx.elasticmapreduce.us-west-2.amazonaws.com'
https://docs.aws.amazon.com/ja_jp/general/latest/gr/emr.html
【4】Tips
1)AWS Glue Data Catalog の使用には
* 以下の公式サイトに載っている
Hive のメタストアとしての AWS Glue Data Catalog の使用
https://docs.aws.amazon.com/ja_jp/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
サンプル
response = emr_client.run_job_flow( Instances={ 'InstanceGroups': [ { "Name": "emr_master", "Market": "ON_DEMAND", "InstanceCount": 5, "InstanceRole": "MASTER", "InstanceType": "m5.4xlarge", "Configurations": [ { "Classification": "hive-site", "Properties": { # ★ここで指定 "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory" } }, ], }, ],
2)EMRFS consistent view を有効にするには
* EMRFS consistent view (EMRFS 整合性のあるビュー) * 以下の公式サイトを参考に実装した
https://docs.aws.amazon.com/ja_jp/emr/latest/ReleaseGuide/emrfs-configure-consistent-view.html
サンプル
response = emr_client.run_job_flow( # ★一番上の階層のConfigurations "Configurations": [ { "Classification": "emrfs-site", "Properties": { "fs.s3.consistent": "true" } } ],
【5】サンプル
例1:EMR起動
import boto3 emr_endpoint_url = "https://vpce-xxxx-xxxx.elasticmapreduce.us-west-2.amazonaws.com" emr_client = boto3.client( 'emr', region_name='us-west-2', endpoint_url=emr_endpoint_url) # ★ここで起動★ response = emr_client.run_job_flow( Name="Hello_world", LogUri="s3://your-bucket-name/logs/", # https://docs.aws.amazon.com/ja_jp/emr/latest/ReleaseGuide/emr-release-components.html ReleaseLabel="emr-5.29.0", Applications=[ { "Name": "Spark" }, { "Name": "Hive" }, ], Instances={ 'InstanceGroups': [ { "Name": "emr_master", # ON_DEMAND | SPOT "Market": "ON_DEMAND", "InstanceCount": 5, # MASTER | CORE | TASK "InstanceRole": "MASTER", "InstanceType": "m5.4xlarge", ... }, { ... } ], 'EC2KeyName': "sample_ec2_key_name", 'KeepJobFlowAliveWhenNoSteps': False, 'TerminationProtected': False, 'Ec2SubnetId': "xxxxxxx", 'EmrManagedMasterSecurityGroup': "xxxxx", 'EmrManagedSlaveSecurityGroup': "xxxxxx", 'ServiceAccessSecurityGroup': "xxxxx" }, Steps=[ { 'Name': 'hello_hive', # TERMINATE_JOB_FLOW | TERMINATE_CLUSTER | CANCEL_AND_WAIT | CONTINUE 'ActionOnFailure': "TERMINATE_CLUSTER", 'HadoopJarStep': { 'Jar': 'command-runner.jar', 'Args': [ 's3://your-bucket-name/script-path/my_script.sh', ] } }, ], # https://dev.classmethod.jp/articles/emr-bootstrap-action/ BootstrapActions=[ { 'Name': 'hello_bootstrap_action', 'ScriptBootstrapAction': { 'Path': 's3://your-bucket-name/script-path/install_script.sh', 'Args': [ 'param1', ] } } ], # https://qiita.com/kazz_ogawa/items/eea9c378193d84139b5d VisibleToAllUsers=True, JobFlowRole="xxxxxx", ServiceRole="xxxxxx", AutoScalingRole="xxxxxx", Tags={ "Name": "hello_world_emr", } ) # [Response] # { # 'JobFlowId': 'string', # 'ClusterArn': 'string' # } print(response)
参考文献
http://laughingman7743.hatenablog.com/entry/2016/02/11/185319
https://gist.github.com/laughingman7743/5c675c9b1d9ed02539e6
https://dev.classmethod.jp/articles/boto3-emr-step-wait/
関連記事
Amazon EMR ~ 入門編 ~
https://dk521123.hatenablog.com/entry/2020/02/20/230519
Amazon EMR ~ 基本編 ~
https://dk521123.hatenablog.com/entry/2020/05/27/175610
Amazon EMR ~ AWS Glueとの連携 ~
https://dk521123.hatenablog.com/entry/2020/11/12/113312
Amazon EMR ~ EMRFS ~
https://dk521123.hatenablog.com/entry/2020/11/13/145545
Amazon EMR ~IAM Role周り ~
https://dk521123.hatenablog.com/entry/2023/07/24/160124
Amazon EMR に関するトラブルシューティング
https://dk521123.hatenablog.com/entry/2020/08/05/144724
Amazon S3 ~ Python boto3でS3を操作する ~
https://dk521123.hatenablog.com/entry/2019/10/21/230004