■ はじめに
https://dk521123.hatenablog.com/entry/2019/10/21/230004
の続き。 今回は、 Boto3でS3上にあるファイル存在チェックについて 考えてみる。
目次
案1:list_objects_v2 を使う方法 案2:例外「ClientError:Error Code=404」を利用する方法 補足:パフォーマンスについて
案1:list_objects_v2 を使う方法
import boto3 s3_client = boto3.client("s3") def has_file( bucket_name: str, target_key: str, is_allowed_zero_byte: bool=True ) -> bool: """Check to exist a file in S3. Args: bucket_name (str): S3 bucket name target_key (str): target of S3 key is_allowed_zero_byte (bool, optional): True: Allow an empty file, False: Not allow. Defaults to True. Returns: bool: True: Exists file. False: Not exists. """ response = s3_client.list_objects_v2( Bucket=bucket_name, Prefix=target_key, ) s3_contents = response["Contents"] for s3_content in s3_contents: if s3_content.get("Key") == target_key: if not is_allowed_zero_byte and s3_content.get("Size") == 0: # is_allowed_zero_byte=Trueを指定し、かつ、空ファイル(0byte)だった場合 return False return True return False
参考文献
https://qiita.com/ikai/items/22281b3cfc9636d8587d
案2:例外「ClientError:Error Code=404」を利用する方法
import boto3 try: from botocore.exceptions import ClientError except ImportError: pass s3_client = boto3.client('s3') def get_file_content( bucket_name: str, target_key: str, file_code: str="utf-8" ) -> str, int: """Get file content in S3. Args: bucket_name (str): S3 bucket name target_key (str): target of S3 key file_code (str, optional): File code. Defaults to "utf-8". Raises: ex: Exception from boto3/get_object etc Returns: str: file content in S3 (None: file not found) int: Size of the body in bytes """ try: response = s3_client.get_object( Bucket=bucket_name, Key=target_key) body = response["Body"].read() return body.decode(file_code), int(response["ContentLength"]) except ClientError as ex: if ex.response['Error']['Code'] == '404': # File not found return None, -1 else: raise ex def has_file( bucket_name: str, target_key: str, is_allowed_zero_byte: bool=True ) -> bool: """Check to exist a file in S3. Args: bucket_name (str): S3 bucket name target_key (str): target of S3 key is_allowed_zero_byte (bool, optional): True: Allow an empty file, False: Not allow. Defaults to True. Returns: bool: True: Exists file. False: Not exists. """ content, file_size = get_file_content(bucket_name, target_key) if content is None or \ (is_allowed_zero_byte and file_size <= 0): return False else: return True
補足:パフォーマンスについて
上記の実装と多少異なるが、以下のサイトで比較している。
https://www.peterbe.com/plog/fastest-way-to-find-out-if-a-file-exists-in-s3
より抜粋 ~~~~~~ FUNCTION: _key_existing_size__list Used 511 times SUM 148.2740752696991 MEAN 0.2901645308604679 MEDIAN 0.2569708824157715 STDEV 0.17742598775696436 FUNCTION: _key_existing_size__head Used 489 times SUM 249.79622673988342 MEAN 0.510830729529414 MEDIAN 0.4780092239379883 STDEV 0.14352671121877011 Because it's network bound, it's really important to avoid the 'MEAN' and instead look at the 'MEDIAN'. My home broadband can cause temporary spikes. Clearly, using client.list_objects_v2 is faster. It's 90% faster than client.head_object. ~~~~~~
関連記事
Amazon S3 ~ 入門編 ~
https://dk521123.hatenablog.com/entry/2022/02/26/182526
Amazon S3 ~ Boto3編 ~
https://dk521123.hatenablog.com/entry/2019/10/21/230004
boto3 API / list_objects_v2 の 使用上の注意 と その対策
https://dk521123.hatenablog.com/entry/2019/12/06/232617
Python ~ 基本編 / パス情報抽出 ~
https://dk521123.hatenablog.com/entry/2022/02/23/000000