■ はじめに
Parquet ファイルを扱うことになり、テストデータを作りたいので Pythonであれば、Pandas でParquet を扱うのが一番楽そうなので 個別にまとめておく
目次
【1】インストール 【2】Parquet の書き出し・読み出し 1)出力・書き出し 2)入力・読み込み 【3】サンプル 例1:簡単な読み書き 例2:CSVファイルからParquetを生成する
【1】インストール
* 以下の関連記事で記載されているような 「pip install pandas」だけだとエラーになる
Pandas ~ 入門編 ~
https://dk521123.hatenablog.com/entry/2019/10/22/014957
# Pandas のインストール pip install pandas # 以下のエラーになるので、、、 # Trying to import the above resulted in these errors: # - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow. # - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet. pip install pyarrow, fastparquet
【2】Parquet の書き出し・読み出し
1)出力・書き出し
以前、以下の関連記事で扱った to_parquet() でいける。
https://dk521123.hatenablog.com/entry/2021/04/10/192752
API仕様:to_parquet
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
2)入力・読み込み
read_parquet() を使う。
API仕様:read_parquet
https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html
【3】サンプル
例1:簡単な読み書き
import pandas as pd file_name = "customer.parq" header_list = ['id', 'name', 'city', 'birth_day', 'created_at'] body_list = [ ["1", "Mike", "Dublin", "2010-10-11", "2021-11-12 23:12:32"], ["2", "Sam", "Tokyo", "1988-01-21", "2021-11-12 23:12:32"], ["3", "Tom", "London", "1999-12-31", "2021-11-12 23:12:32"], ] data_frame = pd.DataFrame(data=body_list, columns=header_list) # 型変換 data_frame['id'] = data_frame['id'].astype('int64') data_frame['name'] = data_frame['name'].astype('str') data_frame['city'] = data_frame['city'].astype('str') data_frame['birth_day'] = data_frame['birth_day'].astype('datetime64[ns]') data_frame['created_at'] = data_frame['created_at'].astype('datetime64[ns]') print("**************") print(data_frame) print("**************") print(data_frame.info()) # 1)出力・書き出し - to_parquet() data_frame.to_parquet(file_name, compression='GZIP') # 2)入力・読み込み - read_parquet() loaded_data_frame = pd.read_parquet(file_name) print("**************") print(loaded_data_frame) print("**************") print(loaded_data_frame.info())
出力結果
************** id name city birth_day created_at 0 1 Mike Dublin 2010-10-11 2021-11-12 23:12:32 1 2 Sam Tokyo 1988-01-21 2021-11-12 23:12:32 2 3 Tom London 1999-12-31 2021-11-12 23:12:32 ************** <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 3 non-null int64 1 name 3 non-null object 2 city 3 non-null object 3 birth_day 3 non-null datetime64[ns] 4 created_at 3 non-null datetime64[ns] dtypes: datetime64[ns](2), int64(1), object(2) memory usage: 252.0+ bytes None ************** id name city birth_day created_at 0 1 Mike Dublin 2010-10-11 2021-11-12 23:12:32 1 2 Sam Tokyo 1988-01-21 2021-11-12 23:12:32 2 3 Tom London 1999-12-31 2021-11-12 23:12:32 ************** <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 3 non-null int64 1 name 3 non-null object 2 city 3 non-null object 3 birth_day 3 non-null datetime64[ns] 4 created_at 3 non-null datetime64[ns] dtypes: datetime64[ns](2), int64(1), object(2) memory usage: 252.0+ bytes None
例2:CSVファイルからParquetを生成する
import os import glob import pandas as pd # inputs配下にCSVファイルを置く(内容は後述) input_path = "inputs/*.csv" output_path = "outputs" def read_csv(input_file_path): data_frame = pd.read_csv(input_file_path, encoding='UTF-8') return data_frame def write_parquet(data_frame, output_file_path, compression='GZIP'): data_frame.to_parquet(output_file_path, compression=compression) def get_filename_without_extension(target_path): file_name = os.path.basename(target_path) return os.path.splitext(file_name)[0] # 確認用 def read_and_print_parquet(file_path): loaded_data_frame = pd.read_parquet(file_path) print(loaded_data_frame) print("**************") print(loaded_data_frame.info()) print("**************") for path in glob.glob(input_path): df = read_csv(path) file_name_without_extension = get_filename_without_extension(path) output_file_path = f"{output_path}/{file_name_without_extension}.parquet" write_parquet(df, output_file_path) # 確認 read_and_print_parquet(output_file_path)
inputs/input1.csv
id,name,registered_datetime 1,Mike,"2021-01-12 23:12:32" 2,Tom,"2022-02-12 23:12:32" 3,Smith,"2023-03-12 23:12:32" 4,Kevin,"2024-04-12 23:12:32"
inputs/input2.csv
id,remarks 1,world1 2,world2 3,world3 4,world4
出力結果
id name registered_datetime 0 1 Mike 2021-01-12 23:12:32 1 2 Tom 2022-02-12 23:12:32 2 3 Smith 2023-03-12 23:12:32 3 4 Kevin 2024-04-12 23:12:32 ************** <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 4 non-null int64 1 name 4 non-null object 2 registered_datetime 4 non-null object dtypes: int64(1), object(2) memory usage: 228.0+ bytes None ************** id remarks 0 1 world1 1 2 world2 2 3 world3 3 4 world4 ************** <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 4 non-null int64 1 remarks 4 non-null object dtypes: int64(1), object(1) memory usage: 196.0+ bytes None **************
関連記事
Python ~ Parquet ~
https://dk521123.hatenablog.com/entry/2021/11/13/095519
Pandas ~ 入門編 ~
https://dk521123.hatenablog.com/entry/2019/10/22/014957
Pandas ~ to_xxxx / 出力編 ~
https://dk521123.hatenablog.com/entry/2021/04/10/192752
Pandas ~ 基本編 / CSV編 ~
https://dk521123.hatenablog.com/entry/2020/11/17/000000
Parquet ファイル
https://dk521123.hatenablog.com/entry/2020/06/03/000000
Python ~ TOML ~
https://dk521123.hatenablog.com/entry/2024/01/27/000110
Python ~ 入門編 ~
https://dk521123.hatenablog.com/entry/2014/08/07/231242
Python ~ 基本編 / 文字列 ~
https://dk521123.hatenablog.com/entry/2019/10/12/075251
Python ~ 基本編 / 文字列操作 ~
https://dk521123.hatenablog.com/entry/2023/10/20/000000
Python ~ 基本編 / 正規表現 ~
https://dk521123.hatenablog.com/entry/2019/09/01/000000
Python ~ 基本編 / 正規表現あれこれ ~
https://dk521123.hatenablog.com/entry/2020/10/15/000000