【Python】Pandas ~ Parquet ~

■ はじめに

Parquet ファイルを扱うことになり、テストデータを作りたいので
Pythonであれば、Pandas でParquet を扱うのが一番楽そうなので
個別にまとめておく

目次

【1】インストール
【2】Parquet の書き出し・読み出し
 1)出力・書き出し
 2)入力・読み込み
【3】サンプル
 例1:簡単な読み書き
 例2:CSVファイルからParquetを生成する

【1】インストール

* 以下の関連記事で記載されているような
 「pip install pandas」だけだとエラーになる

Pandas ~ 入門編 ~
https://dk521123.hatenablog.com/entry/2019/10/22/014957

# Pandas のインストール
pip install pandas

# 以下のエラーになるので、、、
# Trying to import the above resulted in these errors:
# - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
# - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.
pip install pyarrow, fastparquet

【2】Parquet の書き出し・読み出し

1)出力・書き出し

以前、以下の関連記事で扱った to_parquet() でいける。

https://dk521123.hatenablog.com/entry/2021/04/10/192752

API仕様:to_parquet
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html

2)入力・読み込み

read_parquet() を使う。

API仕様:read_parquet
https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

【3】サンプル

例1:簡単な読み書き

import pandas as pd


file_name = "customer.parq"
header_list = ['id', 'name', 'city', 'birth_day', 'created_at']
body_list = [
  ["1", "Mike", "Dublin", "2010-10-11", "2021-11-12 23:12:32"],
  ["2", "Sam", "Tokyo", "1988-01-21", "2021-11-12 23:12:32"],
  ["3", "Tom", "London", "1999-12-31", "2021-11-12 23:12:32"],
]

data_frame = pd.DataFrame(data=body_list, columns=header_list)
# 型変換
data_frame['id'] = data_frame['id'].astype('int64')
data_frame['name'] = data_frame['name'].astype('str')
data_frame['city'] = data_frame['city'].astype('str')
data_frame['birth_day'] = data_frame['birth_day'].astype('datetime64[ns]')
data_frame['created_at'] = data_frame['created_at'].astype('datetime64[ns]')

print("**************")
print(data_frame)
print("**************")
print(data_frame.info())

# 1)出力・書き出し - to_parquet()
data_frame.to_parquet(file_name, compression='GZIP')

# 2)入力・読み込み - read_parquet()
loaded_data_frame = pd.read_parquet(file_name)

print("**************")
print(loaded_data_frame)
print("**************")
print(loaded_data_frame.info())

出力結果

**************
   id  name    city  birth_day          created_at
0   1  Mike  Dublin 2010-10-11 2021-11-12 23:12:32
1   2   Sam   Tokyo 1988-01-21 2021-11-12 23:12:32
2   3   Tom  London 1999-12-31 2021-11-12 23:12:32
**************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   id          3 non-null      int64
 1   name        3 non-null      object
 2   city        3 non-null      object
 3   birth_day   3 non-null      datetime64[ns]
 4   created_at  3 non-null      datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 252.0+ bytes
None
**************
   id  name    city  birth_day          created_at
0   1  Mike  Dublin 2010-10-11 2021-11-12 23:12:32
1   2   Sam   Tokyo 1988-01-21 2021-11-12 23:12:32
2   3   Tom  London 1999-12-31 2021-11-12 23:12:32
**************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   id          3 non-null      int64
 1   name        3 non-null      object
 2   city        3 non-null      object
 3   birth_day   3 non-null      datetime64[ns]
 4   created_at  3 non-null      datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(2)
memory usage: 252.0+ bytes
None

例2:CSVファイルからParquetを生成する

import os
import glob
import pandas as pd


# inputs配下にCSVファイルを置く(内容は後述)
input_path = "inputs/*.csv"
output_path = "outputs"

def read_csv(input_file_path):
  data_frame = pd.read_csv(input_file_path, encoding='UTF-8')
  return data_frame

def write_parquet(data_frame, output_file_path, compression='GZIP'):
  data_frame.to_parquet(output_file_path, compression=compression)

def get_filename_without_extension(target_path):
  file_name = os.path.basename(target_path)
  return os.path.splitext(file_name)[0]

# 確認用
def read_and_print_parquet(file_path):
  loaded_data_frame = pd.read_parquet(file_path)
  print(loaded_data_frame)
  print("**************")
  print(loaded_data_frame.info())
  print("**************")

for path in glob.glob(input_path):
  df = read_csv(path)

  file_name_without_extension = get_filename_without_extension(path)
  output_file_path = f"{output_path}/{file_name_without_extension}.parquet"
  write_parquet(df, output_file_path)

  # 確認
  read_and_print_parquet(output_file_path)

inputs/input1.csv

id,name,registered_datetime
1,Mike,"2021-01-12 23:12:32"
2,Tom,"2022-02-12 23:12:32"
3,Smith,"2023-03-12 23:12:32"
4,Kevin,"2024-04-12 23:12:32"

inputs/input2.csv

id,remarks
1,world1
2,world2
3,world3
4,world4

出力結果

   id   name  registered_datetime
0   1   Mike  2021-01-12 23:12:32
1   2    Tom  2022-02-12 23:12:32
2   3  Smith  2023-03-12 23:12:32
3   4  Kevin  2024-04-12 23:12:32
**************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   id                   4 non-null      int64
 1   name                 4 non-null      object
 2   registered_datetime  4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
None
**************
   id remarks
0   1  world1
1   2  world2
2   3  world3
3   4  world4
**************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   id       4 non-null      int64
 1   remarks  4 non-null      object
dtypes: int64(1), object(1)
memory usage: 196.0+ bytes
None
**************

関連記事

Python ~ Parquet ~
https://dk521123.hatenablog.com/entry/2021/11/13/095519
Pandas ~ 入門編 ~
https://dk521123.hatenablog.com/entry/2019/10/22/014957
Pandas ~ to_xxxx / 出力編 ~
https://dk521123.hatenablog.com/entry/2021/04/10/192752
Pandas ~ 基本編 / CSV編 ~
https://dk521123.hatenablog.com/entry/2020/11/17/000000
Parquet ファイル
https://dk521123.hatenablog.com/entry/2020/06/03/000000
Python ~ TOML ~
https://dk521123.hatenablog.com/entry/2024/01/27/000110
Python ~ 入門編 ~
https://dk521123.hatenablog.com/entry/2014/08/07/231242
Python ~ 基本編 / 文字列 ~
https://dk521123.hatenablog.com/entry/2019/10/12/075251
Python ~ 基本編 / 文字列操作 ~
https://dk521123.hatenablog.com/entry/2023/10/20/000000
Python ~ 基本編 / 正規表現
https://dk521123.hatenablog.com/entry/2019/09/01/000000
Python ~ 基本編 / 正規表現あれこれ ~
https://dk521123.hatenablog.com/entry/2020/10/15/000000