■ はじめに

https://dk521123.hatenablog.com/entry/2021/09/18/232556

で、Glue3.0 を扱ったが、その中で、
Apache Arrow (v2.0)を使っているようなので、
どんなものか調べてみた。

【１】Apache Arrow

* 大量データを効率的にメモリ上で処理するためのライブラリ

１）公式サイト

https://arrow.apache.org/overview/

２）特徴

* インメモリの列指向(カラムナ)データフォーマット
 => カラムナフォーマットでデータを格納すると、効率よく圧縮できる
 => カラムナについての詳細は以下の関連記事を参照のこと。

https://dk521123.hatenablog.com/entry/2011/02/16/205224

 => カラムナにより高速に処理できる

https://arrow.apache.org/overview/

Columnar is Fast
=> カラムナは早い

・・・略・・・
In particular, the contiguous columnar layout enables vectorization
 using the latest SIMD (Single Instruction, Multiple Data) operations
 included in modern processors.
=> 特に、近接するカラムナレイアウトは、
　モダンなプロセッサを含んだ最新のSIMD（1命令、複数データ）操作
　を使って、ベクトル化可能です。

３）サポート言語

* C, C++, C#, Go, Java, JavaScript, Julia
　MATLAB, Python, R, Ruby, and Rust

Python (PyArrow)
https://arrow.apache.org/docs/python/

より抜粋
~~~~~
The Arrow Python bindings (also named “PyArrow”)
~~~~~

【２】環境設定 - PyArrow

https://arrow.apache.org/install/
https://arrow.apache.org/docs/python/install.html

より抜粋
~~~~~
pip install pyarrow==5.0.*
~~~~~

今回は、Glue3.0に合わせて、v2.0.0をインストールする
~~~~~
pip install pyarrow==2.0.0
~~~~~

【３】サンプル

import pyarrow
import pyarrow.parquet
import pyarrow.csv

input_table = pyarrow.csv.read_csv("./hello.csv")
print(input_table)

pyarrow.parquet.write_table(input_table, "./hello.parquet")
print("***********")
output_parquet_table = pyarrow.parquet.read_table("./hello.parquet")
print(output_parquet_table)

print("***********")
df = output_parquet_table.to_pandas()
print(df)

hello.csv

item1,item2,item3
hello1,world1,Mike
hello2,world2,Tom
hello3,world3,Smith
hello4,world4,Kevin

出力結果

pyarrow.Table
item1: string
item2: string
item3: string
***********
pyarrow.Table
item1: string
item2: string
item3: string
***********
    item1   item2  item3
0  hello1  world1   Mike
1  hello2  world2    Tom
2  hello3  world3  Smith
3  hello4  world4  Kevin