【分散処理】PySpark ~ データをクリーニングする ~

■ はじめに

https://dk521123.hatenablog.com/entry/2020/07/06/232350

で行った処理を、PySpackで行ってみた

目次

【1】CASE文を使って値を置き換える

【1】CASE文を使って値を置き換える

from pyspark import SparkContext
from pyspark.sql import SparkSession

spark_context = SparkContext()
spark = SparkSession(spark_context)

data_frame = spark.read \
  .option("header", "true") \
  .option("delimiter", ",") \
  .csv("./hello.csv")
data_frame.createOrReplaceTempView("person")
data_frame = spark.sql("""
SELECT
  CASE WHEN UPPER(TRIM(id)) IN ('\\\\\\\\N', 'N/A', '') THEN NULL ELSE TRIM(id) END AS id
  ,CASE WHEN UPPER(TRIM(name)) IN ('\\\\\\\\N', 'N/A', '') THEN NULL ELSE TRIM(name) END AS name
  ,CASE WHEN UPPER(TRIM(remarks)) IN ('\\\\\\\\N', 'N/A', '') THEN NULL ELSE TRIM(remarks) END AS remarks
  ,CASE WHEN UPPER(TRIM(birth_date)) IN ('\\\\\\\\N', 'N/A', '') THEN NULL ELSE TRIM(birth_date) END AS birth_date
  FROM person
""")
data_frame.show()

入力ファイル:hello.csv

id,name,remarks,birth_date
1,Mike,N/A,2020-12-22
2,Tom,\\N,2020-12-22
3,Naomi,,2020-12-22

出力結果

+---+-----+-------+----------+
| id| name|remarks|birth_date|
+---+-----+-------+----------+
|  1| Mike|   null|2020-12-22|
|  2|  Tom|   null|2020-12-22|
|  3|Naomi|   null|2020-12-22|
+---+-----+-------+----------+

【2】日時・日付形式を変更する

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_unixtime, unix_timestamp, col

spark_context = SparkContext()
spark = SparkSession(spark_context)

data_frame = spark.read \
  .option("header", "true") \
  .option("delimiter", ",") \
  .csv("./hello.csv")

data_frame = data_frame.withColumn(
  'birth_date',
    from_unixtime(unix_timestamp(col('birth_date'), \
      'yyyy-MM-dd HH:mm:ss.SSS'), 'yyyy/MM/dd'))
data_frame.show()

入力ファイル:hello.csv

id,name,city,birth_date
1,Joe Bloggs,Dublin,2020-12-22 13:23:45.123
2,Joseph Smith,Salt Lake City,1982-09-11 13:23:45.123
3,Mike,Tokyo,1991-01-01 13:23:45.123

出力結果

+---+------------+--------------+----------+
| id|        name|          city|birth_date|
+---+------------+--------------+----------+
|  1|  Joe Bloggs|        Dublin|2020/12/22|
|  2|Joseph Smith|Salt Lake City|1982/09/11|
|  3|        Mike|         Tokyo|1991/01/01|
+---+------------+--------------+----------+

参考文献

https://qiita.com/paulxll/items/1c0833782cd4e1de86e2

関連記事

Hive / HiveQL ~ データをクリーニングする ~
https://dk521123.hatenablog.com/entry/2020/07/06/232350
PySpark ~ Hive ~
https://dk521123.hatenablog.com/entry/2020/05/14/222415
PySpark ~ 環境設定 + Hello World編 ~
https://dk521123.hatenablog.com/entry/2019/11/14/221126
PySpark ~ CSV編 ~
https://dk521123.hatenablog.com/entry/2019/11/24/225534
PySpark ~ エスケープされた区切り文字が含んだデータを扱う ~
https://dk521123.hatenablog.com/entry/2020/07/09/000832
PySpark ~ DataFrame / データ操作編 ~
https://dk521123.hatenablog.com/entry/2020/01/04/150942