■ はじめに
https://dk521123.hatenablog.com/entry/2020/07/06/232350
で行った処理を、PySpackで行ってみた
目次
【1】CASE文を使って値を置き換える
【1】CASE文を使って値を置き換える
from pyspark import SparkContext from pyspark.sql import SparkSession spark_context = SparkContext() spark = SparkSession(spark_context) data_frame = spark.read \ .option("header", "true") \ .option("delimiter", ",") \ .csv("./hello.csv") data_frame.createOrReplaceTempView("person") data_frame = spark.sql(""" SELECT CASE WHEN UPPER(TRIM(id)) IN ('\\\\\\\\N', 'N/A', '') THEN NULL ELSE TRIM(id) END AS id ,CASE WHEN UPPER(TRIM(name)) IN ('\\\\\\\\N', 'N/A', '') THEN NULL ELSE TRIM(name) END AS name ,CASE WHEN UPPER(TRIM(remarks)) IN ('\\\\\\\\N', 'N/A', '') THEN NULL ELSE TRIM(remarks) END AS remarks ,CASE WHEN UPPER(TRIM(birth_date)) IN ('\\\\\\\\N', 'N/A', '') THEN NULL ELSE TRIM(birth_date) END AS birth_date FROM person """) data_frame.show()
入力ファイル:hello.csv
id,name,remarks,birth_date 1,Mike,N/A,2020-12-22 2,Tom,\\N,2020-12-22 3,Naomi,,2020-12-22
出力結果
+---+-----+-------+----------+ | id| name|remarks|birth_date| +---+-----+-------+----------+ | 1| Mike| null|2020-12-22| | 2| Tom| null|2020-12-22| | 3|Naomi| null|2020-12-22| +---+-----+-------+----------+
【2】日時・日付形式を変更する
from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.functions import from_unixtime, unix_timestamp, col spark_context = SparkContext() spark = SparkSession(spark_context) data_frame = spark.read \ .option("header", "true") \ .option("delimiter", ",") \ .csv("./hello.csv") data_frame = data_frame.withColumn( 'birth_date', from_unixtime(unix_timestamp(col('birth_date'), \ 'yyyy-MM-dd HH:mm:ss.SSS'), 'yyyy/MM/dd')) data_frame.show()
入力ファイル:hello.csv
id,name,city,birth_date 1,Joe Bloggs,Dublin,2020-12-22 13:23:45.123 2,Joseph Smith,Salt Lake City,1982-09-11 13:23:45.123 3,Mike,Tokyo,1991-01-01 13:23:45.123
出力結果
+---+------------+--------------+----------+ | id| name| city|birth_date| +---+------------+--------------+----------+ | 1| Joe Bloggs| Dublin|2020/12/22| | 2|Joseph Smith|Salt Lake City|1982/09/11| | 3| Mike| Tokyo|1991/01/01| +---+------------+--------------+----------+
参考文献
https://qiita.com/paulxll/items/1c0833782cd4e1de86e2
関連記事
Hive / HiveQL ~ データをクリーニングする ~
https://dk521123.hatenablog.com/entry/2020/07/06/232350
PySpark ~ Hive ~
https://dk521123.hatenablog.com/entry/2020/05/14/222415
PySpark ~ 環境設定 + Hello World編 ~
https://dk521123.hatenablog.com/entry/2019/11/14/221126
PySpark ~ CSV編 ~
https://dk521123.hatenablog.com/entry/2019/11/24/225534
PySpark ~ エスケープされた区切り文字が含んだデータを扱う ~
https://dk521123.hatenablog.com/entry/2020/07/09/000832
PySpark ~ DataFrame / データ操作編 ~
https://dk521123.hatenablog.com/entry/2020/01/04/150942