■ はじめに

PySpark の UDF (User Defined Function) っての学ぶ。

【１】UDF とは

* UDF = User Defined Function(ユーザー定義関数)
* ユーザが定義した関数を使って Spark クラスタで分散処理をするための機能

【２】UDF定義方法

１）udf関数から取り込む
２）デコレータを利用する方法
３）spark.udf.register() で登録する

⇒ 実際のコードは、以下の関連記事を参照のこと

PySpark ～ UDFの各定義方法でのサンプル～
https://dk521123.hatenablog.com/entry/2021/05/27/100132

⇒ デコレータの詳細は、以下の関連記事を参照のこと。

Python ～基本編 / デコレータ @xxxx ～
https://dk521123.hatenablog.com/entry/2020/05/19/000000

【３】使用上の注意

* 以下の関連記事を参照のこと。

https://dk521123.hatenablog.com/entry/2021/05/20/095706

【４】サンプル

* 下記以外のサンプルは、以下の関連記事を参照のこと。

PySpark ～ UDFの各定義方法でのサンプル～
https://dk521123.hatenablog.com/entry/2021/05/27/100132

例：UDFを使って、Data frameに項目を追加する

from pyspark import SparkContext
from pyspark.sql import SparkSession

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql.types import DoubleType
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
from pyspark.sql.functions import col

spark_context = SparkContext()
spark = SparkSession(spark_context)

# UDF - lambda を使用
is_adult_udf = udf(
  lambda age: True if age >= 20 else False, BooleanType())
to_usd_udf = udf(
  lambda salary: salary * 0.0092, DoubleType())

rdd = spark_context.parallelize([
  (1, 'Mike', 18, 320000, 'Sales'),
  (2, 'Tom', 19, 200000, 'IT'),
  (3, 'Sam', 23, 320000, 'Sales'),
  (4, 'Kevin', 32, 300000, 'Human resources'),
  (5, 'Bob', 45, 460000, 'IT'),
  (6, 'Alice', 20, 230000, 'Banking'),
  (7, 'Carol', 54, 500000, 'IT'),
])
schema = StructType([
  StructField('id', IntegerType(), False),
  StructField('name', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('salary', IntegerType(), False),
  StructField('job', StringType(), False),
])
data_frame = spark.createDataFrame(rdd, schema)

# 項目追加
data_frame = data_frame \
  .withColumn('is_adult', is_adult_udf(col("age"))) \
  .withColumn('salary_usd', to_usd_udf(data_frame.salary))

data_frame.show()

出力結果

+---+-----+---+------+---------------+--------+----------+
| id| name|age|salary|            job|is_adult|salary_usd|
+---+-----+---+------+---------------+--------+----------+
|  1| Mike| 18|320000|          Sales|   false|    2944.0|
|  2|  Tom| 19|200000|             IT|   false|    1840.0|
|  3|  Sam| 23|320000|          Sales|    true|    2944.0|
|  4|Kevin| 32|300000|Human resources|    true|    2760.0|
|  5|  Bob| 45|460000|             IT|    true|    4232.0|
|  6|Alice| 20|230000|        Banking|    true|    2116.0|
|  7|Carol| 54|500000|             IT|    true|    4600.0|
+---+-----+---+------+---------------+--------+----------+

参考文献

https://blog.amedama.jp/entry/2018/01/31/210755
https://qiita.com/takugenn/items/eb725f6bfa0bc38b5d79

プログラムの超個人的なメモ

Memo for Programming.

【分散処理】PySpark ～ユーザ定義関数 UDF 編～

■ はじめに

目次

【１】UDF とは

【２】UDF定義方法

【３】使用上の注意

【４】サンプル

例：UDFを使って、Data frameに項目を追加する

参考文献

関連記事