■ はじめに

Hive の 設定プロパティ(Configuration Properties) について
大分、知識が溜まってきたので、まとめておく。

公式サイト
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

* なお、Hadoopに関する設定プロパティについては、
　以下の関連記事を参照のこと

Apache Hadoop ～設定プロパティ～
https://dk521123.hatenablog.com/entry/2021/06/23/151148

【０】パーティション関連
　１）hive.exec.dynamic.partition
　２）hive.exec.dynamic.partition.mode
　３）hive.exec.max.dynamic.partitions
　４）hive.msck.path.validation
　５）spark.sql.sources.partitionOverwriteMode

【１】最適化 - hive.optimize etc
　１）hive.optimize.skewjoin
　２）hive.optimize.constant.propagation
　３）hive.auto.convert.join

【２】実行 - hive.execution
　１）hive.execution.engine

【３】統計情報（Statistics） - hive.compute
　１）hive.compute.query.using.stats

【４】圧縮
　１）hive.exec.compress.output
　２）mapred.output.compression.codec

【５】メタストア
　１）hive.metastore.warehouse.dir / spark.sql.warehouse.dir

【６】その他
　１）hive.cli.print.header
　２）hive.mapred.mode
　３）hive.variable.substitute

【０】パーティション関連

* 以下の関連記事を参照のこと。
　１）hive.exec.dynamic.partition
　２）hive.exec.dynamic.partition.mode
　３）hive.exec.max.dynamic.partitions
　４）hive.msck.path.validation
　５）spark.sql.sources.partitionOverwriteMode

https://dk521123.hatenablog.com/entry/2020/09/18/113637
https://dk521123.hatenablog.com/entry/2021/07/07/093147

【１】最適化 - hive.optimize etc

１）hive.optimize.skewjoin
２）hive.optimize.constant.propagation
３）hive.auto.convert.join

１）hive.optimize.skewjoin

* skew = ゆがんだ、偏った
* skew 結合最適化を適用するか（デフォルトは、False）
 => 偏ったデータを分散しやすい最適なサイズに分割して実行する
 => 以下のサイトの図が分かりやすい

https://data-flair.training/blogs/skew-join-in-hive/
サンプル

SET hive.optimize.skewjoin=true;

公式サイト
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903061#ConfigurationProperties-hive.optimize.skewjoin.compiletime
参考文献
https://support.datafabric.hpe.com/s/article/Join-query-on-Tez-hangs-by-data-skew-at-Reducer-Stage?language=ja

２）hive.optimize.constant.propagation

* 定数伝播最適化を有効にするかどうか（デフォルト：true）
 => 定数伝播最適化 = 定数式をコンパイル時に単純化する
 => 定数伝播(constant propagation)の詳細については、以下を参照。

https://en.wikipedia.org/wiki/Constant_folding#Constant_propagation https://ja.wikipedia.org/wiki/%E5%AE%9A%E6%95%B0%E7%95%B3%E3%81%BF%E8%BE%BC%E3%81%BF

３）hive.auto.convert.join

* hive.auto.convert.join=true の場合、Joinの有効化（ジョブが起動する前にメモリにテーブルのデータを読み込む）
* デフォルトは、Versionによって変わるらしい(詳細は以下のサイト参照)

https://debug-life.net/entry/1223

参考文献
https://wyukawa.hatenablog.com/entry/20120328/1332950392
https://developers.microad.co.jp/entry/2020/04/13/063000

【２】実行 - hive.execution

１）hive.execution.engine

１）hive.execution.engine

* Hive実行エンジンを指定
* hive.execution.engineに関わるトラブルについては、以下の関連記事を参照のこと
（エラー「java.lang.NoClassDefFoundError: scala/collection/Iterable」が発生する）

https://dk521123.hatenablog.com/entry/2020/05/28/175428

サンプル

# Spark エンジンを利用する
SET hive.execution.engine=spark;

# MapReduce エンジンを利用する
SET hive.execution.engine=mr;

公式サイト
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903061#ConfigurationProperties-hive.execution.engine

【３】統計情報（Statistics） - hive.compute

１）hive.compute.query.using.stats

１）hive.compute.query.using.stats

*  min, max, and count(1) のようなクエリに
　メタストアにある統計によっての結果を返すかどうかを設定

⇒ true に設定することにより、メタストアにある統計情報を使うので
　パフォーマンスが上がるかもしれないが、
　メタストアが更新されていない場合は意図していない情報が返されてしまう

サンプル

# メタストアにある統計情報を使わずに結果を返す
SET hive.compute.query.using.stats=false;

参考文献
https://qiita.com/hayanige/items/baad6e39aa6805a8b245

【４】圧縮

* データ圧縮については、以下の関連記事も参照のこと。

Hive / HiveQL ～データ圧縮あれこれ～
https://dk521123.hatenablog.com/entry/2021/08/06/172502

１）hive.exec.compress.output

* true : 出力データを圧縮する
* false : 出力データを圧縮しない（デフォルト）

２）mapred.output.compression.codec

* 圧縮形式を指定

サンプル

set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

参考文献
https://blog.amedama.jp/entry/2018/02/15/234725
https://open-groove.net/hive/hive-hacks2/
https://software.fujitsu.com/jp/manual/manualfiles/m150005/j2ul1563/04z200/j1563-03-17-02-03.html

【５】メタストア

１）hive.metastore.warehouse.dir / spark.sql.warehouse.dir

* データウェアハウスのデフォルトデータベースの場所を指定

hive-site.xml

・・・略・・・
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/apps/spark/warehouse</value>
    <description>location of default database for the warehouse</description>
  </property>
</configuration>

使用上の注意
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

より抜粋
~~~~~
Spark 2.0.0からhive-site.xml 内の
hive.metastore.warehouse.dir プロパティが
非推奨であることに注意してください。

代わりにwarehouse内のデータベースのデフォルトの場所を
指定するために spark.sql.warehouse.dirを使います。
~~~~~

ただし、AWS EMR では、まだ使用されている模様。

https://docs.aws.amazon.com/ja_jp/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html

より抜粋
~~~~~
LOCATION を指定せずに Hive テーブルを作成すると、
テーブルデータは、hive.metastore.warehouse.dir プロパティ
によって 指定された場所に保管されます。

hive-site 設定分類を使用して、
hive.metastore.warehouse.dir の Amazon S3 で場所を
指定できます。
~~~~~

【６】その他

１）hive.cli.print.header

* ヘッダーを表示ON/OFF
* hive.cli.print.header = true / false

https://qiita.com/daifuku_mochi2/items/46b2038f733246f91d26
https://portaltan.hatenablog.com/entry/2018/06/29/102139

２）hive.mapred.mode

* hive.mapred.mode = strict / nonstrict
* cf. mapred = MapReduce？

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

より抜粋
~~~~~~~~~~~
In strict mode, some risky queries are not allowed to run.
# 「strict」モード (厳密モード) の場合、
# 危険なクエリの実行を許可しません。

For example, full table scans are prevented (see HIVE-10454)
 and ORDER BY requires a LIMIT clause.
# 例えば、フルテーブルスキャンを避け、
# ORDER BY句は、LIMIT句を必要とします。
~~~~~~~~~~~

エラー「SemanticException Cartesian products are disabled for safety reasons」が発生する
https://dk521123.hatenablog.com/entry/2021/06/12/093046

３）hive.variable.substitute

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution#LanguageManualVariableSubstitution-DisablingVariableSubstitution

より、変数により置換機能をOffにしたい場合、
「hive.variable.substitute=false」にすればいいらしい。
詳細は、以下の関連記事を参照のこと。

https://dk521123.hatenablog.com/entry/2021/06/24/094254

プログラムの超個人的なメモ

Memo for Programming.

【Hive】Hive / HiveQL ～設定プロパティ～

■ はじめに

目次

【０】パーティション関連

【１】最適化 - hive.optimize etc

１）hive.optimize.skewjoin

２）hive.optimize.constant.propagation

３）hive.auto.convert.join

【２】実行 - hive.execution

１）hive.execution.engine

【３】統計情報（Statistics） - hive.compute

１）hive.compute.query.using.stats

【４】圧縮

１）hive.exec.compress.output

２）mapred.output.compression.codec

【５】メタストア

１）hive.metastore.warehouse.dir / spark.sql.warehouse.dir

【６】その他

１）hive.cli.print.header

２）hive.mapred.mode

３）hive.variable.substitute

関連記事