Skip to main content
Version: Next

Hive

Hive source connector

Description

Read data from Hive.

When using markdown format, SeaTunnel can parse markdown files stored in Hive tables and extract structured data with elements like headings, paragraphs, lists, code blocks, and tables. Each element is converted to a row with the following schema:

  • element_id: Unique identifier for the element
  • element_type: Type of the element (Heading, Paragraph, ListItem, etc.)
  • heading_level: Level of heading (1-6, null for non-heading elements)
  • text: Text content of the element
  • page_number: Page number (default: 1)
  • position_index: Position index within the document
  • parent_id: ID of the parent element
  • child_ids: Comma-separated list of child element IDs

Note: Markdown format only supports reading, not writing.

tip

In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9 and 3.1.3 .

If you use SeaTunnel Engine, You need put seatunnel-hadoop3-3.1.4-uber.jar and hive-exec-3.1.3.jar and libfb303-0.9.3.jar in $SEATUNNEL_HOME/lib/ dir.

Key features

Read all the data in a split in a pollNext call. What splits are read will be saved in snapshot.

Options

nametyperequireddefault value
table_namestringyes-
metastore_uristringyes-
krb5_pathstringno/etc/krb5.conf
kerberos_principalstringno-
kerberos_keytab_pathstringno-
hdfs_site_pathstringno-
hive_site_pathstringno-
hive.hadoop.confMapno-
hive.hadoop.conf-pathstringno-
read_partitionslistno-
read_columnslistno-
compress_codecstringnonone
common-optionsno-

table_name [string]

Target Hive table name eg: db1.table1

metastore_uri [string]

Hive metastore uri

hdfs_site_path [string]

The path of hdfs-site.xml, used to load ha configuration of namenodes

hive.hadoop.conf [map]

Properties in hadoop conf('core-site.xml', 'hdfs-site.xml', 'hive-site.xml')

hive.hadoop.conf-path [string]

The specified loading path for the 'core-site.xml', 'hdfs-site.xml', 'hive-site.xml' files

read_partitions [list]

The target partitions that user want to read from hive table, if user does not set this parameter, it will read all the data from hive table.

Tips: Every partition in partitions list should have the same directory depth. For example, a hive table has two partitions: par1 and par2, if user sets it like as the following: read_partitions = [par1=xxx, par1=yyy/par2=zzz], it is illegal

krb5_path [string]

The path of krb5.conf, used to authentication kerberos

kerberos_principal [string]

The principal of kerberos authentication

kerberos_keytab_path [string]

The keytab file path of kerberos authentication

read_columns [list]

The read column list of the data source, user can use it to implement field projection.

compress_codec [string]

The compress codec of files and the details that supported as the following shown:

  • txt: lzo none
  • json: lzo none
  • csv: lzo none
  • orc/parquet:
    automatically recognizes the compression type, no additional settings required.

common options

Source plugin common parameters, please refer to Source Common Options for details

Example

Example 1: Single table


Hive {
table_name = "default.seatunnel_orc"
metastore_uri = "thrift://namenode001:9083"
}

Example 2: Multiple tables

Note: Hive is a structured data source and should be use 'table_list', and 'tables_configs' will be removed in the future.


Hive {
table_list = [
{
table_name = "default.seatunnel_orc_1"
metastore_uri = "thrift://namenode001:9083"
},
{
table_name = "default.seatunnel_orc_2"
metastore_uri = "thrift://namenode001:9083"
}
]
}


Hive {
tables_configs = [
{
table_name = "default.seatunnel_orc_1"
metastore_uri = "thrift://namenode001:9083"
},
{
table_name = "default.seatunnel_orc_2"
metastore_uri = "thrift://namenode001:9083"
}
]
}

Example3 : Kerberos

source {
Hive {
table_name = "default.test_hive_sink_on_hdfs_with_kerberos"
metastore_uri = "thrift://metastore:9083"
hive.hadoop.conf-path = "/tmp/hadoop"
plugin_output = hive_source
hive_site_path = "/tmp/hive-site.xml"
kerberos_principal = "hive/metastore.seatunnel@EXAMPLE.COM"
kerberos_keytab_path = "/tmp/hive.keytab"
krb5_path = "/tmp/krb5.conf"
}
}

Description:

  • hive_site_path: The path to the hive-site.xml file.
  • kerberos_principal: The principal for Kerberos authentication.
  • kerberos_keytab_path: The keytab file path for Kerberos authentication.
  • krb5_path: The path to the krb5.conf file used for Kerberos authentication.

Run the case:

env {
parallelism = 1
job.mode = "BATCH"
}

source {
Hive {
table_name = "default.test_hive_sink_on_hdfs_with_kerberos"
metastore_uri = "thrift://metastore:9083"
hive.hadoop.conf-path = "/tmp/hadoop"
plugin_output = hive_source
hive_site_path = "/tmp/hive-site.xml"
kerberos_principal = "hive/metastore.seatunnel@EXAMPLE.COM"
kerberos_keytab_path = "/tmp/hive.keytab"
krb5_path = "/tmp/krb5.conf"
}
}

sink {
Assert {
plugin_input = hive_source
rules {
row_rules = [
{
rule_type = MAX_ROW
rule_value = 3
}
],
field_rules = [
{
field_name = pk_id
field_type = bigint
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = name
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = score
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}

Hive on s3

Step 1

Create the lib dir for hive of emr.

mkdir -p ${SEATUNNEL_HOME}/plugins/Hive/lib

Step 2

Get the jars from maven center to the lib.

cd ${SEATUNNEL_HOME}/plugins/Hive/lib
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.6.5/hadoop-aws-2.6.5.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.9/hive-exec-2.3.9.jar

Step 3

Copy the jars from your environment on emr to the lib dir.

cp /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.60.0.jar ${SEATUNNEL_HOME}/plugins/Hive/lib
cp /usr/share/aws/emr/hadoop-state-pusher/lib/hadoop-common-3.3.6-amzn-1.jar ${SEATUNNEL_HOME}/plugins/Hive/lib
cp /usr/share/aws/emr/hadoop-state-pusher/lib/javax.inject-1.jar ${SEATUNNEL_HOME}/plugins/Hive/lib
cp /usr/share/aws/emr/hadoop-state-pusher/lib/aopalliance-1.0.jar ${SEATUNNEL_HOME}/plugins/Hive/lib

Step 4

Run the case.

env {
parallelism = 1
job.mode = "BATCH"
}

source {
Hive {
table_name = "test_hive.test_hive_sink_on_s3"
metastore_uri = "thrift://ip-192-168-0-202.cn-north-1.compute.internal:9083"
hive.hadoop.conf-path = "/home/ec2-user/hadoop-conf"
hive.hadoop.conf = {
bucket="s3://ws-package"
fs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider"
}
read_columns = ["pk_id", "name", "score"]
}
}

sink {
Hive {
table_name = "test_hive.test_hive_sink_on_s3_sink"
metastore_uri = "thrift://ip-192-168-0-202.cn-north-1.compute.internal:9083"
hive.hadoop.conf-path = "/home/ec2-user/hadoop-conf"
hive.hadoop.conf = {
bucket="s3://ws-package"
fs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider"
}
}
}

Hive on oss

Step 1

Create the lib dir for hive of emr.

mkdir -p ${SEATUNNEL_HOME}/plugins/Hive/lib

Step 2

Get the jars from maven center to the lib.

cd ${SEATUNNEL_HOME}/plugins/Hive/lib
wget https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.9/hive-exec-2.3.9.jar

Step 3

Copy the jars from your environment on emr to the lib dir and delete the conflicting jar.

cp -r /opt/apps/JINDOSDK/jindosdk-current/lib/jindo-*.jar ${SEATUNNEL_HOME}/plugins/Hive/lib
rm -f ${SEATUNNEL_HOME}/lib/hadoop-aliyun-*.jar

Step 4

Run the case.

env {
parallelism = 1
job.mode = "BATCH"
}

source {
Hive {
table_name = "test_hive.test_hive_sink_on_oss"
metastore_uri = "thrift://master-1-1.c-1009b01725b501f2.cn-wulanchabu.emr.aliyuncs.com:9083"
hive.hadoop.conf-path = "/tmp/hadoop"
hive.hadoop.conf = {
bucket="oss://emr-osshdfs.cn-wulanchabu.oss-dls.aliyuncs.com"
}
}
}

sink {
Hive {
table_name = "test_hive.test_hive_sink_on_oss_sink"
metastore_uri = "thrift://master-1-1.c-1009b01725b501f2.cn-wulanchabu.emr.aliyuncs.com:9083"
hive.hadoop.conf-path = "/tmp/hadoop"
hive.hadoop.conf = {
bucket="oss://emr-osshdfs.cn-wulanchabu.oss-dls.aliyuncs.com"
}
}
}

Changelog

Change Log
ChangeCommitVersion
[Feature][File] Add markdown parser #9714https://github.com/apache/seatunnel/commit/8b3c07844dev
[Improve][API] Optimize the enumerator API semantics and reduce lock calls at the connector level (#9671)https://github.com/apache/seatunnel/commit/9212a771402.3.12
[Feature][connector-hive] hive sink connector support overwrite mode #7843 (#7891)https://github.com/apache/seatunnel/commit/6fafe6f4d32.3.12
[Fix][Connector-V2] Fix hive client thread unsafe (#9282)https://github.com/apache/seatunnel/commit/5dc25897a92.3.11
[improve] update file connectors config (#9034)https://github.com/apache/seatunnel/commit/8041d59dc22.3.11
[Improve] Refactor file enumerator to prevent duplicate put split (#8989)https://github.com/apache/seatunnel/commit/fdf1beae9c2.3.11
Revert " [improve] update localfile connector config" (#9018)https://github.com/apache/seatunnel/commit/cdc79e13ad2.3.10
[improve] update localfile connector config (#8765)https://github.com/apache/seatunnel/commit/def369a85f2.3.10
[Improve][connector-hive] Improved hive file allocation algorithm for subtasks (#8876)https://github.com/apache/seatunnel/commit/89d1878ade2.3.10
[Improve] restruct connector common options (#8634)https://github.com/apache/seatunnel/commit/f3499a6eeb2.3.10
[Fix][Hive] Writing parquet files supports the optional timestamp int96 (#8509)https://github.com/apache/seatunnel/commit/856aea19522.3.10
[Fix] Set all snappy dependency use one version (#8423)https://github.com/apache/seatunnel/commit/3ac977c8d32.3.9
[Fix][Connector-V2] Fix hive krb5 path not work (#8228)https://github.com/apache/seatunnel/commit/e18a4d07b42.3.9
[Improve][dist]add shade check rule (#8136)https://github.com/apache/seatunnel/commit/51ef8000162.3.9
[Feature][File] Support config null format for text file read (#8109)https://github.com/apache/seatunnel/commit/2dbf02df472.3.9
[Improve][API] Unified tables_configs and table_list (#8100)https://github.com/apache/seatunnel/commit/84c0b8d6602.3.9
[Feature][Core] Rename result_table_name/source_table_name to plugin_input/plugin_output (#8072)https://github.com/apache/seatunnel/commit/c7bbd322db2.3.9
[Feature][E2E] Add hive3 e2e test case (#8003)https://github.com/apache/seatunnel/commit/9a24fac2c42.3.9
[Improve][Connector-V2] Change File Read/WriteStrategy setSeaTunnelRowTypeInfo to setCatalogTable (#7829)https://github.com/apache/seatunnel/commit/6b5f74e5242.3.9
[Feature][Restapi] Allow metrics information to be associated to logical plan nodes (#7786)https://github.com/apache/seatunnel/commit/6b7c53d03c2.3.9
[Improve][Zeta] Split the classloader of task group (#7580)https://github.com/apache/seatunnel/commit/3be0d1cc612.3.8
[Feature][Core] Support using upstream table placeholders in sink options and auto replacement (#7131)https://github.com/apache/seatunnel/commit/c4ca74122c2.3.6
[Improve][Hive] Close resources when exception occurs (#7205)https://github.com/apache/seatunnel/commit/561171528b2.3.6
[Hotfix][Hive Connector] Fix Hive hdfs-site.xml and hive-site.xml not be load error (#7069)https://github.com/apache/seatunnel/commit/c23a577f342.3.6
Fix hive load hive_site_path and hdfs_site_path too late (#7017)https://github.com/apache/seatunnel/commit/e2578a5b4d2.3.6
[Bug][connector-hive] Eanble login with kerberos for hive (#6893)https://github.com/apache/seatunnel/commit/26e433e4722.3.6
[Feature][S3 File] Make S3 File Connector support multiple table write (#6698)https://github.com/apache/seatunnel/commit/8f2049b2f12.3.6
[Feature] Hive Source/Sink support multiple table (#5929)https://github.com/apache/seatunnel/commit/4d9287fce42.3.6
[Improve][Hive] udpate hive3 version (#6699)https://github.com/apache/seatunnel/commit/1184c05c292.3.6
[HiveSink]Fix the risk of resource leakage. (#6721)https://github.com/apache/seatunnel/commit/c23804f13b2.3.6
[Improve][Connector-v2] The hive connector support multiple filesystem (#6648)https://github.com/apache/seatunnel/commit/8a4c01fe352.3.6
[Fix][Connector-V2] Fix add hive partition error when partition already existed (#6577)https://github.com/apache/seatunnel/commit/2a0a0b9d192.3.5
Fix HiveMetaStoreProxy#enableKerberos will return true if doesn't enable kerberos (#6307)https://github.com/apache/seatunnel/commit/1dad6f70612.3.4
[Feature][Engine] Unify job env parameters (#6003)https://github.com/apache/seatunnel/commit/2410ab38f02.3.4
[Refactor][File Connector] Put Multiple Table File API to File Base Module (#6033)https://github.com/apache/seatunnel/commit/c324d663b42.3.4
Support using multiple hadoop account (#5903)https://github.com/apache/seatunnel/commit/d69d88d1aa2.3.4
[Improve][Common] Introduce new error define rule (#5793)https://github.com/apache/seatunnel/commit/9d1b2582b22.3.4
Support config column/primaryKey/constraintKey in schema (#5564)https://github.com/apache/seatunnel/commit/eac76b4e502.3.4
[Hotfix][Connector-V2][Hive] fix the bug that hive-site.xml can not be injected in HiveConf (#5261)https://github.com/apache/seatunnel/commit/04ce22ac1e2.3.4
[Improve][Connector-v2][HiveSink]remove drop partition when abort. (#4940)https://github.com/apache/seatunnel/commit/edef87b5232.3.3
[feature][web] hive add option because web need (#5154)https://github.com/apache/seatunnel/commit/5e1511ff0d2.3.3
[Hotfix][Connector-V2][Hive] Support user-defined hive-site.xml (#4965)https://github.com/apache/seatunnel/commit/2a064bcdb02.3.3
Change file type to file_format_type in file source/sink (#4249)https://github.com/apache/seatunnel/commit/973a2fae3c2.3.1
[hotfix] fixed schema options import errorhttps://github.com/apache/seatunnel/commit/656805f2df2.3.1
[chore] Code format with spotless plugin.https://github.com/apache/seatunnel/commit/291214ad6f2.3.1
Merge branch 'dev' into merge/cdchttps://github.com/apache/seatunnel/commit/4324ee19122.3.1
[Improve][Project] Code format with spotless plugin.https://github.com/apache/seatunnel/commit/423b5830382.3.1
[Imprve][Connector-V2][Hive] Support read text table & Column projection (#4105)https://github.com/apache/seatunnel/commit/717620f5422.3.1
[Hotfix][Connector-V2][Hive] Fix hive unknownhost (#4141)https://github.com/apache/seatunnel/commit/f1a1dfe4af2.3.1
[Improve][build] Give the maven module a human readable name (#4114)https://github.com/apache/seatunnel/commit/d7cd6010512.3.1
[Improve][Project] Code format with spotless plugin. (#4101)https://github.com/apache/seatunnel/commit/a2ab1665612.3.1
[Improve][Connector-V2][Hive] Support assign partitions (#3842)https://github.com/apache/seatunnel/commit/6a4a850b4c2.3.1
[Improve][Connector-V2][Hive] Improve config check logic (#3886)https://github.com/apache/seatunnel/commit/b4348f6f442.3.1
[Feature][Connector-V2] Support kerberos in hive and hdfs file connector (#3840)https://github.com/apache/seatunnel/commit/055ad9d8362.3.1
[Feature][Connector] add get source method to all source connector (#3846)https://github.com/apache/seatunnel/commit/417178fb842.3.1
[Improve][Connector-V2] The log outputs detailed exception stack information (#3805)https://github.com/apache/seatunnel/commit/d0c6217f272.3.1
[Feature][Shade] Add seatunnel hadoop3 uber (#3755)https://github.com/apache/seatunnel/commit/5a024bdf8f2.3.0
[Feature][Connector-V2][File] Optimize filesystem utils (#3749)https://github.com/apache/seatunnel/commit/ac4e880fb52.3.0
[Hotfix][OptionRule] Fix option rule about all connectors (#3592)https://github.com/apache/seatunnel/commit/226dc6a1192.3.0
[Hotfix][Connector-V2][Hive] Fix npe of getting file system (#3506)https://github.com/apache/seatunnel/commit/e1fc3d1b012.3.0
[Improve][Connector-V2][Hive] Unified exceptions for hive source & sink connector (#3541)https://github.com/apache/seatunnel/commit/12c0fb91d22.3.0
[Feature][Connector-V2][File] Add option and factory for file connectors (#3375)https://github.com/apache/seatunnel/commit/db286e86312.3.0
[Hotfix][Connector-V2][Hive] Fix the bug that when write data to hive throws NullPointerException (#3258)https://github.com/apache/seatunnel/commit/777bf6b42e2.3.0
[Improve][Connector-V2][Hive] Hive Sink Support msck partitions (#3133)https://github.com/apache/seatunnel/commit/a8738ef3c42.3.0-beta
unify flatten-maven-plugin version (#3078)https://github.com/apache/seatunnel/commit/ed743fddcc2.3.0-beta
[Engine][Merge] fix merge problemhttps://github.com/apache/seatunnel/commit/0e9ceeefc92.3.0-beta
Merge remote-tracking branch 'upstream/dev' into st-enginehttps://github.com/apache/seatunnel/commit/ca80df779a2.3.0-beta
update hive.metastore.version to hive.exec.version (#2879)https://github.com/apache/seatunnel/commit/018ee0a3db2.2.0-beta
[Bug][Connector-V2] Fix hive sink bug (#2870)https://github.com/apache/seatunnel/commit/d661fa011e2.2.0-beta
[Fix][Connector-V2] Fix HiveSource Connector read orc table error (#2845)https://github.com/apache/seatunnel/commit/61720306e72.2.0-beta
[Bug][Connector-V2] Fix hive source text table name (#2797)https://github.com/apache/seatunnel/commit/563637ebd12.2.0-beta
[Improve][Connector-V2] Refactor hive source & sink connector (#2708)https://github.com/apache/seatunnel/commit/a357dca3652.2.0-beta
[DEV][Api] Replace SeaTunnelContext with JobContext and remove singleton pattern (#2706) (#2731)https://github.com/apache/seatunnel/commit/e8929ab6052.3.0-beta
[DEV][Api] Replace SeaTunnelContext with JobContext and remove singleton pattern (#2706)https://github.com/apache/seatunnel/commit/cbf82f755c2.2.0-beta
[#2606]Dependency management split (#2630)https://github.com/apache/seatunnel/commit/fc047be69b2.2.0-beta
[Improve][Connector-V2] Refactor the package of hdfs file connector (#2402)https://github.com/apache/seatunnel/commit/87d0624c5b2.2.0-beta
[Feature][Connector-V2] Add orc file support in connector hive sink (#2311) (#2374)https://github.com/apache/seatunnel/commit/81cb80c0502.2.0-beta
[improve][UT] Upgrade junit to 5.+ (#2305)https://github.com/apache/seatunnel/commit/362319ff3e2.2.0-beta
Decide table format using outputFormat in HiveSinkConfig #2303https://github.com/apache/seatunnel/commit/3a2586f6dc2.2.0-beta
[Feature][Connector-V2-Hive] Add parquet file format support to Hive Sink (#2310)https://github.com/apache/seatunnel/commit/4ab3c21b8d2.2.0-beta
Add BaseHiveCommitInfo for common hive commit info (#2306)https://github.com/apache/seatunnel/commit/0d2f6f4d7c2.2.0-beta
Remove same code to independent method in HiveSinkWriter (#2307)https://github.com/apache/seatunnel/commit/e99e6ee7262.2.0-beta
Avoid potential null pointer risk in HiveSinkWriter#snapshotState (#2302)https://github.com/apache/seatunnel/commit/e7d817f7d22.2.0-beta
[Connector-V2] Add file type check logic in hive connector (#2275)https://github.com/apache/seatunnel/commit/5488337c672.2.0-beta
[Connector-V2] Add parquet file reader for Hive Source Connector (#2199) (#2237)https://github.com/apache/seatunnel/commit/59db97ed342.2.0-beta
Merge from dev to st-engine (#2243)https://github.com/apache/seatunnel/commit/41e530afd52.3.0-beta
StateT of SeaTunnelSource should extend Serializable (#2214)https://github.com/apache/seatunnel/commit/8c426ef8502.2.0-beta
[Bug][connector-hive] filter '_SUCCESS' file in file list (#2235) (#2236)https://github.com/apache/seatunnel/commit/db046515232.2.0-beta
[Bug][hive-connector-v2] Resolve the schema inconsistency bug (#2229) (#2230)https://github.com/apache/seatunnel/commit/62ca0759152.2.0-beta
[Bug][spark-connector-v2-example] fix the bug of no class found. (#2191) (#2192)https://github.com/apache/seatunnel/commit/5dbc2df17e2.2.0-beta
[Connector-V2] Add Hive sink connector v2 (#2158)https://github.com/apache/seatunnel/commit/23ad4ee7352.2.0-beta
[Connector-V2] Add File Sink Connector (#2117)https://github.com/apache/seatunnel/commit/e2283da64f2.2.0-beta
[Connector-V2]Hive Source (#2123)https://github.com/apache/seatunnel/commit/ffcf3f59e22.2.0-beta
[api-draft][Optimize] Optimize module name (#2062)https://github.com/apache/seatunnel/commit/f79e3112b12.2.0-beta