Version: Next

Hudi

Hudi sink connector

Description

Used to write data to Hudi.

Key features

Options

Base configuration:

name	type	required	default value
table_dfs_path	string	yes	-
conf_files_path	string	no	-
table_list	Array	no	-
schema_save_mode	enum	no	CREATE_SCHEMA_WHEN_NOT_EXIST
common-options	Config	no	-

Table list configuration:

name	type	required	default value
table_name	string	yes	-
database	string	no	default
table_type	enum	no	COPY_ON_WRITE
op_type	enum	no	insert
record_key_fields	string	no	-
partition_fields	string	no	-
precombine_field	string	no	-
batch_interval_ms	Int	no	1000
batch_size	Int	no	1000
insert_shuffle_parallelism	Int	no	2
upsert_shuffle_parallelism	Int	no	2
min_commits_to_keep	Int	no	20
max_commits_to_keep	Int	no	30
index_type	enum	no	BLOOM
index_class_name	string	no	-
record_byte_size	Int	no	1024
cdc_enabled	boolean	no	false

Note: When this configuration corresponds to a single table, you can flatten the configuration items in table_list to the outer layer.

table_name [string]

table_name The name of hudi table.

database [string]

database The database of hudi table.

table_dfs_path [string]

table_dfs_path The dfs root path of hudi table, such as 'hdfs://nameserivce/data/hudi/'.

table_type [enum]

table_type The type of hudi table. The value is COPY_ON_WRITE or MERGE_ON_READ.

record_key_fields [string]

record_key_fields The record key fields of hudi table, its are used to generate record key. It must be configured when op_type is UPSERT.

partition_fields [string]

partition_fields The partition key fields of hudi table, its are used to generate partition.

precombine_field [string]

precombine_field The precombine field of hudi table, its are used in preCombining before actual write.

index_type [string]

index_type The index type of hudi table. Currently, BLOOM, SIMPLE, and GLOBAL SIMPLE are supported.

index_class_name [string]

index_class_name The customized index classpath of hudi table, example org.apache.seatunnel.connectors.seatunnel.hudi.index.CustomHudiIndex.

record_byte_size [Int]

record_byte_size The byte size of each record, This value can be used to help calculate the approximate number of records in each hudi data file. Adjusting this value can effectively reduce the number of hudi data file write magnifications.

conf_files_path [string]

conf_files_path The environment conf file path list(local path), which used to init hdfs client to read hudi table file. The example is '/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml'.

op_type [enum]

op_type The operation type of hudi table. The value is insert or upsert or bulk_insert.

batch_interval_ms [Int]

batch_interval_ms The interval time of batch write to hudi table.

batch_size [Int]

batch_size The size of batch write to hudi table.

insert_shuffle_parallelism [Int]

insert_shuffle_parallelism The parallelism of insert data to hudi table.

upsert_shuffle_parallelism [Int]

upsert_shuffle_parallelism The parallelism of upsert data to hudi table.

min_commits_to_keep [Int]

min_commits_to_keep The min commits to keep of hudi table.

max_commits_to_keep [Int]

max_commits_to_keep The max commits to keep of hudi table.

cdc_enabled [boolean]

cdc_enabled Whether to persist the CDC change log. When enable, persist the change data if necessary, and the table can be queried as a CDC query mode.

schema_save_mode [Enum]

Before the synchronous task is turned on, different treatment schemes are selected for the existing surface structure of the target side.
Option introduction：
RECREATE_SCHEMA ：Will create when the table does not exist, delete and rebuild when the table is saved
CREATE_SCHEMA_WHEN_NOT_EXIST ：Will Created when the table does not exist, skipped when the table is saved
ERROR_WHEN_SCHEMA_NOT_EXIST ：Error will be reported when the table does not exist
IGNORE ：Ignore the treatment of the table

common options

Source plugin common parameters, please refer to Source Common Options for details.

Examples

single table

sink {
  Hudi {
    table_dfs_path = "hdfs://nameserivce/data/"
    database = "st"
    table_name = "test_table"
    table_type = "COPY_ON_WRITE"
    conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
    batch_size = 10000
    use.kerberos = true
    kerberos.principal = "test_user@xxx"
    kerberos.principal.file = "/home/test/test_user.keytab"
  }
}

Multiple table

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  Mysql-CDC {
    url = "jdbc:mysql://127.0.0.1:3306/seatunnel"
    username = "root"
    password = "******"
    
    table-names = ["seatunnel.role","seatunnel.user","galileo.Bucket"]
  }
}

transform {
}

sink {
  Hudi {
    table_dfs_path = "hdfs://nameserivce/data/"
    conf_files_path = "/home/test/hdfs-site.xml;/home/test/core-site.xml;/home/test/yarn-site.xml"
    table_list = [
      {
        database = "st1"
        table_name = "role"
        table_type = "COPY_ON_WRITE"
        op_type="INSERT"
        batch_size = 10000
      },
      {
        database = "st1"
        table_name = "user"
        table_type = "COPY_ON_WRITE"
        op_type="UPSERT"
        # op_type is 'UPSERT', must configured record_key_fields
        record_key_fields = "user_id"
        batch_size = 10000
      },
      {
        database = "st1"
        table_name = "Bucket"
        table_type = "MERGE_ON_READ"
      }
    ]
    ...
  }
}

Changelog

Change Log

Change	Commit	Version
[Fix][Core]fix kotlin jar conflict (#9683)	https://github.com/apache/seatunnel/commit/c4ec5c0be5	2.3.12
[Improve][Connector-Hudi] Add pre-combine field option for hudi sink (#9496)	https://github.com/apache/seatunnel/commit/f134d7e129	2.3.12
[Feature][Checkpoint] Add check script for source/sink state class serialVersionUID missing (#9118)	https://github.com/apache/seatunnel/commit/4f5adeb1c7	2.3.11
[improve] hudi options (#8952)	https://github.com/apache/seatunnel/commit/b24d0e7f86	2.3.10
[Improve] restruct connector common options (#8634)	https://github.com/apache/seatunnel/commit/f3499a6eeb	2.3.10
[Improve][CI]skip ui module, improve module dependent (#8225)	https://github.com/apache/seatunnel/commit/81de0a69cc	2.3.9
[Feature][Connector-V2] Support write cdc changelog event into hudi sink (#7845)	https://github.com/apache/seatunnel/commit/934434cc75	2.3.9
[Feature][Restapi] Allow metrics information to be associated to logical plan nodes (#7786)	https://github.com/apache/seatunnel/commit/6b7c53d03c	2.3.9
[Feature][Connector-V2] Optimize hudi sink (#7662)	https://github.com/apache/seatunnel/commit/0d12520f91	2.3.8
[Improve][Connector] Add multi-table sink option check (#7360)	https://github.com/apache/seatunnel/commit/2489f6446b	2.3.7
[Feature][Core] Support using upstream table placeholders in sink options and auto replacement (#7131)	https://github.com/apache/seatunnel/commit/c4ca74122c	2.3.6
Bump org.xerial.snappy:snappy-java (#7144)	https://github.com/apache/seatunnel/commit/aa26471fb7	2.3.6
[Feature][Connector-V2] [Hudi]Add hudi sink connector (#4405)	https://github.com/apache/seatunnel/commit/dc271dcfb4	2.3.6
[Fix][Connector-V2] Fix connector support SPI but without no args constructor (#6551)	https://github.com/apache/seatunnel/commit/5f3c9c36a5	2.3.5
[Improve][Common] Adapt `FILE_OPERATION_FAILED` to `CommonError` (#5928)	https://github.com/apache/seatunnel/commit/b3dc0bbc21	2.3.4
[Improve][Common] Introduce new error define rule (#5793)	https://github.com/apache/seatunnel/commit/9d1b2582b2	2.3.4
[Hotfix][Zeta] Fix conflict dependency of hadoop-hdfs (#4509)	https://github.com/apache/seatunnel/commit/66923fbdbd	2.3.2
[Improve][build] Give the maven module a human readable name (#4114)	https://github.com/apache/seatunnel/commit/d7cd601051	2.3.1
[Improve][Project] Code format with spotless plugin. (#4101)	https://github.com/apache/seatunnel/commit/a2ab166561	2.3.1
[Feature][Connector] add get source method to all source connector (#3846)	https://github.com/apache/seatunnel/commit/417178fb84	2.3.1
[Feature][API & Connector & Doc] add parallelism and column projection interface (#3829)	https://github.com/apache/seatunnel/commit/b9164b8ba1	2.3.1
[Feature][Connector V2] expose configurable options in Hudi (#3383)	https://github.com/apache/seatunnel/commit/fd4cec3a95	2.3.0
fix hudi connector v2 compile error. (#3728)	https://github.com/apache/seatunnel/commit/4fba0aa024	2.3.0
[Improve][Connector-V2][Hudi] Unified exception for hudi source connector (#3581)	https://github.com/apache/seatunnel/commit/b2fda11ddc	2.3.0
[bug][Connector-V2][Hudi] HashCode may be negative (#3184)	https://github.com/apache/seatunnel/commit/8beffbb603	2.3.0
[DEV][Api] Replace SeaTunnelContext with JobContext and remove singleton pattern (#2706)	https://github.com/apache/seatunnel/commit/cbf82f755c	2.2.0-beta
[#2606]Dependency management split (#2630)	https://github.com/apache/seatunnel/commit/fc047be69b	2.2.0-beta
[improve][UT] Upgrade junit to 5.+ (#2305)	https://github.com/apache/seatunnel/commit/362319ff3e	2.2.0-beta
StateT of SeaTunnelSource should extend `Serializable` (#2214)	https://github.com/apache/seatunnel/commit/8c426ef850	2.2.0-beta
[Connector-V2] Add Hive sink connector v2 (#2158)	https://github.com/apache/seatunnel/commit/23ad4ee735	2.2.0-beta
[Connector-V2]Add Hudi Source (#2147)	https://github.com/apache/seatunnel/commit/eaedc0a3c7	2.2.0-beta

Hudi

Description​

Key features​

Options​

table_name [string]​

database [string]​

table_dfs_path [string]​

table_type [enum]​

record_key_fields [string]​

partition_fields [string]​

precombine_field [string]​

index_type [string]​

index_class_name [string]​

record_byte_size [Int]​

conf_files_path [string]​

op_type [enum]​

batch_interval_ms [Int]​

batch_size [Int]​

insert_shuffle_parallelism [Int]​

upsert_shuffle_parallelism [Int]​

min_commits_to_keep [Int]​

max_commits_to_keep [Int]​

cdc_enabled [boolean]​

schema_save_mode [Enum]​

common options​

Examples​

single table​

Multiple table​

Changelog​