Client

Key Default isDynamic Description Since Deprecated
celeborn.client.adaptive.optimizeSkewedPartitionRead.enabled false false If this is true, Celeborn will adaptively split skewed partitions instead of reading them by Spark map range. Please note that this feature requires the Celeborn-Optimize-Skew-Partitions-spark3_3.patch. 0.6.0
celeborn.client.application.heartbeatInterval 10s false Interval for client to send heartbeat message to master. 0.3.0 celeborn.application.heartbeatInterval
celeborn.client.application.unregister.enabled true false When true, Celeborn client will inform celeborn master the application is already shutdown during client exit, this allows the cluster to release resources immediately, resulting in resource savings. 0.3.2
celeborn.client.application.uuidSuffix.enabled false false Whether to add UUID suffix for application id for unique. When true, add UUID suffix for unique application id. Currently, this only applies to Spark and MR. 0.6.0
celeborn.client.chunk.prefetch.enabled false false Whether to enable chunk prefetch when creating CelebornInputStream. 0.5.1
celeborn.client.closeIdleConnections true false Whether client will close idle connections. 0.3.0
celeborn.client.commitFiles.ignoreExcludedWorker false false When true, LifecycleManager will skip workers which are in the excluded list. 0.3.0
celeborn.client.eagerlyCreateInputStream.threads 32 false Threads count for streamCreatorPool in CelebornShuffleReader. 0.3.1
celeborn.client.excludePeerWorkerOnFailure.enabled true false When true, Celeborn will exclude partition's peer worker on failure when push data to replica failed. 0.3.0
celeborn.client.excludedWorker.expireTimeout 180s false Timeout time for LifecycleManager to clear reserved excluded worker. Default to be 1.5 * celeborn.master.heartbeat.worker.timeout to cover worker heartbeat timeout check period 0.3.0 celeborn.worker.excluded.expireTimeout
celeborn.client.fetch.buffer.size 64k false Size of reducer partition buffer memory for shuffle reader. The fetched data will be buffered in memory before consuming. For performance consideration keep this buffer size not less than celeborn.client.push.buffer.max.size. 0.4.0
celeborn.client.fetch.dfsReadChunkSize 8m false Max chunk size for DfsPartitionReader. 0.3.1
celeborn.client.fetch.excludeWorkerOnFailure.enabled false false Whether to enable shuffle client-side fetch exclude workers on failure. 0.3.0
celeborn.client.fetch.excludedWorker.expireTimeout <value of celeborn.client.excludedWorker.expireTimeout> false ShuffleClient is a static object, it will be used in the whole lifecycle of Executor, We give a expire time for excluded workers to avoid a transient worker issues. 0.3.0
celeborn.client.fetch.maxReqsInFlight 3 false Amount of in-flight chunk fetch request. 0.3.0 celeborn.fetch.maxReqsInFlight
celeborn.client.fetch.maxRetriesForEachReplica 3 false Max retry times of fetch chunk on each replica 0.3.0 celeborn.fetch.maxRetriesForEachReplica,celeborn.fetch.maxRetries
celeborn.client.fetch.timeout 600s false Timeout for a task to open stream and fetch chunk. 0.3.0 celeborn.fetch.timeout
celeborn.client.flink.compression.enabled true false Whether to compress data in Flink plugin. 0.3.0 remote-shuffle.job.enable-data-compression
celeborn.client.flink.inputGate.concurrentReadings 2147483647 false Max concurrent reading channels for a input gate. 0.3.0 remote-shuffle.job.concurrent-readings-per-gate
celeborn.client.flink.inputGate.memory 32m false Memory reserved for a input gate. 0.3.0 remote-shuffle.job.memory-per-gate
celeborn.client.flink.inputGate.supportFloatingBuffer true false Whether to support floating buffer in Flink input gates. 0.3.0 remote-shuffle.job.support-floating-buffer-per-input-gate
celeborn.client.flink.metrics.scope.shuffle <host>.taskmanager.<tm_id>.<job_name>.<task_name>.<subtask_index>.<shuffle_id> false Defines the scope format string that is applied to all metrics scoped to a shuffle. Only effective when a identifier-based reporter is configured 0.6.0
celeborn.client.flink.partitionConnectionException.enabled false false If enabled, org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException would be thrown when RemoteBufferStreamReader finds that the current exception is about connection failure, then Flink can be aware of the lost Celeborn server side nodes and be able to re-compute affected data. 0.6.0
celeborn.client.flink.resultPartition.memory 64m false Memory reserved for a result partition. 0.3.0 remote-shuffle.job.memory-per-partition
celeborn.client.flink.resultPartition.supportFloatingBuffer true false Whether to support floating buffer for result partitions. 0.3.0 remote-shuffle.job.support-floating-buffer-per-output-gate
celeborn.client.flink.shuffle.fallback.policy AUTO false Celeborn supports the following kind of fallback policies. 1. ALWAYS: always use flink built-in shuffle implementation; 2. AUTO: prefer to use celeborn shuffle implementation, and fallback to use flink built-in shuffle implementation based on certain factors, e.g. availability of enough workers and quota; 3. NEVER: always use celeborn shuffle implementation, and fail fast when it it is concluded that fallback is required based on factors above. 0.6.0
celeborn.client.inputStream.creation.window 16 false Window size that CelebornShuffleReader pre-creates CelebornInputStreams, for coalesced scenario where multiple Partitions are read 0.5.1
celeborn.client.mr.pushData.max 32m false Max size for a push data sent from mr client. 0.4.0
celeborn.client.partition.reader.checkpoint.enabled false false Whether or not checkpoint reads when re-creating a partition reader. Setting to true minimizes the amount of unnecessary reads during partition read retries 0.6.0
celeborn.client.push.buffer.initial.size 8k false 0.3.0 celeborn.push.buffer.initial.size
celeborn.client.push.buffer.max.size 64k false Max size of reducer partition buffer memory for shuffle hash writer. The pushed data will be buffered in memory before sending to Celeborn worker. For performance consideration keep this buffer size higher than 32K. Example: If reducer amount is 2000, buffer size is 64K, then each task will consume up to 64KiB * 2000 = 125MiB heap memory. 0.3.0 celeborn.push.buffer.max.size
celeborn.client.push.excludeWorkerOnFailure.enabled false false Whether to enable shuffle client-side push exclude workers on failures. 0.3.0
celeborn.client.push.limit.inFlight.sleepInterval 50ms false Sleep interval when check netty in-flight requests to be done. 0.3.0 celeborn.push.limit.inFlight.sleepInterval
celeborn.client.push.limit.inFlight.timeout <undefined> false Timeout for netty in-flight requests to be done. Default value should be celeborn.client.push.timeout * 2. 0.3.0 celeborn.push.limit.inFlight.timeout
celeborn.client.push.limit.strategy SIMPLE false The strategy used to control the push speed. Valid strategies are SIMPLE and SLOWSTART. The SLOWSTART strategy usually works with congestion control mechanism on the worker side. 0.3.0
celeborn.client.push.maxReqsInFlight.perWorker 32 false Amount of Netty in-flight requests per worker. Default max memory of in flight requests per worker is celeborn.client.push.maxReqsInFlight.perWorker * celeborn.client.push.buffer.max.size * compression ratio(1 in worst case): 64KiB * 32 = 2MiB. The maximum memory will not exceed celeborn.client.push.maxReqsInFlight.total. 0.3.0
celeborn.client.push.maxReqsInFlight.total 256 false Amount of total Netty in-flight requests. The maximum memory is celeborn.client.push.maxReqsInFlight.total * celeborn.client.push.buffer.max.size * compression ratio(1 in worst case): 64KiB * 256 = 16MiB 0.3.0 celeborn.push.maxReqsInFlight
celeborn.client.push.queue.capacity 512 false Push buffer queue size for a task. The maximum memory is celeborn.client.push.buffer.max.size * celeborn.client.push.queue.capacity, default: 64KiB * 512 = 32MiB 0.3.0 celeborn.push.queue.capacity
celeborn.client.push.replicate.enabled false false When true, Celeborn worker will replicate shuffle data to another Celeborn worker asynchronously to ensure the pushed shuffle data won't be lost after the node failure. It's recommended to set false when HDFS is enabled in celeborn.storage.availableTypes. 0.3.0 celeborn.push.replicate.enabled
celeborn.client.push.retry.threads 8 false Thread number to process shuffle re-send push data requests. 0.3.0 celeborn.push.retry.threads
celeborn.client.push.revive.batchSize 2048 false Max number of partitions in one Revive request. 0.3.0
celeborn.client.push.revive.interval 100ms false Interval for client to trigger Revive to LifecycleManager. The number of partitions in one Revive request is celeborn.client.push.revive.batchSize. 0.3.0
celeborn.client.push.revive.maxRetries 5 false Max retry times for reviving when celeborn push data failed. 0.3.0
celeborn.client.push.sendBufferPool.checkExpireInterval 30s false Interval to check expire for send buffer pool. If the pool has been idle for more than celeborn.client.push.sendBufferPool.expireTimeout, the pooled send buffers and push tasks will be cleaned up. 0.3.1
celeborn.client.push.sendBufferPool.expireTimeout 60s false Timeout before clean up SendBufferPool. If SendBufferPool is idle for more than this time, the send buffers and push tasks will be cleaned up. 0.3.1
celeborn.client.push.slowStart.initialSleepTime 500ms false The initial sleep time if the current max in flight requests is 0 0.3.0
celeborn.client.push.slowStart.maxSleepTime 2s false If celeborn.client.push.limit.strategy is set to SLOWSTART, push side will take a sleep strategy for each batch of requests, this controls the max sleep time if the max in flight requests limit is 1 for a long time 0.3.0
celeborn.client.push.sort.randomizePartitionId.enabled false false Whether to randomize partitionId in push sorter. If true, partitionId will be randomized when sort data to avoid skew when push to worker 0.3.0 celeborn.push.sort.randomizePartitionId.enabled
celeborn.client.push.stageEnd.timeout <value of celeborn.<module>.io.connectionTimeout> false Timeout for waiting StageEnd. During this process, there are celeborn.client.requestCommitFiles.maxRetries times for retry opportunities for committing files and 1 times for releasing slots request. User can customize this value according to your setting. By default, the value is the max timeout value celeborn.<module>.io.connectionTimeout. 0.3.0 celeborn.push.stageEnd.timeout
celeborn.client.push.takeTaskMaxWaitAttempts 1 false Max wait times if no task available to push to worker. 0.3.0
celeborn.client.push.takeTaskWaitInterval 50ms false Wait interval if no task available to push to worker. 0.3.0
celeborn.client.push.timeout 120s false Timeout for a task to push data rpc message. This value should better be more than twice of celeborn.<module>.push.timeoutCheck.interval 0.3.0 celeborn.push.data.timeout
celeborn.client.readLocalShuffleFile.enabled false false Enable read local shuffle file for clusters that co-deployed with yarn node manager. 0.3.1
celeborn.client.readLocalShuffleFile.threads 4 false Threads count for read local shuffle file. 0.3.1
celeborn.client.registerShuffle.maxRetries 3 false Max retry times for client to register shuffle. 0.3.0 celeborn.shuffle.register.maxRetries
celeborn.client.registerShuffle.retryWait 3s false Wait time before next retry if register shuffle failed. 0.3.0 celeborn.shuffle.register.retryWait
celeborn.client.requestCommitFiles.maxRetries 4 false Max retry times for requestCommitFiles RPC. 0.3.0
celeborn.client.reserveSlots.maxRetries 3 false Max retry times for client to reserve slots. 0.3.0 celeborn.slots.reserve.maxRetries
celeborn.client.reserveSlots.rackaware.enabled false false Whether need to place different replicates on different racks when allocating slots. 0.3.1 celeborn.client.reserveSlots.rackware.enabled
celeborn.client.reserveSlots.retryWait 3s false Wait time before next retry if reserve slots failed. 0.3.0 celeborn.slots.reserve.retryWait
celeborn.client.rpc.cache.concurrencyLevel 32 false The number of write locks to update rpc cache. 0.3.0 celeborn.rpc.cache.concurrencyLevel
celeborn.client.rpc.cache.expireTime 15s false The time before a cache item is removed. 0.3.0 celeborn.rpc.cache.expireTime
celeborn.client.rpc.cache.size 256 false The max cache items count for rpc cache. 0.3.0 celeborn.rpc.cache.size
celeborn.client.rpc.commitFiles.askTimeout <value of celeborn.rpc.askTimeout> false Timeout for CommitHandler commit files. 0.4.1
celeborn.client.rpc.getReducerFileGroup.askTimeout <value of celeborn.rpc.askTimeout> false Timeout for ask operations during getting reducer file group information. During this process, there are celeborn.client.requestCommitFiles.maxRetries times for retry opportunities for committing files and 1 times for releasing slots request. User can customize this value according to your setting. 0.2.0
celeborn.client.rpc.maxRetries 3 false Max RPC retry times in client. 0.3.2
celeborn.client.rpc.registerShuffle.askTimeout <value of celeborn.rpc.askTimeout> false Timeout for ask operations during register shuffle. During this process, there are two times for retry opportunities for requesting slots, one request for establishing a connection with Worker and celeborn.client.reserveSlots.maxRetries times for retry opportunities for reserving slots. User can customize this value according to your setting. 0.3.0 celeborn.rpc.registerShuffle.askTimeout
celeborn.client.rpc.requestPartition.askTimeout <value of celeborn.rpc.askTimeout> false Timeout for ask operations during requesting change partition location, such as reviving or splitting partition. During this process, there are celeborn.client.reserveSlots.maxRetries times for retry opportunities for reserving slots. User can customize this value according to your setting. 0.2.0
celeborn.client.rpc.reserveSlots.askTimeout <value of celeborn.rpc.askTimeout> false Timeout for LifecycleManager request reserve slots. 0.3.0
celeborn.client.rpc.retryWait 1s false Client-specified time to wait before next retry on RpcTimeoutException. 0.5.4
celeborn.client.rpc.shared.threads 16 false Number of shared rpc threads in LifecycleManager. 0.3.2
celeborn.client.shuffle.batchHandleChangePartition.interval 100ms false Interval for LifecycleManager to schedule handling change partition requests in batch. 0.3.0 celeborn.shuffle.batchHandleChangePartition.interval
celeborn.client.shuffle.batchHandleChangePartition.partitionBuckets 256 false Max number of change partition requests which can be concurrently processed. 0.5.0
celeborn.client.shuffle.batchHandleChangePartition.threads 8 false Threads number for LifecycleManager to handle change partition request in batch. 0.3.0 celeborn.shuffle.batchHandleChangePartition.threads
celeborn.client.shuffle.batchHandleCommitPartition.interval 5s false Interval for LifecycleManager to schedule handling commit partition requests in batch. 0.3.0 celeborn.shuffle.batchHandleCommitPartition.interval
celeborn.client.shuffle.batchHandleCommitPartition.threads 8 false Threads number for LifecycleManager to handle commit partition request in batch. 0.3.0 celeborn.shuffle.batchHandleCommitPartition.threads
celeborn.client.shuffle.batchHandleReleasePartition.interval 5s false Interval for LifecycleManager to schedule handling release partition requests in batch. 0.3.0
celeborn.client.shuffle.batchHandleReleasePartition.threads 8 false Threads number for LifecycleManager to handle release partition request in batch. 0.3.0
celeborn.client.shuffle.batchHandleRemoveExpiredShuffles.enabled false false Whether to batch remove expired shuffles. This is an optimization switch on removing expired shuffles. 0.6.0
celeborn.client.shuffle.checkWorker.enabled true false When true, before registering shuffle, LifecycleManager should check if current cluster have available workers, if cluster don't have available workers, fallback to default shuffle. 0.5.0 celeborn.client.spark.shuffle.checkWorker.enabled
celeborn.client.shuffle.compression.codec LZ4 false The codec used to compress shuffle data. By default, Celeborn provides three codecs: lz4, zstd, none. none means that shuffle compression is disabled. Since Flink version 1.16, zstd is supported for Flink shuffle client. 0.3.0 celeborn.shuffle.compression.codec,remote-shuffle.job.compression.codec
celeborn.client.shuffle.compression.zstd.level 1 false Compression level for Zstd compression codec, its value should be an integer between -5 and 22. Increasing the compression level will result in better compression at the expense of more CPU and memory. 0.3.0 celeborn.shuffle.compression.zstd.level
celeborn.client.shuffle.decompression.lz4.xxhash.instance <undefined> false Decompression XXHash instance for Lz4. Available options: JNI, JAVASAFE, JAVAUNSAFE. 0.3.2
celeborn.client.shuffle.dynamicResourceEnabled false false When enabled, the ChangePartitionManager will obtain candidate workers from the availableWorkers pool during heartbeats when worker resource change. 0.6.0
celeborn.client.shuffle.dynamicResourceFactor 0.5 false The ChangePartitionManager will check whether (unavailable workers / shuffle allocated workers) is more than the factor before obtaining candidate workers from the requestSlots RPC response when celeborn.client.shuffle.dynamicResourceEnabled set true 0.6.0
celeborn.client.shuffle.expired.checkInterval 60s false Interval for client to check expired shuffles. 0.3.0 celeborn.shuffle.expired.checkInterval
celeborn.client.shuffle.manager.port 0 false Port used by the LifecycleManager on the Driver. 0.3.0 celeborn.shuffle.manager.port
celeborn.client.shuffle.partition.type REDUCE false Type of shuffle's partition. 0.3.0 celeborn.shuffle.partition.type
celeborn.client.shuffle.partitionSplit.mode SOFT false soft: the shuffle file size might be larger than split threshold. hard: the shuffle file size will be limited to split threshold. 0.3.0 celeborn.shuffle.partitionSplit.mode
celeborn.client.shuffle.partitionSplit.threshold 1G false Shuffle file size threshold, if file size exceeds this, trigger split. 0.3.0 celeborn.shuffle.partitionSplit.threshold
celeborn.client.shuffle.rangeReadFilter.enabled false false If a spark application have skewed partition, this value can set to true to improve performance. 0.2.0 celeborn.shuffle.rangeReadFilter.enabled
celeborn.client.shuffle.register.filterExcludedWorker.enabled false false Whether to filter excluded worker when register shuffle. 0.4.0
celeborn.client.shuffle.reviseLostShuffles.enabled false false Whether to revise lost shuffles. 0.6.0
celeborn.client.slot.assign.maxWorkers 10000 false Max workers that slots of one shuffle can be allocated on. Will choose the smaller positive one from Master side and Client side, see celeborn.master.slot.assign.maxWorkers. 0.3.1
celeborn.client.spark.fetch.cleanFailedShuffle false false whether to clean those disk space occupied by shuffles which cannot be fetched 0.6.0
celeborn.client.spark.fetch.cleanFailedShuffleInterval 1s false the interval to clean the failed-to-fetch shuffle files, only valid when celeborn.client.spark.fetch.cleanFailedShuffle is enabled 0.6.0
celeborn.client.spark.push.dynamicWriteMode.enabled false false Whether to dynamically switch push write mode based on conditions.If true, shuffle mode will be only determined by partition count 0.5.0
celeborn.client.spark.push.dynamicWriteMode.partitionNum.threshold 2000 false Threshold of shuffle partition number for dynamically switching push writer mode. When the shuffle partition number is greater than this value, use the sort-based shuffle writer for memory efficiency; otherwise use the hash-based shuffle writer for speed. This configuration only takes effect when celeborn.client.spark.push.dynamicWriteMode.enabled is true. 0.5.0
celeborn.client.spark.push.sort.memory.maxMemoryFactor 0.4 false the max portion of executor memory which can be used for SortBasedWriter buffer (only valid when celeborn.client.spark.push.sort.memory.useAdaptiveThreshold is enabled 0.5.0
celeborn.client.spark.push.sort.memory.smallPushTolerateFactor 0.2 false Only be in effect when celeborn.client.spark.push.sort.memory.useAdaptiveThreshold is turned on. The larger this value is, the more aggressive Celeborn will enlarge the Sort-based Shuffle writer's memory threshold. Specifically, this config controls when to enlarge the sort shuffle writer's memory threshold. With N bytes data in memory and V as the value of this config, if the number of pushes, C, when using sort based shuffle writer C >= (1 + V) * C' where C' is the number of pushes if we were using hash based writer, we will enlarge the memory threshold by 2X. 0.5.0
celeborn.client.spark.push.sort.memory.threshold 64m false When SortBasedPusher use memory over the threshold, will trigger push data. 0.3.0 celeborn.push.sortMemory.threshold
celeborn.client.spark.push.sort.memory.useAdaptiveThreshold false false Adaptively adjust sort-based shuffle writer's memory threshold 0.5.0
celeborn.client.spark.push.unsafeRow.fastWrite.enabled true false This is Celeborn's optimization on UnsafeRow for Spark and it's true by default. If you have changed UnsafeRow's memory layout set this to false. 0.2.2
celeborn.client.spark.shuffle.fallback.numPartitionsThreshold 2147483647 false Celeborn will only accept shuffle of partition number lower than this configuration value. This configuration only takes effect when celeborn.client.spark.shuffle.fallback.policy is AUTO. 0.5.0 celeborn.shuffle.forceFallback.numPartitionsThreshold,celeborn.client.spark.shuffle.forceFallback.numPartitionsThreshold
celeborn.client.spark.shuffle.fallback.policy AUTO false Celeborn supports the following kind of fallback policies. 1. ALWAYS: always use spark built-in shuffle implementation; 2. AUTO: prefer to use celeborn shuffle implementation, and fallback to use spark built-in shuffle implementation based on certain factors, e.g. availability of enough workers and quota, shuffle partition number; 3. NEVER: always use celeborn shuffle implementation, and fail fast when it it is concluded that fallback is required based on factors above. 0.5.0
celeborn.client.spark.shuffle.forceFallback.enabled false false Always use spark built-in shuffle implementation. This configuration is deprecated, consider configuring celeborn.client.spark.shuffle.fallback.policy instead. 0.3.0 celeborn.shuffle.forceFallback.enabled
celeborn.client.spark.shuffle.getReducerFileGroup.broadcast.enabled false false Whether to leverage Spark broadcast mechanism to send the GetReducerFileGroupResponse. If the response size is large and Spark executor number is large, the Spark driver network may be exhausted because each executor will pull the response from the driver. With broadcasting GetReducerFileGroupResponse, it prevents the driver from being the bottleneck in sending out multiple copies of the GetReducerFileGroupResponse (one per executor). 0.6.0
celeborn.client.spark.shuffle.getReducerFileGroup.broadcast.miniSize 512k false The size at which we use Broadcast to send the GetReducerFileGroupResponse to the executors. 0.6.0
celeborn.client.spark.shuffle.writer HASH false Celeborn supports the following kind of shuffle writers. 1. hash: hash-based shuffle writer works fine when shuffle partition count is normal; 2. sort: sort-based shuffle writer works fine when memory pressure is high or shuffle partition count is huge. This configuration only takes effect when celeborn.client.spark.push.dynamicWriteMode.enabled is false. 0.3.0 celeborn.shuffle.writer
celeborn.client.spark.stageRerun.enabled true false Whether to enable stage rerun. If true, client throws FetchFailedException instead of CelebornIOException. 0.4.0 celeborn.client.spark.fetch.throwsFetchFailure
celeborn.identity.provider org.apache.celeborn.common.identity.DefaultIdentityProvider false IdentityProvider class name. Default class is org.apache.celeborn.common.identity.DefaultIdentityProvider. Optional values: org.apache.celeborn.common.identity.HadoopBasedIdentityProvider user name will be obtained by UserGroupInformation.getUserName; org.apache.celeborn.common.identity.DefaultIdentityProvider user name and tenant id are default values or user-specific values. 0.6.0 celeborn.quota.identity.provider
celeborn.identity.user-specific.tenant default false Tenant id if celeborn.identity.provider is org.apache.celeborn.common.identity.DefaultIdentityProvider. 0.6.0 celeborn.quota.identity.user-specific.tenant
celeborn.identity.user-specific.userName default false User name if celeborn.identity.provider is org.apache.celeborn.common.identity.DefaultIdentityProvider. 0.6.0 celeborn.quota.identity.user-specific.userName
celeborn.master.endpoints <localhost>:9097 false Endpoints of master nodes for celeborn clients to connect. Client uses resolver provided by celeborn.master.endpoints.resolver to resolve the master endpoints. By default Celeborn uses org.apache.celeborn.common.client.StaticMasterEndpointResolver which take static master endpoints as input. Allowed pattern: <host1>:<port1>[,<host2>:<port2>]*, e.g. clb1:9097,clb2:9098,clb3:9099. If the port is omitted, 9097 will be used. If the master endpoints are not static then users can pass custom resolver implementation to discover master endpoints actively using celeborn.master.endpoints.resolver. 0.2.0
celeborn.master.endpoints.resolver org.apache.celeborn.common.client.StaticMasterEndpointResolver false Resolver class that can be used for discovering and updating the master endpoints. This allows users to provide a custom master endpoint resolver implementation. This is useful in environments where the master nodes might change due to scaling operations or infrastructure updates. Clients need to ensure that provided resolver class should be present in the classpath. 0.5.2
celeborn.quota.enabled true false When Master side sets to true, the master will enable to check the quota via QuotaManager. When Client side sets to true, LifecycleManager will request Master side to check whether the current user has enough quota before registration of shuffle. Fallback to the default shuffle service when Master side checks that there is no enough quota for current user. 0.2.0
celeborn.quota.interruptShuffle.enabled false false Whether to enable interrupt shuffle when quota exceeds. 0.6.0
celeborn.storage.availableTypes HDD false Enabled storages. Available options: MEMORY,HDD,SSD,HDFS,S3,OSS. Note: HDD and SSD would be treated as identical. 0.3.0 celeborn.storage.activeTypes
celeborn.storage.hdfs.dir <undefined> false HDFS base directory for Celeborn to store shuffle data. 0.2.0
celeborn.storage.oss.access.key <undefined> false OSS access key for Celeborn to store shuffle data. 0.6.0
celeborn.storage.oss.dir <undefined> false OSS base directory for Celeborn to store shuffle data. 0.6.0
celeborn.storage.oss.endpoint <undefined> false OSS endpoint for Celeborn to store shuffle data. 0.6.0
celeborn.storage.oss.ignore.credentials true false Whether to skip oss credentials, disable this config to support jindo sdk . 0.6.0
celeborn.storage.oss.secret.key <undefined> false OSS secret key for Celeborn to store shuffle data. 0.6.0
celeborn.storage.s3.dir <undefined> false S3 base directory for Celeborn to store shuffle data. 0.6.0
celeborn.storage.s3.endpoint.region <undefined> false S3 endpoint for Celeborn to store shuffle data. 0.6.0
celeborn.tags.tagsExpr true Expression to filter workers by tags. The expression is a comma-separated list of tags. The expression is evaluated as a logical AND of all tags. For example, prod,high-io filters workers that have both the prod and high-io tags. 0.6.0