Migration Guide
Upgrading from 0.5 to 0.6
-
Since 0.6.0, Celeborn deprecate
celeborn.worker.congestionControl.low.watermark
. Please useceleborn.worker.congestionControl.diskBuffer.low.watermark
instead. -
Since 0.6.0, Celeborn deprecate
celeborn.worker.congestionControl.high.watermark
. Please useceleborn.worker.congestionControl.diskBuffer.high.watermark
instead. -
Since 0.6.0, Celeborn changed the default value of
celeborn.client.spark.fetch.throwsFetchFailure
fromfalse
totrue
, which means Celeborn will enable spark stage rerun at default. -
Since 0.6.0, Celeborn has introduced a new RESTful API namespace: /api/v1, which uses the application/json media type for requests and responses. The
celeborn-openapi-client
SDK is also available to help users interact with the new RESTful APIs. The legacy RESTful APIs have been deprecated and will be removed in future releases. Access the full RESTful API documentation for detailed information. -
The mappings of the old RESTful APIs to the new RESTful APIs for Master.
Old RESTful API New RESTful API Note GET /conf GET /api/v1/conf GET /listDynamicConfigs GET /api/v1/conf/dynamic GET /threadDump GET /api/v1/thread_dump GET /applications GET /api/v1/applications GET /listTopDiskUsedApps GET /api/v1/applications/top_disk_usages GET /hostnames GET /api/v1/applications/hostnames GET /shuffle GET /api/v1/shuffles GET /masterGroupInfo GET /api/v1/masters GET /workerInfo GET /api/v1/workers GET /lostWorkers GET /api/v1/workers get the lostWorkers field in response GET /excludedWorkers GET /api/v1/workers get the excludedWorkers field in response GET /shutdownWorkers GET /api/v1/workers get the shutdownWorkers filed in response GET /decommissionWorkers GET /api/v1/workers get the decommissioningWorkers filed in response POST /exclude POST /api/v1/workers/exclude GET /workerEventInfo GET /api/v1/workers/events POST /sendWorkerEvent POST /api/v1/workers/events -
The mappings of the old RESTful APIs to the new RESTful APIs for Worker.
Old RESTful API New RESTful API Note GET /conf GET /api/v1/conf GET /listDynamicConfigs GET /api/v1/conf/dynamic GET /threadDump GET /api/v1/thread_dump GET /applications GET /api/v1/applications GET /listTopDiskUsedApps GET /api/v1/applications/top_disk_usages GET /shuffle GET /api/v1/shuffles GET /listPartitionLocationInfo GET /api/v1/shuffles/partitions GET /workerInfo GET /api/v1/workers GET /isRegistered GET /api/v1/workers get the isRegistered field in response GET /isDecommissioning GET /api/v1/workers get the isDecommissioning field in response GET /isShutdown GET /api/v1/workers get the isShutdown field in response GET /unavailablePeers GET /api/v1/workers/unavailable_peers POST /exit POST /api/v1/workers/exit
Upgrading from 0.5.0 to 0.5.1
-
Since 0.5.1, Celeborn master REST API
/exclude
request uses media typeapplication/x-www-form-urlencoded
instead oftext/plain
. -
Since 0.5.1, Celeborn master REST API
/sendWorkerEvent
request uses POST method and the parameterstype
andworkers
use form parameters instead, and uses media typeapplication/x-www-form-urlencoded
instead oftext/plain
. -
Since 0.5.1, Celeborn worker REST API
/exit
request uses media typeapplication/x-www-form-urlencoded
instead oftext/plain
.
Upgrading from 0.4 to 0.5
-
Since 0.5.0, Celeborn master metrics
LostWorkers
is renamed asLostWorkerCount
. -
Since 0.5.0, Celeborn worker metrics
ChunkStreamCount
is renamed asActiveChunkStreamCount
. -
Since 0.5.0, Celeborn worker metrics
CreditStreamCount
is renamed asActiveCreditStreamCount
. -
Since 0.5.0, Celeborn configurations support new tag
isDynamic
to represent whether the configuration is dynamic config. -
Since 0.5.0, Celeborn changed the default value of
celeborn.worker.graceful.shutdown.recoverDbBackend
fromLEVELDB
toROCKSDB
, which means Celeborn will use RocksDB store for recovery backend. To restore the behavior before Celeborn 0.5, you can setceleborn.worker.graceful.shutdown.recoverDbBackend
toLEVELDB
. -
Since 0.5.0, Celeborn deprecate
celeborn.quota.configuration.path
. Please useceleborn.dynamicConfig.store.fs.path
instead. -
Since 0.5.0, Celeborn client removes configuration
celeborn.client.push.splitPartition.threads
,celeborn.client.flink.inputGate.minMemory
andceleborn.client.flink.resultPartition.minMemory
. -
Since 0.5.0, Celeborn deprecate
celeborn.client.spark.shuffle.forceFallback.enabled
. Please useceleborn.client.spark.shuffle.fallback.policy
instead. -
Since 0.5.0, Celeborn master REST API
/exclude
uses POST method and the parametersadd
andremove
use form parameters instead. -
Since 0.5.0, Celeborn worker REST API
/exit
uses POST method and the parametertype
uses form parameter instead.
Upgrading from 0.4.0 to 0.4.1
-
Since 0.4.1, Celeborn master adds a limit to the estimated partition size used for computing worker slots. This size is now constrained within the range specified by
celeborn.master.estimatedPartitionSize.minSize
andceleborn.master.estimatedPartitionSize.maxSize
. -
Since 0.4.1, Celeborn changed the fallback configuration of
celeborn.client.rpc.getReducerFileGroup.askTimeout
,celeborn.client.rpc.registerShuffle.askTimeout
andceleborn.client.rpc.requestPartition.askTimeout
fromceleborn.<module>.io.connectionTimeout
toceleborn.rpc.askTimeout
.
Upgrading from 0.3 to 0.4
-
Since 0.4.0, Celeborn won't be compatible with Celeborn client that versions below 0.3.0. Note that: It's strongly recommended to use the same version of Client and Celeborn Master/Worker in production.
-
Since 0.4.0, Celeborn won't support
org.apache.spark.shuffle.celeborn.RssShuffleManager
. -
Since 0.4.0, Celeborn changed the default value of
celeborn.<module>.io.numConnectionsPerPeer
from2
to1
. -
Since 0.4.0, Celeborn has changed the names of the prometheus master and worker configuration as shown in the table below:
Key Before v0.4.0 Key After v0.4.0 celeborn.metrics.master.prometheus.host
celeborn.master.http.host
celeborn.metrics.master.prometheus.port
celeborn.master.http.port
celeborn.metrics.worker.prometheus.host
celeborn.worker.http.host
celeborn.metrics.worker.prometheus.port
celeborn.worker.http.port
-
Since 0.4.0, Celeborn deprecate
celeborn.worker.storage.baseDir.prefix
andceleborn.worker.storage.baseDir.number
. Please useceleborn.worker.storage.dirs
instead. -
Since 0.4.0, Celeborn deprecate
celeborn.storage.activeTypes
. Please useceleborn.storage.availableTypes
instead. -
Since 0.4.0, Celeborn worker removes configuration
celeborn.worker.userResourceConsumption.update.interval
. -
Since 0.4.0, Celeborn master metrics
PartitionWritten
is renamed asActiveShuffleSize
. -
Since 0.4.0, Celeborn master metrics
PartitionFileCount
is renamed asActiveShuffleFileCount
.
Upgrading from 0.3.1 to 0.3.2
-
Since 0.3.1, Celeborn changed the default value of
raft.client.rpc.request.timeout
from3s
to10s
. -
Since 0.3.1, Celeborn changed the default value of
raft.client.rpc.watch.request.timeout
from10s
to20s
.
Upgrading from 0.3.0 to 0.3.1
-
Since 0.3.1, Celeborn changed the default value of
celeborn.worker.directMemoryRatioToResume
from0.5
to0.7
. -
Since 0.3.1, Celeborn changed the default value of
celeborn.worker.monitor.disk.check.interval
from60
to30
. -
Since 0.3.1, name of JVM metrics changed, see details at CELEBORN-1007.
Upgrading from 0.2 to 0.3
-
Celeborn 0.2 Client is compatible with 0.3 Master/Server, it allows to upgrade Master/Worker first then Client. Note that: It's strongly recommended to use the same version of Client and Celeborn Master/Worker in production.
-
Since 0.3.0, the support of deprecated configurations
rss.*
is removed. All configurations listed in 0.2.1 docs still take effect, but some of those are deprecated too, please read the bootstrap logs and follow the suggestion to migrate to the new configuration. -
From 0.3.0 on the default value for
celeborn.client.push.replicate.enabled
is changed fromtrue
tofalse
, users who want replication on should explicitly enable replication. For example, to enable replication for Spark users should add the spark config when submitting job:spark.celeborn.client.push.replicate.enabled=true
-
From 0.3.0 on the default value for
celeborn.worker.storage.workingDir
is changed fromhadoop/rss-worker/shuffle_data
toceleborn-worker/shuffle_data
, users who want to use origin working dir path should set this configuration. -
Since 0.3.0, configuration namespace
celeborn.ha.master
is deprecated, and will be removed in the future versions. All configurationsceleborn.ha.master.*
should migrate toceleborn.master.ha.*
. -
Since 0.3.0, environment variables
CELEBORN_MASTER_HOST
andCELEBORN_MASTER_PORT
are removed. InsteadCELEBORN_LOCAL_HOSTNAME
works on both master and worker, which takes high priority than configurations defined in properties file. -
Since 0.3.0, the Celeborn Master URL schema is changed from
rss://
toceleborn://
, for users who start Worker bysbin/start-worker.sh rss://<master-host>:<master-port>
, should migrate tosbin/start-worker.sh celeborn://<master-host>:<master-port>
. -
Since 0.3.0, Celeborn supports overriding Hadoop configuration(
core-site.xml
,hdfs-site.xml
, etc.) from Celeborn configuration with the additional prefixceleborn.hadoop.
. On Spark client side, user should set Hadoop configuration likespark.celeborn.hadoop.foo=bar
, note thatspark.hadoop.foo=bar
does not take effect; on Flink client and Celeborn Master/Worker side, user should set likeceleborn.hadoop.foo=bar
. -
Since 0.3.0, Celeborn master metrics
BlacklistedWorkerCount
is renamed asExcludedWorkerCount
. -
Since 0.3.0, Celeborn master http request url
/blacklistedWorkers
is renamed as/excludedWorkers
. -
Since 0.3.0, introduces a terminology update for Celeborn worker data replication, replacing the previous
master/slave
terminology withprimary/replica
. In alignment with this change, corresponding metrics keywords have been adjusted. The following table presents a comprehensive overview of the changes:Key Before v0.3.0 Key After v0.3.0 MasterPushDataTime
PrimaryPushDataTime
MasterPushDataHandshakeTime
PrimaryPushDataHandshakeTime
MasterRegionStartTime
PrimaryRegionStartTime
MasterRegionFinishTime
PrimaryRegionFinishTime
SlavePushDataTime
ReplicaPushDataTime
SlavePushDataHandshakeTime
ReplicaPushDataHandshakeTime
SlaveRegionStartTime
ReplicaRegionStartTime
SlaveRegionFinishTime
ReplicaRegionFinishTime
-
Since 0.3.0, Celeborn's spark shuffle manager change from
org.apache.spark.shuffle.celeborn.RssShuffleManager
toorg.apache.spark.shuffle.celeborn.SparkShuffleManager
. User can set spark propertyspark.shuffle.manager
toorg.apache.spark.shuffle.celeborn.SparkShuffleManager
to use Celeborn remote shuffle service. In 0.3.0, Celeborn still supportorg.apache.spark.shuffle.celeborn.RssShuffleManager
, it will be removed in 0.4.0.