Building via SBT
Starting from version 0.4.0, the Celeborn project supports building and packaging using SBT. This article provides a detailed guide on how to build the Celeborn project using SBT.
System Requirements
Celeborn Service (master/worker) supports Scala 2.11/2.12/2.13 and Java 8/11/17.
The following table indicates the compatibility of Celeborn Spark and Flink clients with different versions of Spark and Flink for various Java and Scala versions:
Java 8/Scala 2.11 | Java 8/Scala 2.12 | Java 11/Scala 2.12 | Java 17/Scala 2.12 | Java 8/Scala 2.13 | Java 11/Scala 2.13 | Java 17/Scala 2.13 | |
---|---|---|---|---|---|---|---|
Spark 2.4 | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
Spark 3.0 | ❌ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
Spark 3.1 | ❌ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
Spark 3.2 | ❌ | ✔ | ✔ | ❌ | ✔ | ✔ | ❌ |
Spark 3.3 | ❌ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Spark 3.4 | ❌ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Spark 3.5 | ❌ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Flink 1.14 | ❌ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
Flink 1.15 | ❌ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
Flink 1.16 | ❌ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
Flink 1.17 | ❌ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
Flink 1.18 | ❌ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
Flink 1.19 | ❌ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
Flink 1.20 | ❌ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
Useful SBT commands
Packaging the Project
As an example, one can build a version of Celeborn as follows:
./build/sbt clean package
To create a Celeborn distribution like those distributed by the Celeborn Downloads page, and that is laid out to be runnable, use ./build/make-distribution.sh
in the project root directory.
./build/make-distribution.sh --sbt-enabled --release
Maven-Style Profile Management
We have adopted the Maven-style profile management for our Client module. For example, you can enable the Spark 3.3 client module by adding -Pspark-3.3
:
# ./build/sbt -Pspark-3.3 projects
[info] set current project to celeborn (in build file:/root/celeborn/)
[info] In file:/root/celeborn/
[info] * celeborn
[info] celeborn-client
[info] celeborn-client-spark-3
[info] celeborn-client-spark-3-shaded
[info] celeborn-common
[info] celeborn-master
[info] celeborn-service
[info] celeborn-spark-common
[info] celeborn-spark-group
[info] celeborn-spark-it
[info] celeborn-worker
To enable the Flink 1.15 client module, add -Pflink-1.15
:
# ./build/sbt -Pflink-1.15 projects
[info] set current project to celeborn (in build file:/root/celeborn/)
[info] In file:/root/celeborn/
[info] * celeborn
[info] celeborn-client
[info] celeborn-client-flink-1_15
[info] celeborn-client-flink-1_15-shaded
[info] celeborn-common
[info] celeborn-flink-common
[info] celeborn-flink-group
[info] celeborn-flink-it
[info] celeborn-master
[info] celeborn-service
[info] celeborn-worker
By using these profiles, you can easily switch between different client modules for Spark and Flink. These profiles enable specific dependencies and configurations relevant to the chosen version. This way, you can conveniently manage and build the desired configurations of the Celeborn project.
Building Spark/Flink Assembly Client Jars
For example, you can build the Spark 3.3 client assembly jar by running the following commands:
$ ./build/sbt -Pspark-3.3
> project celeborn-client-spark-3-shaded
> assembly
$ # Or, you can use sbt directly with the `-Pspark-3.3` profile:
$ ./build/sbt -Pspark-3.3 celeborn-client-spark-3-shaded/assembly
Similarly, you can build the Flink 1.15 client assembly jar using the following commands:
$ ./build/sbt -Pflink-1.15
> project celeborn-client-flink-1_15-shaded
> assembly
$ # Or, you can use sbt directly with the `-Pflink-1.15` profile:
$ ./build/sbt -Pflink-1.15 celeborn-client-flink-1_15-shaded/assembly
By executing these commands, you will create assembly jar files for the respective Spark and Flink client modules. The assembly jar bundles all the dependencies, allowing the client module to be used independently with all required dependencies included.
Building submodules individually
For instance, you can build the Celeborn Master module using:
$ # sbt
$ ./build/sbt
> project celeborn-master
> package
$ # Or, you can build the celeborn-master module with sbt directly using:
$ ./build/sbt celeborn-master/package
Testing with SBT
To run all tests for the Celeborn project, you can use the following command:
./build/sbt test
Running tests for specific versions of Spark/Flink client.
For example, to run the test cases for the Spark 3.3 client, use the following command:
$ ./build/sbt -Pspark-3.3 test
$ # only run spark client related modules tests
$ ./build/sbt -Pspark-3.3 celeborn-spark-group/test
Similarly, to run the test cases for the Flink 1.15 client, use the following command:
$ ./build/sbt -Pflink-1.15 test
$ # only run flink client related modules tests
$ ./build/sbt -Pflink-1.15 celeborn-flink-group/test
Running Individual Tests
When developing locally, it’s often convenient to run a single test or a few tests, rather than running the entire test suite.
The fastest way to run individual tests is to use the sbt console. It’s fastest to keep a sbt console open, and use it to re-run tests as necessary. For example, to run all the tests in a particular project, e.g., master:
$ ./build/sbt
> project celeborn-master
> test
testOnly
command. For example, to run the SlotsAllocatorSuiteJ
:
> testOnly org.apache.celeborn.service.deploy.master.SlotsAllocatorSuiteJ
testOnly
command accepts wildcards; e.g., you can also run the SlotsAllocatorSuiteJ
with:
> testOnly *SlotsAllocatorSuiteJ
master
package:
> testOnly org.apache.celeborn.service.deploy.master.*
SlotsAllocatorSuiteJ
, e.g., a test that with the name testAllocateSlotsForSinglePartitionId
, you run the following command in the sbt console:
> testOnly *SlotsAllocatorSuiteJ -- *SlotsAllocatorSuiteJ.testAllocateSlotsForSinglePartitionId
If you’d like to run just a single Scala test in the AppDiskUsageMetricSuite
, e.g., a test that includes "app usage snapshot" in the name, you run the following command in the sbt console:
> testOnly *AppDiskUsageMetricSuite -- -z "app usage snapshot"
If you’d prefer, you can run all of these commands on the command line (but this will be slower than running tests using an open console). To do this, you need to surround testOnly
and the following arguments in quotes:
$ ./build/sbt "celeborn-master/testOnly *AppDiskUsageMetricSuite -- -z \"app usage snapshot\""
Accelerating SBT
This section provides instructions on setting up repository mirrors or proxies for a smoother SBT experience. Depending on your location and network conditions, you can choose the appropriate approach to accelerate SBT startup and enhance dependency retrieval.
Accelerating SBT Startup
The SBT startup process involves fetching the SBT bootstrap jar, which is typically obtained from the Maven Central Repository (https://repo1.maven.org/maven2/). If you encounter slow access to this repository or if it's inaccessible in your network environment, you can expedite the SBT startup by configuring a custom artifact repository using the DEFAULT_ARTIFACT_REPOSITORY
environment variable.
$ # The following command fetches sbt-launch-x.y.z.jar from https://maven.aliyun.com/nexus/content/groups/public/
$ # Ensure that the URL ends with a trailing slash "/"
$ export DEFAULT_ARTIFACT_REPOSITORY=https://maven.aliyun.com/nexus/content/groups/public/
$ ./build/sbt
This will initiate SBT using the specified repository, allowing for faster download and startup times.
Custom SBT Repositories
The current repositories embedded within the Celeborn project are detailed below:
[repositories]
local
mavenLocal: file://${user.home}/.m2/repository/
local-preloaded-ivy: file:///${sbt.preloaded-${sbt.global.base-${user.home}/.sbt}/preloaded/}, [organization]/[module]/[revision]/[type]s/[artifact](-[classifier]).[ext]
local-preloaded: file:///${sbt.preloaded-${sbt.global.base-${user.home}/.sbt}/preloaded/}
# The system property value of `celeborn.sbt.default.artifact.repository` is
# fetched from the environment variable `DEFAULT_ARTIFACT_REPOSITORY` and
# assigned within the build/sbt-launch-lib.bash script.
private: ${celeborn.sbt.default.artifact.repository-file:///dev/null}
gcs-maven-central-mirror: https://maven-central.storage-download.googleapis.com/repos/central/data/
maven-central
typesafe-ivy-releases: https://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/[revision]/[type]s/[artifact](-[classifier]).[ext], bootOnly
sbt-ivy-snapshots: https://repo.scala-sbt.org/scalasbt/ivy-snapshots/, [organization]/[module]/[revision]/[type]s/[artifact](-[classifier]).[ext], bootOnly
sbt-plugin-releases: https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
bintray-typesafe-sbt-plugin-releases: https://dl.bintray.com/typesafe/sbt-plugins/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
bintray-spark-packages: https://dl.bintray.com/spark-packages/maven/
typesafe-releases: https://repo.typesafe.com/typesafe/releases/
For numerous developers across various regions, the default repository download speeds are less than optimal. To address this concern, we have curated a selection of verified public mirror templates tailored for specific regions with a significant local developer presence. For instance, we provide the repositories-cn.template
template for developers situated within the expanse of the Chinese mainland, and the repositories-asia.template
template designed for developers across the Asian continent. In such cases, the following command can be employed to enhance dependency download speeds:
cp build/sbt-config/repositories-cn.template build/sbt-config/repositories-local
Furthermore, it is strongly encouraged that developers from various regions contribute templates tailored to their respective areas.
Note
build/sbt-config/repositories-local
takes precedence overbuild/sbt-config/repositories
and is ignored by.gitignore
.- Should the environment variable
DEFAULT_ARTIFACT_REPOSITORY
be set, it attains the highest priority among non-local repositories. - Repository priority is determined by the file order; repositories listed earlier possess higher precedence.
Similarly, if your objective involves compiling and packaging within an intranet environment, you can edit build/sbt-config/repositories-local
as demonstrated below:
[repositories]
local
mavenLocal: file://${user.home}/.m2/repository/
private: ${celeborn.sbt.default.artifact.repository-file:///dev/null}
private-central: https://example.com/repository/maven/
private-central-http: http://example.com/repository/maven/, allowInsecureProtocol
allowInsecureProtocol
is required if you want to use a repository which only supports HTTP protocol but not HTTPS, otherwise, an error will be raised (insecure HTTP request is unsupported
), please refer to the sbt Launcher Configuration.
For more details on sbt repository configuration, please refer to the SBT documentation.
Publish
SBT supports publishing shade clients (Spark/Flink/MapReduce) to an internal Maven private repository, such as Sonatype Nexus or JFrog.
Before executing the publish command, ensure that the following environment variables are correctly set:
Environment Variable | Description |
---|---|
ASF_USERNAME | Sonatype repository username |
ASF_PASSWORD | Sonatype repository password |
SONATYPE_SNAPSHOTS_URL | Sonatype repository URL for snapshot version releases, default is "https://repository.apache.org/content/repositories/snapshots" |
SONATYPE_RELEASES_URL | Sonatype repository URL for official release versions, default is "https://repository.apache.org/service/local/staging/deploy/maven2" |
For example:
export SONATYPE_SNAPSHOTS_URL=http://192.168.3.46:8081/repository/maven-snapshots/
export SONATYPE_RELEASES_URL=http://192.168.3.46:8081/repository/maven-releases/
export ASF_USERNAME=admin
export ASF_PASSWORD=123456
Publish the shade client for Spark 3.5:
$ ./build/sbt -Pspark-3.5 celeborn-client-spark-3-shaded/publish
Publish the shade client for Flink 1.18:
$ ./build/sbt -Pflink-1.18 celeborn-client-flink-1_18-shaded/publish
Publish the shade client for MapReduce:
$ ./build/sbt -Pmr celeborn-client-mr-shaded/publish
Make sure to complete the necessary build and testing before executing the publish commands.