KubeDL is short for Kubernetes-Deep-Learning. It is a unified operator that supports running multiple types of distributed deep learning/machine learning workloads on Kubernetes.
Currently, KubeDL supports the following types of ML/DL jobs:
- Support running prevalent ML/DL workloads in a single operator.
- A modular architecture that can be easily extended for more types of DL/ML workloads with shared libraries, see how to add a custom job workload.
- Enable specific workload type according to the installed CRDs automatically or through the startup flags explicitly.
- Support submitting a job with artifacts synced from remote source such as github without rebuilding the image.
- Support gang scheduling with a pluggable interface to support different backend gang schedulers.
- Instrumented with rich prometheus metrics to provide more insights about the job stats, such as job launch delay, current number of pending/running jobs.
- [Work-in-progress] Provide a dashboard for monitoring the jobs' lifecycle and stats.
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/kubeflow.org_pytorchjobs.yaml kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/kubeflow.org_tfjobs.yaml kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/xdl.kubedl.io_xdljobs.yaml kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/crd/bases/xgboostjob.kubeflow.org_xgboostjobs.yaml
Deploy KubeDL as a deployment
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/config/manager/all_in_one.yaml
The official KubeDL operator image is hosted under docker hub.
Optional: Enable workload kind selectively
If you only need some of the workload types and want to disable others, you can use either one of the three options or all of them:
WORKLOADS_ENABLEin KubeDL container when you do deploying. The value is a list of workloads types that you want to enable. For example,
WORKLOADS_ENABLE=TFJob,PytorchJobmeans only TFJob and PytorchJob workload are enabled, the others are disabled.
Set startup arguments
--workloadsin KubeDL container args when you do deploying. The value configuration is consistent with
[DEFAULT] Only install the CRDs you need, KubeDL will automatically enables corresponding workload controllers, you can set
WORKLOADS_ENABLE=autoexplicitly. This is the default approach.
Check documents for a full list of operator startup flags.
Run an Example Job
This example demonstrates how to run a simple MNist Tensorflow job with KubeDL.
Submit the TFJob
kubectl apply -f https://raw.githubusercontent.com/alibaba/kubedl/master/example/tf/tf_job_mnist.yaml
Monitor the status of the Tensorflow job
kubectl get tfjobs -n kubedl kubectl describe tfjob mnist -n kubedl
Delete the job
kubectl delete tfjob mnist -n kubedl
Supported workload types are
kubectl get xgboostjob
Check the documents for the prometheus metrics supported for KubeDL operator.
Remote Source Sync
KubeDL supports submitting jobs with artifacts synced from remote source dynamically without rebuilding the image. Currently github is supported. A plugable interface is supported for other sources such as hdfs. Check the documents for details.
A dashboard for monitoring the jobs' lifecycle and stats is currently in progress. The dashboard also provides convenient job operation options including job creation、termination, and deletion. See the demo below.
Build the controller manager binary
Run the tests
Generate manifests e.g. CRD, RBAC YAML files etc
Build the docker image
export IMG=<your_image_name> && make docker-build
Push the image
docker push <your_image_name>
To develop/debug KubeDL controller manager locally, please check the debug guide.
If you have any questions or want to contribute, GitHub issues or pull requests are warmly welcome. You can also contact us via the following channels:
Slack: Join the Channel
Certain implementations rely on existing code from the Kubeflow community and the credit goes to original Kubeflow authors.