k8s 中文文档 k8s 中文文档
指南
kubernetes.io (opens new window)
指南
kubernetes.io (opens new window)
  • k8s 是什么
  • 互动教程

  • Minikube 介绍

  • 概念

  • Kubectl CLI

  • Kubectl 命令表

  • 安装设置

  • API 使用

  • 集群管理

  • TASKS

Kubeflow Training Operator


Build Status

Overview


Starting from v1.3, this training operator provides Kubernetes custom resources that makes it easy to run distributed or non-distributed TensorFlow/PyTorch/Apache MXNet/XGBoost/MPI jobs on Kubernetes.

Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes.


For a complete reference of the custom resource definitions, please refer to the API Definition.
TensorFlow API Definition
PyTorch API Definition
Apache MXNet API Definition
XGBoost API Definition
MPI API Definition
PaddlePaddle API Definition

For details on API design, please refer to the v1alpha2 design doc.
For details of all-in-one operator design, please refer to the All-in-one Kubeflow Training Operator
For details on its observability, please refer to the monitoring design doc.

Prerequisites


Version >= 1.23 of Kubernetes cluster and kubectl

Installation


Master Branch


  1. ``` shell
  2. kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
  3. ```

Stable Release


  1. ``` shell
  2. kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.5.0"
  3. ```

TensorFlow Release Only


For users who prefer to use original TensorFlow controllers, please checkout v1.2-branch, patches for bug fixes will still be accepted to this branch.

  1. ``` shell
  2. kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.2.0"
  3. ```

Python SDK for Kubeflow Training Operator


Training Operator provides Python SDK for the custom resources. More docs are available in sdk/python folder.

Use pip install command to install the latest release of the SDK:

  1. ``` sh
  2. pip install kubeflow-training

  3. ```

Quick Start


Please refer to the quick-start-v1.md and Kubeflow Training User Guide for more information.

API Documentation


Please refer to following API Documentation:

Kubeflow.org v1 API Documentation

Community


You can:

Join our Slack channel.
Check out who is using this operator.

This is a part of Kubeflow, so please see readme in kubeflow/kubeflow to get in touch with the community.

Contributing


Please refer to the DEVELOPMENT

Change Log


Please refer to CHANGELOG

Version Matrix


The following table lists the most recent few versions of the operator.

Operator Version API Version Kubernetes Version
:--- :--- :---
v1.0.x v1 1.16+
v1.1.x v1 1.16+
v1.2.x v1 1.16+
v1.3.x v1 1.18+
v1.4.x v1 1.23+
v1.5.x v1 1.23+
latest (master HEAD) v1 1.23+

Acknowledgement


This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow training operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

PyTorch Operator: list of contributors and maintainers.
MPI Operator: list of contributors and maintainers.
XGBoost Operator: list of contributors and maintainers.
MXNet Operator: list of contributors and maintainers.
Last Updated: 2023-09-03 19:17:54