Overview

Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.

Taints and Tolerations

A taint allows a node to refuse pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the node specification (NodeSpec) and apply tolerations to a pod through the pod specification (PodSpec). A taint on a node instructs the node to repel all pods that do not tolerate the taint.

Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.

Table 1. Taint and toleration components
Parameter Description

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

NoSchedule

  • New pods that do not match the taint are scheduled onto that node.

  • Existing pods on the node remain.

PreferNoSchedule

  • New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.

  • Existing pods on the node remain.

NoExecute

  • New pods that do not match the taint cannot be scheduled onto that node.

  • Existing pods on the node that do not have a matching toleration are removed.

operator

Equal

The key/value/effect parameters must match. This is the default.

Exists

The key/effect parameters must match. You must leave a blank value parameter, which matches any.

A toleration matches a taint:

  • If the operator parameter is set to Equal:

    • the key parameters are the same;

    • the value parameters are the same;

    • the effect parameters are the same.

  • If the operator parameter is set to Exists:

    • the key parameters are the same;

    • the effect parameters are the same.

Using Multiple Taints

You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Origin processes multiple taints and tolerations as follows:

  1. Process the taints for which the pod has a matching toleration.

  2. The remaining unmatched taints have the indicated effects on the pod:

    • If there is at least one unmatched taint with effect NoSchedule, OpenShift Origin cannot schedule a pod onto that node.

    • If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OpenShift Origin tries to not schedule the pod onto the node.

    • If there is at least one unmatched taint with effect NoExecute, OpenShift Origin evicts the pod from the node (if it is already running on the node), or the pod is not scheduled onto the node (if it is not yet running on the node).

      • Pods that do not tolerate the taint are evicted immediately.

      • Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever.

      • Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

For example:

  • The node has the following taints:

    $ oadm taint nodes node1 key1=value1:NoSchedule
    $ oadm taint nodes node1 key1=value1:NoExecute
    $ oadm taint nodes node1 key2=value2:NoSchedule
  • The pod has the following tolerations:

    tolerations:
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoSchedule"
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoExecute"

In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

Adding a Taint to an Existing Node

You add a taint to a node using the oadm taint command with the parameters described in the Taint and toleration components table:

$ oadm taint nodes <node-name> <key>=<value>:<effect>

For example:

$ oadm taint nodes node1 key1=value1:NoSchedule

The example places a taint on node1 that has key key1, value value1, and taint effect NoSchedule.

Adding a Toleration to a Pod

To add a toleration to a pod, edit the pod specification to include a tolerations section:

Sample pod configuration file with Equal operator
tolerations:
- key: "key1" (1)
  operator: "Equal" (1)
  value: "value1" (1)
  effect: "NoExecute" (1)
  tolerationSeconds: 3600 (2)
1 The toleration parameters, as described in the Taint and toleration components table.
2 The tolerationSeconds parameter specifies how long a pod can remain bound to a node before being evicted. See Using Toleration Seconds to Delay Pod Evictions below.
Sample pod configuration file with Exists operator
tolerations:
- key: "key1"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 3600

Both of these tolerations match the taint created by the oadm taint command above. A pod with either toleration would be able to schedule onto node1.

Using Toleration Seconds to Delay Pod Evictions

You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the pod specification. If a taint with the NoExecute effect is added to a node, any pods that do not tolerate the taint are evicted immediately (pods that do tolerate the taint are not evicted). However, if a pod that to be evicted has the tolerationSeconds parameter, the pod is not evicted until that time period expires.

For example:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

Here, if this pod is running but does not have a matching taint, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.

Setting a Default Value for Toleration Seconds

This plug-in sets the default forgiveness toleration for pods, to tolerate the node.alpha.kubernetes.io/notReady:NoExecute and node.alpha.kubernetes.io/notReady:NoExecute taints for five minutes.

If the pod configuration provided by the user already has either toleration, the default is not added.

To enable Default Toleration Seconds:

  1. Modify the master configuration file (/etc/origin/master/master-config.yaml) to Add DefaultTolerationSeconds to the admissionConfig section:

    admissionConfig:
      pluginConfig:
        DefaultTolerationSeconds:
          configuration:
            kind: DefaultAdmissionConfig
            apiVersion: v1
            disable: false
  2. Restart OpenShift for the changes to take effect:

    # systemctl restart origin-master
  3. Verify that the default was added:

    1. Create a pod:

      $ oc create -f </path/to/file>

      For example:

      $ oc create -f hello-pod.yaml
      pod "hello-pod" created
    2. Check the pod tolerations:

      $ oc describe pod <pod-name> |grep -i toleration

      For example:

      $ oc describe pod hello-pod |grep -i toleration
      Tolerations:    node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s

Preventing Pod Eviction for Node Problems

OpenShift Origin can be configured to represent node unreachable and node not ready conditions as taints. This allows per-pod specification of how long to remain bound to a node that becomes unreachable or not ready, rather than using the default of five minutes.

When the Taint Based Evictions feature is enabled, the taints are automatically added by the node controller and the normal logic for evicting pods from Ready nodes is disabled.

  • If a node enters a not ready state, the node.alpha.kubernetes.io/notReady:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.

  • If a node enters a not reachable state, the node.alpha.kubernetes.io/unreachable:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.

To enable Taint Based Evictions:

  1. Modify the master configuration file (/etc/origin/master/master-config.yaml) to add the following to the kubernetesMasterConfig section:

    kubernetesMasterConfig:
       controllerArguments:
            feature-gates:
            - "TaintBasedEvictions=true"
  2. Check that the taint is added to a node:

    oc describe node $node | grep -i taint
    
    Taints: node.alpha.kubernetes.io/notReady:NoExecute
  3. Restart OpenShift for the changes to take effect:

    # systemctl restart origin-master
  4. Add a toleration to pods:

    tolerations:
    - key: "node.alpha.kubernetes.io/unreachable"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 6000

    or

    tolerations:
    - key: "node.alpha.kubernetes.io/notReady"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 6000

To maintain the existing rate limiting behavior of pod evictions due to node problems, the system adds the taints in a rate-limited way. This prevents massive pod evictions in scenarios such as the master becoming partitioned from the nodes.

Daemonsets and Tolerations

DaemonSet pods are created with NoExecute tolerations for node.alpha.kubernetes.io/unreachable and node.alpha.kubernetes.io/notReady with no tolerationSeconds to ensure that DaemonSet pods are never evicted due to these problems, even when the Default Toleration Seconds feature is disabled.

Examples

Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that should not be running on a node. A few of typical scenrios are:

Dedicating a Node for a User

You can specify a set of nodes for exclusive use by a particular set of users.

To specify dedicated nodes:

  1. Add a taint to those nodes:

    For example:

    $ oadm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    Only the pods with the tolerations are allowed to use the dedicated nodes.

Binding a User to a Node

You can configure a node so that particular users can use only the dedicated nodes.

To configure a node so that users can use only that node:

  1. Add a taint to those nodes:

    For example:

    $ oadm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    The admission controller should add a node affinity to require that the pods can only schedule onto nodes labeled with the key:value label (dedicated=groupName).

  3. Add a label similar to the taint (such as the key:value label) to the dedicated nodes.

Nodes with Special Hardware

In a cluster where a small subset of nodes have specialized hardware (for example GPUs), you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.

To ensure pods are blocked from the specialized hardware:

  1. Taint the nodes that have the specialized hardware using one of the following commands:

    $ oadm taint nodes <node-name> disktype=ssd:NoSchedule
    $ oadm taint nodes <node-name> disktype=ssd:PreferNoSchedule
  2. Adding a corresponding toleration to pods that use the special hardware using an admission controller.

For example, the admission controller could use some characteristic(s) of the pod to determine that the pod should be allowed to use the special nodes by adding a toleration.

To ensure pods can only use the specialized hardware, you need some additional mechanism. For example, you could label the nodes that have the special hardware and use node affinity on the pods that need the hardware.