• 大部分情况下,Openshift中的Pod只是容器的载体,通过Deployment、DaemonSet、RC、Job、Cronjob等对象来完成一组Pod的调度与自动控制功能。
    Pod调度也是由Scheduler组件完成的。

Pod调度通过Scheduler组件实现

Deployment/RC:全自动调度

Deployment/RC主要是自动部署应用的多个副本,并持续监控,以维持副本的数量。默认是使用系统Master的Scheduler经过一系列算法计算来调度,用户无法干预调度过程与结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 1
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
restartPolicy: Always
selector:
matchLabels:
app: nginx

NodeSelector:定向调度

通过Node的标签和Pod的nodeSelector属性相匹配,可以达到将pod调度到指定的一些Node上。

1
oc label nodes <node-name> env=dev
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 1
template:
metadata:
name: nginx
labels:
app: nginx
spec:
nodeSelector:
env: 'dev'
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
restartPolicy: Always

NodeAffinity:Node亲和性调度

  • 更具表达力(不仅仅“符合全部”的简单情况)

  • 可以使用软限制、优先采用等限制方式,使得调度在无法满足优先需求的情况下,会退而示其实,继续运行Pod

  • 目前有两种亲和性表达

    • RequiredDuringSchedulingIgnoreDuringExecution
      必须满足指定的规则才可以调度Pod到Node上(与nodeSelector类似),为硬限制
    • PreferredDuringSchedulingIgnoreDuringExecution
      强调优先满足指定规则,调度器优先选择合适的Node,但不强求,为软限制。多个优先级规则还可以设置权重值,以定义执行的先后顺序
  • NodeAffinity语法支持的操作符包括In/NotIn/Exists/DoesNotExist/Gt/Lt

  • 注意事项

    • 同时定义了nodeSelector与nodeAffinity,那必须两个条件都满足,Pod才被调度到指定的Node上
    • nodeAffinity指定了多个nodeSelectorTerms,那么只需要满足其中一个就能够匹配成功就可以完成调度
    • nodeSelectorTerms中有多个matchExpressions,则一个节点必须满足所有的matchExpressions才能运行Pod
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      apiVersion: v1
      kind: Pod
      metadata:
      name: nginx
      labels:
      app: nginx
      spec:
      containers:
      - name: nginx
      image: nginx
      imagePullPolicy: IfNotPresent
      restartPolicy: Always
      affinity:
      nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
      - key: env
      operator: In
      values:
      - dev
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
      weight: 1
      matchExpressions:
      - key: disk-type
      operator: In
      values:
      - ssd

      PodAffinity:Pod亲和与互斥调度策略

  • 可以根据节点上正在运行的其它Pod的标签来进行限制

  • 必须指定topologyKey:表达节点所属的topology范围

    • kubernetes.io/hostname
    • failure-domain.beta.kubernetes.io/zone
    • failure-domain.beta.kubernetes.io/region
  • 注意:PodAffinity规则设置注意事项

    • 除了设置Label Selector和topologyKey,还可以指定namespaces列表来进行限制,namespaces定义与Label Selector和topologyKey同级。默认namespaces设置表示为Pod所在的namespaces,如果namespaces设置为“”则表示所有的namespaces
    • 在所有关联requiredDuringSchedulingIgnoredDuringExecution的matchExpressions全部满足后,才将Pod调度到指定的Node上
      以下例子要求新的Pod与app=test的Pod为同一个zone,但是不与app=nginx 的Pod为同一个Node
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      apiVersion: v1
      kind: Pod
      metadata:
      name: nginx
      labels:
      app: nginx
      spec:
      containers:
      - name: nginx
      image: nginx
      imagePullPolicy: IfNotPresent
      restartPolicy: Always
      affinity:
      podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: failure-domain.beta.kubernetes.io/zone
      labelSelector:
      matchExpressions:
      - key: app
      operator: In
      values:
      - test
      podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
      labelSelector:
      matchExpressions:
      - key: app
      operator: In
      values:
      - nginx
      topologyKey: kubernetes.io/hostname
      weight: 100

      Taints与Tolerations(污点与容忍)

  • Taints与前面的Affinity相反——它让Node拒绝Pod的运行

为node添加一个Taint,效果是NoSchedule(除了NoSchedule还可以取值PreferNoSchedule/NoExecute)。意味着除非Pod明确声明可以容忍这个Taint,否则不会被调度到该Node上。如果Pod无法容忍NoExecute的Taint,则上面已经运行的Pod会被驱逐。

1
2
3
oc taint nodes <node-name> key1=value1:NoSchedule
oc taint nodes <node-name> key1=value1:NoExecute
oc taint nodes <node-name> key2=value2:NoSchedule
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
restartPolicy: Always
tolerations:
- key: "key1"
operator: "Exists"
effect: "NoSchedule"
- key: "key1"
operatior: "Equal"
value: "value1"
effect: "NoExecute"
tolerationSeconds: 3600

DaemonSet:在每个Node上调度一个Pod

  • 管理集群中每个Node上仅运行一份Pod的副本实例。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    apiVersion: extensions/v1beta1
    kind: DaemonSet
    metadata:
    labels:
    component: fluentd
    name: logging-fluentd
    namespace: logging
    spec:
    selector:
    matchLabels:
    component: fluentd
    template:
    metadata:
    labels:
    component: fluentd
    name: fluentd-elasticsearch
    spec:
    containers:
    image: docker.io/openshift/origin-logging-fluentd:latest
    imagePullPolicy: IfNotPresent
    name: fluentd-elasticsearch
    nodeSelector:
    logging-infra-fluentd: "true"

    Job:批处理调度

  • 定义批量任务(并行/串行启动多个计算进程去处理一批工作项)

  • 按照批处理任务实现方式的不同,可以分为四种模式

    • Job Template Expansion模式:一个Job对象对应一个待处理的Work item,有几个Work item就产生几个独立的Job,通常适合Work item数量少,每个Work item要处理的数据量比较大的场景。
    • Queue with Pod Per Work Item模式:采用一个任务队列存放Work item,一个Job对象作为消费者去完成这些Work item,在这种模式下,Job会启动多个Pod,每个Pod对应一个Work item
    • Queue with Variable Pod Count模式:采用一个任务队列存放Work item,一个Job对象作为消费者去完成这些Work item,Job启动的Pod数量是可变的
    • Single Job with Static Work Assignment模式:采用静态方式分配任务项
  • 按批量处理的并行问题,Job分为三类

    • Non-parallel Jobs:通常一个Job启动一个Pod,除非Pod异常才会重启Pod,一量Pod正常结束,则Job结束
    • Parallel Jobs with a fixed completion count:并行Job会启动多个Pod,需要设定Job的.sepc.completions,当正常结束的Pod数量达到该值后,Job结束。同时.spec.parallelism控制并行度,即同时启动几个Job来处理Work item
    • Parallel Jobs with a work queue:任务队列方式的并行Job,需要一个独立的Queue,Work item都在一个Queue中存放,不能设置.spec.completions参数。
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      kind: Job
      apiVersion: batch/v1
      metadata:
      name: process-itme
      labels:
      jobgroup: jobexample
      spec:
      template:
      metadata:
      name: jobexample
      labels:
      jobgroup: jobexample
      spec:
      containers:
      - name: c
      image: busybox
      command: ["sh", "-c", "echo jobexample && sleep 5"]
      restartPolicy: Never

      Cronjob:定时任务

  • 类似Linux Cron的定时任务

  • 定时表达式:Minutes Hours DayofMonth Month DayofWeek Year

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    kind: CronJob
    apiVersion: batch/v2alpha1
    metadata:
    name: hello
    spec:
    schedule: "*/1 * * * *"
    jobTemplate:
    spec:
    template:
    spec:
    containers:
    - name: hello
    image: busybox
    command: ["sh", "-c", "echo Hello && sleep 5"]
    restartPolicy: OnFailure

    参考书目:《Kubernetes权威指南》