Openshift集群3-9升级到3-10

更新方法与策略：自动升级方法

下载openshift-ansible的脚本代码
plaintext
1
2
git clone https://github.com/openshift/openshift-ansible.git 
git checkout  release-3.10
将master上的/etc/origin/master/htpasswd备份到/root/htpasswd

plaintext

1	scp master1:/etc/origin/master/htpasswd /root/htpasswd

openshift_master_identity_providers中的filename去掉

plaintext

1
2
3

# /etc/ansible/hosts
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login':'true','challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]
openshift_master_htpasswd_file=/root/htpasswd

屏蔽掉引入[gluster]导致的一个升级错误

在playbooks/common/openshift-cluster/upgrades/v3_10/upgrade_control_plane.yml中会去做对glusterfs与glusterfs_registry的比较，3.9之前未必需这些host group所以需要在ansible/hosts中需要添加该host group

plaintext

1
2
3

#/etc/ansible/host
[glusterfs]
[glusterfs_registry]

在ansible/hosts中的nodes列表中添加openshift_node_group_name

plaintext

......
master openshift_node_group_name='node-config-master'
node openshift_node_group_name='node-config-compute'
infra openshift_node_group_name='node-config-infra'

升级默认的node group configmap

plaintext

1	# ansible-playbook -i </path/to/inventory/file> playbooks/openshift-master/openshift_node_group.yml

检测Python OpenSSL版本

plaintext

python -c 'import OpenSSL.crypto'
# 如果报错则需要升级Python OpenSSL
yum install python2-pip
pip install -U pyopenssl -i https://pypi.douban.com/simple

关闭etcd目录的selinux检查【不建议这么做，建议开启selinux】
关闭任务Check selinux label of ‘‘与Make sure the ‘‘ has the proper label

plaintext

# roles/etcd/tasks/backup/backup.yml
...
- name: Detecting Atomic Host Operating System
  stat:
    path: /run/ostree-booted
  register: l_ostree_booted

#- name: Check selinux label of '{{ etcd_data_dir }}'
#  command: >
#    stat -c '%C' {{ etcd_data_dir }}
#  register: l_etcd_selinux_labels
#
#- debug:
#    msg: "{{ l_etcd_selinux_labels }}"
#
#- name: Make sure the '{{ etcd_data_dir }}' has the proper label
#  command: >
#    chcon -t svirt_sandbox_file_t  "{{ etcd_data_dir }}"
#  when:
#  - l_etcd_selinux_labels.rc == 0
#  - "'svirt_sandbox_file_t' not in l_etcd_selinux_labels.stdout"

- name: Generate etcd backup
  command: >
    {{ r_etcd_common_etcdctl_command }} backup --data-dir={{ l_etcd_incontainer_data_dir }}
    --backup-dir={{ l_etcd_incontainer_backup_dir }}
...

升级集群

plaintext

1	# ansible-playbook -i </path/to/inventory/file> playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml

升级过程中可能遇到的问题

webconsole自定义兼容问题

webconsole自定义格式是通过在openshift-web-console项目中的webconsole-configConfigMap添加css与js文件路径实现的，3.9版本支持路径不添加引号，而3.10版本必须添加引号。

selinux启动问题

每个节点（包括Master/router/node）都需要开启selinux。

ceph 无法挂载问题

升级过程中，如果有ceph挂载，则会出现超时，升级完成后，自动修复该问题。

HTPasswdPasswordIdentityProvider方式ansible_hosts文件格式更新问题

3.9版本可以将htpasswd文件路径填写在openshift_master_identity_providers变量中，而3.10版本则需要分开写，如下：

plaintext

1 2	openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login':'true','challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}] openshift_master_htpasswd_file=/root/htpasswd

对于router节点添加了独占配置后，ovs服务无法部署问题

版本3.10将组件使用容器的方式部署，如果router节点设置了taints（目的是为了router节点只给route服务，确保应用的性能）那么node,ovs服务将无法在router节点上部署。这时需要给对应deamonset设置tolerations。目前相关的daemonset有：openshift-node/sync、openshift-sdn/sdn、openshift-sdn/ovs。

升级集群时为了尽可能影响线上业务，可以使用分步升级法

所谓分步升级法，就是先升级Master控制节点，再逐个升级计算节点。

通过执行upgrade_control_plane.yaml 脚本来升级控制平台节点

plaintext

1
2
3

$ cd /usr/share/ansible/openshift-ansible
$ ansible-playbook -i </path/to/inventory/file> \
    playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.yml

通过执行upgrade_nodes.yaml 脚本来升级计算节点

plaintext

$ cd /usr/share/ansible/openshift-ansible
$ ansible-playbook -i </path/to/inventory/file> \
    playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_nodes.yml \
    [-e <customized_node_upgrade_variables>]

通过-e传入变量的方式来控制升级Node节点的节奏。例如
-e openshift_upgrade_nodes_serial="20%" 表示一次只升级20%的节点
-e openshift_upgrade_nodes_serial="2" -e openshift_upgrade_nodes_label="region=group1"表示一次只升级带label为region=group1的节点中的2个。
-e openshift_upgrade_nodes_drain_timeout=600设置openshift_upgrade_nodes_drain_timeout变量以指定将节点标记为失败之前要等待的时间。