记录vip,Scan_ip两节点同时存活的故障处理
客户环境:
VMware vSphere 5.1虚拟化环境上两台linux6.5 系统,安装Oracle 11.2.0.3 RAC 两节点集群
客户问题反馈:
客户反映说,数据库集群都可以对外访问,奇怪的是在节点1上检查显示XXX2.vip为FAILED OVER,在节点2上检查****1.vip为FAILED OVER
问题分析思路:
(1)检查集群的状态,看是否符合客户的描述;
(2)检查操作系统日志message,检查集群告警日志alert,cssd日志,数据库alert日志等信息,是否有异常报错输出;
(3)检查底层架构之间的部署,网络之间的通信,磁盘的划分与使用等是否有问题;
(4)在客户允许的情况下,重启集群,检查日志是否有异常,并观察重启后集群状态是否正常;
问题故障处理:
登录到服务器,检查如下图所示:
节点1:
$ crsctl stat res -t节点2:
通过检查集群的状态,确实如客户所描述的那样,是否是网络的问题呢?
所有的IP地址ping一遍,无任何丢包的报错信息。
检查操作系统日志,两节点均无任何异常输出
集群告警日志,监听状态检查失败,节点2被手动shutdown关闭,但是我检查的时候节点2明明是正常的,这又是什么原因造成的呢?
2019-03-05 10:41:16.062: [ohasd(9155)]CRS-2112:The OLR service
started on node yyzc01. 2019-03-05 10:41:16.077:
[ohasd(9155)]CRS-1301:Oracle High Availability Service started on
node yyzc01
. 2019-03-05 10:41:16.077:
[ohasd(9155)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot
advisory log files, 0 were announced and 0 errors occurred 2019-03-05
10:41:19.200: [gpnpd(9292)]CRS-2328:GPNPD started on node yyzc01.
2019-03-05 10:41:21.679: [cssd(9362)]CRS-1713:CSSD daemon is started
in clustered mode 2019-03-05 10:41:23.482:
[ohasd(9155)]CRS-2767:Resource state recovery not attempted for
'ora.diskmon' as its target state is OFFLINE 2019-03-05 10:41:41.460:
[cssd(9362)]CRS-1707:Lease acquisition for node yyzc01 number 1
completed 2019-03-05 10:41:42.843: [cssd(9362)]CRS-1605:CSSD voting
file is online: /dev/asm-diskg; details in
/u01/app/11.2.0/grid/log/yyzc01/cssd/ocssd.log. 2019-03-05
10:41:42.847: [cssd(9362)]CRS-1605:CSSD voting file is online:
/dev/asm-diskf; details in
/u01/app/11.2.0/grid/log/yyzc01/cssd/ocssd.log. 2019-03-05
10:41:42.854: [cssd(9362)]CRS-1605:CSSD voting file is online:
/dev/asm-diske; details in
/u01/app/11.2.0/grid/log/yyzc01/cssd/ocssd.log. 2019-03-05
10:41:51.981: [cssd(9362)]CRS-1601:CSSD Reconfiguration complete.
Active nodes are yyzc01 . 2019-03-05 10:41:53.963:
[ctssd(9517)]CRS-2407:The new Cluster Time Synchronization Service
reference node is host yyzc01. 2019-03-05 10:41:53.965:
[ctssd(9517)]CRS-2401:The Cluster Time Synchronization Service
started on host yyzc01. 2019-03-05 10:41:55.703:
[ohasd(9155)]CRS-2767:Resource state recovery not attempted for
'ora.diskmon' as its target state is OFFLINE 2019-03-05 10:42:18.031:
[crsd(9705)]CRS-1012:The OCR service started on node yyzc01.
2019-03-05 10:42:18.802: [evmd(9537)]CRS-1401:EVMD started on node
yyzc01. 2019-03-05 10:42:19.971: [crsd(9705)]CRS-1201:CRSD started
on node yyzc01. 2019-03-05 10:42:21.273:
[/u01/app/11.2.0/grid/bin/oraagent.bin(9814)]CRS-5016:Process
"/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent
"/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed:
details at "(:CLSN00010:)" in
"/u01/app/11.2.0/grid/log/yyzc01/agent/crsd/oraagent_grid/oraagent_grid.log"
2019-03-05 10:42:21.273:
[/u01/app/11.2.0/grid/bin/oraagent.bin(9814)]CRS-5016:Process
"/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent
"/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed:
details at "(:CLSN00010:)" in
"/u01/app/11.2.0/grid/log/yyzc01/agent/crsd/oraagent_grid/oraagent_grid.log"
2019-03-05 10:42:21.287:
[/u01/app/11.2.0/grid/bin/oraagent.bin(9814)]CRS-5016:Process
"/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent
"/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed:
details at "(:CLSN00010:)" in
"/u01/app/11.2.0/grid/log/yyzc01/agent/crsd/oraagent_grid/oraagent_grid.log"
2019-03-05 10:42:22.793: [crsd(9705)]CRS-2772:Server 'yyzc01' has
been assigned to pool 'Generic'. 2019-03-05 10:42:22.793:
[crsd(9705)]CRS-2772:Server 'yyzc01' has been assigned to pool
'ora.MOEUUMDB'. 2019-03-05 10:42:22.794: [crsd(9705)]CRS-2772:Server
'yyzc01' has been assigned to pool 'ora.MOEUIADB'. 2019-03-05
10:42:23.026: [client(9948)]CRS-4743:File
/u01/app/11.2.0/grid/oc4j/j2ee/home/OC4J_DBWLM_config/system-jazn-data.xml
was updated from OCR(Size: 13384(New), 13397(Old) bytes) 2019-03-05
10:43:01.702: [cssd(9362)]CRS-1625:Node yyzc02, number 2, was
manually shut down
此时还是一头雾水啊,告警日志没有明显的报错信息,集群数据库能正常的启动,状态也都较正常,这到底是什么原因导致的。
相反,本来是两节点的集群,在正常情况下,关闭任何一个节点,vip都会飘到正常的节点上继续存活和对外提供访问,这是Oracle集群的高可用性,现在两个节点都认为自己是正常的,其它都是有问题的,所以各自活成了自己?不对,这种情况下会有仲裁盘的仲裁,谁先与仲裁盘取得联系谁就能存活,把另一个节点踢出集群,这是oracle集群脑裂的机制,肯定只有一个节点存活才对,现在是两个节点都是正常启动的。换句话说,现在的环境等于分裂成两个独立的单节点集群了,检查cssd(管理集群的配置和节点成员的服务进程),有一句明显的日志信息如下:获取节点2的最新信息,仅有单一的节点有效,再一次验证了我的想法。
分析到这里,就应该知道是共享盘的问题了,节点2共享盘属性:
节点1共享盘属性:
到这里才发现,原来还真是共享盘的问题,补充一下,在虚拟化里面,共享盘是需要同scsi选择共享属性,正常情况下,应该如下:
检查到这里,基本问题就已经查清了,之前客户做过一次存储迁移,在迁移的过程中,没有考虑到关于共享盘这一问题,迁移完成以后,Vmware会重新为磁盘分配新的scsi端口,造成磁盘非共享,并且每一个节点都可以正常访问正常启动的问题。