/tmp空间不足导致集群故障

Oracle · newbie_l · 于 1年前发布 · 1460 次阅读

云吞信息--风轻云淡

症状

某企业有两台小机搭建的11.2.0.3的RAC,由于负载变化进行了内存升级。工程师进行了如下操作:

分别为两节点增加4GB内存

停节点1的资源和系统

crsctl stop cluster 

shutdown -h now 

激活节点1以应用新的配置文件,服务器上各资源自动启动

都启动好后停节点二的资源和系统

crsctl stop cluster 

shutdown -h now 

激活节点2以应用新的配置文件,服务器上各资源自动启动

crs_stat -t -v 

检查各资源都运行正常,tnsping测试数据库正常

但是检查数据库发现节点2的实例没有启动,检查节点2的

crs_stat -t -v 

提示 crs通讯失败,crsctl stop cluster 命令也失败。 检查服务器发现节点1有重启,手动在节点2上执行命令启动资源后,两节点各资源又运行正常 csctl start cluster

查看相应时间段日志发现日志如下:

2015-01-11 20:32:32.985 [cssd(9502952)]CRS-1612:Network communication with node racprd03a (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.513 seconds 

2015-01-11 20:32:40.997 [cssd(9502952)]CRS-1611:Network communication with node racprd03a (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.501 seconds   

2015-01-11 20:32:45.004 [cssd(9502952)]CRS-1610:Network communication with node racprd03a (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.494 seconds 

2015-01-11 20:32:47.506 [cssd(9502952)]CRS-1632:Node racprd03a is being removed from the cluster in cluster incarnation 226490782 

由上可知,由于心跳原因节点1被驱逐出集群,此时节点1数据库会自动重新启动。

2015-01-12 02:41:05.783 [client(20971636)]CRS-10051:CVU found following errors with Clusterware setup : PRVF-7501 : Sufficient space is not available at location "/tmp/" on node "racprd03b" [Required space = 1GB ] PRVF-7573 : Sufficient swap size is not available on node "racprd03b" [Required = 16GB (1.6777216E7KB) ; Found = 8GB (8388608.0KB)] 

PRVF-7573 : Sufficient swap size is not available on node "racprd03a" [Required = 16GB (1.6777216E7KB) ; Found = 8GB (8388608.0KB)] 

PRVF-5305 : The Oracle Clusterware is not healthy on node "racprd03b" 

CRS-4535: Cannot communicate with Cluster Ready Services 

CRS-4529: Cluster Synchronization Services is online 

CRS-4534: Cannot communicate with Event Manager 

PRVF-4557 : Node application "ora.racprd03b.vip" is offline on node "racprd03b" 

看到节点2的/tmp目录无法被得到(大小至少1G),而导致本节点不能与集群正常通信,并且2个节点的swap分区的大小只有8G,而需要的则为16G。

故障解决

1.调整/tmp和swap大小;

2.重启cluster,问题正常解决。

本文由 newbie_l 创作,采用 知识共享署名 3.0 中国大陆许可协议 进行许可。 可自由转载、引用,但需署名作者且注明文章出处。


本帖已经被管理员设置为: 精华帖 !
共收到 1 条回复
ruyi#11年前 0 个赞

:joy: 可惜没有RAC环境练练手,太高大上了

回复本帖 (需要登录)