记一次ElasticSearch GC故障

记一次ElasticSearch GC故障

Posted by bluesky blog on March 31, 2016

故障出现

在做了一次大容量的查询后,明显感觉Elastic Search变慢,该查询语句的逻辑如下:

1、查询一个月的索引数据(索引按天划分,每天存储容量为100G)

2、按分钟进行汇总

首先排除了网络的问题,其次,通过查询ES集群的健康状态,发现集群为“green”正常状态。

接着,查看ES的日志,

[2016-03-31 16:24:42,335][WARN ][monitor.jvm              ] [host] [gc][old][17436136][4362] duration [54.3s], collections [1]/[55.2s], total [54.3s]/[48.5m], memory [46.2gb]->[45.5gb]/[47.8gb], all_pools {[young] [44.3mb]->[553.4mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [45.9gb]->[44.9gb]/[46.1gb]}
[2016-03-31 16:25:45,260][WARN ][monitor.jvm              ] [host] [gc][old][17436139][4363] duration [1m], collections [1]/[1m], total [1m]/[49.5m], memory [46.2gb]->[45.5gb]/[47.8gb], all_pools {[young] [372.4mb]->[84.7mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [45.6gb]->[45.4gb]/[46.1gb]}
[2016-03-31 16:26:42,331][WARN ][monitor.jvm              ] [host] [gc][old][17436141][4364] duration [54.8s], collections [1]/[55.5s], total [54.8s]/[50.4m], memory [46.2gb]->[46.1gb]/[47.8gb], all_pools {[young] [235.2mb]->[31.7mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [45.8gb]->[46gb]/[46.1gb]}
[2016-03-31 16:27:30,150][WARN ][monitor.jvm              ] [host] [gc][old][17436142][4365] duration [47.1s], collections [1]/[47.8s], total [47.1s]/[51.2m], memory [46.1gb]->[46.1gb]/[47.8gb], all_pools {[young] [31.7mb]->[91mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [46gb]->[46gb]/[46.1gb]}
[2016-03-31 16:27:30,362][WARN ][transport                ] [host] Received response for a request that has timed out, sent [403685ms] ago, timed out [348858ms] ago, action [cluster:monitor/nodes/stats[n]], node [[slave7][zhj9v_DbSSyZbMjAKQabdg][slave7][inet[/192.168.100.7:9300]]], id [9007132115]
[2016-03-31 16:28:26,066][WARN ][monitor.jvm              ] [host] [gc][old][17436143][4366] duration [55s], collections [1]/[55.9s], total [55s]/[52.1m], memory [46.1gb]->[46.1gb]/[47.8gb], all_pools {[young] [91mb]->[76mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [46gb]->[46.1gb]/[46.1gb]}
[2016-03-31 16:29:22,116][WARN ][monitor.jvm              ] [host] [gc][old][17436144][4367] duration [55.1s], collections [1]/[56s], total [55.1s]/[53m], memory [46.1gb]->[46.4gb]/[47.8gb], all_pools {[young] [76mb]->[279.4mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [46.1gb]->[46.1gb]/[46.1gb]}
[2016-03-31 16:30:19,139][WARN ][monitor.jvm              ] [host] [gc][old][17436146][4368] duration [55.1s], collections [1]/[55.8s], total [55.1s]/[53.9m], memory [47.3gb]->[46.1gb]/[47.8gb], all_pools {[young] [1.2gb]->[69mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [46.1gb]->[46gb]/[46.1gb]}
[2016-03-31 16:31:17,196][WARN ][monitor.jvm              ] [host] [gc][old][17436147][4369] duration [56.6s], collections [1]/[58s], total [56.6s]/[54.9m], memory [46.1gb]->[46.5gb]/[47.8gb], all_pools {[young] [69mb]->[635.7mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [46gb]->[45.9gb]/[46.1gb]}
[2016-03-31 16:32:08,625][WARN ][monitor.jvm              ] [host] [gc][old][17436148][4370] duration [50.4s], collections [1]/[51.4s], total [50.4s]/[55.7m], memory [46.5gb]->[46.5gb]/[47.8gb], all_pools {[young] [635.7mb]->[755mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [45.9gb]->[45.8gb]/[46.1gb]}

发现频繁的Full GC,导致查询请求阻塞。

故障分析

那为什么会导致频繁的GC呢?

ES的底层是lucene,已经比较成熟了,如果说一个大的查询,会导致频繁的Full GC,从个人经验上来判断是不太可能的。

通过查看具体的日志,根据索引的查询时间,可以看到,

[2016-03-31 15:04:51,407][INFO ][monitor.jvm              ] [host] [gc][young][17432909][9591508] duration [711ms], collections [1]/[3.9s], total [711ms]/[6.3d], memory [32.2gb]->[33.3gb]/[47.8gb], all_pools {[young] [92.8mb]->[322.8mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [32gb]->[32.8gb]/[46.1gb]}
[2016-03-31 15:04:53,659][INFO ][monitor.jvm              ] [host] [gc][young][17432911][9591509] duration [858ms], collections [1]/[1.2s], total [858ms]/[6.3d], memory [34.5gb]->[33.9gb]/[47.8gb], all_pools {[young] [1.4gb]->[38.5mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [32.8gb]->[33.6gb]/[46.1gb]}
[2016-03-31 15:04:59,234][INFO ][monitor.jvm              ] [host] [gc][young][17432915][9591511] duration [753ms], collections [1]/[1s], total [753ms]/[6.3d], memory [36.3gb]->[36gb]/[47.8gb], all_pools {[young] [1.4gb]->[84.8mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [34.7gb]->[35.7gb]/[46.1gb]}
[2016-03-31 15:05:08,071][WARN ][monitor.jvm              ] [host] [gc][young][17432920][9591514] duration [1.5s], collections [1]/[2.1s], total [1.5s]/[6.3d], memory [38.8gb]->[38.5gb]/[47.8gb], all_pools {[young] [1.2gb]->[333.9mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [37.4gb]->[38gb]/[46.1gb]}
[2016-03-31 15:05:24,098][INFO ][monitor.jvm              ] [host] [gc][young][17432929][9591523] duration [835ms], collections [1]/[1.5s], total [835ms]/[6.3d], memory [38.5gb]->[38.3gb]/[47.8gb], all_pools {[young] [1gb]->[270.3mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [37.2gb]->[37.9gb]/[46.1gb]}
[2016-03-31 15:06:27,267][INFO ][monitor.jvm              ] [host] [gc][young][17432982][9591561] duration [840ms], collections [1]/[2s], total [840ms]/[6.3d], memory [36.1gb]->[36.2gb]/[47.8gb], all_pools {[young] [1005.2mb]->[58mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [35gb]->[36gb]/[46.1gb]}
[2016-03-31 15:06:30,686][WARN ][monitor.jvm              ] [host] [gc][young][17432984][9591562] duration [1.1s], collections [1]/[2.4s], total [1.1s]/[6.3d], memory [36.9gb]->[37.1gb]/[47.8gb], all_pools {[young] [785.3mb]->[69.7mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [36gb]->[36.9gb]/[46.1gb]}
[2016-03-31 15:06:34,418][WARN ][monitor.jvm              ] [host] [gc][young][17432987][9591563] duration [1.1s], collections [1]/[1.6s], total [1.1s]/[6.3d], memory [38.4gb]->[37.9gb]/[47.8gb], all_pools {[young] [1.3gb]->[84.4mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [36.9gb]->[37.6gb]/[46.1gb]}

大批量的查询,确实会导致GC的发生,不过不是Full GC 是 young GC,每次GC的实际大约为1秒。

那么究竟是什么原因导致的Full GC呢?

通过日志,还可以看到很多这样的语句,

[2016-03-31 15:38:23,095][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016010423]]
[2016-03-31 15:39:38,730][WARN ][monitor.jvm              ] [host] [gc][old][17434637][4291] duration [26.9s], collections [2]/[32.1s], total [26.9s]/[34.2m], memory [41.4gb]->[34gb]/[47.8gb], all_pools {[young] [737.6mb]->[338.8mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [40.5gb]->[33.4gb]/[46.1gb]}
[2016-03-31 15:40:57,341][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016010509]]
[2016-03-31 15:41:24,516][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016010510]]
[2016-03-31 15:41:34,363][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016010511]]
[2016-03-31 15:41:49,896][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016010512]]
[2016-03-31 15:41:59,444][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016010513]]
[2016-03-31 15:42:05,476][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016010514]]
[2016-03-31 15:42:10,364][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016010515]]
[2016-03-31 15:42:45,147][INFO ][monitor.jvm              ] [host] [gc][young][17434761][9592521] duration [742ms], collections [1]/[1s], total [742ms]/[6.3d], memory [37.9gb]->[37.7gb]/[47.8gb], all_pools {[young] [1.4gb]->[243.5mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [36.2gb]->[37.3gb]/[46.1gb]}
[2016-03-31 16:03:45,683][WARN ][monitor.jvm              ] [host] [gc][old][17435819][4330] duration [29s], collections [2]/[30.9s], total [29s]/[34.7m], memory [46.1gb]->[33.7gb]/[47.8gb], all_pools {[young] [127.9mb]->[560mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [45.8gb]->[33.2gb]/[46.1gb]}
[2016-03-31 16:04:09,356][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016011505]]
[2016-03-31 16:04:09,532][WARN ][transport                ] [host] Received response for a request that has timed out, sent [23725ms] ago, timed out [8725ms] ago, action [cluster:monitor/nodes/stats[n]], node [[slave6][7TsqZySmSrGCE_6gI9PRgA][slave6][inet[/192.168.100.6:9300]]{master=false}], id [9006968492]
[2016-03-31 16:04:13,209][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016011407]]
[2016-03-31 16:04:42,477][WARN ][monitor.jvm              ] [host] [gc][old][17435839][4332] duration [25.8s], collections [1]/[27.2s], total [25.8s]/[35.2m], memory [45.7gb]->[33.9gb]/[47.8gb], all_pools {[young] [61.2mb]->[285mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [45.4gb]->[33.6gb]/[46.1gb]}
[2016-03-31 16:05:02,044][INFO ][cluster.metadata         ] [host] [index.xx update_mapping [2016033116] (dynamic)
[2016-03-31 16:05:03,427][WARN ][transport                ] [host] Received response for a request that has timed out, sent [19920ms] ago, timed out [4920ms] ago, action [cluster:monitor/nodes/stats[n]], node [[slave4][pBWdc_KsQcSczFSbU-nmPA][slave4][inet[/192.168.100.4:9300]]], id [9006974853]
[2016-03-31 16:05:05,545][INFO ][cluster.metadata         ] [host] [[index.xx]] remove_mapping [[2016011306]]
[2016-03-31 16:05:34,199][WARN ][monitor.jvm              ] [host] [gc][old][17435855][4334] duration [25.9s], collections [2]/[27s], total [25.9s]/[35.6m], memory [45.9gb]->[34.8gb]/[47.8gb], all_pools {[young] [233.4mb]->[321.4mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [45.5gb]->[34.5gb]/[46.1gb]}
[2016-03-31 16:06:34,742][WARN ][monitor.jvm              ] [host] [gc][old][17435881][4336] duration [28.5s], collections [1]/[29.3s], total [28.5s]/[36.1m], memory [47.2gb]->[36.9gb]/[47.8gb], all_pools {[young] [1.1gb]->[646.5mb]/[1.4gb]}{[survivor] [191.3mb]->[0b]/[191.3mb]}{[old] [45.9gb]->[36.3gb]/[46.1gb]}

日志里发现有一个index在大批量的删除mapping,

组内的一个同事反馈,该Mapping属于历史数据,长时间没有清理,刚才同时开多个terminal窗口,执行删除操作。

似乎已经找到元凶了,

再去看其他ES节点的日志信息,

[2016-03-31 16:36:42,176][DEBUG][action.admin.indices.refresh] [hsot2] [index.xx][28], node[ESrXwrmuSD6FqJ4l5NJ7mw], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.refresh.RefreshRequest@65382d63]
org.elasticsearch.transport.NodeDisconnectedException: [host][inet[/192.168.100.3:9300]][indices:admin/refresh[s]] disconnected

发现有大量上面这样的日志,是ES在刷新这个索引。

至此,可以得知,由于同时,大批量的删除index中的Mapping,ES会去频繁的更新存储在内存中的元数据,从而导致消耗过多的内存,导致Full GC的产生。同时,通过在网上检索,发现有和我们类似问题的出现。

写入数据index中的type被当成一个变量,也就是每一个关键词都会在mapping中生成一个”[关键词]”的type,于是这会导致ES得到一条日志就会 update_mapping ,更新 cluster.metadata, type 是 mapping 中的一部分, mapping 也用于配置元数据和type之间的关系;以上的使用可以理解为每有一个关键字,就会新建一张表,而 mapping 是元数据,需要加载到内存,来管理各个索引每个对应字段的,最终会得到一个非常庞大的 mapping ,从而内存不足,造成集群响应不了从而假死宕机.

故障解决

现在已经知道问题了,解决问题的办法有两个,

1、等待GC结束

2、重启Elastic Search集群

由于等待GC时间的不确定,我们选择了重启ES集群。

重启之前,先关闭ElasticSearch集群的高可用和自平衡方案功能,避免重启之后,长时间的分片平衡。

curl -PUT http://XXX:9200/_cluster/settings -d
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}

重启之后,再打开该功能,

curl -PUT http://XXX:9200/_cluster/settings -d
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

总结

在重启时,我们发现给ElasticSearch分片的堆内存 -Xmx48g -Xms32g 我们将其改为 ` -Xmx48g -Xms48g `固定上下限,避免每次垃圾回收完成后JVM重新分配内存。

补充:最近看Elastic官方文档,上面说对于内存分配,最好不要超过32G,链接地址:https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html

内存分配过多,导致GC的时间延长!

在网上看到,当分配的堆内存比较大时,使用G1 GC性能会好一些,这个我们没有尝试,需要进行测试。

重启完成后,故障解决。

出现问题,首先可以先看一下,是否做了什么操作。

当然最重要的,还是分析日志!

参考网址:

引发Elasticsearch OOM之type