Elasticsearch 滚动重启 必读

时间:2024-10-14 22:37:33

关键词:elasticsearch , es , 滚动重启 , 禁止分片

由于之前es GC没有怎么调优,结果今天被大量scroll查询查挂了,GC 卡死了。然后为了先恢复给业务使用,也没什么其他办法,只能重启server。重启的时候傻逼了,忘记了禁止分片,于是起来后集群就在重新做分片迁移了。这里记录一下ES部分重启或滚动重启的步骤。

参考:https://www.elastic.co/guide/en/elasticsearch/guide/current/_rolling_restarts.html

一般情况下可能是server因为GC、负载等原因卡死了,或升级了需要重启的配置了,这类场景下会要重启部分server或整个集群滚动重启。但是在这类重启场景下,ES的数据实际是没丢失的(指的是ES已经存下的数据,正在写入的数据需要客户端去自己做好重试)。

具体步骤就参见上面的官方文档link了:

  1. If possible, stop indexing new data. This is not always possible, but will help speed up recovery time. (补充:如果可以的话,最好停止读写后jmap触发强制GC看看有没机会让假死的程序恢复,虽然一般停止用户读写可能性不大;另,在停止服务前可以的话建议做好flush操作,但不是必须。)
  2. Disable shard allocation. This prevents Elasticsearch from rebalancing missing shards until you tell it otherwise. If you know the maintenance window will be short, this is a good idea. You can disable allocation as follows:

    PUT /_cluster/settings
    {
    "transient" : {
    "cluster.routing.allocation.enable" : "none"
    }
    }
  3. Shut down a single node.
  4. Perform a maintenance/upgrade.
  5. Restart the node, and confirm that it joins the cluster.
  6. Reenable shard allocation as follows:

    PUT /_cluster/settings
    {
    "transient" : {
    "cluster.routing.allocation.enable" : "all"
    }
    }

    Shard rebalancing may take some time. Wait until the cluster has returned to status greenbefore continuing.

  7. (repeat前建议先确认集群状态,看看有没重启后出现问题)Repeat steps 2 through 6 for the rest of your nodes.
  8. At this point you are safe to resume indexing (if you had previously stopped), but waiting until the cluster is fully balanced before resuming indexing will help to speed up the process.