背景:自己之前的项目里面使用了redis作为KV存储,不仅是因为性能,主要是需要用redis的hash数据结构。后来随着业务发展,读写压力越来越大,一开始的做法是读写分离,接着一主多从,发现还是不能很好的解决写redis的压力,又因为自己使用的redis版本比较低还不支持分布式的功能,所以自己想去部署一套分布式的redis存储系统,开始想到的做法是简单的做个hash,hashcode=hash(key/machine_num),接着将对应的key放在对应的机器,可是考虑到机器异常宕机,或者新增机器,数据迁移的代价都比较大,所以自己了解了一下一致性hash,准备用它实现一套分布式的KV存储系统。
主要思想:
一致性hash的经典介绍网上有很多,我感觉下面这篇文章介绍的不错。
http://blog.csdn.net/cywosp/article/details/23397179/
其实主要的思想我认为如下:
- 把机器按照某种hash算法(比如MD5)计算得到机器的hashcode值
- 对于存储的数据,根据数据的key,使用与机器相同的hash算法获取到相应的hashcode值,然后将key写入到顺时针最近的机器。
可以是hashcode(key) <= hashcode(machine)的机器 - 当有新机器加入时,只需要把新加入机器影响到的数据进行重新分配;当删除机器时,只需要把被删除机器的数据重新分配一下,这样可以减小数据的迁移代价。
- 为了维持平衡性,防止雪崩效应,使用虚拟节点代替真实机器,一个真实机器对应多个虚拟节点,这样可以保证数据的分布均衡
下面是Python的代码实现,实现思路就是上面提到的。
代码也是参考网上的:
http://www.cnblogs.com/xuxm2007/archive/2011/08/28/2156015.html
#! /usr/bin/env python
# -*- coding:utf8 -*-
import md5
class HashRing(object):
def __init__(self, nodes=None, replicas=3):
"""
Manages a hash ring
:nodes
a list of object that have a proper __str__ representation
:replicas
indicates how many virtual points should be used per node,
replicas are required to improve the distribution
"""
self.replicas = replicas
self.ring = {}
self._sorted_keys = []
if nodes:
for node in nodes:
self.add_node(node)
def add_node(self, node):
"""
Adds a node to the hash ring(including a number of replicas)
"""
for i in xrange(0, self.replicas):
key = self.gen_key("%s:%s" % (node, i))
self.ring[key] = node
self._sorted_keys.append(key)
self._sorted_keys.sort()
def remove_node(self, node):
"""
Removes node from the hash ring and its replicas
"""
for i in xrange(0, self.replicas):
key = self.gen_key("%s:%s" % (node, i))
del self.ring[key]
self._sorted_keys.remove(key)
def get_node_keys(self, node):
"""
Given a node return all its virturl node keys
"""
keys = []
for i in xrange(0, self.replicas):
key = self.gen_key("%s:%s" % (node, i))
keys.append(key)
return key
def get_node(self, string_key):
"""
Given a string key corresponding node in the hash ring is returned.
If the hash ring is empty, None is returned.
"""
return self.get_node_pos(string_key)[0]
def get_node_pos(self, string_key):
"""
Given a string key corresponding node in the hash ring is returned
along with it's position in the ring.
If the hash ring is empty, (None, None) is returned
"""
if not self.ring:
return None, None
key = self.gen_key(string_key)
nodes = self._sorted_keys
for i in xrange(0, len(nodes)):
node = nodes[i]
if key <= node:
return self.ring[node], i
return self.ring[nodes[0]], 0
def get_nodes(self, string_key):
"""
Given a string key it returns the nodes as a generator that can hold the key.
The generator is never ending and iterates through the ring starting at the correct position.
"""
if not self.ring:
yield None
node, pos = self.get_node_pos(string_key)
for key in self._sorted_keys:
yield self.ring[key]
def gen_key(self, key):
"""
Given a string key it return a long value,
this long value represent a place on the hash ring.
md5 is currently used because it mixes well.
"""
m = md5.new()
m.update(key)
return long(m.hexdigest(), 16)