在矩阵/位图中查找质量簇

This is in continuation with the question posted here: Finding the center of mass on a 2D bitmap which talked about finding the center of mass in a boolean matrix, as the example was given.

这是在这里发布的问题的继续:在2D位图上寻找质心,其中讨论了在布尔矩阵中找到质心,如给出的示例。

Suppose now we expand the matrix to this form:

假设现在我们将矩阵扩展为这种形式:

0 1 2 3 4 5 6 7 8 9
1 . X X . . . . . .
2 . X X X . . X . .
3 . . . . . X X X .
4 . . . . . . X . .
5 . X X . . . . . .
6 . X . . . . . . .
7 . X . . . . . . .
8 . . . . X X . . .
9 . . . . X X . . .

As you can see we now have 4 centers of mass, for 4 different clusters.

正如您所看到的,我们现在有4个质心,用于4个不同的簇。

We already know how to find a center of mass given that only one exists, if we run that algorithm on this matrix we'll get some point in the middle of the matrix which does not help us.

我们已经知道如何找到一个质心,因为只有一个存在,如果我们在这个矩阵上运行该算法,我们将得到矩阵中间的一些点,这对我们没有帮助。

What can be a good, correct and fast algorithm to find these clusters of mass?

什么是一个好的,正确的和快速的算法来找到这些质量集群?

4 个解决方案

#1

I think I would check each point in the matrix and figure out it's mass based on it's neighbours. The mass for points would fall with say the square of the distance. You could then pick the top four points with a minimum distance from each other.

我想我会检查矩阵中的每个点,并根据它的邻居计算它的质量。点的质量会随着距离的平方而下降。然后,您可以选择前四个点,彼此之间的距离最小。

Here's some Python code I whipped together to try to illustrate the approach for finding out the mass for each point. Some setup using your example matrix:

这里是一些Python代码,我一起试图说明找出每个点的质量的方法。使用示例矩阵进行一些设置:

matrix = [[1.0 if x == "X" else 0.0 for x in y] for y in """.XX......
.XXX..X..
.....XXX.
......X..
.XX......
.X.......
.X.......
....XX...
....XX...""".split("\n")]

HEIGHT = len(matrix)
WIDTH = len(matrix[0])
Y_RADIUS = HEIGHT / 2
X_RADIUS = WIDTH / 2

To calculate the mass for a given point:

要计算给定点的质量:

def distance(x1, y1, x2, y2):
  'Manhattan distance http://en.wikipedia.org/wiki/Manhattan_distance'
  return abs(y1 - y2) + abs(x1 - x2)

def mass(m, x, y):
  _mass = m[y][x]
  for _y in range(max(0, y - Y_RADIUS), min(HEIGHT, y + Y_RADIUS)):
    for _x in range(max(0, x - X_RADIUS), min(WIDTH, x + X_RADIUS)):
      d = max(1, distance(x, y, _x, _y))
      _mass += m[_y][_x] / (d * d)
  return _mass

Note: I'm using Manhattan distances (a k a Cityblock, a k a Taxicab Geometry) here because I don't think the added accuracy using Euclidian distances is worth the cost of calling sqrt().

注意:我在这里使用曼哈顿距离(k a Cityblock,k a Taxicab Geometry),因为我不认为使用Euclidian距离增加的准确性值得调用sqrt()的成本。

Iterating through our matrix and building up a list of tuples like (x, y, mass(x,y)):

迭代我们的矩阵并建立一个元组列表,如(x,y,mass(x,y)):

point_mass = []
for y in range(0, HEIGHT):
  for x in range(0, WIDTH):
    point_mass.append((x, y, mass(matrix, x, y)))

Sorting the list on the mass for each point:

对每个点的质量排序列表:

from operator import itemgetter
point_mass.sort(key=itemgetter(2), reverse=True)

Looking at the top 9 points in that sorted list:

查看排序列表中的前9个点:

(6, 2, 6.1580555555555554)
(2, 1, 5.4861111111111107)
(1, 1, 4.6736111111111107)
(1, 4, 4.5938888888888885)
(2, 0, 4.54)
(4, 7, 4.4480555555555554)
(1, 5, 4.4480555555555554)
(5, 7, 4.4059637188208614)
(4, 8, 4.3659637188208613)

If we would work from highest to lowest and filter away points that are too close to already seen points we'll get (I'm doing it manually since I've run out of time now to do it in code...):

如果我们将从最高到最低工作并过滤掉与已经看到的点太接近的点,我们将得到(我正在手动完成,因为我现在已经没时间在代码中执行它了...):

(6, 2, 6.1580555555555554)
(2, 1, 5.4861111111111107)
(1, 4, 4.5938888888888885)
(4, 7, 4.4480555555555554)

Which is a pretty intuitive result from just looking at your matrix (note that the coordinates are zero based when comparing with your example).

只需查看矩阵,这是非常直观的结果(请注意,与您的示例进行比较时,坐标为零)。

#2

You need a clustering algorithm, this is easy since you just have a 2 dimensional grid, and the entries are bordering each other. You can just use a floodfill algorithm. Once you have each cluster, you can find the center as in the 2D center of mass article..

您需要一个聚类算法,这很简单,因为您只有一个二维网格,并且条目相互接近。您可以使用Floodfill算法。拥有每个群集后,您可以在2D质心文章中找到该中心。

#3

Here's a similar question with a not so fast algorithm, and several other better ways to do it.

这是一个类似的问题,不是那么快的算法,还有其他几种更好的方法。

#4

My first thought would be to first find any cell with a non-zero value. From there do some flood-fill algorithm, and compute the center of mass of the cells found. Next zero out the found cells from the matrix, and start over from the top.

我的第一个想法是首先找到任何具有非零值的单元格。从那里做一些泛洪填充算法,并计算发现的细胞的质心。接下来将矩阵中找到的单元格清零,然后从顶部开始。

This would of course not scale as well as the method from Google, that tuinstoel linked, but would be easier to implement for smaller matrices.

这当然不会像谷歌那样与tuinstoel链接的方法一样扩展,但对于较小的矩阵更容易实现。

EDIT:

Disjoint sets (using path compression and union-by-rank) could be useful here. They have O(α(n)) time complexity for union and find-set, where

不相交的集合(使用路径压缩和按行级联合)在这里可能很有用。对于union和find-set,它们具有O(α(n))时间复杂度

α(n) = min { k : A_k(1) ≥ n }.

α(n)= min {k:Ak(1)≥n}。

A_k(n) is the Ackerman function, so α(n) will essentially be O(1) for any reasonable values. The only problem is that disjoint sets are a one-way mapping of item to set, but this won't matter if you are going trough all items.

Ak(n)是Ackerman函数,因此对于任何合理的值,α(n)基本上都是O(1)。唯一的问题是,不相交的集合是项目与集合的单向映射,但是如果您要通过所有项目,这将无关紧要。

Here is a simple python script for demonstration:

这是一个简单的python脚本用于演示:

from collections import defaultdict

class DisjointSets(object):
    def __init__(self):
        self.item_map = defaultdict(DisjointNode)

    def add(self,item):
        """Add item to the forest."""
        # It's gets initialized to a new node when
        # trying to access a non-existant item.
        return self.item_map[item]

    def __contains__(self,item):
        return (item in self.item_map)

    def __getitem__(self,item):
        if item not in self:
            raise KeyError
        return self.item_map[item]

    def __delitem__(self,item):
        del self.item_map[item]

    def __iter__(self):
        # sort all items into real sets
        all_sets = defaultdict(set)
        for item,node in self.item_map.iteritems():
            all_sets[node.find_set()].add(item)
        return all_sets.itervalues()

class DisjointNode(object):
    def __init__(self,parent=None,rank=0):
        if parent is None:
            self.parent = self
        else:
            self.parent = parent
        self.rank = rank

    def union(self,other):
        """Join two sets."""
        node1 = self.find_set()
        node2 = other.find_set()
        # union by rank
        if node1.rank > node2.rank:
            node2.parent = node1
        else:
            node1.parent = node2
            if node1.rank == node2.rank:
                node2.rank += 1
        return node1

    def find_set(self):
        """Finds the root node of this set."""
        node = self
        while node is not node.parent:
            node = node.parent
        # path compression
        root, node = node, self
        while node is not node.parent:
            node, node.parent = node.parent, root
        return root

def find_clusters(grid):
    disj = DisjointSets()
    for y,row in enumerate(grid):
        for x,cell in enumerate(row):
            if cell:
                node = disj.add((x,y))
                for dx,dy in ((-1,0),(-1,-1),(0,-1),(1,-1)):
                    if (x+dx,y+dy) in disj:
                        node.union(disj[x+dx,y+dy])
    for index,set_ in enumerate(disj):
        sum_x, sum_y, count = 0, 0, 0
        for x,y in set_:
            sum_x += x
            sum_y += y
            count += 1
        yield 1.0 * sum_x / count, 1.0 * sum_y / count

def main():
    grid = [[('.' != cell) for cell in row if not cell.isspace()] for row in (
        ". X X . . . . . .",
        ". X X X . . X . .",
        ". . . . . X X X .",
        ". . . . . . X . .",
        ". X X . . . . . .",
        ". X . . . . . . .",
        ". X . . . . . . .",
        ". . . . X X . . .",
        ". . . . X X . . .",
    )]
    coordinates = list(find_clusters(grid))
    centers = dict(((round(x),round(y)),i) for i,(x,y) in enumerate(coordinates))
    for y,row in enumerate(grid):
        for x,cell in enumerate(row):
            if (x,y) in centers:
                print centers[x,y]+1,
            elif cell:
                print 'X',
            else:
                print '.',
        print
    print
    print '%4s | %7s %7s' % ('i','x','y')
    print '-'*22
    for i,(x,y) in enumerate(coordinates):
        print '%4d | %7.4f %7.4f' % (i+1,x,y)

if __name__ == '__main__':
    main()

Output:

. X X . . . . . .
. X 3 X . . X . .
. . . . . X 4 X .
. . . . . . X . .
. X X . . . . . .
. 2 . . . . . . .
. X . . . . . . .
. . . . X X . . .
. . . . X 1 . . .

   i |       x       y
----------------------
   1 |  4.5000  7.5000
   2 |  1.2500  4.7500
   3 |  1.8000  0.6000
   4 |  6.0000  2.0000

The point of this was to demonstrate disjoint sets. The actual algorithm in find_clusters() could be upgraded to something more robust.

这一点的目的是证明不相交的集合。 find_clusters()中的实际算法可以升级到更健壮的东西。

References

Introduction to algorithms. 2nd ed. Cormen et.al.

算法简介。第2版。 Cormen et.al.

#1