I have a very large graph represented in a text file of size about 1TB with each edge as follows.
我有一个非常大的图形表示在一个大小约1TB的文本文件中,每个边缘如下。
From-node to-node
I would like to split it into its weakly connected components. If it was smaller I could load it into networkx and run their component finding algorithms. For example http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.components.connected.connected_components.html#networkx.algorithms.components.connected.connected_components
我想把它拆分成弱连接的组件。如果它更小,我可以将其加载到networkx并运行其组件查找算法。例如http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.components.connected.connected_components.html#networkx.algorithms.components.connected.connected_components
Is there any way to do this without loading the whole thing into memory?
有没有办法在不将整个内容加载到内存中的情况下执行此操作?
3 个解决方案
#1
1
If even the number of nodes is too large to fit in memory, you can divide and conquer and use external memory sorts to do most of the work for you (e.g. the sort
command included with Windows and Unix can sort files much larger than memory):
如果节点的数量太大而无法容纳在内存中,您可以分而治之并使用外部内存排序来完成大部分工作(例如,Windows和Unix中包含的sort命令可以对比内存大得多的文件进行排序) :
- Choose some threshold vertex k.
- Read the original file and write each of its edges to one of 3 files:
- To
a
if its maximum-numbered vertex is < k - To
b
if its minimum-numbered vertex is >= k - To
c
otherwise (i.e. if it has one vertex < k and one vertex >= k)
如果其最大编号顶点是
如果其最小编号顶点> = k,则为b
否则为c(即,如果它有一个顶点
= k) 且一个顶点> - To
- If
a
is small enough to solve (find connected components for) in memory (using e.g. Peter de Rivaz's algorithm) then do so, otherwise recurse to solve it. The solution should be a file whose lines each consist of two numbersx y
and which is sorted byx
. Eachx
is a vertex number andy
is its representative -- the lowest-numbered vertex in the same component asx
. - Do likewise for
b
. - Sort edges in
c
by their smallest-numbered endpoint. - Go through each edge in
c
, renaming the endpoint that is < k (remember, there must be exactly one such endpoint) to its representative, found from the solution to the subproblema
. This can be done efficiently by using a linear-time merge algorithm to merge with the solution to the subproblema
. Call the resulting filed
. - Sort edges in
d
by their largest-numbered endpoint. (The fact that we have already renamed the smallest-numbered endpoint doesn't make this unsafe, since renaming can never increase a vertex's number.) - Go through each edge in
d
, renaming the endpoint that is >= k to its representative, found from the solution to the subproblemb
using a linear-time merge as before. Call the resulting filee
. - Solve
e
. (As witha
andb
, do this directly in memory if possible, otherwise recurse. If you need to recurse, you will need to find a different way of partitioning the edges, since all the edges ine
already "straddle" k. You could for example renumber vertices using a random permutation of vertex numbers, recurse to solve the resulting problem, then rename them back.) This step is necessary because there could be an edge (1, k), another edge (2, k+1) and a third edge (2, k), and this will mean that all vertices in the components 1, 2, k and k+1 need to be combined into a single component. - Go through each line in the solution for subproblem
a
, updating the representative for this vertex using the solution to subprobleme
if necessary. This can be done efficiently using a linear-time merge. Write out the new list of representatives (which will already be sorted by vertex number due to the fact that we created it froma
's solution) to a filef
. - Do likewise for each line in the solution for subproblem
b
, creating fileg
. - Concatenate
f
andg
to produce the final answer. (For better efficiency, just have step 11 append its results directly tof
).
选择一些阈值顶点k。
读取原始文件并将其每个边写入3个文件中的一个:如果其最大编号顶点为
如果a小到足以在内存中求解(找到连接的组件)(使用例如Peter de Rivaz的算法),那么这样做,否则递归来解决它。解决方案应该是一个文件,其行每个都包含两个数字x y,并按x排序。每个x都是一个顶点数,y是它的代表 - 与x相同的组件中编号最小的顶点。
同样适用于b。
通过最小编号的端点对c中的边进行排序。
遍历c中的每个边,将
在d中按其最大编号的端点对边进行排序。 (事实上我们已经重命名了最小编号的端点并不会使这不安全,因为重命名永远不会增加顶点的数量。)
遍历d中的每个边缘,将> = k的端点重命名为其代表,使用线性时间合并从解决方案到子问题b,如前所述。调用生成的文件e。
解决e。 (和a和b一样,如果可能的话,直接在内存中执行此操作,否则递归。如果需要递归,则需要找到一种不同的分割边缘的方法,因为e中的所有边缘已经“跨越”k。你例如,可以使用顶点数的随机排列重新编号顶点,递归以解决结果问题,然后将它们重命名。)此步骤是必要的,因为可能存在边(1,k),另一边(2,k + 1) )和第三个边缘(2,k),这将意味着组件1,2,k和k + 1中的所有顶点需要组合成单个组件。
遍历解决方案中的子问题a的每一行,如有必要,使用解决方案更新子顶点的代表。这可以使用线性时间合并来有效地完成。写出新的代表列表(由于我们从解决方案中创建它的事实,它已经按顶点数排序)到文件f。
同样对于子问题b的解决方案中的每一行,创建文件g。
连接f和g以产生最终答案。 (为了提高效率,只需将步骤11的结果直接附加到f)。
All the linear-time merge operations used above can read directly from disk files, since they only ever access items from each list in increasing order (i.e. no slow random access is needed).
上面使用的所有线性时间合并操作都可以直接从磁盘文件中读取,因为它们只能按递增顺序访问每个列表中的项目(即不需要慢速随机访问)。
#2
10
If you have few enough nodes (e.g. a few hundred million), then you could compute the connected components with a single pass through the text file by using a disjoint set forest stored in memory.
如果您有足够的节点(例如几亿),那么您可以通过使用存储在内存中的不相交的集合林,通过文本文件单次传递来计算连接的组件。
This data structure only stores the rank and parent pointer for each node so should fit in memory if you have few enough nodes.
此数据结构仅存储每个节点的排名和父指针,因此如果节点足够少,则应该适合内存。
For larger number of nodes, you could try the same idea, but storing the data structure on disk (and possibly improved by using a cache in memory to store frequently used items).
对于大量节点,您可以尝试相同的想法,但将数据结构存储在磁盘上(并且可能通过在内存中使用缓存来存储经常使用的项目来改进)。
Here is some Python code that implements a simple in-memory version of disjoint set forests:
下面是一些Python代码,它实现了一个简单的内存版本的不相交集合林:
N=7 # Number of nodes
rank=[0]*N
parent=range(N)
def Find(x):
"""Find representative of connected component"""
if parent[x] != x:
parent[x] = Find(parent[x])
return parent[x]
def Union(x,y):
"""Merge sets containing elements x and y"""
x = Find(x)
y = Find(y)
if x == y:
return
if rank[x]<rank[y]:
parent[x] = y
elif rank[x]>rank[y]:
parent[y] = x
else:
parent[y] = x
rank[x] += 1
with open("disjointset.txt","r") as fd:
for line in fd:
fr,to = map(int,line.split())
Union(fr,to)
for n in range(N):
print n,'is in component',Find(n)
If you apply it to the text file called disjointset.txt containing:
如果将其应用于名为disjointset.txt的文本文件,其中包含:
1 2
3 4
4 5
0 5
it prints
0 is in component 3
1 is in component 1
2 is in component 1
3 is in component 3
4 is in component 3
5 is in component 3
6 is in component 6
You could save memory by not using the rank array, at the cost of potentially increased computation time.
您可以通过不使用秩数组来节省内存,但代价是可能会增加计算时间。
#3
1
External memory graph traversal is tricky to get performant. I advise against writing your own code, implementation details make the difference between a runtime of a few hours and a runtime of a few months. You should consider using existing libraries like the stxxl. See here for a paper using it to compute connected components.
外部存储器图遍历很难获得高性能。我建议不要编写自己的代码,实现细节会在几个小时的运行时间和几个月的运行时间之间产生差异。您应该考虑使用像stxxl这样的现有库。请参阅此处以获取使用它来计算连接组件的论文。
#1
1
If even the number of nodes is too large to fit in memory, you can divide and conquer and use external memory sorts to do most of the work for you (e.g. the sort
command included with Windows and Unix can sort files much larger than memory):
如果节点的数量太大而无法容纳在内存中,您可以分而治之并使用外部内存排序来完成大部分工作(例如,Windows和Unix中包含的sort命令可以对比内存大得多的文件进行排序) :
- Choose some threshold vertex k.
- Read the original file and write each of its edges to one of 3 files:
- To
a
if its maximum-numbered vertex is < k - To
b
if its minimum-numbered vertex is >= k - To
c
otherwise (i.e. if it has one vertex < k and one vertex >= k)
如果其最大编号顶点是
如果其最小编号顶点> = k,则为b
否则为c(即,如果它有一个顶点
= k) 且一个顶点> - To
- If
a
is small enough to solve (find connected components for) in memory (using e.g. Peter de Rivaz's algorithm) then do so, otherwise recurse to solve it. The solution should be a file whose lines each consist of two numbersx y
and which is sorted byx
. Eachx
is a vertex number andy
is its representative -- the lowest-numbered vertex in the same component asx
. - Do likewise for
b
. - Sort edges in
c
by their smallest-numbered endpoint. - Go through each edge in
c
, renaming the endpoint that is < k (remember, there must be exactly one such endpoint) to its representative, found from the solution to the subproblema
. This can be done efficiently by using a linear-time merge algorithm to merge with the solution to the subproblema
. Call the resulting filed
. - Sort edges in
d
by their largest-numbered endpoint. (The fact that we have already renamed the smallest-numbered endpoint doesn't make this unsafe, since renaming can never increase a vertex's number.) - Go through each edge in
d
, renaming the endpoint that is >= k to its representative, found from the solution to the subproblemb
using a linear-time merge as before. Call the resulting filee
. - Solve
e
. (As witha
andb
, do this directly in memory if possible, otherwise recurse. If you need to recurse, you will need to find a different way of partitioning the edges, since all the edges ine
already "straddle" k. You could for example renumber vertices using a random permutation of vertex numbers, recurse to solve the resulting problem, then rename them back.) This step is necessary because there could be an edge (1, k), another edge (2, k+1) and a third edge (2, k), and this will mean that all vertices in the components 1, 2, k and k+1 need to be combined into a single component. - Go through each line in the solution for subproblem
a
, updating the representative for this vertex using the solution to subprobleme
if necessary. This can be done efficiently using a linear-time merge. Write out the new list of representatives (which will already be sorted by vertex number due to the fact that we created it froma
's solution) to a filef
. - Do likewise for each line in the solution for subproblem
b
, creating fileg
. - Concatenate
f
andg
to produce the final answer. (For better efficiency, just have step 11 append its results directly tof
).
选择一些阈值顶点k。
读取原始文件并将其每个边写入3个文件中的一个:如果其最大编号顶点为
如果a小到足以在内存中求解(找到连接的组件)(使用例如Peter de Rivaz的算法),那么这样做,否则递归来解决它。解决方案应该是一个文件,其行每个都包含两个数字x y,并按x排序。每个x都是一个顶点数,y是它的代表 - 与x相同的组件中编号最小的顶点。
同样适用于b。
通过最小编号的端点对c中的边进行排序。
遍历c中的每个边,将
在d中按其最大编号的端点对边进行排序。 (事实上我们已经重命名了最小编号的端点并不会使这不安全,因为重命名永远不会增加顶点的数量。)
遍历d中的每个边缘,将> = k的端点重命名为其代表,使用线性时间合并从解决方案到子问题b,如前所述。调用生成的文件e。
解决e。 (和a和b一样,如果可能的话,直接在内存中执行此操作,否则递归。如果需要递归,则需要找到一种不同的分割边缘的方法,因为e中的所有边缘已经“跨越”k。你例如,可以使用顶点数的随机排列重新编号顶点,递归以解决结果问题,然后将它们重命名。)此步骤是必要的,因为可能存在边(1,k),另一边(2,k + 1) )和第三个边缘(2,k),这将意味着组件1,2,k和k + 1中的所有顶点需要组合成单个组件。
遍历解决方案中的子问题a的每一行,如有必要,使用解决方案更新子顶点的代表。这可以使用线性时间合并来有效地完成。写出新的代表列表(由于我们从解决方案中创建它的事实,它已经按顶点数排序)到文件f。
同样对于子问题b的解决方案中的每一行,创建文件g。
连接f和g以产生最终答案。 (为了提高效率,只需将步骤11的结果直接附加到f)。
All the linear-time merge operations used above can read directly from disk files, since they only ever access items from each list in increasing order (i.e. no slow random access is needed).
上面使用的所有线性时间合并操作都可以直接从磁盘文件中读取,因为它们只能按递增顺序访问每个列表中的项目(即不需要慢速随机访问)。
#2
10
If you have few enough nodes (e.g. a few hundred million), then you could compute the connected components with a single pass through the text file by using a disjoint set forest stored in memory.
如果您有足够的节点(例如几亿),那么您可以通过使用存储在内存中的不相交的集合林,通过文本文件单次传递来计算连接的组件。
This data structure only stores the rank and parent pointer for each node so should fit in memory if you have few enough nodes.
此数据结构仅存储每个节点的排名和父指针,因此如果节点足够少,则应该适合内存。
For larger number of nodes, you could try the same idea, but storing the data structure on disk (and possibly improved by using a cache in memory to store frequently used items).
对于大量节点,您可以尝试相同的想法,但将数据结构存储在磁盘上(并且可能通过在内存中使用缓存来存储经常使用的项目来改进)。
Here is some Python code that implements a simple in-memory version of disjoint set forests:
下面是一些Python代码,它实现了一个简单的内存版本的不相交集合林:
N=7 # Number of nodes
rank=[0]*N
parent=range(N)
def Find(x):
"""Find representative of connected component"""
if parent[x] != x:
parent[x] = Find(parent[x])
return parent[x]
def Union(x,y):
"""Merge sets containing elements x and y"""
x = Find(x)
y = Find(y)
if x == y:
return
if rank[x]<rank[y]:
parent[x] = y
elif rank[x]>rank[y]:
parent[y] = x
else:
parent[y] = x
rank[x] += 1
with open("disjointset.txt","r") as fd:
for line in fd:
fr,to = map(int,line.split())
Union(fr,to)
for n in range(N):
print n,'is in component',Find(n)
If you apply it to the text file called disjointset.txt containing:
如果将其应用于名为disjointset.txt的文本文件,其中包含:
1 2
3 4
4 5
0 5
it prints
0 is in component 3
1 is in component 1
2 is in component 1
3 is in component 3
4 is in component 3
5 is in component 3
6 is in component 6
You could save memory by not using the rank array, at the cost of potentially increased computation time.
您可以通过不使用秩数组来节省内存,但代价是可能会增加计算时间。
#3
1
External memory graph traversal is tricky to get performant. I advise against writing your own code, implementation details make the difference between a runtime of a few hours and a runtime of a few months. You should consider using existing libraries like the stxxl. See here for a paper using it to compute connected components.
外部存储器图遍历很难获得高性能。我建议不要编写自己的代码,实现细节会在几个小时的运行时间和几个月的运行时间之间产生差异。您应该考虑使用像stxxl这样的现有库。请参阅此处以获取使用它来计算连接组件的论文。