I have a DAG that looks like this: Example DAG
我有一个DAG,它看起来是这样的:示例DAG
I want to extract all the paths constituted by 4 nodes in this graph.
我要提取图中由4个节点组成的所有路径。
My expected result should look like this:
我的预期结果应该是这样的:
N1 -> N2 -> N3 -> N4
N1 -> N2 -> N3 -> N4
N1 -> N2 -> N3 -> N5
N1 -> N2 -> N3 -> N5
N1 -> N3 -> N4 -> N5
N1 -> N3 -> N4 -> N5
N2 -> N3 -> N4 -> N5
N2 -> N3 -> N4 -> N5
My current attempt looks like this
我现在的尝试是这样的
def path_finder(n1):
paths = []
if DAG.has_node(n1):
for n2 in DAG.successors(n1):
for n3 in DAG.successors(n2):
for n4 in DAG.successors(n3):
paths.append([n1, n2, n3, n4])
return paths
I'm calling this function for each node. DAG
is a global variable, more specifically it is a networkx
object (DAG = networkx.DiGraph()
) This naive function is pathetically slow. Is there a more efficient strategy to do this?
我为每个节点调用这个函数。DAG是一个全局变量,更具体地说,它是一个networkx对象(DAG = networkx. digraph()))。有没有更有效的策略来解决这个问题?
I have looked at question 20262712 but was self-solved by the author of the question in rather obscure way.
我看过问题20262712,但问题的作者以相当隐晦的方式自行解答了这个问题。
Thanks
谢谢
UPDATE:
更新:
Since I couldn't get any satisfactory algorithm to solve this, I ended up parallelizing the job using my naive function as a worker while dumping all the data into a queue. I used pool.imap_unordered
to launch worker function and aggregated the results from queue. It still is slow (takes couple of hours for 5M nodes). I should also furnish the data for the average degree of nodes that I'm dealing with, because that will have an effect upon how fast my workers runs. But, I'll leave that out for now.
由于我无法得到任何令人满意的算法来解决这个问题,所以我以作为工作人员的幼稚函数来并行处理这个工作,同时将所有数据转储到队列中。我使用游泳池。imap_unordered启动worker函数并聚合来自队列的结果。它仍然很慢(5百万个节点需要几个小时)。我还应该提供我正在处理的平均节点度的数据,因为这将影响我的员工的运行速度。但是,我现在先不谈这个。
3 个解决方案
#1
1
Here is a function that returns the paths of a given length between all nodes in the graph. It iterates between all sets of nodes and uses the networkx.all_simple_paths
to get the paths.
这是一个函数,它返回图中所有节点之间给定长度的路径。它在所有节点集之间迭代并使用networkx。all_simple_paths获得路径。
import networkx as nx
g = nx.DiGraph()
g.add_nodes_from(['A','B','C','D','E'])
g.add_path(['A','B','C','D'])
g.add_path(['A','B','C','E'])
g.add_path(['A','C','D','E'])
g.add_path(['B','C','D','D'])
def find_paths(graph, number_nodes=4):
paths = []
for source in graph.nodes_iter():
for target in graph.nodes_iter():
if not source==target:
p_source_target = nx.all_simple_paths(graph,
source,
target,
cutoff=number_nodes-1)
paths.extend([p for p in p_source_target if len(p)==number_nodes])
return paths
find_paths(g)
# output:
[['B', 'C', 'D', 'E'],
['A', 'C', 'D', 'E'],
['A', 'B', 'C', 'E'],
['A', 'B', 'C', 'D']]
#2
0
Part of your problem could be that if you come across a node u
as the second node in a path then you do all of the calculations to find all the paths of length 3. But then if you come across u
again as the second node, you repeat all of these calculations.
问题的一部分可能是,如果你遇到一个节点u作为路径中的第二个节点,那么你要做所有的计算来找到长度为3的所有路径。但是如果你再次遇到u作为第二个节点,你重复所有的计算。
So try to avoid this. We'll do it recursively calculating all the length 3 paths first (which requires calculating length 2 paths)
所以尽量避免这种情况。我们先递归地计算所有长度3条路径(这需要计算长度2条路径)
def get_paths(G, n):
'''returns a dict, paths, such that paths[u] is a list of all paths
of length n that start from u'''
if n == 1: #base case, return a dict so that D[u] is a
#list of all length 1 paths starting from u.
#it's a boring list.
return {u: [[u]] for u in G.nodes()}
#if we get to here n>1 (unless input was bad)
subpath_dict = get_paths(G,n-1) #contains all length n-1 paths,
#indexed by first node
path_dict = {}
for u in G:
path_dict[u] = []
for v in G.successors(u):
path_dict[u].extend([[u]+subpath for subpath in subpath_dict[v]])
return(path_dict)
G=nx.DiGraph()
G.add_path([1,2,3,4,5,6])
G.add_path([1,3,6,8,10])
path_dict = get_paths(G,4)
path_list = []
for paths in path_dict.values():
path_list.extend(paths)
#3
0
Number of sequences is of order |V|*d^3, where d is average node output degree. From how graph is created, d is bounded. I suppose that d is not very small (like < 5). That means, for 5M nodes graph there are > 1G paths.
V的组合数的顺序| | * d ^ 3 d是平均节点输出学位。从图的创建方式来看,d是有界的。假设d不是很小(如< 5),也就是说,对于5M节点图,有> 1G路径。
Since finding one path is fast (they are short) it is not sure that DP like algorithm can help. DP like algorithms try to take an advantage of partially calculated data, so there is an overhead of storing and retrieving that data and maybe that is larger overhead than just to calculate needed partial data.
由于找到一条路径很快(它们很短),所以不确定像DP算法这样的算法是否有用。像DP算法一样的算法试图利用部分计算的数据,因此存储和检索数据的开销可能比仅仅计算所需的部分数据要大。
One idea is algorithm that traverse DAG in back topological order and do two things:
其中一种算法是通过DAG算法在后面的拓扑排序中做两件事:
- for node keep all paths that starts from that node of length 3,
- 对于节点,保持从长度为3的节点开始的所有路径,
- using successors paths of length 3 print all paths of length 4.
- 使用长度为3的后继路径打印长度为4的所有路径。
This method can use lot of memory, but some of it can be freed for nodes that are not successor of any traverse boundary node.
这种方法可以使用大量的内存,但是对于没有继承任何遍历边界节点的节点,可以释放一些内存。
Other idea is just to make simple algorithm more optimized. In your solution there are three for loops for each node. That means four for loops for all paths. Note that each loop is through nodes. It is possible to join first two loops by iterating through edges. That is due each path has to start with one edge. Algorithm is like:
其他的想法只是让简单的算法更加优化。在您的解决方案中,每个节点有三个for循环。这意味着所有路径都有四个for循环。注意,每个循环通过节点。可以通过遍历边来连接前两个循环。每个路径都必须从一条边开始。算法:
for n1, n2 in DAG.edges():
for n3 in DAG.successors(n2):
for n4 in DAG.successors(n3):
paths.append([n1, n2, n3, n4])
Or even simpler by first selecting middle edge:
或者更简单,首先选择中间的边缘:
for n2, n3 in DAG.edges():
for n1, n4 in itertools.product(DAG.predecessors(n2), DAG.successors(n3)):
paths.append([n1, n2, n3, n4])
Outer loop can be optimized by not selecting middle edge that starts on source node or ends on target node. But that is quite fast detected in product() method. Maybe this optimization can help by not sending not needed data into other process.
外部循环可以通过不选择从源节点开始或目标节点结束的中间边缘来优化。但是在product()方法中可以很快地检测到这一点。也许这种优化可以通过不将不需要的数据发送到其他进程而有所帮助。
#1
1
Here is a function that returns the paths of a given length between all nodes in the graph. It iterates between all sets of nodes and uses the networkx.all_simple_paths
to get the paths.
这是一个函数,它返回图中所有节点之间给定长度的路径。它在所有节点集之间迭代并使用networkx。all_simple_paths获得路径。
import networkx as nx
g = nx.DiGraph()
g.add_nodes_from(['A','B','C','D','E'])
g.add_path(['A','B','C','D'])
g.add_path(['A','B','C','E'])
g.add_path(['A','C','D','E'])
g.add_path(['B','C','D','D'])
def find_paths(graph, number_nodes=4):
paths = []
for source in graph.nodes_iter():
for target in graph.nodes_iter():
if not source==target:
p_source_target = nx.all_simple_paths(graph,
source,
target,
cutoff=number_nodes-1)
paths.extend([p for p in p_source_target if len(p)==number_nodes])
return paths
find_paths(g)
# output:
[['B', 'C', 'D', 'E'],
['A', 'C', 'D', 'E'],
['A', 'B', 'C', 'E'],
['A', 'B', 'C', 'D']]
#2
0
Part of your problem could be that if you come across a node u
as the second node in a path then you do all of the calculations to find all the paths of length 3. But then if you come across u
again as the second node, you repeat all of these calculations.
问题的一部分可能是,如果你遇到一个节点u作为路径中的第二个节点,那么你要做所有的计算来找到长度为3的所有路径。但是如果你再次遇到u作为第二个节点,你重复所有的计算。
So try to avoid this. We'll do it recursively calculating all the length 3 paths first (which requires calculating length 2 paths)
所以尽量避免这种情况。我们先递归地计算所有长度3条路径(这需要计算长度2条路径)
def get_paths(G, n):
'''returns a dict, paths, such that paths[u] is a list of all paths
of length n that start from u'''
if n == 1: #base case, return a dict so that D[u] is a
#list of all length 1 paths starting from u.
#it's a boring list.
return {u: [[u]] for u in G.nodes()}
#if we get to here n>1 (unless input was bad)
subpath_dict = get_paths(G,n-1) #contains all length n-1 paths,
#indexed by first node
path_dict = {}
for u in G:
path_dict[u] = []
for v in G.successors(u):
path_dict[u].extend([[u]+subpath for subpath in subpath_dict[v]])
return(path_dict)
G=nx.DiGraph()
G.add_path([1,2,3,4,5,6])
G.add_path([1,3,6,8,10])
path_dict = get_paths(G,4)
path_list = []
for paths in path_dict.values():
path_list.extend(paths)
#3
0
Number of sequences is of order |V|*d^3, where d is average node output degree. From how graph is created, d is bounded. I suppose that d is not very small (like < 5). That means, for 5M nodes graph there are > 1G paths.
V的组合数的顺序| | * d ^ 3 d是平均节点输出学位。从图的创建方式来看,d是有界的。假设d不是很小(如< 5),也就是说,对于5M节点图,有> 1G路径。
Since finding one path is fast (they are short) it is not sure that DP like algorithm can help. DP like algorithms try to take an advantage of partially calculated data, so there is an overhead of storing and retrieving that data and maybe that is larger overhead than just to calculate needed partial data.
由于找到一条路径很快(它们很短),所以不确定像DP算法这样的算法是否有用。像DP算法一样的算法试图利用部分计算的数据,因此存储和检索数据的开销可能比仅仅计算所需的部分数据要大。
One idea is algorithm that traverse DAG in back topological order and do two things:
其中一种算法是通过DAG算法在后面的拓扑排序中做两件事:
- for node keep all paths that starts from that node of length 3,
- 对于节点,保持从长度为3的节点开始的所有路径,
- using successors paths of length 3 print all paths of length 4.
- 使用长度为3的后继路径打印长度为4的所有路径。
This method can use lot of memory, but some of it can be freed for nodes that are not successor of any traverse boundary node.
这种方法可以使用大量的内存,但是对于没有继承任何遍历边界节点的节点,可以释放一些内存。
Other idea is just to make simple algorithm more optimized. In your solution there are three for loops for each node. That means four for loops for all paths. Note that each loop is through nodes. It is possible to join first two loops by iterating through edges. That is due each path has to start with one edge. Algorithm is like:
其他的想法只是让简单的算法更加优化。在您的解决方案中,每个节点有三个for循环。这意味着所有路径都有四个for循环。注意,每个循环通过节点。可以通过遍历边来连接前两个循环。每个路径都必须从一条边开始。算法:
for n1, n2 in DAG.edges():
for n3 in DAG.successors(n2):
for n4 in DAG.successors(n3):
paths.append([n1, n2, n3, n4])
Or even simpler by first selecting middle edge:
或者更简单,首先选择中间的边缘:
for n2, n3 in DAG.edges():
for n1, n4 in itertools.product(DAG.predecessors(n2), DAG.successors(n3)):
paths.append([n1, n2, n3, n4])
Outer loop can be optimized by not selecting middle edge that starts on source node or ends on target node. But that is quite fast detected in product() method. Maybe this optimization can help by not sending not needed data into other process.
外部循环可以通过不选择从源节点开始或目标节点结束的中间边缘来优化。但是在product()方法中可以很快地检测到这一点。也许这种优化可以通过不将不需要的数据发送到其他进程而有所帮助。