I need to be able to manipulate a large (10^7 nodes) graph in python. The data corresponding to each node/edge is minimal, say, a small number of strings. What is the most efficient, in terms of memory and speed, way of doing this?
我需要操作一个大在python中(10 ^ 7节点)图。每个节点/边缘对应的数据是最小的,例如,少量的字符串。在内存和速度方面,最有效的方法是什么?
A dict of dicts is more flexible and simpler to implement, but I intuitively expect a list of lists to be faster. The list option would also require that I keep the data separate from the structure, while dicts would allow for something of the sort:
命令的命令更灵活、更容易实现,但我直觉地认为列表列表会更快。列表选项还要求我将数据与结构分开,而dicts将允许类似的内容:
graph[I][J]["Property"]="value"
What would you suggest?
你建议什么?
Yes, I should have been a bit clearer on what I mean by efficiency. In this particular case I mean it in terms of random access retrieval.
是的,我应该更清楚地理解我所说的效率。在这个例子中,我指的是随机存取检索。
Loading the data in to memory isn't a huge problem. That's done once and for all. The time consuming part is visiting the nodes so I can extract the information and measure the metrics I'm interested in.
将数据加载到内存中并不是什么大问题。这是一劳永逸的。耗时部分是访问节点,这样我就可以提取信息并度量我感兴趣的指标。
I hadn't considered making each node a class (properties are the same for all nodes) but it seems like that would add an extra layer of overhead? I was hoping someone would have some direct experience with a similar case that they could share. After all, graphs are one of the most common abstractions in CS.
我没有考虑让每个节点都是一个类(所有节点的属性都是相同的),但是这样做会增加额外的开销吗?我希望有人能有一些直接的经验与类似的情况,他们可以分享。毕竟,图形是CS中最常见的抽象之一。
7 个解决方案
#1
51
I would strongly advocate you look at NetworkX. It's a battle-tested war horse and the first tool most 'research' types reach for when they need to do analysis of network based data. I have manipulated graphs with 100s of thousands of edges without problem on a notebook. Its feature rich and very easy to use. You will find yourself focusing more on the problem at hand rather than the details in the underlying implementation.
我强烈建议你看看NetworkX。这是一种经过实战考验的战马,也是大多数“研究”类型的第一种工具,当他们需要对基于网络的数据进行分析时,就能找到它。我在笔记本上操作了无数条边的图形。其功能丰富,使用方便。您将发现自己更多地关注手头的问题,而不是底层实现中的细节。
Example of Erdős-Rényi random graph generation and analysis
Erdős-Renyi随机图生成和分析的例子
"""
Create an G{n,m} random graph with n nodes and m edges
and report some properties.
This graph is sometimes called the Erd##[m~Qs-Rényi graph
but is different from G{n,p} or binomial_graph which is also
sometimes called the Erd##[m~Qs-Rényi graph.
"""
__author__ = """Aric Hagberg (hagberg@lanl.gov)"""
__credits__ = """"""
# Copyright (C) 2004-2006 by
# Aric Hagberg
# Dan Schult
# Pieter Swart
# Distributed under the terms of the GNU Lesser General Public License
# http://www.gnu.org/copyleft/lesser.html
from networkx import *
import sys
n=10 # 10 nodes
m=20 # 20 edges
G=gnm_random_graph(n,m)
# some properties
print "node degree clustering"
for v in nodes(G):
print v,degree(G,v),clustering(G,v)
# print the adjacency list to terminal
write_adjlist(G,sys.stdout)
Visualizations are also straightforward:
可视化也简单:
More visualization: http://jonschull.blogspot.com/2008/08/graph-visualization.html
更多的可视化:http://jonschull.blogspot.com/2008/08/graph-visualization.html
#2
12
Even though this question is now quite old, I think it is worthwhile to mention my own python module for graph manipulation called graph-tool. It is very efficient, since the data structures and algorithms are implemented in C++, with template metaprograming, using the Boost Graph Library. Therefore its performance (both in memory usage and runtime) is comparable to a pure C++ library, and can be orders of magnitude better than typical python code, without sacrificing ease of use. I use it myself constantly to work with very large graphs.
尽管这个问题现在已经很老了,但我认为有必要提一下我自己的用于图形处理的python模块,称为graph-tool。它非常高效,因为数据结构和算法是在c++中实现的,使用了Boost图形库和模板元编程。因此,它的性能(在内存使用和运行时中)可以与纯c++库相媲美,并且可以比典型的python代码好几个数量级,同时又不会牺牲易用性。我自己经常使用它来处理非常大的图形。
#3
6
As already mentioned, NetworkX is very good, with another option being igraph. Both modules will have most (if not all) the analysis tools you're likely to need, and both libraries are routinely used with large networks.
如前所述,NetworkX非常好,另外一个选项是igraph。这两个模块都将拥有您可能需要的大部分(如果不是全部)分析工具,并且这两个库通常用于大型网络。
#4
4
A dictionary may also contain overhead, depending on the actual implementation. A hashtable usually contain some prime number of available nodes to begin with, even though you might only use a couple of the nodes.
字典也可能包含开销,这取决于实际的实现。hashtable通常包含一些可用节点的素数,即使您可能只使用几个节点。
Judging by your example, "Property", would you be better of with a class approach for the final level and real properties? Or is the names of the properties changing a lot from node to node?
从您的示例“Property”判断,您是否更适合使用针对最终级别和真实属性的类方法?或者是属性的名称在节点之间变化很大?
I'd say that what "efficient" means depends on a lot of things, like:
我想说,“高效”的含义取决于很多东西,比如:
- speed of updates (insert, update, delete)
- 更新速度(插入、更新、删除)
- speed of random access retrieval
- 随机存取的速度。
- speed of sequential retrieval
- 顺序检索速度
- memory used
- 内存使用
I think that you'll find that a data structure that is speedy will generally consume more memory than one that is slow. This isn't always the case, but most data structures seems to follow this.
我认为您会发现,一个快速的数据结构通常会消耗比一个慢的内存更多的内存。并非总是如此,但大多数数据结构似乎都遵循这一点。
A dictionary might be easy to use, and give you relatively uniformly fast access, it will most likely use more memory than, as you suggest, lists. Lists, however, generally tend to contain more overhead when you insert data into it, unless they preallocate X nodes, in which they will again use more memory.
字典可能很容易使用,并且给您相对一致的快速访问,它将很可能使用比您建议的列表更多的内存。然而,当您向列表中插入数据时,列表通常会包含更多的开销,除非它们预先分配了X个节点,在这些节点中,它们将再次使用更多的内存。
My suggestion, in general, would be to just use the method that seems the most natural to you, and then do a "stress test" of the system, adding a substantial amount of data to it and see if it becomes a problem.
总的来说,我的建议是使用对您来说最自然的方法,然后对系统进行“压力测试”,向其添加大量的数据,看看它是否会成为一个问题。
You might also consider adding a layer of abstraction to your system, so that you don't have to change the programming interface if you later on need to change the internal data structure.
您还可以考虑向您的系统添加一个抽象层,以便以后需要更改内部数据结构时,不必更改编程接口。
#5
3
As I understand it, random access is in constant time for both Python's dicts and lists, the difference is that you can only do random access of integer indexes with lists. I'm assuming that you need to lookup a node by its label, so you want a dict of dicts.
据我所知,对于Python的命令和列表,随机访问都是在固定时间内进行的,不同之处在于,您只能对带有列表的整数索引进行随机访问。我假设您需要根据节点的标签查找节点,所以您需要一个命令。
However, on the performance front, loading it into memory may not be a problem, but if you use too much you'll end up swapping to disk, which will kill the performance of even Python's highly efficient dicts. Try to keep memory usage down as much as possible. Also, RAM is amazingly cheap right now; if you do this kind of thing a lot, there's no reason not to have at least 4GB.
但是,在性能方面,将它加载到内存中可能不是问题,但是如果使用太多,最终将交换到磁盘,这甚至会破坏Python的高效命令的性能。尽量减少内存使用。而且,RAM现在非常便宜;如果您经常做这种事情,那么没有理由不使用至少4GB。
If you'd like advice on keeping memory usage down, give some more information about the kind of information you're tracking for each node.
如果您想要关于降低内存使用的建议,请提供关于您正在为每个节点跟踪的信息类型的更多信息。
#6
2
Making a class-based structure would probably have more overhead than the dict-based structure, since in python classes actually use dicts when they are implemented.
创建基于类的结构可能会比基于文本的结构有更多的开销,因为在python类中,当它们被实现时实际上会使用文本。
#7
1
No doubt NetworkX is the best data structure till now for graph. It comes with utilities like Helper Functions, Data Structures and Algorithms, Random Sequence Generators, Decorators, Cuthill-Mckee Ordering, Context Managers
毫无疑问,NetworkX是迄今为止最好的图形数据结构。它附带了诸如Helper函数、数据结构和算法、随机序列生成器、decorator、cuthmalmckee排序、上下文管理器等实用工具。
NetworkX is great because it wowrs for graphs, digraphs, and multigraphs. It can write graph with multiple ways: Adjacency List, Multiline Adjacency List, Edge List, GEXF, GML. It works with Pickle, GraphML, JSON, SparseGraph6 etc.
NetworkX非常好,因为它可以用于图形、双图和多图。它可以用多种方式写图:邻接表、多行邻接表、边缘列表、GEXF、GML。它与Pickle、GraphML、JSON、SparseGraph6等协同工作。
It has implimentation of various radimade algorithms including: Approximation, Bipartite, Boundary, Centrality, Clique, Clustering, Coloring, Components, Connectivity, Cycles, Directed Acyclic Graphs, Distance Measures, Dominating Sets, Eulerian, Isomorphism, Link Analysis, Link Prediction, Matching, Minimum Spanning Tree, Rich Club, Shortest Paths, Traversal, Tree.
它具有逼近、二部、边界、中心性、团簇、聚类、聚类、颜色、分量、连接性、周期、有向无环图、距离度量、支配集、欧拉、同构、链接分析、链接预测、匹配、最小生成树、丰富的俱乐部、最短路径、遍历、树等多种辐射算法。
#1
51
I would strongly advocate you look at NetworkX. It's a battle-tested war horse and the first tool most 'research' types reach for when they need to do analysis of network based data. I have manipulated graphs with 100s of thousands of edges without problem on a notebook. Its feature rich and very easy to use. You will find yourself focusing more on the problem at hand rather than the details in the underlying implementation.
我强烈建议你看看NetworkX。这是一种经过实战考验的战马,也是大多数“研究”类型的第一种工具,当他们需要对基于网络的数据进行分析时,就能找到它。我在笔记本上操作了无数条边的图形。其功能丰富,使用方便。您将发现自己更多地关注手头的问题,而不是底层实现中的细节。
Example of Erdős-Rényi random graph generation and analysis
Erdős-Renyi随机图生成和分析的例子
"""
Create an G{n,m} random graph with n nodes and m edges
and report some properties.
This graph is sometimes called the Erd##[m~Qs-Rényi graph
but is different from G{n,p} or binomial_graph which is also
sometimes called the Erd##[m~Qs-Rényi graph.
"""
__author__ = """Aric Hagberg (hagberg@lanl.gov)"""
__credits__ = """"""
# Copyright (C) 2004-2006 by
# Aric Hagberg
# Dan Schult
# Pieter Swart
# Distributed under the terms of the GNU Lesser General Public License
# http://www.gnu.org/copyleft/lesser.html
from networkx import *
import sys
n=10 # 10 nodes
m=20 # 20 edges
G=gnm_random_graph(n,m)
# some properties
print "node degree clustering"
for v in nodes(G):
print v,degree(G,v),clustering(G,v)
# print the adjacency list to terminal
write_adjlist(G,sys.stdout)
Visualizations are also straightforward:
可视化也简单:
More visualization: http://jonschull.blogspot.com/2008/08/graph-visualization.html
更多的可视化:http://jonschull.blogspot.com/2008/08/graph-visualization.html
#2
12
Even though this question is now quite old, I think it is worthwhile to mention my own python module for graph manipulation called graph-tool. It is very efficient, since the data structures and algorithms are implemented in C++, with template metaprograming, using the Boost Graph Library. Therefore its performance (both in memory usage and runtime) is comparable to a pure C++ library, and can be orders of magnitude better than typical python code, without sacrificing ease of use. I use it myself constantly to work with very large graphs.
尽管这个问题现在已经很老了,但我认为有必要提一下我自己的用于图形处理的python模块,称为graph-tool。它非常高效,因为数据结构和算法是在c++中实现的,使用了Boost图形库和模板元编程。因此,它的性能(在内存使用和运行时中)可以与纯c++库相媲美,并且可以比典型的python代码好几个数量级,同时又不会牺牲易用性。我自己经常使用它来处理非常大的图形。
#3
6
As already mentioned, NetworkX is very good, with another option being igraph. Both modules will have most (if not all) the analysis tools you're likely to need, and both libraries are routinely used with large networks.
如前所述,NetworkX非常好,另外一个选项是igraph。这两个模块都将拥有您可能需要的大部分(如果不是全部)分析工具,并且这两个库通常用于大型网络。
#4
4
A dictionary may also contain overhead, depending on the actual implementation. A hashtable usually contain some prime number of available nodes to begin with, even though you might only use a couple of the nodes.
字典也可能包含开销,这取决于实际的实现。hashtable通常包含一些可用节点的素数,即使您可能只使用几个节点。
Judging by your example, "Property", would you be better of with a class approach for the final level and real properties? Or is the names of the properties changing a lot from node to node?
从您的示例“Property”判断,您是否更适合使用针对最终级别和真实属性的类方法?或者是属性的名称在节点之间变化很大?
I'd say that what "efficient" means depends on a lot of things, like:
我想说,“高效”的含义取决于很多东西,比如:
- speed of updates (insert, update, delete)
- 更新速度(插入、更新、删除)
- speed of random access retrieval
- 随机存取的速度。
- speed of sequential retrieval
- 顺序检索速度
- memory used
- 内存使用
I think that you'll find that a data structure that is speedy will generally consume more memory than one that is slow. This isn't always the case, but most data structures seems to follow this.
我认为您会发现,一个快速的数据结构通常会消耗比一个慢的内存更多的内存。并非总是如此,但大多数数据结构似乎都遵循这一点。
A dictionary might be easy to use, and give you relatively uniformly fast access, it will most likely use more memory than, as you suggest, lists. Lists, however, generally tend to contain more overhead when you insert data into it, unless they preallocate X nodes, in which they will again use more memory.
字典可能很容易使用,并且给您相对一致的快速访问,它将很可能使用比您建议的列表更多的内存。然而,当您向列表中插入数据时,列表通常会包含更多的开销,除非它们预先分配了X个节点,在这些节点中,它们将再次使用更多的内存。
My suggestion, in general, would be to just use the method that seems the most natural to you, and then do a "stress test" of the system, adding a substantial amount of data to it and see if it becomes a problem.
总的来说,我的建议是使用对您来说最自然的方法,然后对系统进行“压力测试”,向其添加大量的数据,看看它是否会成为一个问题。
You might also consider adding a layer of abstraction to your system, so that you don't have to change the programming interface if you later on need to change the internal data structure.
您还可以考虑向您的系统添加一个抽象层,以便以后需要更改内部数据结构时,不必更改编程接口。
#5
3
As I understand it, random access is in constant time for both Python's dicts and lists, the difference is that you can only do random access of integer indexes with lists. I'm assuming that you need to lookup a node by its label, so you want a dict of dicts.
据我所知,对于Python的命令和列表,随机访问都是在固定时间内进行的,不同之处在于,您只能对带有列表的整数索引进行随机访问。我假设您需要根据节点的标签查找节点,所以您需要一个命令。
However, on the performance front, loading it into memory may not be a problem, but if you use too much you'll end up swapping to disk, which will kill the performance of even Python's highly efficient dicts. Try to keep memory usage down as much as possible. Also, RAM is amazingly cheap right now; if you do this kind of thing a lot, there's no reason not to have at least 4GB.
但是,在性能方面,将它加载到内存中可能不是问题,但是如果使用太多,最终将交换到磁盘,这甚至会破坏Python的高效命令的性能。尽量减少内存使用。而且,RAM现在非常便宜;如果您经常做这种事情,那么没有理由不使用至少4GB。
If you'd like advice on keeping memory usage down, give some more information about the kind of information you're tracking for each node.
如果您想要关于降低内存使用的建议,请提供关于您正在为每个节点跟踪的信息类型的更多信息。
#6
2
Making a class-based structure would probably have more overhead than the dict-based structure, since in python classes actually use dicts when they are implemented.
创建基于类的结构可能会比基于文本的结构有更多的开销,因为在python类中,当它们被实现时实际上会使用文本。
#7
1
No doubt NetworkX is the best data structure till now for graph. It comes with utilities like Helper Functions, Data Structures and Algorithms, Random Sequence Generators, Decorators, Cuthill-Mckee Ordering, Context Managers
毫无疑问,NetworkX是迄今为止最好的图形数据结构。它附带了诸如Helper函数、数据结构和算法、随机序列生成器、decorator、cuthmalmckee排序、上下文管理器等实用工具。
NetworkX is great because it wowrs for graphs, digraphs, and multigraphs. It can write graph with multiple ways: Adjacency List, Multiline Adjacency List, Edge List, GEXF, GML. It works with Pickle, GraphML, JSON, SparseGraph6 etc.
NetworkX非常好,因为它可以用于图形、双图和多图。它可以用多种方式写图:邻接表、多行邻接表、边缘列表、GEXF、GML。它与Pickle、GraphML、JSON、SparseGraph6等协同工作。
It has implimentation of various radimade algorithms including: Approximation, Bipartite, Boundary, Centrality, Clique, Clustering, Coloring, Components, Connectivity, Cycles, Directed Acyclic Graphs, Distance Measures, Dominating Sets, Eulerian, Isomorphism, Link Analysis, Link Prediction, Matching, Minimum Spanning Tree, Rich Club, Shortest Paths, Traversal, Tree.
它具有逼近、二部、边界、中心性、团簇、聚类、聚类、颜色、分量、连接性、周期、有向无环图、距离度量、支配集、欧拉、同构、链接分析、链接预测、匹配、最小生成树、丰富的俱乐部、最短路径、遍历、树等多种辐射算法。