I need to analyze 1 TB+ of web access logs, and in particular I need to analyze statistics relating to requested URLs and subsets of the URLs (child branches). If possible, I want the queries to be fast over small subsets of the data (e.g. 10 million requests).
我需要分析1 TB以上的web访问日志,特别是需要分析与请求的url和url(子分支)子集相关的统计信息。如果可能的话,我希望查询速度要快于数据的小子集(例如1000万次请求)。
For example, given an access log with the following URLs being requested:
例如,给定一个访问日志,请求下列url:
/ocp/about_us.html
/ocp/security/ed-209/patches/urgent.html
/ocp/security/rc/
/ocp/food/
/weyland-yutani/products/
I want to do queries such as:
我想做以下查询:
- Count the number of requests for everything 'below' /ocp.
- 计算所有“低于”/ocp的请求数。
- Same as above, but only count requests for child nodes under /ocp/security
- 与上面一样,但是只计算/ocp/security下的子节点的请求
- Return the top 5 most frequently requested URLs.
- 返回最常用的5个url。
- Same as above, except group by an arbitrary depth,
- 和上面一样,除了按任意深度分组,
e.g. For the last query above, depth 2 for the data would return:
例:对于上面的最后一个查询,数据的深度2将返回:
2: /ocp/security/
1: /ocp/
1: /ocp/food/
1: /weyland-yutani/products/
I think the ideal approach would probably be to use a column DB and tokenize the URLs such that there is a column for each element in the URL. However, I would really like to find a way to do this with open source apps if possible. HBase is a possibility, but query performance seems too slow to be useful for real-time queries (also, I don't really want to be in the business of re-implementing SQL)
我认为理想的方法可能是使用列DB并标记URL,以便URL中每个元素都有一个列。但是,如果可能的话,我真的想找到一种使用开源应用程序的方法。HBase是一种可能性,但是查询性能似乎太慢,不能用于实时查询(而且,我也不想从事重新实现SQL的工作)
I'm aware there are commercial apps for doing this this type of analytics, but for various reasons I want to implement this myself.
我知道有一些商业应用程序可以实现这种类型的分析,但是出于各种原因,我想自己实现它。
5 个解决方案
#1
13
Before investing too much time into designing a hierarchical data structure on top of a relational database, consider reading "Naive Trees" section (starting at slide 48) in the excellent presentation SQL Anti-Patterns Strike Back by Bill Karwin. Bill outlines the following methods for developing a hierarchy:
在花费大量时间在关系数据库之上设计层次化数据结构之前,请考虑阅读Bill Karwin的优秀的SQL反模式表示部分(从第48页开始)。Bill概述了开发层次结构的以下方法:
- Path enumeration (slide 55)
- 路径枚举(幻灯片55)
- Nested sets (slide 58)
- 嵌套集幻灯片(58)
- Closure table (slide 68)
- 关闭表(68张)
#2
3
Trees are generally not very efficient in databases. I mean: if you'd design the tree to be truly recursive, with items pointing to their parents, you'll get lots of queries to find all sub-nodes.
树在数据库中通常不是很有效。我的意思是:如果你把树设计成真正的递归的,让条目指向它们的父节点,你会得到很多查询来找到所有的子节点。
But you can optimize the tree, according to your needs.
但是你可以根据你的需要优化树。
Put any part of the url into a column is not a bad idea. You need to limit the depth to a certain number of sub-nodes. You could have indexes on any column, which makes it very fast.
将url的任何部分放到列中不是一个坏主意。您需要将深度限制到一定数量的子节点。你可以在任何列上都有索引,这使得它非常快。
Queries on such a structure are very simple:
对这种结构的查询非常简单:
Select count(*) From Hits where node1 = 'ocp' AND node2 = 'security';
Make a access statistic:
访问统计:
SELECT node1, node2, count(*) as "number of hits"
FROM hits
GROUP BY node1, node2
ORDER BY count(*) DESC
you'll get
你会得到
node1 node2 number of hits
'ocp' 23345
'ocp' 'security' 1020
'ocp' 'food' 234
'weyland-yutani' 'products' 22
You could also store the url as it is and filter using regex. This is more flexible, but slower, because you don't have indexes. You need only to limit the whole length of the url, not the number of sub-nodes.
您还可以按原样存储url并使用regex进行筛选。这更灵活,但速度更慢,因为没有索引。您只需要限制url的整个长度,而不是子节点的数量。
I think you could do this with any database good enough to store large amount of data. For instance MySql.
我认为您可以使用任何足够好的数据库来存储大量的数据。例如MySql。
#3
2
The book, The Art of Sql, by Stephane Faroult has a very excellent chapter (7 - Dealing with Hierarchical Data) which explains and compares 3 methods for storing and querying trees using relational databases.
史蒂芬·法罗特的《Sql的艺术》一书有一个非常优秀的章节(7 -处理分层数据),它解释和比较了使用关系数据库存储和查询树的3种方法。
If you are doing a serious, industrial-strength implementation, studying the chapter will be time well spent.
如果你正在做一个认真的,行业力量强大的实施,学习这一章将会是很好的时间。
#4
1
I think the most efficient way to store this type of data is in a parts explosion (or hierarchy) table.
我认为存储这类数据最有效的方法是在部件爆炸(或层级)表中。
A parts explosion table consists of three columns: an identity, a parent, and a description. For the example data, the table would look something like this:
部件爆炸表由三列组成:标识、父类和描述。对于示例数据,该表将如下所示:
Identity Parent Description
0 Null ocp
1 0 about_us.html
2 0 security
3 2 ed-209
4 3 patches
5 4 urgent.html
6 2 rc
7 0 food
8 Null weyland-yutani
9 8 products
As the URL (explosion) table is being populated, populate a table that records the leaf of each URL. From the example data:
当URL(爆炸)表被填充时,填充一个表,记录每个URL的叶子。从示例数据:
Leaf ID
-------
1
5
6
7
9
I believe you can answer all your questions starting with these two tables.
我相信你们可以从这两张表开始回答所有的问题。
#5
0
You might want to checkout the HIERARCHYID datatype in SQL Server 2008 or its equivalent in Oracle.
您可能希望在SQL Server 2008中签出层次化数据类型,或者在Oracle中检查它的等效值。
#1
13
Before investing too much time into designing a hierarchical data structure on top of a relational database, consider reading "Naive Trees" section (starting at slide 48) in the excellent presentation SQL Anti-Patterns Strike Back by Bill Karwin. Bill outlines the following methods for developing a hierarchy:
在花费大量时间在关系数据库之上设计层次化数据结构之前,请考虑阅读Bill Karwin的优秀的SQL反模式表示部分(从第48页开始)。Bill概述了开发层次结构的以下方法:
- Path enumeration (slide 55)
- 路径枚举(幻灯片55)
- Nested sets (slide 58)
- 嵌套集幻灯片(58)
- Closure table (slide 68)
- 关闭表(68张)
#2
3
Trees are generally not very efficient in databases. I mean: if you'd design the tree to be truly recursive, with items pointing to their parents, you'll get lots of queries to find all sub-nodes.
树在数据库中通常不是很有效。我的意思是:如果你把树设计成真正的递归的,让条目指向它们的父节点,你会得到很多查询来找到所有的子节点。
But you can optimize the tree, according to your needs.
但是你可以根据你的需要优化树。
Put any part of the url into a column is not a bad idea. You need to limit the depth to a certain number of sub-nodes. You could have indexes on any column, which makes it very fast.
将url的任何部分放到列中不是一个坏主意。您需要将深度限制到一定数量的子节点。你可以在任何列上都有索引,这使得它非常快。
Queries on such a structure are very simple:
对这种结构的查询非常简单:
Select count(*) From Hits where node1 = 'ocp' AND node2 = 'security';
Make a access statistic:
访问统计:
SELECT node1, node2, count(*) as "number of hits"
FROM hits
GROUP BY node1, node2
ORDER BY count(*) DESC
you'll get
你会得到
node1 node2 number of hits
'ocp' 23345
'ocp' 'security' 1020
'ocp' 'food' 234
'weyland-yutani' 'products' 22
You could also store the url as it is and filter using regex. This is more flexible, but slower, because you don't have indexes. You need only to limit the whole length of the url, not the number of sub-nodes.
您还可以按原样存储url并使用regex进行筛选。这更灵活,但速度更慢,因为没有索引。您只需要限制url的整个长度,而不是子节点的数量。
I think you could do this with any database good enough to store large amount of data. For instance MySql.
我认为您可以使用任何足够好的数据库来存储大量的数据。例如MySql。
#3
2
The book, The Art of Sql, by Stephane Faroult has a very excellent chapter (7 - Dealing with Hierarchical Data) which explains and compares 3 methods for storing and querying trees using relational databases.
史蒂芬·法罗特的《Sql的艺术》一书有一个非常优秀的章节(7 -处理分层数据),它解释和比较了使用关系数据库存储和查询树的3种方法。
If you are doing a serious, industrial-strength implementation, studying the chapter will be time well spent.
如果你正在做一个认真的,行业力量强大的实施,学习这一章将会是很好的时间。
#4
1
I think the most efficient way to store this type of data is in a parts explosion (or hierarchy) table.
我认为存储这类数据最有效的方法是在部件爆炸(或层级)表中。
A parts explosion table consists of three columns: an identity, a parent, and a description. For the example data, the table would look something like this:
部件爆炸表由三列组成:标识、父类和描述。对于示例数据,该表将如下所示:
Identity Parent Description
0 Null ocp
1 0 about_us.html
2 0 security
3 2 ed-209
4 3 patches
5 4 urgent.html
6 2 rc
7 0 food
8 Null weyland-yutani
9 8 products
As the URL (explosion) table is being populated, populate a table that records the leaf of each URL. From the example data:
当URL(爆炸)表被填充时,填充一个表,记录每个URL的叶子。从示例数据:
Leaf ID
-------
1
5
6
7
9
I believe you can answer all your questions starting with these two tables.
我相信你们可以从这两张表开始回答所有的问题。
#5
0
You might want to checkout the HIERARCHYID datatype in SQL Server 2008 or its equivalent in Oracle.
您可能希望在SQL Server 2008中签出层次化数据类型,或者在Oracle中检查它的等效值。