为快速的特别查询构造100米记录表的好方法是什么?

时间:2021-03-31 09:08:13

The scenario is quite simple, there are about 100M records in a table with 10 columns (kind of analytics data), and I need to be able to perform queries on any combination of those 10 columns. For example something like this:

这个场景非常简单,在一个包含10列(分析数据类型)的表中有大约1亿个记录,我需要能够对这10列的任何组合执行查询。例如:

  • how many records with a = 3 && b > 100 are there in past 3 months?
  • 在过去的3个月里有多少记录是a = 3 & b > 100 ?

Basically all of the queries are going to be a kind of how many records with attributes X are there in time interval Y, where X can be any combination of those 10 columns.

基本上所有的查询都是关于在时间间隔Y中有多少个带有X属性的记录,其中X可以是这10列的任意组合。

The data will keep coming in, it is not just a pre-given set of 100M records, but it is growing over time.

数据将源源不断地涌入,这不仅仅是一个预先给定的1亿记录,而且随着时间的推移,它也在不断增长。

Since the column selection can be completely random, creating indexes for popular combinations is most likely not possible.

由于列选择可以是完全随机的,因此创建流行组合的索引很可能是不可能的。

The question has two parts:

这个问题有两个部分:

  • How should I structure this in a SQL database to make the queries as fast as possible, and what are some general steps I can take to improve performance?
  • 我应该如何在SQL数据库中构造这个结构,使查询尽可能快,我可以采取哪些常规步骤来提高性能?
  • Is there any kind of NoSQL database that is optimized for this kind of search? I can think of only ElasticSearch, but I'm not it would perform very well on this large data set.
  • 有没有针对这种搜索优化的NoSQL数据库?我只能想到弹性搜索,但它在这个大数据集上表现得很好。

6 个解决方案

#1


1  

Without indexes your options for tuning an RDBMS to support this kind of processing are severely limited. Basically you need massive parallelism and super-fast kit. But clearly you're not storing realtional data so an RDBMS is the wrong fit.

如果没有索引,对RDBMS进行调优以支持这种处理的选项就会受到严重的限制。基本上你需要大量的并行性和超快的工具包。但显然您没有存储实际数据,所以RDBMS是不合适的。

Pursuing the parallel route, the industry standard is Hadoop. You can still use SQL style queries through Hive.

行业标准是Hadoop。您仍然可以通过Hive使用SQL样式查询。

Another noSQL option would be to consider a columnar database. These are an alternative way of organising data for analytics without using cubes. They are good at loading data fast. Vectorwise is the latest player in the arena. I haven't used it personally, but somebody at last night's LondonData meetup was raving to me about it. Check it out.

另一个noSQL选项是考虑一个柱状数据库。这是一种不用多维数据集就可以为分析组织数据的替代方法。他们擅长快速加载数据。Vectorwise是竞技场里最新的玩家。我还没有亲自使用过它,但昨晚在伦敦的数据会议上有人对我大吼。检查出来。

Of course, moving away from SQL databases - in whatever direction you go - will incur a steep learning curve.

当然,离开SQL数据库——不管你朝哪个方向走——会导致学习曲线变得陡峭。

#2


0  

you should build a SSAS cube and use MDX to query it.

您应该构建一个SSAS多维数据集并使用MDX查询它。

The cube has "aggregations" witch means results calculated ahead of time.Dependiong on how you configure your cube (and your aggregations), you can have a SUM attribute (A for example) on a measure group and every time you ask the cube how many records A has, it will just read the aggregation instead of reading all the table and calculate it.

立方体具有“聚合”功能,意味着可以提前计算结果。取决于您如何配置多维数据集(以及聚合),您可以在度量组上拥有SUM属性(例如),并且每当您询问多维数据集a有多少条记录时,它将读取聚合,而不是读取所有表并计算它。

#3


0  

As far as Oracle is concerned this would most likely be structured as an interval partitioned table with local bitmap indexes on each column that you might query, and new data being added either through a direct path insert or partition exchange.

就Oracle而言,这很可能是一个具有本地位图索引的间隔分区表,您可以对每个列进行查询,并通过直接路径插入或分区交换添加新数据。

Queries for popular combinations of columns could be optimised with a set of materialised views, possibly using rollup or cube queries.

对于流行的列组合的查询可以使用一组物化视图进行优化,可能使用rollup或cube查询。

#4


0  

to get these queries to run fast using SQL solutions use these rules of thumb. There are lots of caveats with this though, and the actual SQL engine you are using will be very relevant to the solution.

要让这些查询使用SQL解决方案快速运行,请使用这些经验法则。但是,这有很多需要注意的地方,您正在使用的实际SQL引擎将与解决方案非常相关。

I am assuming that your data is integer, dates or short scalers. long strings etc change the game. I'm also assuming you are only using fixed comparisons (=, <, >, <>, etc)

我假设您的数据是整数、日期或短标量。长弦等改变游戏。我还假设你只使用固定比较(=,<,>,<>,等等)

a) If time interval Y will be present in every query, make sure it is indexed, unless the Y predicate is selecting a large percentage of rows. Ensure rows are stored in "Y" order, so that they get packed on the disk next to each other. This will happen naturally anyway over time for new data. If the Y predicate is very tight (ie few hundred rows) then this might be all you need to do.

a)如果时间间隔Y将出现在每个查询中,请确保它被索引,除非Y谓词正在选择大量的行。确保行以“Y”的顺序存储,以便将它们相互打包到磁盘上。随着时间的推移,对于新数据来说,这是很自然的事情。如果Y谓词非常紧凑(即几百行),那么这可能就是您所需要做的。

b) Are you doing a "select " or "select count()" ? If not "select *" then vertical partitioning MAY help depending on the engine and other indexes present.

b)您是执行“select”还是“select count()”吗?如果不是“select *”,那么垂直分区可以根据引擎和其他索引提供帮助。

c) Create single column indexes for each column where the values are widely distributed and dont have too many duplicates. Index YEAR_OF_BIRTH would generally be OK, but indexing FEMALE_OR_MALE is often not good - although this is highly database engine specific.

c)为每个列创建单个列索引,这些列的值分布广泛,并且没有太多重复。索引YEAR_OF_BIRTH通常是可以的,但是索引FEMALE_OR_MALE通常不太好——尽管这是高度数据库引擎特有的。

d) If you have columns like FEMALE_OR_MALE and "Y predicates" are wide, you have a different problem - selecting count of number of females from most of the rows will behard. You can try indexing, but depends on engine.

d)如果您有像FEMALE_OR_MALE和“Y谓词”这样的列很宽,那么您将面临不同的问题——从大多数行中选择女性的数量将是困难的。您可以尝试索引,但这取决于引擎。

e) Try and make columns "NOT NULL" if possible - typically saves 1 bit per row and can simplify internal optimiser operation.

e)尽量使列“非空”—通常每一行节省1位,并可以简化内部optimiser操作。

f) Updates/inserts. Creating indexes often hurts insert performance, but if your rate is low enough it might not matter. With only 100M rows, I'll assume your insert rate is reasonably low.

f)更新/插入。创建索引通常会损害插入性能,但如果您的速率足够低,这可能并不重要。只有100行,我假设插入率相当低。

g) Multi-segment keys would help, but you've already said they are no go.

g)多段键会有帮助,但你已经说过了,它们是不行的。

h) Get high speed disks (RPM) - the problem for these types of queries is usually IO (TPC-H benchmarks are about IO, and you are sounding like a "H" problem)

h)获取高速磁盘(RPM)——这类查询的问题通常是IO (TPC-H基准测试是关于IO的,听起来就像“h”问题)

There are lots more options, but it depends how much effort you want to expend "to make the queries as fast as possible". There are lots of No-SQL and other options to solve this, but I'll leave that part of the question to others.

有很多选项,但这取决于您想要花费多少精力“使查询越快越好”。有很多非sql和其他选项可以解决这个问题,但我将把这个问题留给其他人。

#5


0  

In addition to the above suggestions, consider just querying an updated materialized view. I think I would just create a select ,count(*) group by cube () materialized view on the table.

除了上述建议,还可以考虑查询更新后的物化视图。我想我应该在表上创建一个select,count(*) group by cube () materialized view。

This will give you a full cube to work with. Play around with this on a small test table to get the feel of how the cube rollups work. Check out Joe Celko's books for some examples or just hit your specific RDBMS documentation for examples.

这会给你一个完整的立方体来工作。在一个小的测试表上摆弄这个,以了解立方体的滚动是如何工作的。查看Joe Celko的书以获得一些示例,或者只需点击特定的RDBMS文档以获得示例。

You are a little stuck if you have to always be able to query the most up-to-the-microsecond data in your table. But if you can relax that requirement, you'll find a materialized view cube a pretty decent choice.

如果您必须始终能够查询表中最接近微秒的数据,那么您会遇到一些困难。但是如果您可以放松这个需求,您将会发现一个物化的视图立方体是一个相当不错的选择。

Are you absolutely certain that your users will hit all the 10 columns in a uniform way? I have dinged myself with premature optimization in the past for this type of situation, only to find that users really used one or two columns for most of their reports and that rolling up to those one or two colunmns was 'good enough.'

您确定您的用户将以统一的方式访问所有10列吗?在过去,我曾为这种情况做过过早的优化,结果发现用户在大多数报告中确实使用了一到两列,而将这些一到两个colunmns扩展到这两列就足够了。

#6


0  

If you can't create an OLAP cube from the data, can you instead create a summary table based on the unique combinations of X and Y. If the time period Y is of a sufficient high granularity your summary table could be reasonably small. Obviously depends on the data.

如果不能从数据中创建OLAP多维数据集,那么可以根据X和Y的唯一组合创建一个总结表,如果时间周期Y具有足够高的粒度,那么您的总结表可以相当小。显然这取决于数据。

Also, you should capture the queries that users run. It's generally the case that users say they want every possible combination, when in practice this is rarely what happens and most users queries can be satisfied from pre-calculated results. The summary table would be an option here again, you'll get some data latency with this option, but it could work.

此外,您应该捕获用户运行的查询。通常情况下,用户说他们想要所有可能的组合,但实际上很少发生这种情况,大多数用户的查询可以从预先计算的结果中得到满足。汇总表在这里也是一个选项,您将使用这个选项获得一些数据延迟,但是它可以工作。

Other options if possible would be to look at hardware. I've had good results in the past using Solid State Drives such as Fusion-IO. This can reduce query time massively. This is not a replacement for good design, but with good design and the right hardware it works well.

如果可能的话,其他的选择是查看硬件。在过去,我使用了像fuio - io这样的固态硬盘。这可以大大减少查询时间。这并不是对好的设计的替换,而是对好的设计和正确的硬件的替换。

#1


1  

Without indexes your options for tuning an RDBMS to support this kind of processing are severely limited. Basically you need massive parallelism and super-fast kit. But clearly you're not storing realtional data so an RDBMS is the wrong fit.

如果没有索引,对RDBMS进行调优以支持这种处理的选项就会受到严重的限制。基本上你需要大量的并行性和超快的工具包。但显然您没有存储实际数据,所以RDBMS是不合适的。

Pursuing the parallel route, the industry standard is Hadoop. You can still use SQL style queries through Hive.

行业标准是Hadoop。您仍然可以通过Hive使用SQL样式查询。

Another noSQL option would be to consider a columnar database. These are an alternative way of organising data for analytics without using cubes. They are good at loading data fast. Vectorwise is the latest player in the arena. I haven't used it personally, but somebody at last night's LondonData meetup was raving to me about it. Check it out.

另一个noSQL选项是考虑一个柱状数据库。这是一种不用多维数据集就可以为分析组织数据的替代方法。他们擅长快速加载数据。Vectorwise是竞技场里最新的玩家。我还没有亲自使用过它,但昨晚在伦敦的数据会议上有人对我大吼。检查出来。

Of course, moving away from SQL databases - in whatever direction you go - will incur a steep learning curve.

当然,离开SQL数据库——不管你朝哪个方向走——会导致学习曲线变得陡峭。

#2


0  

you should build a SSAS cube and use MDX to query it.

您应该构建一个SSAS多维数据集并使用MDX查询它。

The cube has "aggregations" witch means results calculated ahead of time.Dependiong on how you configure your cube (and your aggregations), you can have a SUM attribute (A for example) on a measure group and every time you ask the cube how many records A has, it will just read the aggregation instead of reading all the table and calculate it.

立方体具有“聚合”功能,意味着可以提前计算结果。取决于您如何配置多维数据集(以及聚合),您可以在度量组上拥有SUM属性(例如),并且每当您询问多维数据集a有多少条记录时,它将读取聚合,而不是读取所有表并计算它。

#3


0  

As far as Oracle is concerned this would most likely be structured as an interval partitioned table with local bitmap indexes on each column that you might query, and new data being added either through a direct path insert or partition exchange.

就Oracle而言,这很可能是一个具有本地位图索引的间隔分区表,您可以对每个列进行查询,并通过直接路径插入或分区交换添加新数据。

Queries for popular combinations of columns could be optimised with a set of materialised views, possibly using rollup or cube queries.

对于流行的列组合的查询可以使用一组物化视图进行优化,可能使用rollup或cube查询。

#4


0  

to get these queries to run fast using SQL solutions use these rules of thumb. There are lots of caveats with this though, and the actual SQL engine you are using will be very relevant to the solution.

要让这些查询使用SQL解决方案快速运行,请使用这些经验法则。但是,这有很多需要注意的地方,您正在使用的实际SQL引擎将与解决方案非常相关。

I am assuming that your data is integer, dates or short scalers. long strings etc change the game. I'm also assuming you are only using fixed comparisons (=, <, >, <>, etc)

我假设您的数据是整数、日期或短标量。长弦等改变游戏。我还假设你只使用固定比较(=,<,>,<>,等等)

a) If time interval Y will be present in every query, make sure it is indexed, unless the Y predicate is selecting a large percentage of rows. Ensure rows are stored in "Y" order, so that they get packed on the disk next to each other. This will happen naturally anyway over time for new data. If the Y predicate is very tight (ie few hundred rows) then this might be all you need to do.

a)如果时间间隔Y将出现在每个查询中,请确保它被索引,除非Y谓词正在选择大量的行。确保行以“Y”的顺序存储,以便将它们相互打包到磁盘上。随着时间的推移,对于新数据来说,这是很自然的事情。如果Y谓词非常紧凑(即几百行),那么这可能就是您所需要做的。

b) Are you doing a "select " or "select count()" ? If not "select *" then vertical partitioning MAY help depending on the engine and other indexes present.

b)您是执行“select”还是“select count()”吗?如果不是“select *”,那么垂直分区可以根据引擎和其他索引提供帮助。

c) Create single column indexes for each column where the values are widely distributed and dont have too many duplicates. Index YEAR_OF_BIRTH would generally be OK, but indexing FEMALE_OR_MALE is often not good - although this is highly database engine specific.

c)为每个列创建单个列索引,这些列的值分布广泛,并且没有太多重复。索引YEAR_OF_BIRTH通常是可以的,但是索引FEMALE_OR_MALE通常不太好——尽管这是高度数据库引擎特有的。

d) If you have columns like FEMALE_OR_MALE and "Y predicates" are wide, you have a different problem - selecting count of number of females from most of the rows will behard. You can try indexing, but depends on engine.

d)如果您有像FEMALE_OR_MALE和“Y谓词”这样的列很宽,那么您将面临不同的问题——从大多数行中选择女性的数量将是困难的。您可以尝试索引,但这取决于引擎。

e) Try and make columns "NOT NULL" if possible - typically saves 1 bit per row and can simplify internal optimiser operation.

e)尽量使列“非空”—通常每一行节省1位,并可以简化内部optimiser操作。

f) Updates/inserts. Creating indexes often hurts insert performance, but if your rate is low enough it might not matter. With only 100M rows, I'll assume your insert rate is reasonably low.

f)更新/插入。创建索引通常会损害插入性能,但如果您的速率足够低,这可能并不重要。只有100行,我假设插入率相当低。

g) Multi-segment keys would help, but you've already said they are no go.

g)多段键会有帮助,但你已经说过了,它们是不行的。

h) Get high speed disks (RPM) - the problem for these types of queries is usually IO (TPC-H benchmarks are about IO, and you are sounding like a "H" problem)

h)获取高速磁盘(RPM)——这类查询的问题通常是IO (TPC-H基准测试是关于IO的,听起来就像“h”问题)

There are lots more options, but it depends how much effort you want to expend "to make the queries as fast as possible". There are lots of No-SQL and other options to solve this, but I'll leave that part of the question to others.

有很多选项,但这取决于您想要花费多少精力“使查询越快越好”。有很多非sql和其他选项可以解决这个问题,但我将把这个问题留给其他人。

#5


0  

In addition to the above suggestions, consider just querying an updated materialized view. I think I would just create a select ,count(*) group by cube () materialized view on the table.

除了上述建议,还可以考虑查询更新后的物化视图。我想我应该在表上创建一个select,count(*) group by cube () materialized view。

This will give you a full cube to work with. Play around with this on a small test table to get the feel of how the cube rollups work. Check out Joe Celko's books for some examples or just hit your specific RDBMS documentation for examples.

这会给你一个完整的立方体来工作。在一个小的测试表上摆弄这个,以了解立方体的滚动是如何工作的。查看Joe Celko的书以获得一些示例,或者只需点击特定的RDBMS文档以获得示例。

You are a little stuck if you have to always be able to query the most up-to-the-microsecond data in your table. But if you can relax that requirement, you'll find a materialized view cube a pretty decent choice.

如果您必须始终能够查询表中最接近微秒的数据,那么您会遇到一些困难。但是如果您可以放松这个需求,您将会发现一个物化的视图立方体是一个相当不错的选择。

Are you absolutely certain that your users will hit all the 10 columns in a uniform way? I have dinged myself with premature optimization in the past for this type of situation, only to find that users really used one or two columns for most of their reports and that rolling up to those one or two colunmns was 'good enough.'

您确定您的用户将以统一的方式访问所有10列吗?在过去,我曾为这种情况做过过早的优化,结果发现用户在大多数报告中确实使用了一到两列,而将这些一到两个colunmns扩展到这两列就足够了。

#6


0  

If you can't create an OLAP cube from the data, can you instead create a summary table based on the unique combinations of X and Y. If the time period Y is of a sufficient high granularity your summary table could be reasonably small. Obviously depends on the data.

如果不能从数据中创建OLAP多维数据集,那么可以根据X和Y的唯一组合创建一个总结表,如果时间周期Y具有足够高的粒度,那么您的总结表可以相当小。显然这取决于数据。

Also, you should capture the queries that users run. It's generally the case that users say they want every possible combination, when in practice this is rarely what happens and most users queries can be satisfied from pre-calculated results. The summary table would be an option here again, you'll get some data latency with this option, but it could work.

此外,您应该捕获用户运行的查询。通常情况下,用户说他们想要所有可能的组合,但实际上很少发生这种情况,大多数用户的查询可以从预先计算的结果中得到满足。汇总表在这里也是一个选项,您将使用这个选项获得一些数据延迟,但是它可以工作。

Other options if possible would be to look at hardware. I've had good results in the past using Solid State Drives such as Fusion-IO. This can reduce query time massively. This is not a replacement for good design, but with good design and the right hardware it works well.

如果可能的话,其他的选择是查看硬件。在过去,我使用了像fuio - io这样的固态硬盘。这可以大大减少查询时间。这并不是对好的设计的替换,而是对好的设计和正确的硬件的替换。