We have a table design that consists of 10,000,000
records and 200,000
columns.
我们有一个包含10,000,000条记录和200,000列的表设计。
The columns are a mixture of:
这些圆柱是:
- Binary flags.
- 二进制的旗帜。
- Integers.
- 整数。
The queries need to perform and
/ or
operations on 1-100
columns at a time, and should complete in under 0.1
seconds, returning a only projection/subset of each matched row.
查询每次需要对1-100列执行和/或操作,并且应该在0.1秒内完成,只返回每个匹配行的投影/子集。
Around 10
new columns get added per day.
每天增加10个新的栏目。
Around 1,000
new rows get added per day.
每天大约新增1000行。
There are no joins.
没有连接。
Which DBMS is best suited for this?
哪种DBMS最适合这种情况?
Reason behind this approach:
The columns are materialized indexes from user defined queries: that's why new columns get added each day (as more users come up with their own queries). The other option would be to not use materialized views, and have the user's queries perform joins. Problem here is the queries could take any form and in aggregate there would be a large number of very different execution plans across everyones query... since the user defines the query, it's kinda impossible to optimise a traditional SQL database using indexes, normalised tables, etc.
这种方法背后的原因是:列是用户定义查询的物化索引:这就是为什么每天都要添加新的列(随着更多的用户提出他们自己的查询)。另一种选择是不使用物化视图,并让用户的查询执行连接。这里的问题是查询可以采用任何形式,总的来说,在每个人的查询中会有大量非常不同的执行计划……由于用户定义了查询,所以不可能使用索引、规范化表等来优化传统的SQL数据库。
2 个解决方案
#1
1
First, I'd suggest measuring ad-hoc JOINs, and only doing further optimization if you find the performance lacking. I understand it could be difficult to measure every possible query, but you may be able to cover most common/representative cases, and if they perform well-enough just stop there. There is a lot that can be done with good indexing!
首先,我建议度量特定的连接,如果发现性能不足,只做进一步的优化。我理解度量每一个可能的查询可能是困难的,但是您可能能够覆盖大多数常见/代表性的情况,如果它们执行得足够好,那么就到此为止。有很多可以做的很好的索引!
Second, and only if the measurements above warrant it, create a new separate materialized view for each ad-hoc query.
其次,只有在上述度量值允许的情况下,才为每个特别查询创建一个新的单独的物化视图。
- Some databases will be able to maintain such views automatically for you1, so if the "base" data changes, relevant results will be automatically added or removed from the materialized view (just as they would from the "live" query result).
- 一些数据库将能够为you1自动维护此类视图,因此如果“base”数据发生更改,相关结果将自动从物化视图中添加或删除(就像它们从“live”查询结果中删除一样)。
- Other databases may allow periodic refresh2.
- 其他数据库可能允许定期的refresh2。
Be warned though: maintaining materialized views is not free, and having thousands of them (especially if they are constantly kept up-to-date, as opposed to periodically refreshed) will definitely impact the insert/update/delete performance on the base data!
但是要注意的是:维护物化视图不是免费的,并且拥有数以千计的视图(特别是如果它们是经常更新的,而不是定期刷新)肯定会影响基本数据的插入/更新/删除性能!
1 E.g. SQL Server indexed views.
1 . SQL Server索引视图。
2 E.g. Oracle Materialized views, although it looks like 12c can also do something close to SQL Server's immediate refresh.
2 .例如,Oracle具体化了视图,虽然看起来12c也可以做一些类似于SQL Server的即时刷新的事情。
#2
1
Keeping aside ,why you want to go with 1000 of columns,you can look at below databases which support,unlimited columns
顺便提一下,为什么要使用1000列,您可以查看下面支持无限列的数据库
References: https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
引用:https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
#1
1
First, I'd suggest measuring ad-hoc JOINs, and only doing further optimization if you find the performance lacking. I understand it could be difficult to measure every possible query, but you may be able to cover most common/representative cases, and if they perform well-enough just stop there. There is a lot that can be done with good indexing!
首先,我建议度量特定的连接,如果发现性能不足,只做进一步的优化。我理解度量每一个可能的查询可能是困难的,但是您可能能够覆盖大多数常见/代表性的情况,如果它们执行得足够好,那么就到此为止。有很多可以做的很好的索引!
Second, and only if the measurements above warrant it, create a new separate materialized view for each ad-hoc query.
其次,只有在上述度量值允许的情况下,才为每个特别查询创建一个新的单独的物化视图。
- Some databases will be able to maintain such views automatically for you1, so if the "base" data changes, relevant results will be automatically added or removed from the materialized view (just as they would from the "live" query result).
- 一些数据库将能够为you1自动维护此类视图,因此如果“base”数据发生更改,相关结果将自动从物化视图中添加或删除(就像它们从“live”查询结果中删除一样)。
- Other databases may allow periodic refresh2.
- 其他数据库可能允许定期的refresh2。
Be warned though: maintaining materialized views is not free, and having thousands of them (especially if they are constantly kept up-to-date, as opposed to periodically refreshed) will definitely impact the insert/update/delete performance on the base data!
但是要注意的是:维护物化视图不是免费的,并且拥有数以千计的视图(特别是如果它们是经常更新的,而不是定期刷新)肯定会影响基本数据的插入/更新/删除性能!
1 E.g. SQL Server indexed views.
1 . SQL Server索引视图。
2 E.g. Oracle Materialized views, although it looks like 12c can also do something close to SQL Server's immediate refresh.
2 .例如,Oracle具体化了视图,虽然看起来12c也可以做一些类似于SQL Server的即时刷新的事情。
#2
1
Keeping aside ,why you want to go with 1000 of columns,you can look at below databases which support,unlimited columns
顺便提一下,为什么要使用1000列,您可以查看下面支持无限列的数据库
References: https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
引用:https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems