Over the years I have read a lot of people's opinions on how to get better performance out of their SQL (Microsoft SQL Server, just so we are all on the same page...) queries. However, they all seem to be tightly tied to either a high-performance OLTP setup or a data warehouse OLAP setup (cubes-galore...). However, my situation today is kind of in the middle of the 2, hence my indecision.
多年来,我读过很多人关于如何从他们的SQL (Microsoft SQL Server,只是为了让我们都在同一个页面上…)查询中获得更好的性能的观点。然而,它们似乎都与高性能OLTP设置或数据仓库OLAP设置紧密相关(cubes-galore…)。然而,我今天的情况是介于两者之间,因此我犹豫不决。
I have a general DB structure of [Contacts], [Sites], [SiteContacts] (the junction table of [Sites] and [Contacts]), [SiteTraits], and [ContractTraits]. I have nearly 3 million contacts with about 50 fields (between [Contacts] and [ContactTraits]) relating to just the contact, and about 600 thousand sites with about 150 fields (between [Sites] and [SiteTraits]) relating to just the sites. Basically it’s a pretty big flattened table or view… Most of the columns are int, bit, char(3), or short varchar(s). My problem is that a good portion of these columns are available to be used in ad-hoc queries by the user, and as quickly as possible because the main UI for this will be a website. I know the most common filters, but even with heavy indexing on them I think this will still be a beast… This data is read-only; the data doesn’t change at all during the day and the database will only be refreshed with the latest information during scheduled downtime. So I see this situation like an OLAP database with the read requirements of an OLTP database.
我有[联系人]、[站点]、[SiteContacts]([站点]和[联系人]的连接表)、[站点特征]和[契约特征]的通用DB结构。我有近300万的联系人,大约有50个字段(在[联系人]和[ContactTraits]之间),涉及的只是联系人,还有大约60万个站点,大约有150个字段(在[站点]和[SiteTraits]之间),与这些站点有关。基本上,它是一个相当大的扁平的表或视图……大多数列是int、bit、char(3)或short varchar(s)。我的问题是,这些列的很大一部分可以用于用户的特别查询,并且尽可能快地使用,因为这方面的主要UI将是一个网站。我知道最常见的过滤器,但即使有大量的索引,我认为这仍然是一个野兽…这个数据是只读的;数据在白天是不会改变的,数据库只有在计划停机期间更新最新的信息。因此,我认为这种情况类似于OLAP数据库,具有OLTP数据库的读取需求。
I see 3 options; 1. Break the table into smaller divisible units sub-query everything, 2. make one flat table and really go to town on the indexing 3. Create an OLAP cube and sub-query the rest based on what filter values I don’t put as the cube dimensions, and. I have not done much with OLAP cubes so I frankly don’t even know if that is an option, but from what I’ve done with them in the past I think it might be an option. Also, just to clarify what I mean when I say “sub-query everything” is instead of having a WHERE clause on the outer select, there would be one (if applicable) for each table being brought into the query and then the tables are INNER JOINed, to eliminate a really large Cartesian Product. As for the second option of the one large table, I have heard and seen conflicting results with that approach as it will save on joins but at the same time a table scan takes much longer.
我看到3选项;1。把表拆分成更小的可分割单元子查询所有东西,2。做一个平面的表,真正去城镇索引3。创建一个OLAP多维数据集,并根据我不将哪些值作为多维数据集的维数进行子查询。我对OLAP多维数据集做的不多,所以我甚至不知道这是否是一个选项,但是从我过去对它们的处理来看,我认为这可能是一个选项。同时,澄清一下我的意思,当我说“子查询一切”而不是一个WHERE子句在外部选择,会有一个为每个表(如适用)在进入到查询的表内加入,消除一个很大的笛卡儿积。对于一个大表的第二个选项,我听到并看到了与该方法相冲突的结果,因为它将节省连接,但同时表扫描需要更长的时间。
Ideas anyone? Do I need to share what I’m smoking? I think this could turn into a pretty good discussion if everyone puts in their 2 cents. Oh, and feel free to tell me if I’m way off base with the OLAP cube idea if that’s the case, I’m new to that stuff too.
的想法吗?我需要分享我正在吸烟的东西吗?我认为如果每个人都投入他们的2分,这将会变成一个很好的讨论。哦,如果我是用OLAP cube的想法来告诉我的话,我也不介意,如果是这样的话,我也是新手。
Thanks in advance to any and all opinions and help with this dilemma I’ve found myself in.
提前感谢所有的意见和帮助我解决这个进退两难的局面。
4 个解决方案
#1
2
You may want to consider this as a relational data warehouse. You could design your relational database tables as a star schema (or, a snowflake schema). This design is very similar to the OLAP cube logical structure, but the physical structure is in the relational database.
您可能希望将其视为关系数据仓库。您可以将关系数据库表设计为星型模式(或者雪花模式)。此设计与OLAP多维数据集逻辑结构非常相似,但是物理结构位于关系数据库中。
In the star schema you would have one or more fact tables, which represent transactions of some sort and is usually associated with a date. I'm not sure what a transaction might be in this case though. The fact may be the association of sites to contacts and the table.
在星型模式中,您将有一个或多个事实表,它们表示某种类型的事务,通常与日期相关联。我不确定在这种情况下交易是什么。事实可能是网站与联系人和表格的关联。
The fact table would reference dimension tables, which describe the fact. Dimensions might be Sites and Contacts. A dimension contains attributes, such as contact name, contact address, etc. If you are familiar with the OLAP cube, then this will be a familiar logical architecture.
事实表将引用描述事实的维度表。维度可能是站点和联系人。维度包含属性,如联系人名称、联系人地址等。如果您熟悉OLAP多维数据集,那么这将是一个熟悉的逻辑体系结构。
It wouldn't be a very big problem to add numerous indexes to your architecture. The database is mostly read only, except for the refresh time. You won't have to worry about read performance while indexes are being updated. So, the architecture can accommodate all indexes that are needed (as long as you can dedicate enough downtime to refresh the data).
在您的体系结构中添加许多索引并不是一个很大的问题。除了刷新时间外,数据库主要是只读的。在更新索引时,您不必担心读取性能。因此,体系结构可以容纳所有需要的索引(只要您能够提供足够的停机时间来刷新数据)。
#2
1
I agree with bobs answer: throw an OLAP front end and query through the cube. The reason why this will be a good think is that cubes are highly efficient at querying (often precomputed) aggregates by multiple dimensions and they store the data in a column-oriented format that is more efficient for data analysis.
我同意bobs的回答:抛出OLAP前端并通过多维数据集查询。这样做的原因是多维数据集能够高效地查询(通常是预先计算的)多维数据集,并且它们以面向列的格式存储数据,这对于数据分析更有效。
The relational data underneath the cube will be great for detail drill-ins to find the individual facts that give a certain aggregate value. But querying directly the relational data will always be slow, because those aggregates users are interested in for analysis can only be produced by scanning large amounts of data. OLAP is just better at this.
多维数据集下面的关系数据对于深入挖掘以找到给出特定聚合值的单个事实非常有用。但是直接查询关系数据总是很慢,因为这些聚合用户对分析感兴趣的数据只能通过扫描大量数据来生成。OLAP更擅长这个。
#3
0
OLAP/SSAS is efficient for aggregate queries, not as much for granular data in my experience.
OLAP/SSAS对于聚合查询是有效的,在我的经验中对粒度数据没有那么有效。
What are the most common queries? For single pieces of data or aggregates?
最常见的查询是什么?对于单个数据或聚合?
#4
0
If the granularity of SiteContacts is pretty close to that of Contacts (ie. circa 3 million records - most contacts associated with only a single site), you may get the best performance out of a single table (with plenty of appropriate indexes, obviously; partitioning should also be considered).
如果SiteContacts的粒度与联系人的粒度非常接近(例如:大约300万条记录——大多数联系人只与一个站点相关联),您可以从单个表中获得最佳性能(显然,有大量适当的索引;分区也应该被考虑。
On the other hand, if most contacts are associated with many sites, it might be better to stick with something close to your current schema.
另一方面,如果大多数联系人与许多站点相关联,那么最好还是使用与当前模式相近的内容。
OLAP tends to produce the best results on aggregated data - it sounds as though there will be relatively little aggregation carried out on this data.
OLAP倾向于在聚合数据上产生最好的结果——听起来似乎在这些数据上进行的聚合相对较少。
Star schemas consist of fact tables with dimensions hanging off them - depending on the relationship between Sites and Contacts, it sounds as though you either have one huge dimension table, or two large dimensions with a factless fact table (sounds like an oxymoron, but is covered in Kimball's methodology) linking them.
明星模式由事实表与维挂掉——根据地点和联系人之间的关系,这听起来好像你有一个巨大的维度表,或两大维度factless事实表(听起来像一个矛盾,但在金博的方法论)连接。
#1
2
You may want to consider this as a relational data warehouse. You could design your relational database tables as a star schema (or, a snowflake schema). This design is very similar to the OLAP cube logical structure, but the physical structure is in the relational database.
您可能希望将其视为关系数据仓库。您可以将关系数据库表设计为星型模式(或者雪花模式)。此设计与OLAP多维数据集逻辑结构非常相似,但是物理结构位于关系数据库中。
In the star schema you would have one or more fact tables, which represent transactions of some sort and is usually associated with a date. I'm not sure what a transaction might be in this case though. The fact may be the association of sites to contacts and the table.
在星型模式中,您将有一个或多个事实表,它们表示某种类型的事务,通常与日期相关联。我不确定在这种情况下交易是什么。事实可能是网站与联系人和表格的关联。
The fact table would reference dimension tables, which describe the fact. Dimensions might be Sites and Contacts. A dimension contains attributes, such as contact name, contact address, etc. If you are familiar with the OLAP cube, then this will be a familiar logical architecture.
事实表将引用描述事实的维度表。维度可能是站点和联系人。维度包含属性,如联系人名称、联系人地址等。如果您熟悉OLAP多维数据集,那么这将是一个熟悉的逻辑体系结构。
It wouldn't be a very big problem to add numerous indexes to your architecture. The database is mostly read only, except for the refresh time. You won't have to worry about read performance while indexes are being updated. So, the architecture can accommodate all indexes that are needed (as long as you can dedicate enough downtime to refresh the data).
在您的体系结构中添加许多索引并不是一个很大的问题。除了刷新时间外,数据库主要是只读的。在更新索引时,您不必担心读取性能。因此,体系结构可以容纳所有需要的索引(只要您能够提供足够的停机时间来刷新数据)。
#2
1
I agree with bobs answer: throw an OLAP front end and query through the cube. The reason why this will be a good think is that cubes are highly efficient at querying (often precomputed) aggregates by multiple dimensions and they store the data in a column-oriented format that is more efficient for data analysis.
我同意bobs的回答:抛出OLAP前端并通过多维数据集查询。这样做的原因是多维数据集能够高效地查询(通常是预先计算的)多维数据集,并且它们以面向列的格式存储数据,这对于数据分析更有效。
The relational data underneath the cube will be great for detail drill-ins to find the individual facts that give a certain aggregate value. But querying directly the relational data will always be slow, because those aggregates users are interested in for analysis can only be produced by scanning large amounts of data. OLAP is just better at this.
多维数据集下面的关系数据对于深入挖掘以找到给出特定聚合值的单个事实非常有用。但是直接查询关系数据总是很慢,因为这些聚合用户对分析感兴趣的数据只能通过扫描大量数据来生成。OLAP更擅长这个。
#3
0
OLAP/SSAS is efficient for aggregate queries, not as much for granular data in my experience.
OLAP/SSAS对于聚合查询是有效的,在我的经验中对粒度数据没有那么有效。
What are the most common queries? For single pieces of data or aggregates?
最常见的查询是什么?对于单个数据或聚合?
#4
0
If the granularity of SiteContacts is pretty close to that of Contacts (ie. circa 3 million records - most contacts associated with only a single site), you may get the best performance out of a single table (with plenty of appropriate indexes, obviously; partitioning should also be considered).
如果SiteContacts的粒度与联系人的粒度非常接近(例如:大约300万条记录——大多数联系人只与一个站点相关联),您可以从单个表中获得最佳性能(显然,有大量适当的索引;分区也应该被考虑。
On the other hand, if most contacts are associated with many sites, it might be better to stick with something close to your current schema.
另一方面,如果大多数联系人与许多站点相关联,那么最好还是使用与当前模式相近的内容。
OLAP tends to produce the best results on aggregated data - it sounds as though there will be relatively little aggregation carried out on this data.
OLAP倾向于在聚合数据上产生最好的结果——听起来似乎在这些数据上进行的聚合相对较少。
Star schemas consist of fact tables with dimensions hanging off them - depending on the relationship between Sites and Contacts, it sounds as though you either have one huge dimension table, or two large dimensions with a factless fact table (sounds like an oxymoron, but is covered in Kimball's methodology) linking them.
明星模式由事实表与维挂掉——根据地点和联系人之间的关系,这听起来好像你有一个巨大的维度表,或两大维度factless事实表(听起来像一个矛盾,但在金博的方法论)连接。