I have an app, which cycles through a huge number of records in a database table and performs a number of SQL and .Net operations on records within that database (currently I am using Castle.ActiveRecord on PostgreSQL).
我有一个应用程序,它循环数据库表中的大量记录,并对该数据库中的记录执行大量SQL和.Net操作(目前我在PostgreSQL上使用Castle.ActiveRecord)。
I added some basic btree indexes on a couple of the feilds, and as you would expect, the performance of the SQL operations increased substantially. Wanting to make the most of dbms performance I want to make some better educated choices about what I should index on all my projects.
我在几个字段上添加了一些基本的btree索引,正如您所料,SQL操作的性能大幅提升。想要充分利用dbms的性能,我想做一些关于我应该在所有项目上编制索引的更好的教育选择。
I understand that there is a detrement to performance when doing inserts (as the database needs to update the index, as well as the data), but what suggestions and best practices should I consider with creating database indexes? How do I best select the feilds/combination of fields for a set of database indexes (rules of thumb)?
我知道插入时性能有所下降(因为数据库需要更新索引以及数据),但是在创建数据库索引时应该考虑哪些建议和最佳实践?如何最好地为一组数据库索引(经验法则)选择字段/字段组合?
Also, how do I best select which index to use as a clustered index? And when it comes to the access method, under what conditions should I use a btree over a hash or a gist or a gin (what are they anyway?).
另外,如何最好地选择要用作聚簇索引的索引?当谈到访问方法时,我应该在什么条件下使用btree而不是哈希或gist或杜松子酒(无论如何它们都是什么?)。
3 个解决方案
#1
34
Some of my rules of thumb:
我的一些经验法则:
- Index ALL primary keys (I think most of the RDBMS do this when table is created).
- 索引所有主键(我认为大多数RDBMS在创建表时执行此操作)。
- Index ALL foreign keys columns.
- 索引所有外键列。
- Create more indexes ONLY if:
- Queries are slow.
- 查询很慢。
- You know the data volume are going to increase significantly.
- 您知道数据量将显着增加。
- 仅在以下情况下创建更多索引:查询速度很慢。您知道数据量将显着增加。
- Run statistics when populating a lot of data on tables.
- 在表上填充大量数据时运行统计信息。
If a query is slow, look for the execution plan and:
如果查询速度很慢,请查找执行计划并:
- If the query for a table only uses few columns put all that columns into an index, then you can help the RDBMS to use only the index.
- 如果对表的查询仅使用少量列将所有列放入索引,则可以帮助RDBMS仅使用索引。
- Don't waste resources indexing tiny tables (hundreds of records).
- 不要浪费资源索引微小的表(数百条记录)。
- Index multiple columns in order from high cardinality to less. It means, first the columns with more distinct values followed by columns with fewer distinct values.
- 从高基数到较低的顺序索引多列。这意味着,首先是具有更多不同值的列,然后是具有更少不同值的列。
- If a query needs to access more than 10% of the data, normaly a full scan is better than an index.
- 如果查询需要访问超过10%的数据,则通常完全扫描比索引更好。
#2
3
Here's a slightly simplistic overview: it's certainly true that there is an overhead to data modifications due to the presence of indexes, but you ought to consider the relative number of reads and writes to the data. In general the number of reads is far higher than the number of writes, and you should take that into account when defining an indexing strategy.
这里有一个稍微简单的概述:由于存在索引而导致数据修改的开销肯定是正确的,但您应该考虑对数据的读取和写入的相对数量。通常,读取次数远远高于写入次数,在定义索引策略时应考虑到这一点。
When it comes to which columns to index I'v e always felt that the designer ought to know the business well enough to be able to take a very good first pass at which columns are likely to benefit. Other then that it really comes down to feedback from the programmers, full-scale testing, and system monitoring (preferably with extensive internal metrics on performance to capture long-running operations),
当涉及到索引的列时,我总是觉得设计师应该对业务有足够的了解,以便能够在列可能受益的情况下获得非常好的第一遍。除此之外,它真正归结为来自程序员的反馈,全面测试和系统监控(最好具有广泛的内部性能指标来捕获长时间运行的操作),
#3
2
As @David Aldridge mentioned, the majority of databases perform many more reads than they do writes and in addition, appropriate indexes will often be utilised even when performing INSERTS (to determine the correct place to INSERT).
正如@David Aldridge所提到的,大多数数据库执行的读取次数多于写入次数,此外,即使执行INSERTS(确定INSERT的正确位置),也经常使用适当的索引。
The critical indexes under an unknown production workload are often hard to guess/estimate, and a set of indexes should not be viewed as set once and forget. Indexes should be monitored and altered with changing workloads (that new killer report, for instance).
未知生产工作负载下的关键索引通常难以猜测/估计,并且一组索引不应被视为设置一次而忘记。应该通过改变工作负载来监控和改变索引(例如,新的杀手报告)。
Nothing beats profiling; if you guess your indexes, you will often miss the really important ones.
什么都没有比剖析更好;如果你猜测你的索引,你会经常错过真正重要的索引。
As a general rule, if I have little idea how the database will be queried, then I will create indexes on all Foriegn Keys, profile under a workload (think UAT release) and remove those that are not being used, as well as creating important missing indexes.
作为一般规则,如果我不知道如何查询数据库,那么我将在所有Foriegn Keys上创建索引,在工作负载(想想UAT版本)下创建配置文件并删除那些未使用的配置文件,以及创建重要的缺少索引。
Also, make sure that a scheduled index maintenance plan is also created.
此外,请确保还创建了计划的索引维护计划。
#1
34
Some of my rules of thumb:
我的一些经验法则:
- Index ALL primary keys (I think most of the RDBMS do this when table is created).
- 索引所有主键(我认为大多数RDBMS在创建表时执行此操作)。
- Index ALL foreign keys columns.
- 索引所有外键列。
- Create more indexes ONLY if:
- Queries are slow.
- 查询很慢。
- You know the data volume are going to increase significantly.
- 您知道数据量将显着增加。
- 仅在以下情况下创建更多索引:查询速度很慢。您知道数据量将显着增加。
- Run statistics when populating a lot of data on tables.
- 在表上填充大量数据时运行统计信息。
If a query is slow, look for the execution plan and:
如果查询速度很慢,请查找执行计划并:
- If the query for a table only uses few columns put all that columns into an index, then you can help the RDBMS to use only the index.
- 如果对表的查询仅使用少量列将所有列放入索引,则可以帮助RDBMS仅使用索引。
- Don't waste resources indexing tiny tables (hundreds of records).
- 不要浪费资源索引微小的表(数百条记录)。
- Index multiple columns in order from high cardinality to less. It means, first the columns with more distinct values followed by columns with fewer distinct values.
- 从高基数到较低的顺序索引多列。这意味着,首先是具有更多不同值的列,然后是具有更少不同值的列。
- If a query needs to access more than 10% of the data, normaly a full scan is better than an index.
- 如果查询需要访问超过10%的数据,则通常完全扫描比索引更好。
#2
3
Here's a slightly simplistic overview: it's certainly true that there is an overhead to data modifications due to the presence of indexes, but you ought to consider the relative number of reads and writes to the data. In general the number of reads is far higher than the number of writes, and you should take that into account when defining an indexing strategy.
这里有一个稍微简单的概述:由于存在索引而导致数据修改的开销肯定是正确的,但您应该考虑对数据的读取和写入的相对数量。通常,读取次数远远高于写入次数,在定义索引策略时应考虑到这一点。
When it comes to which columns to index I'v e always felt that the designer ought to know the business well enough to be able to take a very good first pass at which columns are likely to benefit. Other then that it really comes down to feedback from the programmers, full-scale testing, and system monitoring (preferably with extensive internal metrics on performance to capture long-running operations),
当涉及到索引的列时,我总是觉得设计师应该对业务有足够的了解,以便能够在列可能受益的情况下获得非常好的第一遍。除此之外,它真正归结为来自程序员的反馈,全面测试和系统监控(最好具有广泛的内部性能指标来捕获长时间运行的操作),
#3
2
As @David Aldridge mentioned, the majority of databases perform many more reads than they do writes and in addition, appropriate indexes will often be utilised even when performing INSERTS (to determine the correct place to INSERT).
正如@David Aldridge所提到的,大多数数据库执行的读取次数多于写入次数,此外,即使执行INSERTS(确定INSERT的正确位置),也经常使用适当的索引。
The critical indexes under an unknown production workload are often hard to guess/estimate, and a set of indexes should not be viewed as set once and forget. Indexes should be monitored and altered with changing workloads (that new killer report, for instance).
未知生产工作负载下的关键索引通常难以猜测/估计,并且一组索引不应被视为设置一次而忘记。应该通过改变工作负载来监控和改变索引(例如,新的杀手报告)。
Nothing beats profiling; if you guess your indexes, you will often miss the really important ones.
什么都没有比剖析更好;如果你猜测你的索引,你会经常错过真正重要的索引。
As a general rule, if I have little idea how the database will be queried, then I will create indexes on all Foriegn Keys, profile under a workload (think UAT release) and remove those that are not being used, as well as creating important missing indexes.
作为一般规则,如果我不知道如何查询数据库,那么我将在所有Foriegn Keys上创建索引,在工作负载(想想UAT版本)下创建配置文件并删除那些未使用的配置文件,以及创建重要的缺少索引。
Also, make sure that a scheduled index maintenance plan is also created.
此外,请确保还创建了计划的索引维护计划。