索引ActiveRecord中的多个列

时间:2022-02-13 04:18:45

In ActiveRecord there are two ways to declare indexes for multiple columns:

在ActiveRecord中有两种方法可以为多个列声明索引:

add_index :classifications, [:species, :family, :trivial_names]
add_index :classifications, :species
add_index :classifications, :family
add_index :classifications, :trivial_names

Is there any difference between the first approach and the second one? If so, when should I use the first and when the second?

第一种方法和第二种方法有什么不同吗?如果是,我应该什么时候使用第一个和第二个?

3 个解决方案

#1


86  

You are comparing a composite index with a set of independent indices. They are just different.

您正在将复合索引与一组独立的索引进行比较。他们只是不同。

Think of it this way: a compound index gives you rapid look-up of the first field in a nested set of fields followed by rapid look-up of the second field within ONLY the records already selected by the first field, followed by rapid look-up of the third field - again, only within the records selected by the previous two indices.

这样想:复合指数给你快速查找第一个字段在一个嵌套组字段的快速查找第二个字段内只选择的记录已经第一个字段,紧随其后的是第三个字段的快速查找,再一次,只在前两个指标选择的记录。

Lets take an example. Your database engine will take no more than 20 steps to locate a unique value within 1,000,000 records (if memory serves) if you are using an index. This is true whether you are using a composite or and independent index - but ONLY for the first field ("species" in your example although I'd think you'd want Family, Species, and then Common Name).

让我们举一个例子。如果使用索引,您的数据库引擎将不超过20步,以在1,000,000条记录(如果内存有效)中定位唯一值。无论您使用的是复合索引还是独立索引,这都是正确的——但仅限于第一个字段(在您的示例中是“物种”,尽管我认为您需要的是家族、物种,然后是通用名称)。

Now, let's say that there are 100,000 matching records for this first field value. If you have only single indices, then any lookup within these records will take 100,000 steps: one for each record retrieved by the first index. This is because the second index will not be used (in most databases - this is a bit of a simplification) and a brute force match must be used.

现在,假设第一个字段值有100,000个匹配记录。如果您只有一个索引,那么在这些记录中进行任何查找将需要花费100,000步:第一个索引检索到的每个记录对应一个步骤。这是因为不会使用第二个索引(在大多数数据库中—这有点简化),并且必须使用蛮力匹配。

If you have a composite index then your search is much faster because your second field search will have an index within the first set of values. In this case you'll need no more than 17 steps to get to your first matching value on field 2 within the 100,000 matches on field 1 (log base 2 of 100,000).

如果你有一个复合索引,那么你的搜索会快得多,因为你的第二个字段搜索会在第一组值中有一个索引。在本例中,在字段1的100,000个匹配中,您将需要不超过17个步骤来获得第2字段的第一个匹配值(以100,000为基数2)。

So: steps needed to find a unique record out of a database of 1,000,000 records using a composite index on 3 nested fields where the first retrieves 100,000 and the second retrieves 10,000 = 20 + 17 + 14 = 51 steps.

因此:需要从数据库中找到一个唯一的记录,该记录使用3个嵌套字段的复合索引,其中第一个检索100,000,第二个检索10,000 = 20 + 17 + 14 = 51步。

Steps needed under the same conditions with just independent indices = 20 + 100,000 + 10,000 = 110,020 steps.

在相同条件下,只需要独立指标= 20 + 100,000 + 10,000 = 110,020步。

Big difference, eh?

很大的不同,是吗?

Now, don't go nuts putting composite indices everywhere. First, they are expensive on inserts and updates. Second, they are only brought to bear if you are truly searching across nested data (for another example, I use them when pulling data for logins for a client over a given date range). Also, they are not worth it if you are working with relatively small data sets.

现在,不要疯狂地把综合指数放在任何地方。首先,它们在插入和更新上很昂贵。其次,如果您真正地搜索了嵌套数据(另一个例子,我在为客户在给定的日期范围内为客户提供登录数据时使用它们),那么它们只会被引入。而且,如果您使用的是相对较小的数据集,那么它们也不值得使用。

Finally, check your database documentation. Databases have grown extremely sophisticated in the ability to deploy indices these days and the Database 101 scenario I described above may not hold for some (although I always develop as if it does just so I know what I am getting).

最后,检查您的数据库文档。这些天,数据库在部署索引的能力上变得非常复杂,我上面描述的数据库101场景可能对某些数据库不适用(尽管我总是这样开发,这样我就知道我得到了什么)。

#2


10  

The two approaches are different. The first creates a single index on three attributes, the second creates three single-attribute indices. Storage requirements will be different, although without distributions it's not possible to say which would be larger.

这两种方法是不同的。第一个创建了三个属性的单个索引,第二个创建了三个单属性索引。存储需求将是不同的,尽管没有分布,也不可能说哪个更大。

Indexing three columns [A, B, C] works well when you need to access for values of A, A+B and A+B+C. It won't be any good if your query (or find conditions or whatever) doesn't reference A.

当您需要访问A、A+B和A+B+C的值时,三个列[A、B、C]工作得很好。如果您的查询(或查找条件或其他)没有引用A,那就没有任何好处。

When A, B and C are indexed separately, some DBMS query optimizers will consider combining two or more indices (subject to the optimizer's estimate of efficiency) to give a similar result to a single multi-column index.

当A、B和C被单独索引时,一些DBMS查询优化器将考虑组合两个或多个索引(取决于优化器对效率的估计),从而为单个多列索引提供类似的结果。

Suppose you have some e-commerce system. You want to query orders by purchase_date, customer_id and sometimes both. I'd start by creating two indices: one for each attribute.

假设你有电子商务系统。您希望按purchase e_date、customer_id和有时两者查询订单。我首先创建两个索引:每个属性一个索引。

On the other hand, if you always specify purchase_date and customer_id, then a single index on both columns would probably be most efficient. The order is significant: if you also wanted to query orders for all dates for a customer, then make the customer_id the first column in the index.

另一方面,如果您总是指定购买日期和customer_id,那么两个列上的单个索引可能是最有效的。订单是重要的:如果您还想为客户查询所有日期的订单,那么将customer_id设置为索引中的第一列。

#3


1  

From the docs:

从文档:

When creating an index on multiple columns, the first column is used as a name for the index. For example, when you specify an index on two columns [:first, :last], the DBMS creates an index for both columns as well as an index for the first column :first. Using just the first name for this index makes sense, because you will never have to create a singular index with this name.

当在多个列上创建索引时,第一个列用作索引的名称。例如,当您在两列[:first,:last]上指定索引时,DBMS将为两列创建索引,并为第一列创建索引:first。只使用这个索引的名称是有意义的,因为您永远不需要用这个名称创建一个单数索引。

Use the first method when creating a compound index, and the second when creating indexes on single attributes.

在创建复合索引时使用第一个方法,在单个属性上创建索引时使用第二个方法。

There are some good points here on when to use compound indexes, but the gist is that they are good when utilizing a where on multiple attributes. Note that they should be used alongside other indexes (always index your foriegn keys) - not as a replacement.

这里有一些关于何时使用复合索引的优点,但要点是它们在使用多个属性上的where时很有用。注意,它们应该与其他索引一起使用(总是索引您的foriegn键)——而不是作为替代。

#1


86  

You are comparing a composite index with a set of independent indices. They are just different.

您正在将复合索引与一组独立的索引进行比较。他们只是不同。

Think of it this way: a compound index gives you rapid look-up of the first field in a nested set of fields followed by rapid look-up of the second field within ONLY the records already selected by the first field, followed by rapid look-up of the third field - again, only within the records selected by the previous two indices.

这样想:复合指数给你快速查找第一个字段在一个嵌套组字段的快速查找第二个字段内只选择的记录已经第一个字段,紧随其后的是第三个字段的快速查找,再一次,只在前两个指标选择的记录。

Lets take an example. Your database engine will take no more than 20 steps to locate a unique value within 1,000,000 records (if memory serves) if you are using an index. This is true whether you are using a composite or and independent index - but ONLY for the first field ("species" in your example although I'd think you'd want Family, Species, and then Common Name).

让我们举一个例子。如果使用索引,您的数据库引擎将不超过20步,以在1,000,000条记录(如果内存有效)中定位唯一值。无论您使用的是复合索引还是独立索引,这都是正确的——但仅限于第一个字段(在您的示例中是“物种”,尽管我认为您需要的是家族、物种,然后是通用名称)。

Now, let's say that there are 100,000 matching records for this first field value. If you have only single indices, then any lookup within these records will take 100,000 steps: one for each record retrieved by the first index. This is because the second index will not be used (in most databases - this is a bit of a simplification) and a brute force match must be used.

现在,假设第一个字段值有100,000个匹配记录。如果您只有一个索引,那么在这些记录中进行任何查找将需要花费100,000步:第一个索引检索到的每个记录对应一个步骤。这是因为不会使用第二个索引(在大多数数据库中—这有点简化),并且必须使用蛮力匹配。

If you have a composite index then your search is much faster because your second field search will have an index within the first set of values. In this case you'll need no more than 17 steps to get to your first matching value on field 2 within the 100,000 matches on field 1 (log base 2 of 100,000).

如果你有一个复合索引,那么你的搜索会快得多,因为你的第二个字段搜索会在第一组值中有一个索引。在本例中,在字段1的100,000个匹配中,您将需要不超过17个步骤来获得第2字段的第一个匹配值(以100,000为基数2)。

So: steps needed to find a unique record out of a database of 1,000,000 records using a composite index on 3 nested fields where the first retrieves 100,000 and the second retrieves 10,000 = 20 + 17 + 14 = 51 steps.

因此:需要从数据库中找到一个唯一的记录,该记录使用3个嵌套字段的复合索引,其中第一个检索100,000,第二个检索10,000 = 20 + 17 + 14 = 51步。

Steps needed under the same conditions with just independent indices = 20 + 100,000 + 10,000 = 110,020 steps.

在相同条件下,只需要独立指标= 20 + 100,000 + 10,000 = 110,020步。

Big difference, eh?

很大的不同,是吗?

Now, don't go nuts putting composite indices everywhere. First, they are expensive on inserts and updates. Second, they are only brought to bear if you are truly searching across nested data (for another example, I use them when pulling data for logins for a client over a given date range). Also, they are not worth it if you are working with relatively small data sets.

现在,不要疯狂地把综合指数放在任何地方。首先,它们在插入和更新上很昂贵。其次,如果您真正地搜索了嵌套数据(另一个例子,我在为客户在给定的日期范围内为客户提供登录数据时使用它们),那么它们只会被引入。而且,如果您使用的是相对较小的数据集,那么它们也不值得使用。

Finally, check your database documentation. Databases have grown extremely sophisticated in the ability to deploy indices these days and the Database 101 scenario I described above may not hold for some (although I always develop as if it does just so I know what I am getting).

最后,检查您的数据库文档。这些天,数据库在部署索引的能力上变得非常复杂,我上面描述的数据库101场景可能对某些数据库不适用(尽管我总是这样开发,这样我就知道我得到了什么)。

#2


10  

The two approaches are different. The first creates a single index on three attributes, the second creates three single-attribute indices. Storage requirements will be different, although without distributions it's not possible to say which would be larger.

这两种方法是不同的。第一个创建了三个属性的单个索引,第二个创建了三个单属性索引。存储需求将是不同的,尽管没有分布,也不可能说哪个更大。

Indexing three columns [A, B, C] works well when you need to access for values of A, A+B and A+B+C. It won't be any good if your query (or find conditions or whatever) doesn't reference A.

当您需要访问A、A+B和A+B+C的值时,三个列[A、B、C]工作得很好。如果您的查询(或查找条件或其他)没有引用A,那就没有任何好处。

When A, B and C are indexed separately, some DBMS query optimizers will consider combining two or more indices (subject to the optimizer's estimate of efficiency) to give a similar result to a single multi-column index.

当A、B和C被单独索引时,一些DBMS查询优化器将考虑组合两个或多个索引(取决于优化器对效率的估计),从而为单个多列索引提供类似的结果。

Suppose you have some e-commerce system. You want to query orders by purchase_date, customer_id and sometimes both. I'd start by creating two indices: one for each attribute.

假设你有电子商务系统。您希望按purchase e_date、customer_id和有时两者查询订单。我首先创建两个索引:每个属性一个索引。

On the other hand, if you always specify purchase_date and customer_id, then a single index on both columns would probably be most efficient. The order is significant: if you also wanted to query orders for all dates for a customer, then make the customer_id the first column in the index.

另一方面,如果您总是指定购买日期和customer_id,那么两个列上的单个索引可能是最有效的。订单是重要的:如果您还想为客户查询所有日期的订单,那么将customer_id设置为索引中的第一列。

#3


1  

From the docs:

从文档:

When creating an index on multiple columns, the first column is used as a name for the index. For example, when you specify an index on two columns [:first, :last], the DBMS creates an index for both columns as well as an index for the first column :first. Using just the first name for this index makes sense, because you will never have to create a singular index with this name.

当在多个列上创建索引时,第一个列用作索引的名称。例如,当您在两列[:first,:last]上指定索引时,DBMS将为两列创建索引,并为第一列创建索引:first。只使用这个索引的名称是有意义的,因为您永远不需要用这个名称创建一个单数索引。

Use the first method when creating a compound index, and the second when creating indexes on single attributes.

在创建复合索引时使用第一个方法,在单个属性上创建索引时使用第二个方法。

There are some good points here on when to use compound indexes, but the gist is that they are good when utilizing a where on multiple attributes. Note that they should be used alongside other indexes (always index your foriegn keys) - not as a replacement.

这里有一些关于何时使用复合索引的优点,但要点是它们在使用多个属性上的where时很有用。注意,它们应该与其他索引一起使用(总是索引您的foriegn键)——而不是作为替代。