I am building a forward index on a wiki using MySQL. I am running into performance problems with queries and I am hoping for some help optimising either my schema or my queries
我正在使用MySQL在wiki上构建转发索引。我遇到了查询的性能问题,我希望能帮助优化我的架构或查询
The database is around 1GB and it has three tables
数据库大约1GB,有三个表
- fi_page is the table of 800k wiki pages
-
fi_keyword is a table of 70k keywords
fi_keyword是一个70k关键字表
CREATE TABLE `fi_keyword` ( `id` int(11) NOT NULL AUTO_INCREMENT, `keyword` varchar(100) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `keyword` (`keyword`) );
-
fi_titlekeywordlink is a table with 6 million entries linking keywords to wiki pages
fi_titlekeywordlink是一个包含600万个条目的表,将关键字链接到维基页面
CREATE TABLE `fi_titlekeywordlink` ( `id` int(11) NOT NULL AUTO_INCREMENT, `keyword_id` int(11) NOT NULL, `page_id` int(11) NOT NULL, PRIMARY KEY (`id`), KEY `fi_titlekeywordlink_a6434082` (`keyword_id`), KEY `fi_titlekeywordlink_c2d3d2bb` (`page_id`), CONSTRAINT `keyword_id_refs_id_67197756` FOREIGN KEY (`keyword_id`) REFERENCES `fi_keyword` (`id`), CONSTRAINT `paper_id_refs_id_705ddf03` FOREIGN KEY (`page_id`) REFERENCES `fi_page` (`id`) );
fi_page是800k维基页面的表格
I am translating up a search for 'search terms galore' into an sql query such as
我正在翻译搜索“搜索术语丰富”到一个SQL查询,如
select p.*
from
fi_keyword as k0, fi_titlekeywordlink as l0,
fi_keyword as k1, fi_titlekeywordlink as l1,
fi_keyword as k2, fi_titlekeywordlink as l2,
fi_keyword as k3, fi_titlekeywordlink as l3,
fi_page as p
where
k0.keyword = e and k0.id = l0.keyword_id and p.id = l0.paper_id
and k1.keyword = 'search' and k1.id = l1.keyword_id and p.id = l1.paper_id
and k2.keyword = 'terms' and k2.id = l2.keyword_id and p.id = l2.paper_id
and k3.keyword = 'galore' and k3.id = l3.keyword_id and p.id = l3.paper_id
limit 1,10
however this is taking around half a second to run on my MBP. Do you have any suggestions on how to speed up this sort of operation either by changing the schema or the query? I cannot use a separate search server in this case, the forward index must run on MySQL. Thank you.
然而这需要大约半秒才能在我的MBP上运行。您是否有任何关于如何通过更改架构或查询来加速此类操作的建议?在这种情况下我不能使用单独的搜索服务器,正向索引必须在MySQL上运行。谢谢。
1 个解决方案
#1
2
At the cost of insertion performance, you could delete the surrogate id
primary key columns from both tables and make your primary key index on the keyword
column for fi_keyword and (keyword_id
, page_id
) as the primary key index for fi_titlekeywordlink.
以插入性能为代价,您可以从两个表中删除代理项id主键列,并在fi_keyword和(keyword_id,page_id)的关键字列上创建主键索引作为fi_titlekeywordlink的主键索引。
If you are using InnoDB, primary keys are clustered indexes, so they are much faster.
如果您使用InnoDB,主键是聚簇索引,因此它们更快。
Even if you don't make this change, a compound (multi-column) index of (keyword_id
, page_id
) on fi_titlekeywordlink would improve performance because you would have a covering index (MySQL wouldn't have to visit the table data) on fi_titlekeywordlink. This assumes that your MySQL server has enough RAM to fit all indexes in memory and that you've configured MySQL server to allow it to use enough RAM to make it so (configuration variables differ between MyISAM and InnoDB).
即使您没有进行此更改,fi_titlekeywordlink上的(keyword_id,page_id)的复合(多列)索引也会提高性能,因为您将在fi_titlekeywordlink上拥有覆盖索引(MySQL不必访问表数据) 。这假设您的MySQL服务器有足够的RAM来适应内存中的所有索引,并且您已配置MySQL服务器以允许它使用足够的RAM来实现它(MyISAM和InnoDB之间的配置变量不同)。
Sometimes, an implicit JOIN can get too complex for MySQL to properly optimize. You should also consider rewriting the query with explicit ANSI standard joins using JOIN
and ON
.
有时,隐式JOIN可能变得太复杂,MySQL无法正确优化。您还应该考虑使用JOIN和ON使用显式ANSI标准连接重写查询。
You probably just wrote SELECT p.*
for brevity, but be sure to only select the columns that you require so that you're not returning unneeded data. Only returning the columns that you need reduces the work load.
您可能只是为了简洁而编写了SELECT p。*,但请确保只选择您需要的列,这样您就不会返回不需要的数据。只返回您需要的列可以减少工作量。
Also, the first row in a LIMIT clause is 0, so LIMIT 1, 10
skips the first row. Use LIMIT 0, 10
to get the first 10 rows.
此外,LIMIT子句中的第一行为0,因此LIMIT 1,10跳过第一行。使用LIMIT 0,10获得前10行。
#1
2
At the cost of insertion performance, you could delete the surrogate id
primary key columns from both tables and make your primary key index on the keyword
column for fi_keyword and (keyword_id
, page_id
) as the primary key index for fi_titlekeywordlink.
以插入性能为代价,您可以从两个表中删除代理项id主键列,并在fi_keyword和(keyword_id,page_id)的关键字列上创建主键索引作为fi_titlekeywordlink的主键索引。
If you are using InnoDB, primary keys are clustered indexes, so they are much faster.
如果您使用InnoDB,主键是聚簇索引,因此它们更快。
Even if you don't make this change, a compound (multi-column) index of (keyword_id
, page_id
) on fi_titlekeywordlink would improve performance because you would have a covering index (MySQL wouldn't have to visit the table data) on fi_titlekeywordlink. This assumes that your MySQL server has enough RAM to fit all indexes in memory and that you've configured MySQL server to allow it to use enough RAM to make it so (configuration variables differ between MyISAM and InnoDB).
即使您没有进行此更改,fi_titlekeywordlink上的(keyword_id,page_id)的复合(多列)索引也会提高性能,因为您将在fi_titlekeywordlink上拥有覆盖索引(MySQL不必访问表数据) 。这假设您的MySQL服务器有足够的RAM来适应内存中的所有索引,并且您已配置MySQL服务器以允许它使用足够的RAM来实现它(MyISAM和InnoDB之间的配置变量不同)。
Sometimes, an implicit JOIN can get too complex for MySQL to properly optimize. You should also consider rewriting the query with explicit ANSI standard joins using JOIN
and ON
.
有时,隐式JOIN可能变得太复杂,MySQL无法正确优化。您还应该考虑使用JOIN和ON使用显式ANSI标准连接重写查询。
You probably just wrote SELECT p.*
for brevity, but be sure to only select the columns that you require so that you're not returning unneeded data. Only returning the columns that you need reduces the work load.
您可能只是为了简洁而编写了SELECT p。*,但请确保只选择您需要的列,这样您就不会返回不需要的数据。只返回您需要的列可以减少工作量。
Also, the first row in a LIMIT clause is 0, so LIMIT 1, 10
skips the first row. Use LIMIT 0, 10
to get the first 10 rows.
此外,LIMIT子句中的第一行为0,因此LIMIT 1,10跳过第一行。使用LIMIT 0,10获得前10行。