I'm trying a little experiment at pushing a data set which is not geo-spatial but fits it quite well and am finding the results somewhat unsettling. The data set is genomic data e.g. the Human Genome where we have a region of DNA where elements like genes occupy specific start and stop coordinates (our X axis). We have multiple regions of DNA (chromosomes) which occupy the Y axis. The goal is to bring back all the items which intersect two X coordinates along a single Y coordinate e.g. LineString(START 1, END 2).
我正在做一个小实验来推动一个数据集,这个数据集不是地理空间的,但是非常适合它,我发现结果有些令人不安。数据集是基因组数据,例如人类基因组,我们有一个DNA区域,像基因这样的元素占据特定的开始和停止坐标(X轴)。我们有多个位于Y轴的DNA(染色体)区域。目标是将所有与两个X坐标相交的项沿着一个Y坐标返回,例如LineString(开始1,结束2)。
The theory seemed sound so I pushed it into an existing MySQL based genome project and came up with a table structure like:
这个理论听起来很合理,所以我把它推进了一个现有的基于MySQL的基因组项目,并提出了一个表结构,比如:
CREATE TABLE `spatial_feature` (
`spatial_feature_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`external_id` int(10) unsigned NOT NULL,
`external_type` int(3) unsigned NOT NULL,
`location` geometry NOT NULL,
PRIMARY KEY (`spatial_feature_id`),
SPATIAL KEY `sf_location_idx` (`location`)
) ENGINE=MyISAM;
external_id
represents the identifier of the entity we have encoded into this table & external_type
encodes the source of this. Everything looked good and I pushed in some preliminary data (30,000 rows) which seemed to work well. When this increased past the 3 million row mark MySQL refused to use the spatial index and was slower when it was forced to use it (40 seconds vs. 5 seconds using a full table scan). When more data was added the index started to be used but the performance penalty persisted. Forcing the index off brought the query down to 8 seconds. The query I'm using looks like:
external_id表示已编码到此表中的实体的标识符,而external_type对其源代码进行编码。一切看起来都很好,我推了一些初步数据(3万行),看起来效果不错。当超过300万行标记时,MySQL拒绝使用空间索引,并且在*使用时速度更慢(40秒vs. 5秒使用全表扫描)。当添加更多的数据时,索引开始被使用,但是性能惩罚仍然存在。强制关闭索引将查询降低到8秒。我使用的查询如下:
select count(*)
from spatial_feature
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);
The data going into this is be very dense along the Y dimensions (think of it like you've recorded the position of every building, telephone box, post box and pigeon on a very long road). I've done tests of how R-Indexes behave with this data in Java as well as others in the field have applied them to flat-file formats with success. However no one has applied them to databases AFAIK which is the goal of this test.
这里面的数据在Y维上非常密集(想象一下,你已经记录了每一栋楼,电话亭,邮筒和鸽子在很长的道路上的位置)。我已经对Java中的r - index如何使用这些数据进行了测试,以及该领域的其他人成功地将它们应用到平面文件格式中。然而,还没有人将它们应用到数据库AFAIK中,这是这次测试的目标。
Has anyone out there seen a similar behaviour when adding large quantities of data to a spatial model which is not very disparate along a particular axis? The problem persists if I reverse the coordinate usage. I'm running the following setup if that's a cause
在空间模型中添加大量数据时,是否有人看到过类似的行为,这些数据沿着一个特定的轴不是非常不同?如果我改变坐标的用法,问题就会继续存在。如果这是原因,我将运行以下设置
- MacOS 10.6.6
- MacOS 10.6.6
- MySQL 5.1.46
- MySQL 5.1.46
Help!
的帮助!
Also bringing in explain plan in
并引入解释计划。
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | spatial_feature | ALL | sf_location_idx | NULL | NULL | NULL | 3636060 | 33.33 | Using where |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
The re-written SQL looks like the following
重写的SQL如下所示
select count(0) AS `count(*)` from `arabidopsis_thaliana_core_7_60_9`.`spatial_feature` where intersects(geometryfromtext('LineString(7420023 1, 7420023 1)'),`arabidopsis_thaliana_core_7_60_9`.`spatial_feature`.`location`)
Still not highlighting why this query's performance is so poor
仍然没有强调为什么这个查询的性能如此糟糕。
After reading the article posted by @Fraser from rickonrails it seems like the problem is all to do with the index not being in memory. If I apply similar techniques to those mentioned in the article (making key buffer very big indeed) and I then force the query to use the index query times plumet. We still see a lag between querying a region & then searching for a subset of the region but it's all pointing to getting the load of the indexes correct.
在阅读了@Fraser从rickonrails上发布的文章后,似乎所有的问题都与索引不在内存中有关。如果我将类似的技术应用于本文中提到的技术(使键缓冲区非常大),然后强制查询使用索引查询次数plumet。我们仍然看到在查询一个区域和搜索该区域的一个子集之间存在差距,但这一切都指向正确地加载索引。
What's the moral of the story? R-Indexes in MySQL have quite poor performance until they are in memory and then they have excellent performance. Not really a good solution for what I wanted to do wit them but still it provides an interesting angle on MySQL.
这个故事的寓意是什么?MySQL中的r -索引在内存中之前性能都很差,而且性能也很好。这并不是一个很好的解决方案,但它提供了一个有趣的角度。
Thanks for all the help people.
谢谢大家的帮助。
4 个解决方案
#1
1
From the EXPLAIN planwe see that although the spatial might be used for the query ('possible_keys' column) , it is not used (NULL in 'key' column). I am not sure why it is not selected automatically, but you may explicitly instruct MySql to use the index by specifying it in the query using a 'force index' clause:
从EXPLAIN planwe可以看到,尽管空间可能用于查询('possible_keys'列),但它没有被使用(' key'列为NULL)。我不知道为什么没有自动选择索引,但是您可以使用“force index”子句在查询中指定索引,明确指示MySql使用索引:
select count(*)
from spatial_feature
force index (sf_location_idx) -- <== this is the 'force index' clause
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);
#2
0
The goal is to bring back all the items which intersect two X coordinates along a single Y coordinate
我们的目标是把所有与两个X坐标相交的项带回来,沿着一个Y坐标
Have you considered using an index wit multiple fields? Like:
你有没有考虑过使用一个索引?如:
CREATE INDEX spacial_search ON spatial_feature(y, x)
If you are working with a limited set of y
values this is the way to.
如果你用的是有限的y值,这是一种方法。
#3
0
I have a degree in Genetics and I am a programmer, you don't need be using an X and a Y as your nomenclature it'll get far too faffy... you need a start and a stop position (not an "axis") and a chromosome number. You index by the chromosome number first then the position and then you index the position then the chromosome. (Question: Are you dealing in eukaryotes or chromosomes that can have two reading frames?)
我有遗传学的学位,我是一名程序员,你不需要用X和Y作为你的命名法,这会让你觉得很奇怪……你需要一个开始和停止位置(不是“轴”)和一个染色体号码。首先是染色体数目,然后是位置然后是染色体。(问题:你是研究真核生物还是有两个阅读框的染色体?)
EG: (where "x" = position and "y" = chromosome)
(x =位置,y =染色体)
CREATE INDEX spatial_index_1 ON spatial_feature(chromosome, position);
CREATE INDEX spatial_index_2 ON spatial_feature(position, chromosome);
Incidentally Chromosomes are very long strings (just like data) you can (to speed things up dump them as such as blobs (i.e. coding genes and junk DNA)
顺便说一下,染色体是非常长的字符串(就像数据一样),你可以(为了加快数据的速度,把它们像blob一样(即编码基因和垃圾DNA)
#4
0
Are you sure a relational database is the way to go? If I were you I'd look at pushing your datasets to Solr or Elastic Search (probably storing the master datasets elsewhere). These engines are built for indexing, you will notice the difference in response times.
您确定要使用关系数据库吗?如果我是你,我会考虑将数据集推送到Solr或弹性搜索(可能将主数据集存储在其他地方)。这些引擎是为索引而构建的,您将注意到响应时间的差异。
#1
1
From the EXPLAIN planwe see that although the spatial might be used for the query ('possible_keys' column) , it is not used (NULL in 'key' column). I am not sure why it is not selected automatically, but you may explicitly instruct MySql to use the index by specifying it in the query using a 'force index' clause:
从EXPLAIN planwe可以看到,尽管空间可能用于查询('possible_keys'列),但它没有被使用(' key'列为NULL)。我不知道为什么没有自动选择索引,但是您可以使用“force index”子句在查询中指定索引,明确指示MySql使用索引:
select count(*)
from spatial_feature
force index (sf_location_idx) -- <== this is the 'force index' clause
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);
#2
0
The goal is to bring back all the items which intersect two X coordinates along a single Y coordinate
我们的目标是把所有与两个X坐标相交的项带回来,沿着一个Y坐标
Have you considered using an index wit multiple fields? Like:
你有没有考虑过使用一个索引?如:
CREATE INDEX spacial_search ON spatial_feature(y, x)
If you are working with a limited set of y
values this is the way to.
如果你用的是有限的y值,这是一种方法。
#3
0
I have a degree in Genetics and I am a programmer, you don't need be using an X and a Y as your nomenclature it'll get far too faffy... you need a start and a stop position (not an "axis") and a chromosome number. You index by the chromosome number first then the position and then you index the position then the chromosome. (Question: Are you dealing in eukaryotes or chromosomes that can have two reading frames?)
我有遗传学的学位,我是一名程序员,你不需要用X和Y作为你的命名法,这会让你觉得很奇怪……你需要一个开始和停止位置(不是“轴”)和一个染色体号码。首先是染色体数目,然后是位置然后是染色体。(问题:你是研究真核生物还是有两个阅读框的染色体?)
EG: (where "x" = position and "y" = chromosome)
(x =位置,y =染色体)
CREATE INDEX spatial_index_1 ON spatial_feature(chromosome, position);
CREATE INDEX spatial_index_2 ON spatial_feature(position, chromosome);
Incidentally Chromosomes are very long strings (just like data) you can (to speed things up dump them as such as blobs (i.e. coding genes and junk DNA)
顺便说一下,染色体是非常长的字符串(就像数据一样),你可以(为了加快数据的速度,把它们像blob一样(即编码基因和垃圾DNA)
#4
0
Are you sure a relational database is the way to go? If I were you I'd look at pushing your datasets to Solr or Elastic Search (probably storing the master datasets elsewhere). These engines are built for indexing, you will notice the difference in response times.
您确定要使用关系数据库吗?如果我是你,我会考虑将数据集推送到Solr或弹性搜索(可能将主数据集存储在其他地方)。这些引擎是为索引而构建的,您将注意到响应时间的差异。