I'm looking for advice on either optimizing multiple self-joins, or a better table/DB design.
我正在寻找关于优化多个自连接或更好的表/ DB设计的建议。
One of the tables looks as follows (relevant cols only):
其中一个表格如下(仅限相关的cols):
CREATE TABLE IF NOT EXISTS CountryData (
countryDataID INT PRIMARY KEY AUTO_INCREMENT,
dataID INT NOT NULL REFERENCES DataSources (dataID),
dataCode VARCHAR(30) NULL,
countryID INT NOT NULL REFERENCES Countries (countryID),
year INT NOT NULL ,
data DEC(20,4) NULL,
INDEX countryDataYear (dataID, countryID, year));
The data
column has values for a few hundred indicators, 90 countries, and 30 years for ~1mn rows total. A standard query requires selecting N indicators for a particular year and C countries, yielding a CxN table for 90 rows max.
数据列具有几百个指标的值,90个国家和30年,总共约100万行。标准查询需要为特定年份和C国家选择N个指标,从而产生最多90行的CxN表。
With all values in a single column, self-joins seemed like the way to go. So I have experimented with various suggestions to speed those up, including indexing and creating new (temp) tables. At 9 self-joins, the query takes a little under 1 min. Beyond that, it spins forever.
将所有值放在一列中,自连接似乎就是要走的路。所以我已经尝试了各种建议来加快这些速度,包括索引和创建新的(临时)表。在9个自联接中,查询需要不到1分钟。除此之外,它永远旋转。
The new table from where the self-joins take place has only about 1,000 rows, indexed on what seem to be the relevant variables - creation takes about 0.5 sec:
自联接发生的新表只有大约1,000行,索引似乎是相关的变量 - 创建大约需要0.5秒:
CREATE TABLE Growth
SELECT dataID, countryID, year, data
FROM CountryData
WHERE dataID > 522 AND year = 2017;
CREATE INDEX growth_ix
ON Growth (dataID, countryID);
The SELECT
query then arranges up to XX indicators in the results table, with XX unfortunately <10:
SELECT查询然后在结果表中排列最多XX个指标,不幸的是XX <10:
SELECT
Countries.countryName AS Country,
em01.em,
em02.em,
em03.em
...
emX.em
FROM
(SELECT
em1.data AS em,
em1.countryID
FROM Growth AS em1
WHERE
em1.dataID = 523) as em01
JOIN
(SELECT
em2.data AS em,
em2.countryID
FROM Growth AS em2
WHERE
em2.dataID = 524) as em02
USING (countryID)
JOIN
(SELECT
em3.data AS em,
em3.countryID
FROM Growth AS em3
WHERE
em3.dataID = 525) as em03
USING (countryID)
...
JOIN
(SELECT
emX.data AS em,
emX.countryID
FROM Growth AS em5
WHERE
emX.dataID = 527) as emXX
USING (countryID)
JOIN Countries
USING (countryID)
I'd actually like to retrieve a few more variables, plus potentially join other tables. Now I'm wondering whether there's a way to run this more efficiently, or whether I should take an altogether different approach, such as using wide tables with indicators in different columns to avoid self-joins.
我实际上想要检索更多的变量,并可能加入其他表。现在我想知道是否有办法更有效地运行它,或者我是否应采取完全不同的方法,例如在不同的列中使用带有指示符的宽表来避免自连接。
1 个解决方案
#1
0
Is the dataID
unique for a given countryID
and year
or can a dataID
appear multiple times with different values? If it is unique, you may be able to try something like this?
对于给定的countryID和year,dataID是唯一的,还是dataID可以多次出现不同的值?如果它是独一无二的,你可以试试这样的东西吗?
SELECT countryID, year
,MAX( CASE WHEN dataID = 523 THEN data ELSE NULL END ) AS em0
,MAX( CASE WHEN dataID = 524 THEN data ELSE NULL END ) AS em1
,...
FROM CountryData
GROUP BY countryID, year
#1
0
Is the dataID
unique for a given countryID
and year
or can a dataID
appear multiple times with different values? If it is unique, you may be able to try something like this?
对于给定的countryID和year,dataID是唯一的,还是dataID可以多次出现不同的值?如果它是独一无二的,你可以试试这样的东西吗?
SELECT countryID, year
,MAX( CASE WHEN dataID = 523 THEN data ELSE NULL END ) AS em0
,MAX( CASE WHEN dataID = 524 THEN data ELSE NULL END ) AS em1
,...
FROM CountryData
GROUP BY countryID, year