优化多个自我JOIN或重新设计DB?

时间:2023-01-03 00:15:09

I'm looking for advice on either optimizing multiple self-joins, or a better table/DB design.

我正在寻找关于优化多个自连接或更好的表/ DB设计的建议。

One of the tables looks as follows (relevant cols only):

其中一个表格如下(仅限相关的cols):

CREATE TABLE IF NOT EXISTS CountryData (
  countryDataID INT PRIMARY KEY AUTO_INCREMENT,
  dataID INT NOT NULL REFERENCES DataSources (dataID),
  dataCode VARCHAR(30) NULL,
  countryID INT NOT NULL REFERENCES Countries (countryID),
  year INT NOT NULL ,
  data DEC(20,4) NULL,
  INDEX countryDataYear (dataID, countryID, year));

The data column has values for a few hundred indicators, 90 countries, and 30 years for ~1mn rows total. A standard query requires selecting N indicators for a particular year and C countries, yielding a CxN table for 90 rows max.

数据列具有几百个指标的值,90个国家和30年,总共约100万行。标准查询需要为特定年份和C国家选择N个指标,从而产生最多90行的CxN表。

With all values in a single column, self-joins seemed like the way to go. So I have experimented with various suggestions to speed those up, including indexing and creating new (temp) tables. At 9 self-joins, the query takes a little under 1 min. Beyond that, it spins forever.

将所有值放在一列中,自连接似乎就是要走的路。所以我已经尝试了各种建议来加快这些速度,包括索引和创建新的(临时)表。在9个自联接中,查询需要不到1分钟。除此之外,它永远旋转。

The new table from where the self-joins take place has only about 1,000 rows, indexed on what seem to be the relevant variables - creation takes about 0.5 sec:

自联接发生的新表只有大约1,000行,索引似乎是相关的变量 - 创建大约需要0.5秒:

CREATE TABLE Growth
    SELECT dataID, countryID, year, data
    FROM CountryData
    WHERE dataID > 522 AND year = 2017;

CREATE INDEX growth_ix 
    ON Growth (dataID, countryID);

The SELECT query then arranges up to XX indicators in the results table, with XX unfortunately <10:

SELECT查询然后在结果表中排列最多XX个指标,不幸的是XX <10:

SELECT 
    Countries.countryName AS Country,   
    em01.em,
    em02.em,
    em03.em
    ...
    emX.em
FROM    
    (SELECT
        em1.data AS em,
        em1.countryID
    FROM Growth AS em1
    WHERE
    em1.dataID = 523) as em01
    JOIN 
    (SELECT
        em2.data AS em,
        em2.countryID
    FROM Growth AS em2
    WHERE
    em2.dataID = 524) as em02
    USING (countryID)
    JOIN
    (SELECT
        em3.data AS em,
        em3.countryID
    FROM Growth AS em3
    WHERE
    em3.dataID = 525) as em03
    USING (countryID)
    ...
    JOIN
    (SELECT
        emX.data AS em,
        emX.countryID
    FROM Growth AS em5
    WHERE
    emX.dataID = 527) as emXX
    USING (countryID)
    JOIN Countries 
    USING (countryID)

I'd actually like to retrieve a few more variables, plus potentially join other tables. Now I'm wondering whether there's a way to run this more efficiently, or whether I should take an altogether different approach, such as using wide tables with indicators in different columns to avoid self-joins.

我实际上想要检索更多的变量,并可能加入其他表。现在我想知道是否有办法更有效地运行它,或者我是否应采取完全不同的方法,例如在不同的列中使用带有指示符的宽表来避免自连接。

1 个解决方案

#1


0  

Is the dataID unique for a given countryID and year or can a dataID appear multiple times with different values? If it is unique, you may be able to try something like this?

对于给定的countryID和year,dataID是唯一的,还是dataID可以多次出现不同的值?如果它是独一无二的,你可以试试这样的东西吗?

SELECT countryID, year
    ,MAX( CASE WHEN dataID = 523 THEN data ELSE NULL END ) AS em0 
    ,MAX( CASE WHEN dataID = 524 THEN data ELSE NULL END ) AS em1 
    ,...
FROM CountryData
GROUP BY countryID, year 

#1


0  

Is the dataID unique for a given countryID and year or can a dataID appear multiple times with different values? If it is unique, you may be able to try something like this?

对于给定的countryID和year,dataID是唯一的,还是dataID可以多次出现不同的值?如果它是独一无二的,你可以试试这样的东西吗?

SELECT countryID, year
    ,MAX( CASE WHEN dataID = 523 THEN data ELSE NULL END ) AS em0 
    ,MAX( CASE WHEN dataID = 524 THEN data ELSE NULL END ) AS em1 
    ,...
FROM CountryData
GROUP BY countryID, year