我可以在多个查询中拆分查询或创建并行性来加速查询吗？

I have a table avl_pool, and I have a function to find on the map the link nearest to that (x, y) position.

我有一个表avl_pool,我有一个函数可以在地图上找到最接近该(x,y)位置的链接。

The performance of this select is very linear, the function require ~8 ms to execute. So calculate this select for 1000 rows require 8 seconds. Or as I show in this sample 20.000 rows require 162 seconds.

该选择的性能非常线性,该功能需要~8 ms才能执行。因此,计算此行选择1000行需要8秒。或者,正如我在此示例中所示,20.000行需要162秒。

SELECT avl_id, x, y, azimuth, map.get_near_link(X, Y, AZIMUTH)
FROM avl_db.avl_pool         
WHERE avl_id between 1 AND 20000

"Index Scan using avl_pool_pkey on avl_pool  (cost=0.43..11524.76 rows=19143 width=28) (actual time=8.793..162805.384 rows=20000 loops=1)"
"  Index Cond: ((avl_id >= 1) AND (avl_id <= 20000))"
"  Buffers: shared hit=19879838"
"Planning time: 0.328 ms"
"Execution time: 162812.113 ms"

Using pgAdmin I found out if execute half of the range on separated windows at the same time, the execution time is actually split in half. So looks like the server can handle multiple requests to that same table/function without problem.

使用pgAdmin我发现如果同时在分离的窗口上执行一半的范围,执行时间实际上被分成两半。所以看起来服务器可以毫无问题地处理对同一个表/函数的多个请求。

-- windows 1
SELECT avl_id, x, y, azimuth, map.get_near_link(X, Y, AZIMUTH)
FROM avl_db.avl_pool         
WHERE avl_id between 1 AND 10000 

Total query runtime: 83792 ms.

-- windows 2
SELECT avl_id, x, y, azimuth, map.get_near_link(X, Y, AZIMUTH)
FROM avl_db.avl_pool         
WHERE avl_id between 10001 AND 20000

Total query runtime: 84047 ms.

So how should I aproach this scenario to improve performance?.

那么我该如何应对这种情况以提高性能呢?

From the C# aproach I guess I can create multiple threads and each one send a portion of the range and then I join all the data in the client. So instead one query with 20k and 162 seconds, I could send 10 querys with 2000 rows and finish in ~16 seconds. Of course maybe there is an overhead cost in the join, but shouldn't be big compared with the 160 seconds.

从C#aproach我想我可以创建多个线程,每个线程发送一部分范围,然后我加入客户端中的所有数据。所以相反,一个查询有20k和162秒,我可以发送10个查询,2000行,并在~16秒内完成。当然,联接中可能存在间接成本,但与160秒相比不应该很大。

Or is there is a different aproach I should consider, even better if is a just sql solution?

或者我应该考虑不同的方法,如果只是一个简单的SQL解决方案,那就更好了吗?

@PeterRing I dont think function code is relevant but anyway here is.

@PeterRing我不认为功能代码是相关的,但无论如何这里是。

CREATE OR REPLACE FUNCTION map.get_near_link(
    x NUMERIC,
    y NUMERIC,
    azim NUMERIC)
  RETURNS map.get_near_link AS
$BODY$
DECLARE
    strPoint TEXT;
    sRow map.get_near_link;
  BEGIN
    strPoint = 'POINT('|| X || ' ' || Y || ')';
    RAISE DEBUG 'GetLink strPoint % -- Azim %', strPoint, Azim;

    WITH index_query AS (
        SELECT --Seg_ID,
               Link_ID,
               azimuth,
               TRUNC(ST_Distance(ST_GeomFromText(strPoint,4326), geom  )*100000)::INTEGER AS distance,
               sentido,
               --ST_AsText(geom),
               geom
        FROM map.vzla_seg S
        WHERE
            ABS(Azim - S.azimuth) < 30 OR
            ABS(Azim - S.azimuth) > 330
        ORDER BY
            geom <-> ST_GeomFromText(strPoint, 4326)
        LIMIT 101
    )
    SELECT i.Link_ID, i.Distance, i.Sentido, v.geom INTO sRow
    FROM
        index_query i INNER JOIN
        map.vzla_rto v ON i.link_id = v.link_id
    ORDER BY
        distance LIMIT 1;

    RAISE DEBUG 'GetLink distance % ', sRow.distance;
    IF sRow.distance > 50 THEN
        sRow.link_id = -1;
    END IF;

    RETURN sRow;
  END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE
  COST 100;
ALTER FUNCTION map.get_near_link(NUMERIC, NUMERIC, NUMERIC)
  OWNER TO postgres;

3 个解决方案

#1

Consider marking your map.get_near_link function as PARALLEL SAFE. This will tell the database engine that it is allowed to try generate a parallel plan when executing the function:

考虑将map.get_near_link函数标记为PARALLEL SAFE。这将告诉数据库引擎在执行函数时允许尝试生成并行计划:

PARALLEL UNSAFE indicates that the function can't be executed in parallel mode and the presence of such a function in an SQL statement forces a serial execution plan. This is the default. PARALLEL RESTRICTED indicates that the function can be executed in parallel mode, but the execution is restricted to parallel group leader. PARALLEL SAFE indicates that the function is safe to run in parallel mode without restriction.

PARALLEL UNSAFE表示该函数无法在并行模式下执行,并且SQL语句中存在此类函数会强制执行串行执行计划。这是默认值。 PARALLEL RESTRICTED表示该功能可以并行模式执行,但执行仅限于并行组负责人。 PARALLEL SAFE表示该功能可以安全地以并行模式运行而不受限制。

There are several settings which can cause the query planner not to generate a parallel query plan under any circumstances. Consider this documentation:

有几种设置可能导致查询计划程序在任何情况下都不生成并行查询计划。请考虑以下文档:

15.4. Parallel Safety

15.4。并行安全
15.2. When Can Parallel Query Be Used?

15.2。什么时候可以使用并行查询?

On my reading, you may be able to achieve a parallel plan if you refactor your function like this:

在我的阅读中,如果您像这样重构您的函数,您可能能够实现并行计划:

CREATE OR REPLACE FUNCTION map.get_near_link(
    x NUMERIC,
    y NUMERIC,
    azim NUMERIC)
RETURNS TABLE
(Link_ID INTEGER, Distance INTEGER, Sendito TEXT, Geom GEOGRAPHY)
AS
$$
        SELECT 
               S.Link_ID,
               TRUNC(ST_Distance(ST_GeomFromText('POINT('|| X || ' ' || Y || ')',4326), S.geom) * 100000)::INTEGER AS distance,
               S.sentido,
               v.geom
        FROM (
          SELECT *
          FROM map.vzla_seg
          WHERE ABS(Azim - S.azimuth) NOT BETWEEN 30 AND 330
        ) S
          INNER JOIN map.vzla_rto v
            ON S.link_id = v.link_id
        WHERE
            ST_Distance(ST_GeomFromText('POINT('|| X || ' ' || Y || ')',4326), S.geom) * 100000 < 50
        ORDER BY
            S.geom <-> ST_GeomFromText('POINT('|| X || ' ' || Y || ')', 4326)
        LIMIT 1
$$
LANGUAGE SQL
PARALLEL SAFE -- Include this parameter
;

If the query optimiser will generate a parallel plan when executing this function, you won't need to implement your own parallelisation logic.

如果查询优化器在执行此函数时将生成并行计划,则无需实现自己的并行化逻辑。

#2

I have done things like this. It works relatively well. Note that each connection can handle exactly one query at a time so for each partition of your query, you have to have a separate connection. Now, in C# you could use threads to interact with each connection.

我做过这样的事情。它的效果相对较好。请注意,每个连接一次只能处理一个查询,因此对于查询的每个分区,您必须具有单独的连接。现在,在C#中,您可以使用线程与每个连接进行交互。

But another option would be to use asynchronous queries and have a single thread manage and poll your entire connection pool (this sometimes simplifies data manipulations on the application side). Note in this case you are best ensuring a sleep or other yield point after every poll cycle.

但另一种选择是使用异步查询并让单个线程管理和轮询整个连接池(这有时简化了应用程序端的数据操作)。请注意,在这种情况下,您最好在每个轮询周期后确保睡眠或其他屈服点。

Note further that the extent to which this speeds the query depends on your disk I/O subsystem and your CPU parallelism. So you cannot just throw more pieces of a query and expect a speed up.

请进一步注意,查询加速的程度取决于磁盘I / O子系统和CPU并行度。因此,您不能只是抛出更多的查询,并期望加快速度。

#3

I've done this with SSIS by creating a script that buckets each server into 7 different "@Mode" (In my case the many servers assigns @Mode based on the last three digits of their IP -- this creates fairly evenly distributed buckets.

我用SSIS创建了一个脚本,将每个服务器分成7个不同的“@Mode”(在我的情况下,许多服务器根据其IP的最后三位数分配@Mode - 这会创建相当均匀分布的存储桶。

 (CONVERT(int, RIGHT(dbserver, 3)) % @stages) + 1 AS Mode

In SSIS, I have 7 sets of the same 14 large queries running. Each are assigned a different @Mode number that is passed to the stored procedure.

在SSIS中,我有7组相同的14个大型查询正在运行。每个都分配了一个传递给存储过程的不同@Mode编号。

Essentially this allows for 7 simultaneous queries that never run on the same server and effectively cutting the runtime down by approx 85%.

从本质上讲,这允许7个同时进行的查询从不在同一台服务器上运行,并有效地将运行时间减少了大约85%。

So, Create an SSIS package with the first step of refreshing the @Mode table.

因此,创建一个SSIS包,第一步是刷新@Mode表。

Then create a container that contains 7 containers. Within each of those 7 containers execute your SQL queries with Parameter Mapping to @Mode. I point everything to stored procs, so in my case the SQLStatement field reads something like: EXEC StoredProc ?. The ? will then check the Parameter Mapping you created for @Mode.

然后创建一个包含7个容器的容器。在这7个容器中的每个容器中,使用参数映射执行SQL查询到@Mode。我将所有内容都指向存储过程,因此在我的情况下,SQLStatement字段的内容类似于:EXEC StoredProc?。的?然后将检查您为@Mode创建的参数映射。

Finally, in the SQL query, be sure that @Mode is indicated as a variable for which server to run the query against.

最后,在SQL查询中,确保将@Mode指示为运行查询的服务器的变量。

#1