如何克服sql查询中IN cause的局限性

时间:2022-06-01 21:31:26

I have written an sql query like :

我编写了一个sql查询,如:

select field1, field2 from table_name;

The problem is this query will return 1 million records/ or more than 100k records. I have a directory in which I have input files (around 20,000 to 50,000 records) that contain field1 . This is the main data I am concerned with. Using perl script, I am extracting from the directory. But , if I write a query like :

问题是这个查询将返回100万条记录/或超过10万条记录。我有一个目录,其中有包含field1的输入文件(大约20,000到50,000条记录)。这是我所关心的主要数据。我使用perl脚本从目录中提取。但是,如果我写这样的查询:

select field1 , field2 from table_name 
where field1 in (need to write a query to take field1 from directory);

If I use IN cause then it has limitation of processing 1000 entries, then how should I overcome the limitation of IN cause?

如果我在cause中使用,那么它有处理1000个条目的限制,那么我该如何克服cause的限制呢?

4 个解决方案

#1


4  

In any DBMS, I would insert them into a temporary table and perform a JOIN to workaround the IN clause limitation on the size of the list.

在任何DBMS中,我都会将它们插入到临时表中,并执行一个连接来解决列表大小的In子句限制。

E.g.

如。

CREATE TABLE #idList
(
    ID INT
)
INSERT INTO #idList VALUES(1)
INSERT INTO #idList VALUES(2)
INSERT INTO #idList VALUES(3)

SELECT * 
FROM 
    MyTable m 
    JOIN #idList AS t 
    ON m.id = t.id

In SQL Server 2005, in one of our previous projects, we used to convert this list of values that are a result of querying another data store (lucene index) into XML and pass it as XML variable in the SQL query and convert it into a table using the nodes() function on XML data types and perform a JOIN with that.

在SQL Server 2005中,在我们之前的项目之一,我们使用这个列表的值转换的结果查询另一个数据存储(lucene索引)为XML并将其传递给SQL查询XML变量并将其转换为一个表使用XML数据类型的节点()函数并执行连接。

DECLARE @IdList XML
SELECT @idList = '
<Requests>
    <Request id="1" />
    <Request id="2" />
    <Request id="3" />
</Requests>'

SELECT * 
FROM 
    MyTable m 
    JOIN (
            SELECT id.value('(@id)[1]', 'INT') as 'id' 
            FROM @idList.nodes('/Requests/Request') as T(id)
         ) AS t 
    ON m.id = t.id

#2


1  

Vikdor is right, you shouldn't be querying this with an IN() clause, it's faster and more memory efficient to use a table to JOIN.

Vikdor是对的,您不应该使用IN()子句查询它,使用表进行连接更快、更节省内存。

Expanding on his answer I would recommend the following approach:

扩展他的回答,我建议采取以下方法:

  1. Get a list of all input files via Perl
  2. 通过Perl获取所有输入文件的列表
  3. Think of some clever way to compute a hash value for your list that is unique and based on all input files (I'd recommend the filenames or similar)
  4. 想一些聪明的方法来计算列表的哈希值,它是唯一的,并且基于所有的输入文件(我推荐文件名或类似的)
  5. This hash will serve as the name of the table that stores the input filenames (think of it as a quasi temporary table that gets discarded once the hash changes)
  6. 该散列将作为存储输入文件名的表的名称(将其视为一旦散列发生更改就被丢弃的准临时表)
  7. JOIN that table to return the correct records
  8. 加入该表以返回正确的记录

For step 2. you could either use a cronjob or compute whenever the query is actually needed (which would delay the response, though). To get this right you need to consider how likely it is that files are added/removed.

步骤2。您可以使用cronjob,也可以在需要查询时进行计算(不过这会延迟响应)。为了得到正确的结果,您需要考虑文件被添加/删除的可能性。

For step 3. you would need some logic that drops the previously generated tables once the current hash value differs from last execution, then recreate the table named after the current hash.

步骤3。当当前哈希值与上次执行不同时,您需要一些逻辑来删除先前生成的表,然后重新创建以当前哈希命名的表。

For the quasi temporary table names I'd recommend something along the lines of

对于类似的临时表名,我推荐一些类似于

input_files_XXX (.i.e. prefix_<hashvalue>)

which makes it easier to know what stale tables to drop.

这使得我们更容易知道哪些陈旧的表将被丢弃。

#3


0  

You could split your 50'000 ids in 50 lists of 1000 ids, do a query for each such list, and collect the result sets in perl.

您可以将50,000个id分成50个1000个id列表,对每个这样的列表进行查询,并使用perl收集结果集。

#4


0  

Oracle wise, the best solution with using a temporary table - which without indexing won't give you much performance is to use a nested tabled type.

Oracle wise,使用临时表的最佳解决方案是使用嵌套制表类型。

CREATE TYPE my_ntt is table of directory_rec;

创建类型my_ntt是directory_rec的表;

Then create a function f1 that returns a variable of my_ntt type and use in the query.

然后创建一个函数f1,该函数返回_myntt类型的变量并在查询中使用。

select field1 , field2 from table_name where field1 in table (cast (f1 as my_ntt));

从表中field1、field2所在的table_name中选择field1 (cast (f1为my_ntt);

#1


4  

In any DBMS, I would insert them into a temporary table and perform a JOIN to workaround the IN clause limitation on the size of the list.

在任何DBMS中,我都会将它们插入到临时表中,并执行一个连接来解决列表大小的In子句限制。

E.g.

如。

CREATE TABLE #idList
(
    ID INT
)
INSERT INTO #idList VALUES(1)
INSERT INTO #idList VALUES(2)
INSERT INTO #idList VALUES(3)

SELECT * 
FROM 
    MyTable m 
    JOIN #idList AS t 
    ON m.id = t.id

In SQL Server 2005, in one of our previous projects, we used to convert this list of values that are a result of querying another data store (lucene index) into XML and pass it as XML variable in the SQL query and convert it into a table using the nodes() function on XML data types and perform a JOIN with that.

在SQL Server 2005中,在我们之前的项目之一,我们使用这个列表的值转换的结果查询另一个数据存储(lucene索引)为XML并将其传递给SQL查询XML变量并将其转换为一个表使用XML数据类型的节点()函数并执行连接。

DECLARE @IdList XML
SELECT @idList = '
<Requests>
    <Request id="1" />
    <Request id="2" />
    <Request id="3" />
</Requests>'

SELECT * 
FROM 
    MyTable m 
    JOIN (
            SELECT id.value('(@id)[1]', 'INT') as 'id' 
            FROM @idList.nodes('/Requests/Request') as T(id)
         ) AS t 
    ON m.id = t.id

#2


1  

Vikdor is right, you shouldn't be querying this with an IN() clause, it's faster and more memory efficient to use a table to JOIN.

Vikdor是对的,您不应该使用IN()子句查询它,使用表进行连接更快、更节省内存。

Expanding on his answer I would recommend the following approach:

扩展他的回答,我建议采取以下方法:

  1. Get a list of all input files via Perl
  2. 通过Perl获取所有输入文件的列表
  3. Think of some clever way to compute a hash value for your list that is unique and based on all input files (I'd recommend the filenames or similar)
  4. 想一些聪明的方法来计算列表的哈希值,它是唯一的,并且基于所有的输入文件(我推荐文件名或类似的)
  5. This hash will serve as the name of the table that stores the input filenames (think of it as a quasi temporary table that gets discarded once the hash changes)
  6. 该散列将作为存储输入文件名的表的名称(将其视为一旦散列发生更改就被丢弃的准临时表)
  7. JOIN that table to return the correct records
  8. 加入该表以返回正确的记录

For step 2. you could either use a cronjob or compute whenever the query is actually needed (which would delay the response, though). To get this right you need to consider how likely it is that files are added/removed.

步骤2。您可以使用cronjob,也可以在需要查询时进行计算(不过这会延迟响应)。为了得到正确的结果,您需要考虑文件被添加/删除的可能性。

For step 3. you would need some logic that drops the previously generated tables once the current hash value differs from last execution, then recreate the table named after the current hash.

步骤3。当当前哈希值与上次执行不同时,您需要一些逻辑来删除先前生成的表,然后重新创建以当前哈希命名的表。

For the quasi temporary table names I'd recommend something along the lines of

对于类似的临时表名,我推荐一些类似于

input_files_XXX (.i.e. prefix_<hashvalue>)

which makes it easier to know what stale tables to drop.

这使得我们更容易知道哪些陈旧的表将被丢弃。

#3


0  

You could split your 50'000 ids in 50 lists of 1000 ids, do a query for each such list, and collect the result sets in perl.

您可以将50,000个id分成50个1000个id列表,对每个这样的列表进行查询,并使用perl收集结果集。

#4


0  

Oracle wise, the best solution with using a temporary table - which without indexing won't give you much performance is to use a nested tabled type.

Oracle wise,使用临时表的最佳解决方案是使用嵌套制表类型。

CREATE TYPE my_ntt is table of directory_rec;

创建类型my_ntt是directory_rec的表;

Then create a function f1 that returns a variable of my_ntt type and use in the query.

然后创建一个函数f1,该函数返回_myntt类型的变量并在查询中使用。

select field1 , field2 from table_name where field1 in table (cast (f1 as my_ntt));

从表中field1、field2所在的table_name中选择field1 (cast (f1为my_ntt);