Microsoft SQL Server
Microsoft SQL Server
I need a query that will return all rows that greater than X number of commas exist in the description column. Like won't work because the commas would be spread out and the in between text would be different. I'm not certain this query exists at all.
我需要一个查询,它将返回在description列中存在的所有大于X的逗号的行。就像不工作,因为逗号会分散,而在文本之间会有所不同。我不确定这个查询是否存在。
Any insight or assistance with that would be greatly appreciated.
对此的任何见解或帮助都将受到极大的赞赏。
Thank you for your time.
谢谢您的时间。
2 个解决方案
#1
4
One way of counting the number of times a character appears in a string is to compare the string's length to the length of the string with this character removed.
计算字符出现在字符串中的次数的一种方法是将字符串的长度与字符串的长度进行比较,并删除该字符。
So, e.g., assuming you want to find all the rows with 5 commas in col1:
例如,假设你想找到col1中所有5个逗号的行:
SELECT *
FROM my_table
WHERE LEN(col1) - LEN(REPLACE(col1, ',', '')) = 5
#2
1
In case you are interested in performance, and would like to implement a slightly more complicated approach, I mocked up some data and did a relative simple test:
如果您对性能感兴趣,并且希望实现一个稍微复杂一点的方法,我模拟了一些数据并做了一个相对简单的测试:
CREATE TABLE dbo.Test
(
TestID INT NOT NULL CONSTRAINT PK_Test
PRIMARY KEY CLUSTERED IDENTITY(1,1)
, col1 VARCHAR(255) NOT NULL
, col1_comma_count AS LEN(col1) - LEN(REPLACE(col1, ',','')) PERSISTED
);
INSERT INTO Test (col1) VALUES ('this, is, a, test');
GO 50000
INSERT INTO Test (col1) VALUES ('this, is, a, test, another, test');
GO 1500
The statements above create a test table with a computed column that contains the count of the number of commas in col1
. The table then has 50,000 rows inserted where the comma count is 3, and 1,500 rows inserted where the comma count is 5.
上面的语句创建一个带有计算列的测试表,该列包含col1中逗号数量的计数。然后,在逗号计数为3的地方插入50,000行,在逗号计数为5的地方插入1,500行。
I then executed the following query, with SET STATISTICS IO ON; SET STATISTICS TIME ON;
:
然后执行以下查询,设置STATISTICS IO ON;设置统计时间,:
SELECT COUNT(1)
FROM dbo.Test t
WHERE t.col1_comma_count = 5;
The statistics info:
统计信息:
As you can see, 248 logical reads are necessary to scan the entire table to obtain the count of rows where 5 commas. The execution plan for this query looks like:
如您所见,需要对整个表进行248次逻辑读操作,才能获得包含5个逗号的行数。该查询的执行计划如下:
As expected, SQL Server does a clustered index scan of the entire table.
如预期的那样,SQL Server对整个表进行集群索引扫描。
I then created an index on the persisted computed column, to show the difference:
然后,我在持久化计算列上创建了一个索引,以显示差异:
CREATE INDEX IX_Test_col1_comma_count ON dbo.Test (col1_comma_count);
and re-ran the test query. Here are the statistics info for the run with the index:
并重新运行测试查询。以下是与索引一起运行的统计信息:
The number of reads necessary has dropped to 6, or 41 times less reads. On a busy system this will make a real difference. Here is the new execution plan:
必要的读取次数已经减少到6次,或减少了41倍。在一个繁忙的系统中,这将产生真正的不同。以下是新的执行计划:
This time, we see a much more efficient seek into the index.
这一次,我们看到了一个更有效的指数搜索。
If we drop both the index and the computed column from the table, we see a huge increase in time spent getting the results of the query:
如果我们从表中同时删除索引和计算列,我们会发现获得查询结果所花费的时间有很大的增加:
DROP INDEX IX_Test_col1_comma_count ON dbo.Test;
ALTER TABLE Test DROP COLUMN col1_comma_count;
SELECT COUNT(1)
FROM dbo.Test t
WHERE LEN(col1) - LEN(REPLACE(col1, ',','')) = 5
STATISTICS TIME ON
shows a value on my computer (an Intel Core-i7 3.4Ghz with 8GB ram) of SQL Server Execution Times: CPU time = 15 ms, elapsed time = 24 ms.
统计时间显示了我的计算机(Intel Core-i7 3.4Ghz, 8GB ram)的SQL服务器执行时间:CPU时间= 15ms,运行时间= 24 ms。
With the index and computed, persisted column in place, the SQL Server execution times are SQL Server Execution Times: CPU time = 0 ms, elapsed time = 2 ms.
在索引和计算的持久化列中,SQL Server执行时间是SQL Server执行时间:CPU时间= 0 ms,运行时间= 2 ms。
Clearly, there is a price to pay for doing string manipulation in the WHERE
clause.
显然,在WHERE子句中执行字符串操作是有代价的。
#1
4
One way of counting the number of times a character appears in a string is to compare the string's length to the length of the string with this character removed.
计算字符出现在字符串中的次数的一种方法是将字符串的长度与字符串的长度进行比较,并删除该字符。
So, e.g., assuming you want to find all the rows with 5 commas in col1:
例如,假设你想找到col1中所有5个逗号的行:
SELECT *
FROM my_table
WHERE LEN(col1) - LEN(REPLACE(col1, ',', '')) = 5
#2
1
In case you are interested in performance, and would like to implement a slightly more complicated approach, I mocked up some data and did a relative simple test:
如果您对性能感兴趣,并且希望实现一个稍微复杂一点的方法,我模拟了一些数据并做了一个相对简单的测试:
CREATE TABLE dbo.Test
(
TestID INT NOT NULL CONSTRAINT PK_Test
PRIMARY KEY CLUSTERED IDENTITY(1,1)
, col1 VARCHAR(255) NOT NULL
, col1_comma_count AS LEN(col1) - LEN(REPLACE(col1, ',','')) PERSISTED
);
INSERT INTO Test (col1) VALUES ('this, is, a, test');
GO 50000
INSERT INTO Test (col1) VALUES ('this, is, a, test, another, test');
GO 1500
The statements above create a test table with a computed column that contains the count of the number of commas in col1
. The table then has 50,000 rows inserted where the comma count is 3, and 1,500 rows inserted where the comma count is 5.
上面的语句创建一个带有计算列的测试表,该列包含col1中逗号数量的计数。然后,在逗号计数为3的地方插入50,000行,在逗号计数为5的地方插入1,500行。
I then executed the following query, with SET STATISTICS IO ON; SET STATISTICS TIME ON;
:
然后执行以下查询,设置STATISTICS IO ON;设置统计时间,:
SELECT COUNT(1)
FROM dbo.Test t
WHERE t.col1_comma_count = 5;
The statistics info:
统计信息:
As you can see, 248 logical reads are necessary to scan the entire table to obtain the count of rows where 5 commas. The execution plan for this query looks like:
如您所见,需要对整个表进行248次逻辑读操作,才能获得包含5个逗号的行数。该查询的执行计划如下:
As expected, SQL Server does a clustered index scan of the entire table.
如预期的那样,SQL Server对整个表进行集群索引扫描。
I then created an index on the persisted computed column, to show the difference:
然后,我在持久化计算列上创建了一个索引,以显示差异:
CREATE INDEX IX_Test_col1_comma_count ON dbo.Test (col1_comma_count);
and re-ran the test query. Here are the statistics info for the run with the index:
并重新运行测试查询。以下是与索引一起运行的统计信息:
The number of reads necessary has dropped to 6, or 41 times less reads. On a busy system this will make a real difference. Here is the new execution plan:
必要的读取次数已经减少到6次,或减少了41倍。在一个繁忙的系统中,这将产生真正的不同。以下是新的执行计划:
This time, we see a much more efficient seek into the index.
这一次,我们看到了一个更有效的指数搜索。
If we drop both the index and the computed column from the table, we see a huge increase in time spent getting the results of the query:
如果我们从表中同时删除索引和计算列,我们会发现获得查询结果所花费的时间有很大的增加:
DROP INDEX IX_Test_col1_comma_count ON dbo.Test;
ALTER TABLE Test DROP COLUMN col1_comma_count;
SELECT COUNT(1)
FROM dbo.Test t
WHERE LEN(col1) - LEN(REPLACE(col1, ',','')) = 5
STATISTICS TIME ON
shows a value on my computer (an Intel Core-i7 3.4Ghz with 8GB ram) of SQL Server Execution Times: CPU time = 15 ms, elapsed time = 24 ms.
统计时间显示了我的计算机(Intel Core-i7 3.4Ghz, 8GB ram)的SQL服务器执行时间:CPU时间= 15ms,运行时间= 24 ms。
With the index and computed, persisted column in place, the SQL Server execution times are SQL Server Execution Times: CPU time = 0 ms, elapsed time = 2 ms.
在索引和计算的持久化列中,SQL Server执行时间是SQL Server执行时间:CPU时间= 0 ms,运行时间= 2 ms。
Clearly, there is a price to pay for doing string manipulation in the WHERE
clause.
显然,在WHERE子句中执行字符串操作是有代价的。