更新db 6000次需要几分钟?

时间:2021-11-07 05:54:56

I am writing a test program with Ruby and ActiveRecord, and it reads a document which is like 6000 words long. And then I just tally up the words by

我正在用Ruby和ActiveRecord编写测试程序,它读取的文档长度为6000字。然后我就算了

recordWord = Word.find_by_s(word);
if (recordWord.nil?)
  recordWord = Word.new
  recordWord.s = word
end
if recordWord.count.nil?
  recordWord.count = 1
else
  recordWord.count += 1
end
recordWord.save

and so this part loops for 6000 times... and it takes a few minutes to run at least using sqlite3. Is it normal? I was expecting it could run within a couple seconds... can MySQL speed it up a lot?

所以这个部分循环了6000次......至少使用sqlite3运行需要几分钟。这是正常的吗?我原以为它可以在几秒钟内运行...... MySQL可以加速它吗?

7 个解决方案

#1


Take a look at AR:Extensions as well to handle the bulk insertions.

看看AR:Extensions也可以处理批量插入。

http://rubypond.com/articles/2008/06/18/bulk-insertion-of-data-with-activerecord/

#2


With 6000 calls to write to the database, you're going to see speed issues. I would save the various tallies in memory and save to the database once at the end, not 6000 times along the way.

通过6000次调用来写入数据库,您将看到速度问题。我会将各种标签保存在内存中并在结束时保存到数据库中,而不是一直保存6000次。

#3


I wrote up some quick code in perl that simply does:

我在perl中编写了一些快速代码:

  1. Create the database
  2. 创建数据库

  3. Insert a record that only contains a single integer
  4. 插入仅包含单个整数的记录

  5. Retrieve the most recent record and verify that it returns what it inserted
  6. 检索最新记录并验证它是否返回它插入的内容

And it does steps #2 and #3 6000 times. This is obviously a considerably lighter workload than having an entire object/relational bridge. For this trivial case with SQLite it still took 17 seconds to execute, so your desire to have it take "a couple of seconds" is not realistic on "traditional hardware."

它执行步骤#2和#3 6000次。与使用整个对象/关系桥相比,这显然是一个相当轻的工作量。对于SQLite这个简单的案例,它仍然需要17秒才能执行,所以你需要“几秒钟”才能在“传统硬件”上实现。

Using the monitor I verified that it was primarily disk activity that was slowing it down. Based on that if for some reason you really do need the database to behave that quickly I suggest one of two options:

使用监视器,我确认主要是磁盘活动正在减慢速度。基于此,如果由于某种原因你真的需要数据库快速表现我建议两个选项之一:

  1. Do what people have suggested and find away around the requirement
  2. 按照人们的建议去做,并找到要求

  3. Try buying some solid state disks.
  4. 尝试购买一些固态硬盘。

I think #1 is a good way to start :)

我认为#1是一个好的开始方式:)

Code:

#!/usr/bin/perl

use warnings;
use strict;

use DBI;

my $dbh = DBI->connect('dbi:SQLite:dbname=/tmp/dbfile', '', '');

create_database($dbh);
insert_data($dbh);

sub insert_data {
  my ($dbh) = @_;

  my $insert_sql = "INSERT INTO test_table (test_data) values (?)";
  my $retrieve_sql = "SELECT test_data FROM test_table WHERE test_data = ?";

  my $insert_sth = $dbh->prepare($insert_sql);
  my $retrieve_sth = $dbh->prepare($retrieve_sql);

  my $i = 0;
  while (++$i < 6000) {
     $insert_sth->execute(($i));
     $retrieve_sth->execute(($i));

     my $hash_ref = $retrieve_sth->fetchrow_hashref;

     die "bad data!" unless $hash_ref->{'test_data'} == $i;
  }
}

sub create_database {
   my ($dbh) = @_;

   my $status = $dbh->do("DROP TABLE test_table");
   # return error status if CREATE resulted in error
   if (!defined $status) {
     print "DROP TABLE failed";
   }

   my $create_statement = "CREATE TABLE test_table (id INTEGER PRIMARY KEY AUTOINCREMENT, \n";
   $create_statement .= "test_data varchar(255)\n";
   $create_statement .= ");";

   $status = $dbh->do($create_statement);

   # return error status if CREATE resulted in error
   if (!defined $status) {
     die "CREATE failed";
   }
}

#4


What kind of database connection are you using? Some databases allow you to connect 'directly' rather then using a TCP network connection that goes through the network stack. In other words, if you're making an internet connection and sending data through that way, it can slow things down.

你使用什么样的数据库连接?某些数据库允许您“直接”连接,而不是使用通过网络堆栈的TCP网络连接。换句话说,如果您正在建立互联网连接并通过这种方式发送数据,那么它可能会减慢速度。

Another way to boost performance of a database connection is to group SQL statements together in a single command.

提高数据库连接性能的另一种方法是在单个命令中将SQL语句组合在一起。

For example, making a single 6,000 line SQL statement that looks like this

例如,制作一个看起来像这样的6,000行SQL语句

"update words set count = count + 1 where word = 'the'
update words set count = count + 1 where word = 'in'
...
update words set count = count + 1 where word = 'copacetic'" 

and run that as a single command, performance will be a lot better. By default, MySQL has a 'packet size' limit of 1 megabyte, but you can change that in the my.ini file to be larger if you want.

并将其作为单个命令运行,性能会好很多。默认情况下,MySQL的“数据包大小”限制为1兆字节,但如果需要,可以将my.ini文件中的更改设置为更大。

Since you're abstracting away your database calls through ActiveRecord, you don't have much control over how the commands are issued, so it can be difficult to optimize your code.

由于您通过ActiveRecord抽象出数据库调用,因此您无法控制命令的发布方式,因此优化代码可能很困难。

Another thin you could do would be to keep a count of words in memory, and then only insert the final total into the database, rather then doing an update every time you come across a word. That will probably cut down a lot on the number of inserts, because if you do an update every time you come across the word 'the', that's a huge, huge waste. Words have a 'long tail' distribution and the most common words are hugely more common then more obscure words. Then the underlying SQL would look more like this:

你可以做的另一个细节是在内存中保留一个单词的数量,然后只将最终的总数插入数据库,而不是每次遇到一个单词时都进行更新。这可能会减少很多插入数量,因为如果你每次遇到“the”这个词都做了更新,那就是巨大的浪费。单词具有“长尾”分布,最常见的单词比更加模糊的单词更常见。然后底层SQL看起来更像这样:

"update words set count = 300 where word = 'the'
update words set count = 250 where word = 'in'
...
update words set count = 1 where word = 'copacetic'" 

If you're worried about taking up too much memory, you could count words and periodically 'flush' them. So read a couple megabytes of text, then spend a few seconds updating the totals, rather then updating each word every time you encounter it. If you want to improve performance even more, you should consider issuing SQL commands in batches directly

如果你担心占用太多内存,你可以计算单词并定期“冲洗”它们。因此,阅读几兆字节的文本,然后花几秒钟更新总数,而不是每次遇到它时更新每个单词。如果要进一步提高性能,应考虑直接批量发出SQL命令

#5


Without knowing about Ruby and Sqlite, some general hints:

不知道Ruby和Sqlite,一些一般提示:

create a unique index on Word.s (you did not state whether you have one)

在Word.s上创建一个唯一的索引(你没有说明你是否有一个)

define a default for Word.count in the database ( DEFAULT 1 )

在数据库中定义Word.count的默认值(DEFAULT 1)

optimize assignment of count:

优化计数分配:

recordWord = Word.find_by_s(word);
if (recordWord.nil?)
    recordWord = Word.new
    recordWord.s = word
    recordWord.count = 1
else
    recordWord.count += 1
end
recordWord.save

#6


Use BEGIN TRANSACTION before your updates then COMMIT at the end.

在更新之前使用BEGIN TRANSACTION,然后在结束时使用COMMIT。

#7


ok, i found some general rule:

好的,我发现了一些一般规则:

1) use a hash to keep the count first, not the db
2) at the end, wrap all insert or updates in one transaction, so that it won't hit the db 6000 times.

1)使用哈希来保持计数优先,而不是数据库2),在一个事务中包装所有插入或更新,这样它就不会达到db 6000次。

#1


Take a look at AR:Extensions as well to handle the bulk insertions.

看看AR:Extensions也可以处理批量插入。

http://rubypond.com/articles/2008/06/18/bulk-insertion-of-data-with-activerecord/

#2


With 6000 calls to write to the database, you're going to see speed issues. I would save the various tallies in memory and save to the database once at the end, not 6000 times along the way.

通过6000次调用来写入数据库,您将看到速度问题。我会将各种标签保存在内存中并在结束时保存到数据库中,而不是一直保存6000次。

#3


I wrote up some quick code in perl that simply does:

我在perl中编写了一些快速代码:

  1. Create the database
  2. 创建数据库

  3. Insert a record that only contains a single integer
  4. 插入仅包含单个整数的记录

  5. Retrieve the most recent record and verify that it returns what it inserted
  6. 检索最新记录并验证它是否返回它插入的内容

And it does steps #2 and #3 6000 times. This is obviously a considerably lighter workload than having an entire object/relational bridge. For this trivial case with SQLite it still took 17 seconds to execute, so your desire to have it take "a couple of seconds" is not realistic on "traditional hardware."

它执行步骤#2和#3 6000次。与使用整个对象/关系桥相比,这显然是一个相当轻的工作量。对于SQLite这个简单的案例,它仍然需要17秒才能执行,所以你需要“几秒钟”才能在“传统硬件”上实现。

Using the monitor I verified that it was primarily disk activity that was slowing it down. Based on that if for some reason you really do need the database to behave that quickly I suggest one of two options:

使用监视器,我确认主要是磁盘活动正在减慢速度。基于此,如果由于某种原因你真的需要数据库快速表现我建议两个选项之一:

  1. Do what people have suggested and find away around the requirement
  2. 按照人们的建议去做,并找到要求

  3. Try buying some solid state disks.
  4. 尝试购买一些固态硬盘。

I think #1 is a good way to start :)

我认为#1是一个好的开始方式:)

Code:

#!/usr/bin/perl

use warnings;
use strict;

use DBI;

my $dbh = DBI->connect('dbi:SQLite:dbname=/tmp/dbfile', '', '');

create_database($dbh);
insert_data($dbh);

sub insert_data {
  my ($dbh) = @_;

  my $insert_sql = "INSERT INTO test_table (test_data) values (?)";
  my $retrieve_sql = "SELECT test_data FROM test_table WHERE test_data = ?";

  my $insert_sth = $dbh->prepare($insert_sql);
  my $retrieve_sth = $dbh->prepare($retrieve_sql);

  my $i = 0;
  while (++$i < 6000) {
     $insert_sth->execute(($i));
     $retrieve_sth->execute(($i));

     my $hash_ref = $retrieve_sth->fetchrow_hashref;

     die "bad data!" unless $hash_ref->{'test_data'} == $i;
  }
}

sub create_database {
   my ($dbh) = @_;

   my $status = $dbh->do("DROP TABLE test_table");
   # return error status if CREATE resulted in error
   if (!defined $status) {
     print "DROP TABLE failed";
   }

   my $create_statement = "CREATE TABLE test_table (id INTEGER PRIMARY KEY AUTOINCREMENT, \n";
   $create_statement .= "test_data varchar(255)\n";
   $create_statement .= ");";

   $status = $dbh->do($create_statement);

   # return error status if CREATE resulted in error
   if (!defined $status) {
     die "CREATE failed";
   }
}

#4


What kind of database connection are you using? Some databases allow you to connect 'directly' rather then using a TCP network connection that goes through the network stack. In other words, if you're making an internet connection and sending data through that way, it can slow things down.

你使用什么样的数据库连接?某些数据库允许您“直接”连接,而不是使用通过网络堆栈的TCP网络连接。换句话说,如果您正在建立互联网连接并通过这种方式发送数据,那么它可能会减慢速度。

Another way to boost performance of a database connection is to group SQL statements together in a single command.

提高数据库连接性能的另一种方法是在单个命令中将SQL语句组合在一起。

For example, making a single 6,000 line SQL statement that looks like this

例如,制作一个看起来像这样的6,000行SQL语句

"update words set count = count + 1 where word = 'the'
update words set count = count + 1 where word = 'in'
...
update words set count = count + 1 where word = 'copacetic'" 

and run that as a single command, performance will be a lot better. By default, MySQL has a 'packet size' limit of 1 megabyte, but you can change that in the my.ini file to be larger if you want.

并将其作为单个命令运行,性能会好很多。默认情况下,MySQL的“数据包大小”限制为1兆字节,但如果需要,可以将my.ini文件中的更改设置为更大。

Since you're abstracting away your database calls through ActiveRecord, you don't have much control over how the commands are issued, so it can be difficult to optimize your code.

由于您通过ActiveRecord抽象出数据库调用,因此您无法控制命令的发布方式,因此优化代码可能很困难。

Another thin you could do would be to keep a count of words in memory, and then only insert the final total into the database, rather then doing an update every time you come across a word. That will probably cut down a lot on the number of inserts, because if you do an update every time you come across the word 'the', that's a huge, huge waste. Words have a 'long tail' distribution and the most common words are hugely more common then more obscure words. Then the underlying SQL would look more like this:

你可以做的另一个细节是在内存中保留一个单词的数量,然后只将最终的总数插入数据库,而不是每次遇到一个单词时都进行更新。这可能会减少很多插入数量,因为如果你每次遇到“the”这个词都做了更新,那就是巨大的浪费。单词具有“长尾”分布,最常见的单词比更加模糊的单词更常见。然后底层SQL看起来更像这样:

"update words set count = 300 where word = 'the'
update words set count = 250 where word = 'in'
...
update words set count = 1 where word = 'copacetic'" 

If you're worried about taking up too much memory, you could count words and periodically 'flush' them. So read a couple megabytes of text, then spend a few seconds updating the totals, rather then updating each word every time you encounter it. If you want to improve performance even more, you should consider issuing SQL commands in batches directly

如果你担心占用太多内存,你可以计算单词并定期“冲洗”它们。因此,阅读几兆字节的文本,然后花几秒钟更新总数,而不是每次遇到它时更新每个单词。如果要进一步提高性能,应考虑直接批量发出SQL命令

#5


Without knowing about Ruby and Sqlite, some general hints:

不知道Ruby和Sqlite,一些一般提示:

create a unique index on Word.s (you did not state whether you have one)

在Word.s上创建一个唯一的索引(你没有说明你是否有一个)

define a default for Word.count in the database ( DEFAULT 1 )

在数据库中定义Word.count的默认值(DEFAULT 1)

optimize assignment of count:

优化计数分配:

recordWord = Word.find_by_s(word);
if (recordWord.nil?)
    recordWord = Word.new
    recordWord.s = word
    recordWord.count = 1
else
    recordWord.count += 1
end
recordWord.save

#6


Use BEGIN TRANSACTION before your updates then COMMIT at the end.

在更新之前使用BEGIN TRANSACTION,然后在结束时使用COMMIT。

#7


ok, i found some general rule:

好的,我发现了一些一般规则:

1) use a hash to keep the count first, not the db
2) at the end, wrap all insert or updates in one transaction, so that it won't hit the db 6000 times.

1)使用哈希来保持计数优先,而不是数据库2),在一个事务中包装所有插入或更新,这样它就不会达到db 6000次。