I've developed an application for one of my client. He already have one. So I need to convert his actual database (SQL Server), to the new one (MySQL).
我为我的一个客户开发了一个应用程序。他已经有一个了。因此,我需要将他的实际数据库(SQL Server)转换为新的数据库(MySQL)。
Some tables of SQL Server has over then 10.000.000 records. When I initiate start developing this converter, I've started with some tables with a few records, so I find all records and save to my new MySQL database. I'll show you some code for better understanding (this is just an example)
SQL Server的一些表有超过10.000.000条记录。当我开始开发这个转换器时,我已经从一些有一些记录的表开始,所以我找到了所有的记录并保存到新的MySQL数据库中。我将向您展示一些代码,以便更好地理解(这只是一个示例)
<?php
namespace Converter\Model;
class PostConverter extends AbstractConverter
{
public function convert()
{
// this is the default connection, it is a mysql database (new application)
$em = $this->getEntityManager();
// this return an alternative connection to the sqlserver database (actual application)
$emAlternative = $this->getEntityManagerAlternative();
// instance of Converter\Repository\Post
$repository = $emAlternative->getRepository('Converter\Entity\Post');
$posts = $repository->findAll();
foreach ($posts as $post)
$post = new Post();
$post->setTitle($object->getTitle());
$em->persist($post);
}
$em->flush();
}
}
Now let's suppose that Post table has over then 10.000.000 records. I can't just find all and iterate over it. I'll get out of RAM. So I did something like this.
现在假设Post表有超过10.000.000条记录。我不能只找到所有并遍历它。我要离开拉姆。我做了这样的事情。
Repository class:
库班:
<?php
namespace Converter\Repository;
class Posts extends \Doctrine\ORM\EntityRepository
{
public function findPosts($limit, $offset)
{
$qb = $this->createQueryBuilder('Post');
$qb->setMaxResults($limit);
$qb->setFirstResult($offset);
return $qb->getQuery->getResult();
}
}
Here I find only a few posts at time, in the while loop. But it's kinda slow. I couldn't find a better solution to improve the performance
在这里,我只在while循环中找到几个帖子。但它有点慢。我找不到更好的解决方案来提高性能
<?php
namespace Converter\Model;
class PostConverter extends AbstractConverter
{
public function convert()
{
$em = $this->getEntityManager();
$emAlternative = $this->getEntityManagerAlternative();
$repository = $emAlternative->getRepository('Converter\Entity\Post');
$limit = 1000;
while ($object = $repository->findPosts($limit, $offset) {
$post = new Post();
$post->setTitle($object->getTitle());
$em->persist($post);
$offset += $limit;
}
$em->flush();
}
}
I had never done something like this before. Maybe I'm going to a wrong way. I'll really appreciate if some of you could tell me the right one, so I can move on in this.
我以前从来没有做过这样的事。也许我走错路了。如果你们中有人能告诉我合适的,我会很感激的,这样我就可以继续了。
Thank you all
谢谢大家
EDIT
I can't just dump one to another. What I posted here is just an example, in the conversion I have to handle almost all data before insert in the new database. His actual application was developed in 2005. The database is not even normalized
我不能把一个人扔给另一个人。我在这里发布的只是一个例子,在转换中,我必须处理几乎所有的数据,然后插入新的数据库。他的实际应用是在2005年开发的。数据库甚至没有被规范化
4 个解决方案
#1
5
I'm currently building a data warehousing system with similar issues. Doctrine's own documentation correctly states:
我目前正在构建一个有类似问题的数据仓库系统。学说本身的文件正确地指出:
An ORM tool is not primarily well-suited for mass inserts, updates or deletions. Every RDBMS has its own, most effective way of dealing with such operations and if the options outlined below are not sufficient for your purposes we recommend you use the tools for your particular RDBMS for these bulk operations.
ORM工具主要不适合大量插入、更新或删除。每个RDBMS都有自己的、最有效的处理此类操作的方法,如果下面列出的选项不能满足您的目的,我们建议您将针对这些批量操作的特定RDBMS使用这些工具。
This is how I would handle it:
我就是这样处理的:
- Create your empty MySQL database with Doctrine's tools.
- 使用Doctrine的工具创建空MySQL数据库。
- Make a list of all indexes and primary keys in the MySQL database and drop them. I would script this. This will remove the overhead of constant index updates until your data migration is complete.
- 在MySQL数据库中列出所有索引和主键,然后删除它们。我将脚本。这将消除常量索引更新的开销,直到完成数据迁移。
- Write a script to copy the data. Loop through the SQL Server data in batches of a few thousand and insert into MySQL.
- Use PDO or native libraries. No Doctrine or query builders. Write the queries by hand.
- 使用PDO或本地库。没有原则或查询生成器。手工编写查询。
- Open one connection to your SQL Server and one connection to MySQL. Keep them open for the duration of the script.
- 打开一个到SQL服务器的连接和一个到MySQL的连接。在脚本期间保持它们的打开状态。
- Query in batches using
LIMIT
and primary key > last id. Querying usingOFFSET
is often slower. - 使用LIMIT和主键>进行批量查询。使用偏移量进行查询通常速度较慢。
- Prepare statements outside of loops to optimize query processing.
- 在循环外部准备语句以优化查询处理。
- Wrap each batch of inserts in one transaction to reduce transactional overhead.
- 将每批插入封装在一个事务中,以减少事务开销。
- "Manually" check referential integrity if necessary. Your tables don't have primary keys yet.
- 如果需要,“手动”检查参考完整性。您的表还没有主键。
- If you have many tables, segment your code into objects or functions so local variables can be cleared from memory and it'll be easier to debug.
- 如果有很多表,将代码分割成对象或函数,这样本地变量就可以从内存中清除,这样就更容易调试。
- You might want to call
gc_collect_cycles()
periodically. If your code is broken into objects this an easy way to keep memory under control. - 您可能需要定期调用gc_collect_cycle()。如果您的代码被分解为对象,这是一种简单的方法来控制内存。
- 编写一个脚本来复制数据。循环SQL Server数据,每次几千次,然后插入到MySQL中。使用PDO或本地库。没有原则或查询生成器。手工编写查询。打开一个到SQL服务器的连接和一个到MySQL的连接。在脚本期间保持它们的打开状态。使用LIMIT和主键>进行批量查询。使用偏移量进行查询通常速度较慢。在循环外部准备语句以优化查询处理。将每批插入封装在一个事务中,以减少事务开销。如果需要,“手动”检查参考完整性。您的表还没有主键。如果您有许多表,那么将代码分割成对象或函数,这样就可以从内存中清除本地变量,这样就更容易进行调试。您可能需要定期调用gc_collect_cycle()。如果您的代码被分解为对象,这是一种简单的方法来控制内存。
- Recreate the database indexes and primary keys. Bonus points if these were scripted from the beginning. Watch for any primary keys that can't be created due to mistakes with duplicate data.
- 重新创建数据库索引和主键。如果这些都是从头开始编写的,则可以获得额外的积分。观察由于重复数据的错误而无法创建的主键。
- Test and test before opening your new MySQL database to production use. You don't want to write another script to fix data migrations later.
- 在将新的MySQL数据库打开到生产使用之前进行测试和测试。您不希望编写另一个脚本来修复稍后的数据迁移。
#2
1
If the schema in the source database (MSSQL) and the target database (MySQL) are exact or similar, I would export the records from one database and then import them into the other using purely the database tools. Example:
如果源数据库(MSSQL)和目标数据库(MySQL)中的模式是完全相同或相似的,我将从一个数据库导出记录,然后使用纯数据库工具将它们导入另一个数据库。例子:
- In MSSQL, for every table, export the records to CSV
- 在MSSQL中,对于每个表,都将记录导出到CSV
- In MySQL, for every table, import the records from CSV
- 在MySQL中,对于每个表,都从CSV导入记录
You can use a shell script to glue all this together and automate the process.
您可以使用shell脚本将所有这些粘合在一起,并自动执行该过程。
This export/import will be reasonably fast, as it happens at the database layer. It's also the fastest you can probably get.
这种导出/导入将相当快,就像在数据库层中发生的那样。这也是最快的速度。
Moving an entire database from the model layer is going to be slower, by definition: you're going to create a model object for every row. That said, using the model layer is a good approach when the source and target schemas diverge, because then you can use the programmatic model to adapt one schema to another.
根据定义,从模型层移动整个数据库将会更慢:您将为每一行创建一个模型对象。也就是说,当源模式和目标模式有差异时,使用模型层是一种很好的方法,因为这样您就可以使用编程模型来调整一个模式到另一个模式。
In your specific example, you may see some improvement in performance if you unset($object)
at the bottom of your while
loop, though I doubt memory is the bottleneck. (I/O is.)
在您的特定示例中,如果在while循环的底部取消设置($object),您可能会看到性能的改进,尽管我怀疑内存是瓶颈。(I / O)。
#3
1
I've tried this approach before and from my experience, it is always faster to utilize the DBMS native data dumping and restore tools rather than process records through a framework like this.
我以前尝试过这种方法,根据我的经验,使用DBMS本机数据转储和恢复工具总是比通过这样的框架处理记录要快。
I would suggest using a utility such as bcp
(https://msdn.microsoft.com/en-us/library/aa337544.aspx) to dump the data out of SQL Server and then use MySQL's LOAD DATA
(http://dev.mysql.com/doc/refman/5.7/en/load-data.html) or mysqlimport
to bring the data into MySQL.
我建议使用bcp (https://msdn.microsoft.com/en-us/library/aa337544.aspx)之类的实用程序将数据从SQL服务器中转储出来,然后使用MySQL的加载数据(http://dev.mysql.com/doc/refman/5.7/en/load-data.html)或mysqlimport将数据导入MySQL。
If you need to re-structure the data before it's loaded into MySQL, you could do that by setting up the new data structure in MySQL and then manipulating the data to be imported with a utility that can search and replace like sed
.
如果需要在数据加载到MySQL之前重新构造数据结构,可以通过在MySQL中设置新的数据结构,然后使用可以搜索和替换的实用程序(如sed)操作要导入的数据。
#4
1
Prefer utilize the DBMS native data dumping and restore tools rather than process records through a framework like this. Export database in CSV format and import in mysql.
更喜欢使用DBMS本地数据转储和恢复工具,而不是通过这样的框架处理记录。导出CSV格式的数据库并导入mysql。
#1
5
I'm currently building a data warehousing system with similar issues. Doctrine's own documentation correctly states:
我目前正在构建一个有类似问题的数据仓库系统。学说本身的文件正确地指出:
An ORM tool is not primarily well-suited for mass inserts, updates or deletions. Every RDBMS has its own, most effective way of dealing with such operations and if the options outlined below are not sufficient for your purposes we recommend you use the tools for your particular RDBMS for these bulk operations.
ORM工具主要不适合大量插入、更新或删除。每个RDBMS都有自己的、最有效的处理此类操作的方法,如果下面列出的选项不能满足您的目的,我们建议您将针对这些批量操作的特定RDBMS使用这些工具。
This is how I would handle it:
我就是这样处理的:
- Create your empty MySQL database with Doctrine's tools.
- 使用Doctrine的工具创建空MySQL数据库。
- Make a list of all indexes and primary keys in the MySQL database and drop them. I would script this. This will remove the overhead of constant index updates until your data migration is complete.
- 在MySQL数据库中列出所有索引和主键,然后删除它们。我将脚本。这将消除常量索引更新的开销,直到完成数据迁移。
- Write a script to copy the data. Loop through the SQL Server data in batches of a few thousand and insert into MySQL.
- Use PDO or native libraries. No Doctrine or query builders. Write the queries by hand.
- 使用PDO或本地库。没有原则或查询生成器。手工编写查询。
- Open one connection to your SQL Server and one connection to MySQL. Keep them open for the duration of the script.
- 打开一个到SQL服务器的连接和一个到MySQL的连接。在脚本期间保持它们的打开状态。
- Query in batches using
LIMIT
and primary key > last id. Querying usingOFFSET
is often slower. - 使用LIMIT和主键>进行批量查询。使用偏移量进行查询通常速度较慢。
- Prepare statements outside of loops to optimize query processing.
- 在循环外部准备语句以优化查询处理。
- Wrap each batch of inserts in one transaction to reduce transactional overhead.
- 将每批插入封装在一个事务中,以减少事务开销。
- "Manually" check referential integrity if necessary. Your tables don't have primary keys yet.
- 如果需要,“手动”检查参考完整性。您的表还没有主键。
- If you have many tables, segment your code into objects or functions so local variables can be cleared from memory and it'll be easier to debug.
- 如果有很多表,将代码分割成对象或函数,这样本地变量就可以从内存中清除,这样就更容易调试。
- You might want to call
gc_collect_cycles()
periodically. If your code is broken into objects this an easy way to keep memory under control. - 您可能需要定期调用gc_collect_cycle()。如果您的代码被分解为对象,这是一种简单的方法来控制内存。
- 编写一个脚本来复制数据。循环SQL Server数据,每次几千次,然后插入到MySQL中。使用PDO或本地库。没有原则或查询生成器。手工编写查询。打开一个到SQL服务器的连接和一个到MySQL的连接。在脚本期间保持它们的打开状态。使用LIMIT和主键>进行批量查询。使用偏移量进行查询通常速度较慢。在循环外部准备语句以优化查询处理。将每批插入封装在一个事务中,以减少事务开销。如果需要,“手动”检查参考完整性。您的表还没有主键。如果您有许多表,那么将代码分割成对象或函数,这样就可以从内存中清除本地变量,这样就更容易进行调试。您可能需要定期调用gc_collect_cycle()。如果您的代码被分解为对象,这是一种简单的方法来控制内存。
- Recreate the database indexes and primary keys. Bonus points if these were scripted from the beginning. Watch for any primary keys that can't be created due to mistakes with duplicate data.
- 重新创建数据库索引和主键。如果这些都是从头开始编写的,则可以获得额外的积分。观察由于重复数据的错误而无法创建的主键。
- Test and test before opening your new MySQL database to production use. You don't want to write another script to fix data migrations later.
- 在将新的MySQL数据库打开到生产使用之前进行测试和测试。您不希望编写另一个脚本来修复稍后的数据迁移。
#2
1
If the schema in the source database (MSSQL) and the target database (MySQL) are exact or similar, I would export the records from one database and then import them into the other using purely the database tools. Example:
如果源数据库(MSSQL)和目标数据库(MySQL)中的模式是完全相同或相似的,我将从一个数据库导出记录,然后使用纯数据库工具将它们导入另一个数据库。例子:
- In MSSQL, for every table, export the records to CSV
- 在MSSQL中,对于每个表,都将记录导出到CSV
- In MySQL, for every table, import the records from CSV
- 在MySQL中,对于每个表,都从CSV导入记录
You can use a shell script to glue all this together and automate the process.
您可以使用shell脚本将所有这些粘合在一起,并自动执行该过程。
This export/import will be reasonably fast, as it happens at the database layer. It's also the fastest you can probably get.
这种导出/导入将相当快,就像在数据库层中发生的那样。这也是最快的速度。
Moving an entire database from the model layer is going to be slower, by definition: you're going to create a model object for every row. That said, using the model layer is a good approach when the source and target schemas diverge, because then you can use the programmatic model to adapt one schema to another.
根据定义,从模型层移动整个数据库将会更慢:您将为每一行创建一个模型对象。也就是说,当源模式和目标模式有差异时,使用模型层是一种很好的方法,因为这样您就可以使用编程模型来调整一个模式到另一个模式。
In your specific example, you may see some improvement in performance if you unset($object)
at the bottom of your while
loop, though I doubt memory is the bottleneck. (I/O is.)
在您的特定示例中,如果在while循环的底部取消设置($object),您可能会看到性能的改进,尽管我怀疑内存是瓶颈。(I / O)。
#3
1
I've tried this approach before and from my experience, it is always faster to utilize the DBMS native data dumping and restore tools rather than process records through a framework like this.
我以前尝试过这种方法,根据我的经验,使用DBMS本机数据转储和恢复工具总是比通过这样的框架处理记录要快。
I would suggest using a utility such as bcp
(https://msdn.microsoft.com/en-us/library/aa337544.aspx) to dump the data out of SQL Server and then use MySQL's LOAD DATA
(http://dev.mysql.com/doc/refman/5.7/en/load-data.html) or mysqlimport
to bring the data into MySQL.
我建议使用bcp (https://msdn.microsoft.com/en-us/library/aa337544.aspx)之类的实用程序将数据从SQL服务器中转储出来,然后使用MySQL的加载数据(http://dev.mysql.com/doc/refman/5.7/en/load-data.html)或mysqlimport将数据导入MySQL。
If you need to re-structure the data before it's loaded into MySQL, you could do that by setting up the new data structure in MySQL and then manipulating the data to be imported with a utility that can search and replace like sed
.
如果需要在数据加载到MySQL之前重新构造数据结构,可以通过在MySQL中设置新的数据结构,然后使用可以搜索和替换的实用程序(如sed)操作要导入的数据。
#4
1
Prefer utilize the DBMS native data dumping and restore tools rather than process records through a framework like this. Export database in CSV format and import in mysql.
更喜欢使用DBMS本地数据转储和恢复工具,而不是通过这样的框架处理记录。导出CSV格式的数据库并导入mysql。