Consider the following table
考虑下面的表
+-------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| date | date | NO | | NULL | |
| sku | varchar(10) | | | NULL |
| impressions | int(11) | NO | | NULL | |
| sales | int(11) | NO | | NULL | |
+-------------+---------+------+-----+---------+----------------+
The table gets populated daily from a bulk download of the previous days sales records.
该表每天从前几天的大量下载销售记录中填充。
Each days download not only contains the previous days sales data but also all data from the last 90 days (possible 50k+ records).
每天的下载不仅包含前几天的销售数据,还包含过去90天的所有数据(可能是50k+记录)。
However the data for previous days may change since the original insert due to matters outside our control, e.g.
然而,由于我们无法控制的原因,例如,由于原始插入,前几天的数据可能会发生变化。
Day 1.
第一天。
Date: 2015-01-01
SKU: ABCD
Impressions: 100
Sales: 0
Day 2.
天2。
Date: 2015-01-01
SKU: ABCD
Impressions: 100
Sales: 3
Date: 2015-01-02
SKU: ABCD
Impressions: 105
Sales: 0
So for any given record from the data download it could be
所以对于任何给定的数据下载记录都可以
a) Already seen and the same as before - ignore b) New - add to database c) Already seen but new data - Update
a)已经看到的和以前一样-忽略b)新的-添加到数据库c)已经看到但新的数据-更新
Arguably this could be trivially solved by checking each row as so
可以说,这可以通过检查每一行来解决
while (!$file->eof()) {
$row = $file->fgets();
$data = explode("\t", $row);
$sku = $data[0];
$date = $data[1];
$impressions = $data[2];
$sales = $data[3];
$order = $em->getRepository('Orders')->findOneBy(['sku' => $sku, 'date' => $date]);
if($order && $order->getImpressions() != $impressions && $order->getSales() != $sales) {
$order->setImpressions($impressions);
$order->setSales($sales);
} else {
... create new model
}
$em->persist($order);
}
However the rows which will have updated data will be minimal and doing a select for each and every row would mean this job would be incredibly slow due to sheer number of rows.
然而,具有更新数据的行将是最少的,并且为每一行执行一个选择,这意味着由于行数的庞大,这项工作将会非常缓慢。
So my question is what patterns could be used to solve this problem as efficiently as possible?
我的问题是什么模式可以被用来尽可能有效地解决这个问题?
Any ideas welcome
任何想法欢迎
2 个解决方案
#1
2
I would suggest you completely replace the previous 90 days' data with the newly downloaded data.
我建议您将之前90天的数据完全替换为新下载的数据。
The reasoning is simple:
原因很简单:
- The processing time to do this will be trivial. 50,000 rows is tiny in database terms. I would probably do this even if it were a million rows.
- 这样做的处理时间很简单。就数据库而言,50,000行是很小的。我可能会这样做,即使它有一百万行。
- Trying to replace only the changed rows is complicated and could introduce errors.
- 试图只替换已更改的行是复杂的,可能会引入错误。
#2
1
When you say "same as before" it seems like the keys are date
and sku
(combined) and sales and impressions are the fields that could be updated. If that's correct, then the most efficient way to do this in MySQL is to use INSERT ... ON DUPLICATE KEY UPDATE ...
query:
当你说“和以前一样”的时候,似乎键是date和sku(合在一起),而sales和impressions是可以更新的字段。如果这是正确的,那么在MySQL中最有效的方法就是使用INSERT…对于重复的密钥更新…查询:
-
Create a unique key on
date
andsku
columns.在日期和sku列上创建一个唯一的键。
-
In your php script pre-parse all data from file (or do it in batches if you'd like).
在php脚本中,预先解析来自文件的所有数据(如果愿意,也可以批量解析)。
-
Run a query similar to this (substitute actual data from parsed values in step 1):
运行与此类似的查询(替换步骤1中解析值中的实际数据):
INSERT INTO mytable (`date`, sku, impressions, sales) VALUES ('2015-01-01', 'ABCD', 100, 3), ('2015-01-02','ABCD', 100, 3), ... ON DUPLICATE KEY UPDATE impressions = VALUES(impressions), sales = VALUES(sales)
A couple of notes:
一些笔记:
- check out the documentation for this syntax
- 请查看此语法的文档
- if the next day's data update containing previous date record was supplementary, you could do
sales = sales + VALUES(sales)
but I don't think that's the case for you - 如果第二天包含先前日期记录的数据更新是补充的,您可以执行sales = sales + VALUES(sales),但我不认为您是这种情况
#1
2
I would suggest you completely replace the previous 90 days' data with the newly downloaded data.
我建议您将之前90天的数据完全替换为新下载的数据。
The reasoning is simple:
原因很简单:
- The processing time to do this will be trivial. 50,000 rows is tiny in database terms. I would probably do this even if it were a million rows.
- 这样做的处理时间很简单。就数据库而言,50,000行是很小的。我可能会这样做,即使它有一百万行。
- Trying to replace only the changed rows is complicated and could introduce errors.
- 试图只替换已更改的行是复杂的,可能会引入错误。
#2
1
When you say "same as before" it seems like the keys are date
and sku
(combined) and sales and impressions are the fields that could be updated. If that's correct, then the most efficient way to do this in MySQL is to use INSERT ... ON DUPLICATE KEY UPDATE ...
query:
当你说“和以前一样”的时候,似乎键是date和sku(合在一起),而sales和impressions是可以更新的字段。如果这是正确的,那么在MySQL中最有效的方法就是使用INSERT…对于重复的密钥更新…查询:
-
Create a unique key on
date
andsku
columns.在日期和sku列上创建一个唯一的键。
-
In your php script pre-parse all data from file (or do it in batches if you'd like).
在php脚本中,预先解析来自文件的所有数据(如果愿意,也可以批量解析)。
-
Run a query similar to this (substitute actual data from parsed values in step 1):
运行与此类似的查询(替换步骤1中解析值中的实际数据):
INSERT INTO mytable (`date`, sku, impressions, sales) VALUES ('2015-01-01', 'ABCD', 100, 3), ('2015-01-02','ABCD', 100, 3), ... ON DUPLICATE KEY UPDATE impressions = VALUES(impressions), sales = VALUES(sales)
A couple of notes:
一些笔记:
- check out the documentation for this syntax
- 请查看此语法的文档
- if the next day's data update containing previous date record was supplementary, you could do
sales = sales + VALUES(sales)
but I don't think that's the case for you - 如果第二天包含先前日期记录的数据更新是补充的,您可以执行sales = sales + VALUES(sales),但我不认为您是这种情况