We have some customer data which started in a separate data-store. I have a consolidation script to standardize and migrate it into our core DB. There are somewhere around 60,000-70,000 records being migrated.
我们有一些客户数据在一个单独的数据存储中开始。我有一个合并脚本来标准化并将其迁移到我们的核心数据库中。有大约60,000-70,000条记录正在迁移。
Naturally, there was a little bug, and it failed around row 9k.
My next trick is to make the script able to pick up where it left off when it is run again.
当然,有一个小虫子,它在第9k排失败了。我的下一个技巧是让脚本能够在它再次运行时从中断的地方继续。
FYI:
The source records are pretty icky, and split over 5 tables by what brand they purchased ... IE:
仅供参考:来源记录非常蹩脚,并按照他们购买的品牌分成5张桌子...... IE:
create TABLE `brand1_custs` (`id` int(9), `company_name` varchar(112), etc...)
create TABLE `brand2_custs` (`id` int(9), `company_name` varchar(112), etc...)
Of course, a given company name can (and does) exist in multiple source tables.
当然,给定的公司名称可以(并且确实)存在于多个源表中。
Anyhow ... I used the ParseCSV lib for logging, and each row gets logged if successfully migrated (some rows get skipped if they are just too ugly to parse programatically). When opening the log back up with ParseCSV, it comes in looking like:
无论如何......我使用ParseCSV lib进行日志记录,如果成功迁移,每行都会被记录(如果它们过于丑陋而无法以编程方式解析,则会跳过某些行)。使用ParseCSV打开日志备份时,它看起来像:
array(
0 => array( 'row_id' => '1',
'company_name' => 'Cust A',
'blah' => 'blah',
'source_tbl' => 'brand1_cust'
),
1 => array( 'row_id' => '2',
'company_name' => 'customer B',
'blah' => 'blah',
'source_tbl' => 'brand1_cust'
),
2 => array( 'row_id' => '1',
'company_name' => 'Cust A',
'blah' => 'blah',
'source_tbl' => 'brand2_cust'
),
etc...
)
My current workflow is along the lines of:
我目前的工作流程如下:
foreach( $source_table AS $src){
$results = // get all rows from $src
foreach($results AS $row){
// heavy lifting
{
}
My Plan is to check the $row->id
and $src->tbl
combination
for a match in the $log[?x?]['row_id']
and $log[?x?]['source_tbl']
combination.
我的计划是在$ log [?x?] ['row_id']和$ log [?x?] ['source_tbl']组合中检查$ row-> id和$ src-> tbl组合。
In order to achieve that, I would have to do a foreach($log AS $xyz)
loop inside the foreach($results AS $row)
loop, and skip any rows which are found to have already been migrated (otherwise, they would get duplicated).
That seems like a LOT of of looping to me.
What about when we get up around record # 40 or 50 thousand?
That would be 50k x 50k loops!!
为了实现这一点,我必须在foreach($ results AS $ row)循环中执行foreach($ log AS $ xyz)循环,并跳过任何已发现已迁移的行(否则,它们会得到重复)。这似乎很多循环给我。当我们在#40或5万的记录中起床时怎么样?这将是50k x 50k循环!!
Question:
Is there a better way for me to check if a sub-array has a "row_id" and "source_tbl" match other than looping each time?
问题:我是否有更好的方法来检查子阵列是否具有“row_id”和“source_tbl”匹配而不是每次循环?
NOTE: as always, if there's a completely different way I should be thinking about this, I'm open to any and all suggestions :)
注意:一如既往,如果有一种完全不同的方式我应该考虑这个,我对任何和所有的建议开放:)
1 个解决方案
#1
0
I think that you should do a preprocessing on the log doing a hash (or composed key) of row_id and
source_tbl
and store it in an hashmap then for each row just construct the hash of the key and check if it is already defined in the hashmap.
我认为您应该对日志执行row_id和source_tbl的散列(或组合键)的预处理并将其存储在散列映射中,然后对于每一行,只需构造密钥的散列并检查它是否已在散列映射中定义。
I am telling you to use hashed set because you can search in it with O(k)
time otherwise it would be the same as you are proposing only that it would be a cleaner code.
我告诉你使用散列集,因为你可以用O(k)时间搜索它,否则它会与你提出的只是它是一个更干净的代码相同。
#1
0
I think that you should do a preprocessing on the log doing a hash (or composed key) of row_id and
source_tbl
and store it in an hashmap then for each row just construct the hash of the key and check if it is already defined in the hashmap.
我认为您应该对日志执行row_id和source_tbl的散列(或组合键)的预处理并将其存储在散列映射中,然后对于每一行,只需构造密钥的散列并检查它是否已在散列映射中定义。
I am telling you to use hashed set because you can search in it with O(k)
time otherwise it would be the same as you are proposing only that it would be a cleaner code.
我告诉你使用散列集,因为你可以用O(k)时间搜索它,否则它会与你提出的只是它是一个更干净的代码相同。