终极MySQL遗留数据库的噩梦

时间:2023-02-05 14:11:28

Table1: Everything including the kitchen sink. Dates in the wrong format (year last so you cannot sort on that column), Numbers stored as VARCHAR, complete addresses in the 'street' column, firstname and lastname in the firstname column, city in the lastname column, incomplete addresses, Rows that update preceeding rows by moving data from one field to another based on some set of rules that has changed over the years, duplicate records, incomplete records, garbage records... you name it... oh and of course not a TIMESTAMP or PRIMARY KEY column in sight.

表1:包括厨房水槽在内的一切。日期格式错误(年末,因此您无法对该列进行排序),存储为VARCHAR的数字,“street”列中的完整地址,firstname列中的firstname和lastname,lastname列中的city,不完整的地址,行数根据多年来发生变化的一些规则,重复记录,不完整记录,垃圾记录......将数据从一个字段移动到另一个字段,从而更新行前行...您可以将其命名为...哦,当然不是TIMESTAMP或PRIMARY KEY栏目即将来临。

Table2: Any hope of normalization went out the window upon cracking this baby open. We have a row for each entry AND update of rows in table one. So duplicates like there is no tomorrow (800MB worth) and columns like Phone1 Phone2 Phone3 Phone4 ... Phone15 (they are not called phone. I use this for illustration) The foriegn key is.. well take guess. There are three candidates depending on what kind of data was in the row in table1

表2:任何正常化的希望都会在打开这个婴儿时消失。对于表1中的每个条目和行更新,我们都有一行。所以复制品就像没有明天(800MB价值)和像Phone1 Phone2 Phone3 Phone4 ... Phone15这样的列(它们不叫电话。我用这个来说明)foriegn键是..好好猜测。有三个候选项取决于table1中行的数据类型

Table3: Can it get any worse. Oh yes. The "foreign key is a VARCHAR column combination of dashes, dots, numbers and letters! if that doesn't provide the match (which it often doesn't) then a second column of similar product code should. Columns that have names that bear NO correlation to the data within them, and the obligatory Phone1 Phone2 Phone3 Phone4... Phone15. There are columns Duplicated from Table1 and not a TIMESTAMP or PRIMARY KEY column in sight.

表3:它会变得更糟吗?哦,是的。 “外键是短划线,点,数字和字母的VARCHAR列组合!如果不提供匹配(通常不提供),那么类似产品代码的第二列应该是。具有名称的列与它们内部的数据无关,并且必须与Phone1进行电话交换.Pable15。从Table1中复制了列,而不是TIMESTAMP或PRIMARY KEY列。

Table4: was described as a work in progess and subject to change at any moment. It is essentailly simlar to the others.

表4:被描述为一项工作,并且随时可能发生变化。它与其他人相似。

At close to 1m rows this is a BIG mess. Luckily it is not my big mess. Unluckily I have to pull out of it a composit record for each "customer".

在接近1米的行中,这是一个很大的混乱。幸运的是,这不是我的大混乱。不幸的是,我必须为每个“客户”提供一份复合记录。

Initially I devised a four step translation of Table1 adding a PRIMARY KEY and converting all the dates into sortable format. Then a couple more steps of queries that returned filtered data until I had Table1 to where I could use it to pull from the other tables to form the composit. After weeks of work I got this down to one step using some tricks. So now I can point my app at the mess and pull out a nice clean table of composited data. Luckily I only need one of the phone numbers for my purposes so normalizing my table is not an issue.

最初,我设计了Table1的四步翻译,添加了一个PRIMARY KEY并将所有日期转换为可排序的格式。然后是几个返回过滤数据的查询步骤,直到我将Table1用于从其他表中拉出以形成合成。经过数周的工作,我使用一些技巧将其降低到一步。所以现在我可以将我的应用程序指向混乱并提取一个很好的合成数据表。幸运的是,我只需要一个电话号码用于我的目的,因此规范我的桌子不是问题。

However this is where the real task begins, because every day hundreds of employees add/update/delete this database in ways you don't want to imagine and every night I must retrieve the new rows.

然而,这是真正的任务开始的地方,因为每天都有数百名员工以您不想要的方式添加/更新/删除此数据库,每天晚上我必须检索新行。

Since existing rows in any of the tables can be changed, and since there are no TIMESTAMP ON UPDATE columns, I will have to resort to the logs to know what has happened. Of course this assumes that there is a binary log, which there is not!

由于任何表中的现有行都可以更改,并且由于没有TIMESTAMP ON UPDATE列,因此我将不得不求助于日志以了解发生了什么。当然这假设有一个二进制日志,但没有!

Introducing the concept went down like lead balloon. I might as well have told them that their children are going to have to undergo experimental surgery. They are not exactly hi tech... in case you hadn't gathered...

介绍这个概念就像铅气球一样。我不妨告诉他们,他们的孩子将不得不接受实验性手术。他们并不完全是高科技...如果你没有聚集......

The situation is a little delicate as they have some valuable information that my company wants badly. I have been sent down by senior management of a large corporation (you know how they are) to "make it happen".

这种情况有点微妙,因为它们有一些我公司非常想要的有价值的信息。我被一家大公司的高级管理人员(你知道他们是如何)发送给我们“让它成为现实”。

I can't think of any other way to handle the nightly updates, than parsing the bin log file with yet another application, to figure out what they have done to that database during the day and then composite my table accordingly. I really only need to look at their table1 to figure out what to do to my table. The other tables just provide fields to flush out the record. (Using MASTER SLAVE won't help because I will have a duplicate of the mess.)

我想不出处理夜间更新的任何其他方法,比用另一个应用程序解析bin日志文件,弄清楚他们在白天对该数据库做了什么,然后相应地合成我的表。我真的只需要看看他们的table1来弄清楚该怎么做我的桌子。其他表只提供清除记录的字段。 (使用MASTER SLAVE无济于事,因为我会有一个副本。)

The alternative is to create a unique hash for every row of their table1 and build a hash table. Then I would go through the ENTIRE database every night checking to see if the hashs match. If they do not then I would read that record and check if it exists in my database, if it does then I would update it in my database, if it doesn't then its a new record and I would INSERT it. This is ugly and not fast, but parsing a binary log file is not pretty either.

另一种方法是为table1的每一行创建一个唯一的哈希,并构建一个哈希表。然后我会每晚检查整个ENTIRE数据库,看看哈希值是否匹配。如果他们没有,那么我会读取该记录,并检查它是否存在于我的数据库中,如果它存在,那么我会在我的数据库中更新它,如果它不是它的新记录,我会插入它。这是丑陋而且不快,但解析二进制日志文件也不是很好。

I have written this to help get clear about the problem. often telling it to someone else helps clarify the problem making a solution more obvious. In this case I just have a bigger headache!

我写这篇文章是为了帮助弄清楚这个问题。经常告诉别人有助于澄清问题,使解决方案更加明显。在这种情况下,我只是有一个更大的头痛!

Your thoughts would be greatly appreciated.

非常感谢您的想法。

4 个解决方案

#1


1  

The Log Files (binary Logs) were my first thought too. If you knew how they did things you would shudder. For every row there are many many entries in the log as pieces are added and changed. Its just HUGE! For now I settled upon the Hash approach. With some clever file memory paging this is quite fast.

日志文件(二进制日志)也是我的第一个想法。如果你知道他们是怎么做的,你会不寒而栗。对于每一行,在添加和更改片段时,日志中有许多条目。它只是巨大的!现在我决定使用Hash方法。通过一些聪明的文件内存分页,这非常快。

#2


2  

I am not a MySQL person, so this is coming out of left field.

我不是一个MySQL人,所以这是左边的领域。

But I think the log files might be the answer.

但我认为日志文件可能就是答案。

Thankfully, you really only need to know 2 things from the log.

谢天谢地,你真的只需要知道日志中的两件事。

You need the record/rowid, and you need the operation.

你需要记录/ rowid,你需要操作。

In most DB's, and I assume MySQL, there's an implicit column on each row, like a rowid or recordid, or whatever. It's the internal row number used by the database. This is your "free" primary key.

在大多数DB中,我假设MySQL,每行都有一个隐式列,比如rowid或recordid,或者其他什么。它是数据库使用的内部行号。这是您的“免费”主键。

Next, you need the operation. Notably whether it's an insert, update, or delete operation on the row.

接下来,您需要操作。值得注意的是,它是对行的插入,更新还是删除操作。

You consolidate all of this information, in time order, and then run through it.

您按时间顺序合并所有这些信息,然后运行它。

For each insert/update, you select the row from your original DB, and insert/update that row in your destination DB. If it's a delete, then you delete the row.

对于每次插入/更新,您从原始数据库中选择行,然后在目标数据库中插入/更新该行。如果是删除,则删除该行。

You don't care about field values, they're just not important. Do the whole row.

你不关心字段值,它们并不重要。做整行。

You hopefully shouldn't have to "parse" binary log files, MySQL already must have routines to do that, you just need to find and figure out how to use them (there may even be some handy "dump log" utility you could use).

你希望不必“解析”二进制日志文件,MySQL已经必须有例程来做到这一点,你只需要找到并弄清楚如何使用它们(甚至可能有一些方便的“转储日志”实用程序你可以使用)。

This lets you keep the system pretty simple, and it should only depend on your actual activity during the day, rather than the total DB size. Finally, you could later optimize it by making it "smarter". For example, perhaps they insert a row, then update it, then delete it. You would know you can just ignore that row completely in your replay.

这使您可以保持系统非常简单,它应该只取决于您白天的实际活动,而不是总数据库大小。最后,您可以稍后通过使其“更智能”来优化它。例如,也许他们插入一行,然后更新它,然后删除它。你会知道你可以在你的重播中完全忽略那一行。

Obviously this takes a bit of arcane knowledge in order actually read the log files, but the rest should be straightforward. I would like to think that the log files are timestamped as well, so you can know to work on rows "from today", or whatever date range you want.

显然,为了实际读取日志文件,这需要一些神秘的知识,但其余的应该是直截了当的。我想也认为日志文件也带有时间戳,因此您可以知道“从今天开始”处理行,或者您想要的任何日期范围。

#3


1  

Can't you use the existing code which accesses this database and adapt it to your needs? Of course, the code must be horrible, but it might handle the database structure for you, no? You could hopefully concentrate on getting your work done instead of playing archaeologist then.

您不能使用访问此数据库的现有代码并根据您的需求进行调整吗?当然,代码必须是可怕的,但它可能会为你处理数据库结构,不是吗?你可能希望集中精力完成你的工作,而不是那么玩考古学家。

#4


0  

you might be able to use maatkit's mk-table-sync tool to synchronise a staging database (your database is only very small, after all). This will "duplicate the mess"

您可以使用maatkit的mk-table-sync工具来同步临时数据库(毕竟您的数据库非常小)。这将“复制一团糟”

You could then write something that, after the sync, does various queries to generate a set of more sane tables that you can then report off.

然后,您可以编写一些内容,在同步之后执行各种查询以生成一组更理智的表,然后您可以报告这些表。

I imagine that this could be done on a daily basis without a performance problem.

我想这可以在没有性能问题的情况下每天完成。

Doing it all off a different server will avoid impacting the original database.

在不同的服务器上完成所有操作将避免影响原始数据库。

The only problem I can see is if some of the tables don't have primary keys.

我能看到的唯一问题是,有些表没有主键。

#1


1  

The Log Files (binary Logs) were my first thought too. If you knew how they did things you would shudder. For every row there are many many entries in the log as pieces are added and changed. Its just HUGE! For now I settled upon the Hash approach. With some clever file memory paging this is quite fast.

日志文件(二进制日志)也是我的第一个想法。如果你知道他们是怎么做的,你会不寒而栗。对于每一行,在添加和更改片段时,日志中有许多条目。它只是巨大的!现在我决定使用Hash方法。通过一些聪明的文件内存分页,这非常快。

#2


2  

I am not a MySQL person, so this is coming out of left field.

我不是一个MySQL人,所以这是左边的领域。

But I think the log files might be the answer.

但我认为日志文件可能就是答案。

Thankfully, you really only need to know 2 things from the log.

谢天谢地,你真的只需要知道日志中的两件事。

You need the record/rowid, and you need the operation.

你需要记录/ rowid,你需要操作。

In most DB's, and I assume MySQL, there's an implicit column on each row, like a rowid or recordid, or whatever. It's the internal row number used by the database. This is your "free" primary key.

在大多数DB中,我假设MySQL,每行都有一个隐式列,比如rowid或recordid,或者其他什么。它是数据库使用的内部行号。这是您的“免费”主键。

Next, you need the operation. Notably whether it's an insert, update, or delete operation on the row.

接下来,您需要操作。值得注意的是,它是对行的插入,更新还是删除操作。

You consolidate all of this information, in time order, and then run through it.

您按时间顺序合并所有这些信息,然后运行它。

For each insert/update, you select the row from your original DB, and insert/update that row in your destination DB. If it's a delete, then you delete the row.

对于每次插入/更新,您从原始数据库中选择行,然后在目标数据库中插入/更新该行。如果是删除,则删除该行。

You don't care about field values, they're just not important. Do the whole row.

你不关心字段值,它们并不重要。做整行。

You hopefully shouldn't have to "parse" binary log files, MySQL already must have routines to do that, you just need to find and figure out how to use them (there may even be some handy "dump log" utility you could use).

你希望不必“解析”二进制日志文件,MySQL已经必须有例程来做到这一点,你只需要找到并弄清楚如何使用它们(甚至可能有一些方便的“转储日志”实用程序你可以使用)。

This lets you keep the system pretty simple, and it should only depend on your actual activity during the day, rather than the total DB size. Finally, you could later optimize it by making it "smarter". For example, perhaps they insert a row, then update it, then delete it. You would know you can just ignore that row completely in your replay.

这使您可以保持系统非常简单,它应该只取决于您白天的实际活动,而不是总数据库大小。最后,您可以稍后通过使其“更智能”来优化它。例如,也许他们插入一行,然后更新它,然后删除它。你会知道你可以在你的重播中完全忽略那一行。

Obviously this takes a bit of arcane knowledge in order actually read the log files, but the rest should be straightforward. I would like to think that the log files are timestamped as well, so you can know to work on rows "from today", or whatever date range you want.

显然,为了实际读取日志文件,这需要一些神秘的知识,但其余的应该是直截了当的。我想也认为日志文件也带有时间戳,因此您可以知道“从今天开始”处理行,或者您想要的任何日期范围。

#3


1  

Can't you use the existing code which accesses this database and adapt it to your needs? Of course, the code must be horrible, but it might handle the database structure for you, no? You could hopefully concentrate on getting your work done instead of playing archaeologist then.

您不能使用访问此数据库的现有代码并根据您的需求进行调整吗?当然,代码必须是可怕的,但它可能会为你处理数据库结构,不是吗?你可能希望集中精力完成你的工作,而不是那么玩考古学家。

#4


0  

you might be able to use maatkit's mk-table-sync tool to synchronise a staging database (your database is only very small, after all). This will "duplicate the mess"

您可以使用maatkit的mk-table-sync工具来同步临时数据库(毕竟您的数据库非常小)。这将“复制一团糟”

You could then write something that, after the sync, does various queries to generate a set of more sane tables that you can then report off.

然后,您可以编写一些内容,在同步之后执行各种查询以生成一组更理智的表,然后您可以报告这些表。

I imagine that this could be done on a daily basis without a performance problem.

我想这可以在没有性能问题的情况下每天完成。

Doing it all off a different server will avoid impacting the original database.

在不同的服务器上完成所有操作将避免影响原始数据库。

The only problem I can see is if some of the tables don't have primary keys.

我能看到的唯一问题是,有些表没有主键。