如何在Django中完成文本的完整历史记录?

时间:2022-03-25 13:54:03

I'd like to have the full history of a large text field edited by users, stored using Django.

我想拥有用户编辑的大型文本字段的完整历史记录,使用Django存储。

I've seen the projects:

我见过这些项目:

I've a special use-case that probably falls outside the scope of what these projects provide. Further, I'm wary of how well documented, tested and updated these projects are. In any event, here's the problem I face:

我有一个特殊的用例,可能超出了这些项目提供的范围。此外,我担心这些项目的记录,测试和更新程度如何。无论如何,这是我面临的问题:

I've a model, likeso:

我有一个模特,喜欢:

from django.db import models

class Document(models.Model):
   text_field = models.TextField()

This text field may be large - over 40k - and I would like to have an autosave feature that saves the field every 30 seconds or so. This could make the database unwieldly large, obviously, if there are a lot of saves at 40k each (probably still 10k if zipped). The best solution I can think of is to keep a difference between the most recent saved version and the new version.

这个文本字段可能很大 - 超过40k - 我希望有一个自动保存功能,每30秒左右保存一次该字段。这可能会使数据库变得非常大,显然,如果每次40k都有很多保存(如果压缩可能仍然是10k)。我能想到的最佳解决方案是保持最新保存版本和新版本之间的差异。

However, I'm concerned about race conditions involving parallel updates. There are two distinct race conditions that come to mind (the second much more serious than the first):

但是,我担心涉及并行更新的竞争条件。有两种截然不同的竞争条件(第二种情况比第一次严重得多):

  1. HTTP transaction race condition: User A and User B request document X0, and make changes individually, producing Xa and Xb. Xa is saved, the difference between X0 and Xa being "Xa-0" ("a less not"), Xa now being stored as the official version in the database. If Xb subsequently saves, it overwrite Xa, the diff being Xb-a ("b less a").

    HTTP事务竞争条件:用户A和用户B请求文档X0,并单独进行更改,生成Xa和Xb。 Xa被保存,X0和Xa之间的差异是“Xa-0”(“少了不是”),Xa现在被存储为数据库中的正式版本。如果Xb随后保存,它会覆盖Xa,差异为Xb-a(“b减去a”)。

    While not ideal, I'm not overly concerned with this behaviour. The documents are overwriting each other, and users A and B may have been unaware of each other (each having started with document X0), but the history retains integrity.

    虽然不理想,但我并不过分担心这种行为。文档相互重写,用户A和B可能彼此不知道(每个文档都以文档X0开头),但历史保留了完整性。

  2. Database read/update race condition: The problematic race condition is when Xa and Xb save at the same time over X0. There will be (pseudo-)code something like:

    数据库读取/更新竞争条件:有问题的竞争条件是Xa和Xb同时在X0上保存。会有(伪)代码类似于:

     def save_history(orig_doc, new_doc):
         text_field_diff = diff(orig_doc.text_field, new_doc.text_field)
         save_diff(text_field_diff)
    

    If Xa and Xb both read X0 from the database (i.e. orig_doc is X0), their differences will become Xa-0 and Xb-0 (as opposed to the serialized Xa-0 then Xb-a, or equivalently Xb-0 then Xa-b). When you try to patch the diffs together to produce the history, it will fail on either patch Xa-0 or Xb-0 (which both apply to X0). The integrity of the history has been compromised (or has it?).

    如果Xa和Xb都从数据库中读取X0(即orig_doc是X0),它们的差异将变为Xa-0和Xb-0(与序列化Xa-0相反,然后是Xb-a,或等效Xb-0然后Xa b)。当您尝试将差异拼接在一起以生成历史记录时,它将在修补程序Xa-0或Xb-0(均适用于X0)上失败。历史的完整性已经受到损害(或者有吗?)。

    One possible solution is an automatic reconciliation algorithm, that detects these problems ex-post. If rebuilding the history fails, one may assume that a race condition has occurred, and so apply the failed patch to prior versions of the history until it succeeds.

    一种可能的解决方案是自动协调算法,它可以事后检测这些问题。如果重建历史记录失败,可能会假设已发生竞争条件,因此将失败的修补程序应用于历史记录的先前版本,直到成功为止。

I'd be delighted to have some feedback and suggestions on how to go about tackling this problem.

我很高兴能就如何解决这个问题提出一些反馈意见和建议。

Incidentally, insofar as it's a useful way out, I've noted that Django atomicity is discussed here:

顺便说一下,只要它是一个有用的出路,我注意到这里讨论了Django原子性:

Thank you kindly.

非常感谢你。

5 个解决方案

#1


3  

Here's what I've done to save an object's history:

这是我为保存对象历史所做的工作:

For Django application History:

对于Django应用程序历史记录:

history/__init__.py:

历史/ __ init__.py:

"""
history/__init__.py
"""
from django.core import serializers
from django.utils import simplejson as json
from django.db.models.signals import pre_save, post_save

# from http://code.google.com/p/google-diff-match-patch/
from contrib.diff_match_patch import diff_match_patch

from history.models import History

def register_history(M):
  """
  Register Django model M for keeping its history

  e.g. register_history(Document) - every time Document is saved,
  its history (i.e. the differences) is saved.
  """
  pre_save.connect(_pre_handler, sender=M)
  post_save.connect(_post_handler, sender=M)

def _pre_handler(signal, sender, instance, **kwargs):
  """
  Save objects that have been changed.
  """
  if not instance.pk:
    return

  # there must be a before, if there's a pk, since
  # this is before the saving of this object.
  before = sender.objects.get(pk=instance.pk)

  _save_history(instance, _serialize(before).get('fields'))

def _post_handler(signal, sender, instance, created, **kwargs):
  """
  Save objects that are being created (otherwise we wouldn't have a pk!)
  """
  if not created:
     return

  _save_history(instance, {})

def _serialize(instance):
   """
   Given a Django model instance, return it as serialized data
   """
   return serializers.serialize("python", [instance])[0]

def _save_history(instance, before):
  """
  Save two serialized objects
  """
  after = _serialize(instance).get('fields',{})

  # All fields.
  fields = set.union(set(before.keys()),set(after.keys()))

  dmp = diff_match_patch()

  diff = {}

  for field in fields:
    field_before = before.get(field,False)
    field_after = after.get(field,False)

    if field_before != field_after:
      if isinstance(field_before, unicode) or isinstance(field_before, str):
      # a patch
        diff[field] = dmp.diff_main(field_before,field_after)
      else:
        diff[field] = field_before

  history = History(history_for=instance, diff=json.dumps(diff))
  history.save()

history/models.py

历史/ models.py

"""
history/models.py
"""

from django.db import models

from django.contrib.contenttypes.models import ContentType
from django.contrib.contenttypes import generic

from contrib import diff_match_patch as diff

class History(models.Model):
     """
     Retain the history of generic objects, e.g. documents, people, etc..
  """

  content_type = models.ForeignKey(ContentType, null=True)

  object_id = models.PositiveIntegerField(null=True)

  history_for = generic.GenericForeignKey('content_type', 'object_id')

  diff = models.TextField()

  def __unicode__(self):
       return "<History (%s:%d):%d>" % (self.content_type, self. object_id, self.pk)

Hope that helps someone, and comments would be appreciated.

希望能帮助某人,并表示赞赏。

Note that this does not address the race condition of my greatest concern. If, in _pre_handler "before = sender.objects.get(pk=instance.pk)" is called before another instance saves, but after that other instance has updated the history, and the present instance saves first, there will be an 'broken history' (i.e. out-of-order). Thankfully diff_match_patch attempts to gracefully handle "non-fatal" breaks, but there's no guarantee of success.

请注意,这并不能解决我最关心的竞争条件。如果,在_pre_handler“before = sender.objects.get(pk = instance.pk)”之前调用另一个实例保存,但在其他实例更新了历史记录之后,当前实例首先保存,则会出现'破坏历史'(即无序)。值得庆幸的是,diff_match_patch试图优雅地处理“非致命”休息,但并不能保证成功。

One solution is atomicity. I'm not sure how to go about making the above race condition (i.e. _pre_handler) an atomic operation across all instances of Django, though. A HistoryLock table, or a shared hash in memory (memcached?) would be fine - suggestions?

一种解决方案是原子性。我不确定如何在上面的竞争条件(即_pre_handler)中对所有Django实例进行原子操作。 HistoryLock表或内存中的共享哈希(memcached?)会没问题 - 建议?

The other solution, as mentioned, is a reconciliation algorithm. However, concurrent saves may have "genuine" conflicts and require user intervention to determine the correct reconciliation.

如上所述,另一种解决方案是协调算法。但是,并发保存可能存在“真实”冲突,需要用户干预才能确定正确的对帐。

Obviously, piecing the history back together isn't part of the above snippets.

显然,将历史拼凑在一起并不是上述片段的一部分。

#2


2  

The storage issue: I think you should only store the diffs of two consecutive valid versions of the document. As you point out, the problem becomes getting a valid version when concurrent edits take place.

存储问题:我认为您应该只存储文档的两个连续有效版本的差异。正如您所指出的,当发生并发编辑时,问题变为有效版本。

The concurrency issue:

并发问题:

  1. Could you avoid them all together like Jeff suggests or by locking the document?
  2. 你可以像杰夫建议的那样一起避开它们,还是锁定文件?
  3. If not, I think you're ultimately in the paradigm of online collaborative real-time editors like Google Docs.
  4. 如果没有,我认为你最终是在谷歌文档等在线协作实时编辑器的范例。

To get an illustrated view of the can of worms you are opening catch this google tech-talk at 9m21s (it's about Eclipse's collaborative real-time editing)

为了获得有关蠕虫病毒的插图视图,您将在9分21秒开始捕获这个谷歌技术讲座(这是关于Eclipse的协作实时编辑)

Alternatively, there are a couple of patents that detail ways of dealing with these concurrences on the Wikipedia article on collaborative real-time editors.

或者,有几项专利详细说明了在*关于协作实时编辑器的文章中处理这些并发的方法。

#3


1  

For managing the diffs, you would probably want to investigate Python's difflib.

为了管理差异,你可能想要研究Python的difflib。

Regarding atomicity, I would probably handle it the same as the Wikis (Trac, etc.). If the content has changed since the user last retrieved it, request that they override with the new version. If you're storing the text and diffs in the same record, it shouldn't be difficult to avoid database race conditions using the techniques in the links you posted.

关于原子性,我可能会像Wikis(Trac等)一样处理它。如果自用户上次检索内容后内容已更改,请求他们使用新版本覆盖。如果您将文本和差异存储在同一记录中,则使用您发布的链接中的技术避免数据库争用情况应该不难。

#4


1  

Your auto save, I assume, saves a draft version before the user actually presses the save button, right?

我认为,您的自动保存会在用户实际按下保存按钮之前保存草稿版本,对吧?

If so, you don't have to keep the draft saves, simply dispose them after the user decideds to save for real, and only keep history of the real/explicit saves.

如果是这样,您不必保留草稿保存,只需在用户决定保存为真实后再处理它们,并且只保留真实/显式保存的历史记录。

#5


1  

I've since discovered django-reversion, also, which seems to work well and be actively maintained, though it doesn't do diff's to efficiently store small diffs to large pieces of text.

我已经发现了django-reversion,它似乎运行良好并且得到了积极的维护,尽管它没有做差异来有效地将小差异存储到大块文本中。

#1


3  

Here's what I've done to save an object's history:

这是我为保存对象历史所做的工作:

For Django application History:

对于Django应用程序历史记录:

history/__init__.py:

历史/ __ init__.py:

"""
history/__init__.py
"""
from django.core import serializers
from django.utils import simplejson as json
from django.db.models.signals import pre_save, post_save

# from http://code.google.com/p/google-diff-match-patch/
from contrib.diff_match_patch import diff_match_patch

from history.models import History

def register_history(M):
  """
  Register Django model M for keeping its history

  e.g. register_history(Document) - every time Document is saved,
  its history (i.e. the differences) is saved.
  """
  pre_save.connect(_pre_handler, sender=M)
  post_save.connect(_post_handler, sender=M)

def _pre_handler(signal, sender, instance, **kwargs):
  """
  Save objects that have been changed.
  """
  if not instance.pk:
    return

  # there must be a before, if there's a pk, since
  # this is before the saving of this object.
  before = sender.objects.get(pk=instance.pk)

  _save_history(instance, _serialize(before).get('fields'))

def _post_handler(signal, sender, instance, created, **kwargs):
  """
  Save objects that are being created (otherwise we wouldn't have a pk!)
  """
  if not created:
     return

  _save_history(instance, {})

def _serialize(instance):
   """
   Given a Django model instance, return it as serialized data
   """
   return serializers.serialize("python", [instance])[0]

def _save_history(instance, before):
  """
  Save two serialized objects
  """
  after = _serialize(instance).get('fields',{})

  # All fields.
  fields = set.union(set(before.keys()),set(after.keys()))

  dmp = diff_match_patch()

  diff = {}

  for field in fields:
    field_before = before.get(field,False)
    field_after = after.get(field,False)

    if field_before != field_after:
      if isinstance(field_before, unicode) or isinstance(field_before, str):
      # a patch
        diff[field] = dmp.diff_main(field_before,field_after)
      else:
        diff[field] = field_before

  history = History(history_for=instance, diff=json.dumps(diff))
  history.save()

history/models.py

历史/ models.py

"""
history/models.py
"""

from django.db import models

from django.contrib.contenttypes.models import ContentType
from django.contrib.contenttypes import generic

from contrib import diff_match_patch as diff

class History(models.Model):
     """
     Retain the history of generic objects, e.g. documents, people, etc..
  """

  content_type = models.ForeignKey(ContentType, null=True)

  object_id = models.PositiveIntegerField(null=True)

  history_for = generic.GenericForeignKey('content_type', 'object_id')

  diff = models.TextField()

  def __unicode__(self):
       return "<History (%s:%d):%d>" % (self.content_type, self. object_id, self.pk)

Hope that helps someone, and comments would be appreciated.

希望能帮助某人,并表示赞赏。

Note that this does not address the race condition of my greatest concern. If, in _pre_handler "before = sender.objects.get(pk=instance.pk)" is called before another instance saves, but after that other instance has updated the history, and the present instance saves first, there will be an 'broken history' (i.e. out-of-order). Thankfully diff_match_patch attempts to gracefully handle "non-fatal" breaks, but there's no guarantee of success.

请注意,这并不能解决我最关心的竞争条件。如果,在_pre_handler“before = sender.objects.get(pk = instance.pk)”之前调用另一个实例保存,但在其他实例更新了历史记录之后,当前实例首先保存,则会出现'破坏历史'(即无序)。值得庆幸的是,diff_match_patch试图优雅地处理“非致命”休息,但并不能保证成功。

One solution is atomicity. I'm not sure how to go about making the above race condition (i.e. _pre_handler) an atomic operation across all instances of Django, though. A HistoryLock table, or a shared hash in memory (memcached?) would be fine - suggestions?

一种解决方案是原子性。我不确定如何在上面的竞争条件(即_pre_handler)中对所有Django实例进行原子操作。 HistoryLock表或内存中的共享哈希(memcached?)会没问题 - 建议?

The other solution, as mentioned, is a reconciliation algorithm. However, concurrent saves may have "genuine" conflicts and require user intervention to determine the correct reconciliation.

如上所述,另一种解决方案是协调算法。但是,并发保存可能存在“真实”冲突,需要用户干预才能确定正确的对帐。

Obviously, piecing the history back together isn't part of the above snippets.

显然,将历史拼凑在一起并不是上述片段的一部分。

#2


2  

The storage issue: I think you should only store the diffs of two consecutive valid versions of the document. As you point out, the problem becomes getting a valid version when concurrent edits take place.

存储问题:我认为您应该只存储文档的两个连续有效版本的差异。正如您所指出的,当发生并发编辑时,问题变为有效版本。

The concurrency issue:

并发问题:

  1. Could you avoid them all together like Jeff suggests or by locking the document?
  2. 你可以像杰夫建议的那样一起避开它们,还是锁定文件?
  3. If not, I think you're ultimately in the paradigm of online collaborative real-time editors like Google Docs.
  4. 如果没有,我认为你最终是在谷歌文档等在线协作实时编辑器的范例。

To get an illustrated view of the can of worms you are opening catch this google tech-talk at 9m21s (it's about Eclipse's collaborative real-time editing)

为了获得有关蠕虫病毒的插图视图,您将在9分21秒开始捕获这个谷歌技术讲座(这是关于Eclipse的协作实时编辑)

Alternatively, there are a couple of patents that detail ways of dealing with these concurrences on the Wikipedia article on collaborative real-time editors.

或者,有几项专利详细说明了在*关于协作实时编辑器的文章中处理这些并发的方法。

#3


1  

For managing the diffs, you would probably want to investigate Python's difflib.

为了管理差异,你可能想要研究Python的difflib。

Regarding atomicity, I would probably handle it the same as the Wikis (Trac, etc.). If the content has changed since the user last retrieved it, request that they override with the new version. If you're storing the text and diffs in the same record, it shouldn't be difficult to avoid database race conditions using the techniques in the links you posted.

关于原子性,我可能会像Wikis(Trac等)一样处理它。如果自用户上次检索内容后内容已更改,请求他们使用新版本覆盖。如果您将文本和差异存储在同一记录中,则使用您发布的链接中的技术避免数据库争用情况应该不难。

#4


1  

Your auto save, I assume, saves a draft version before the user actually presses the save button, right?

我认为,您的自动保存会在用户实际按下保存按钮之前保存草稿版本,对吧?

If so, you don't have to keep the draft saves, simply dispose them after the user decideds to save for real, and only keep history of the real/explicit saves.

如果是这样,您不必保留草稿保存,只需在用户决定保存为真实后再处理它们,并且只保留真实/显式保存的历史记录。

#5


1  

I've since discovered django-reversion, also, which seems to work well and be actively maintained, though it doesn't do diff's to efficiently store small diffs to large pieces of text.

我已经发现了django-reversion,它似乎运行良好并且得到了积极的维护,尽管它没有做差异来有效地将小差异存储到大块文本中。