如何进行SimpleDB备份？

I'm developing a Facebook application that uses SimpleDB to store its data, but I've realized Amazon does not provide a way to backup that data (at least that I know of)

我正在开发一个使用SimpleDB存储数据的Facebook应用程序,但我意识到亚马逊没有提供备份数据的方法(至少我知道)

And SimpleDB is slow. You can get about 4 lists per second, each list of 100 records. Not a good way to backup tons of records.

而SimpleDB很慢。每秒可以获得大约4个列表,每个列表包含100条记录。不是备份大量记录的好方法。

I found some services in the web that offer to do the backup for you, but I'm not comfortable about giving them my AWS Credentials.

我在网上找到了一些为你做备份的服务,但是我不愿意给他们提供我的AWS凭证。

So I though about using threads. Problem is that if you do a select for all the keys in the domain, you need to wait for the next_token value of the first page in order to process the second page and so on.

所以我虽然使用线程。问题是,如果您对域中的所有键执行选择,则需要等待第一页的next_token值以处理第二页,依此类推。

A solution I was thinking for this was to have a new attribute based on the last 2 digits of the Facebook id. So I'd start a thread with a select for "00", another for "01", and so on, potentially having the possibility of running 100 threads and doing backups much faster (at least in theory). A related solution would be to split that domain into 100 domains (so I can backup each one individually), but that would break some of the selects I need to do. Another solution, probably more PHP friendly, would be to use a cron job to backup lets say 10,000 records and save "next_token", then the next job starts at next_token, etc.

我想到的一个解决方案是根据Facebook id的最后2位数设置一个新属性。所以我开始一个选择为“00”的线程,另一个选择为“01”,依此类推,可能会运行100个线程并更快地进行备份(至少在理论上)。一个相关的解决方案是将该域分成100个域(因此我可以单独备份每个域),但这会破坏我需要做的一些选择。另一个可能更友好的PHP解决方案是使用cron作业备份让我们说10,000条记录并保存“next_token”,然后下一个作业从next_token开始,等等。

Does anyone have a better solution for this? If its a PHP solution it'd be great, but if it involves something else its welcome anyway.

有人有更好的解决方案吗?如果它是一个PHP解决方案,它会很棒,但如果它涉及其他东西,那么无论如何。

PS: before you mention it, as far as I know, PHP is still not thread safe. And I'm aware that unless I stop the writes during the backup, there will be some consistency problems, but I'm not too worried about it in this particular case.

PS:在你提到它之前,据我所知,PHP仍然不是线程安全的。而且我知道除非我在备份期间停止写入,否则会出现一些一致性问题,但在这种特殊情况下我并不太担心。

2 个解决方案

#1

The approach of creating a proxy shard attribute certainly works, from experience where I am.

从我的经验来看,创建代理分片属性的方法确实有效。

Alternatively, what we have done in the past is to break down the backup into a 2 step process, in order to get as much potential for multi-processing as possible (though this is in java and for the write to the backup file we can rely on synchronization to ensure write-safety - not sure what the deal is on php side).

或者,我们过去所做的是将备份分解为两个步骤,以便尽可能多地进行多处理(尽管这是在java中,对于写入备份文件,我们可以依靠同步来确保写安全性 - 不确定php方面的交易是什么。

Basically we have one thread which does a select across the data within a domain, but rather than "SELECT * FROM ...", it is just "SELECT itemName FROM ..." to get the keys to the entries needing backing up. These are then dropped into a queue of item keys which a pool of threads read with the getItem API and write in a thread safe manner to the backup file.

基本上我们有一个线程在域内对数据进行选择,而不是“SELECT * FROM ...”,它只是“SELECT itemName FROM ...”来获取需要备份的条目的键。然后将这些项放入项密钥队列中,其中一个线程池使用getItem API读取,并以线程安全的方式写入备份文件。

This gave us better throughput on a single domain than spinning on a single thread.

这使得我们在单个域上的吞吐量比在单个线程上旋转更好。

Ultimately though, with numerous domains in our nightly backup we ended up reverting back to doing each domain backup in the single thread and "SELECT * FROM domain" type model, mainly because we already had a shedload of threads going on and the thread overburden started to become an issue on the backup processor, but also because the backup program was starting to get dangerously complex.

但最终,在我们的夜间备份中有多个域,我们最终还原为在单线程和“SELECT * FROM域”类型模型中执行每个域备份,主要是因为我们已经有一堆线程正在进行并且线程负担已经开始成为备份处理器的问题,但也因为备份程序开始变得危险复杂。

#2

I've researched this problem as of October 2012. Three major issues seem to govern choice:

我从2012年10月开始研究这个问题。三个主要问题似乎决定了选择:

There is no 'native' way to ensure a consistent export or import with SimpleDB. It is your responsibility to understand and manage the implications of this w.r.t. your application code.

没有'本地'方法来确保使用SimpleDB进行一致的导出或导入。您有责任理解和管理此w.r.t的含义。你的申请代码。

No managed backup solution is available from Amazon, but a variety of third-party companies offer something in this space (typically with "backup to S3" as an option).

亚马逊没有可用的托管备份解决方案,但是各种第三方公司在这个领域提供了一些东西(通常选择“备份到S3”)。

At some volume of data, you'll need to consider a multi-threaded approach which, again, has important implications re: consistency.

在一定数量的数据中,您需要考虑一种多线程方法,这种方法同样具有重要意义:一致性。

If all you need is to dump data from a single domain and your data volumes are low enough such that single-threaded export makes sense, then here is some Python code I wrote which works great for me. No warranty is expressed or implied, only use this if you understand it:

如果您只需要从单个域转储数据并且您的数据量足够低以使单线程导出有意义,那么我编写的一些Python代码对我来说非常有用。不作任何明示或暗示的保证,只有在您理解的情况下才能使用此保证:

#simpledb2json.py

import boto
import simplejson as json

AWS_KEY = "YOUR_KEY"
AWS_SECRET = "YOUR_SECRET"

DOMAIN = "YOUR_DOMAIN"


def fetch_items(boto_dom, dom_name, offset=None, limit=300):
    offset_predicate = ""

    if offset:
        offset_predicate = " and itemName() > '" + offset + "'"

    query = "select * from " \
        + "`" + dom_name + "`" \
        + " where itemName() is not null" \
        + offset_predicate \
        + " order by itemName() asc limit " + str(limit)

    rs = boto_dom.select(query)

    # by default, boto does not include the simpledb 'key' or 'name' in the
    # dict, it is a separate property. so we add it:
    result = []
    for r in rs:
        r['_itemName'] = r.name
        result.append(r)

    return result


def _main():
    con = boto.connect_sdb(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)

    dom = con.get_domain(DOMAIN)

    all_items = []
    offset = None

    while True:
        items = fetch_items(dom, DOMAIN, offset=offset)

        if not items:
            break

        all_items += items

        offset = all_items[-1].name

    print json.dumps(all_items, sort_keys=True, indent=4)

if __name__ == "__main__":
    _main()

#1