I am building an application that includes a feature to bulk tag millions of records, more or less interactively. The user interaction is very similar to Gmail where users can tag individual emails, or bulk tag large amounts of emails. I also need quick read access to these tag memberships as well, and where the read pattern is more or less random.
我正在构建一个应用程序,其中包含一个功能,可以或多或少地以交互方式批量标记数百万条记录。用户交互非常类似于Gmail,用户可以在其中标记单个电子邮件,或批量标记大量电子邮件。我还需要快速读取这些标记成员资格,并且读取模式或多或少是随机的。
Right now we're using Mysql and inserting one row for every tag-document pair. Writing millions of rows to Mysql takes a while (high I/O), even with bulk insertions and heavy optimization. We need this to be an interactive process, not a batch process.
现在我们正在使用Mysql并为每个标签 - 文档对插入一行。向Mysql写入数百万行需要一段时间(高I / O),即使批量插入和大量优化也是如此。我们需要这是一个交互过程,而不是批处理过程。
For the data that we're storing and reading, consistency and availability of the data are not as important as performance and scalability. So in the event of system failure while the writes are occurring, I can deal with some data loss. However, the data definitely needs to be persisted to secondary storage at some point.
对于我们存储和读取的数据,数据的一致性和可用性并不像性能和可伸缩性那么重要。因此,如果在写入发生时系统出现故障,我可以处理一些数据丢失。但是,数据肯定需要在某个时刻持久保存到二级存储。
So, to sum up, here are the requirements:
总而言之,以下是要求:
- Low latency bulk writes of potentially tens of millions of records
- Data needs to be persisted in some way
- Low latency random reads
- Durable writes not required
- Eventual consistency is okay
低延迟批量写入潜在的数千万条记录
数据需要以某种方式持久化
低延迟随机读取
耐用写入不是必需的
最终的一致性是可以的
Here are some solutions I've looked at:
以下是我看过的一些解决方案:
- Write behind caches (Terracotta, Gigaspaces, Coherence) where records are written to memory and drained to the database asynchronously. These scare me a little because they appear to add a certain amount of complexity to the app that I'd want to avoid.
- Highly scalable key-value stores, like MongoDB, HBase, Tokyo Tyrant
写下缓存(Terracotta,Gigaspaces,Coherence),将记录写入内存并异步排入数据库。这些让我感到有些害怕,因为它们似乎为我想要避免的应用程序增加了一定的复杂性。
高度可扩展的键值存储,如MongoDB,HBase,Tokyo Tyrant
4 个解决方案
#1
2
If you have the budget to use Coherence for this, I highly recommend doing so. There is direct support for write-behind, eventual consistency behavior in Coherence and it is very survivable to both a database outage and Coherence cluster node outages (if you use >= 3 Coherence nodes on separate JVMs, preferably on separate hosts). I have implemented this for doing high-volume CRM for a Fortune 100 company's e-commerce site and it works fantastically.
如果您有预算使用Coherence,我强烈建议您这样做。 Coherence中直接支持后写,最终一致性行为,并且它对于数据库中断和Coherence群集节点中断都是非常可生存的(如果在单独的JVM上使用> = 3 Coherence节点,最好是在单独的主机上)。我已经实现了这一目的,为财富100强公司的电子商务网站进行大批量CRM,它的运作非常出色。
One of the best aspects of this architecture is that you write your Java application code as if none of the write-behind behavior were taking place, and then plug in the Coherence topology and configuration that makes it happen. If you need to change the behavior or topology of Coherence later, no change in your application is required. I know there are probably a handful of reasonable ways to do this, but this behavior is directly supported in Coherence rather than having to invent or hand-roll a way of doing it.
此体系结构的最佳方面之一是您编写Java应用程序代码,就好像没有发生后写行为一样,然后插入Coherence拓扑和配置来实现它。如果以后需要更改Coherence的行为或拓扑,则不需要更改应用程序。我知道可能有一些合理的方法可以做到这一点,但这种行为在Coherence中直接支持,而不是必须发明或手动一种方式。
To make a really fine point - your worry about adding application complexity is a good one. With Coherence, you simply write updates to the cache (or if you're using Hibernate it can be the L2 cache provider). Depending upon your Coherence configuration and topology, you have the option to deploy your application to use write-behind, distributed, caches. So, your application is no more complex (and, frankly unaware) due to the features of the cache.
要提出一个非常好的观点 - 您对增加应用程序复杂性的担心是一个很好的问题。使用Coherence,您只需将更新写入缓存(或者如果您使用Hibernate,它可以是L2缓存提供程序)。根据您的Coherence配置和拓扑,您可以选择部署应用程序以使用后写,分布式缓存。因此,由于缓存的功能,您的应用程序不再复杂(并且坦率地说不知道)。
Finally, I implemented the solution mentioned above from 2005-2007 when Coherence was made by Tangosol and they had the best possible support. I'm not sure how things are now under Oracle - hopefully still good.
最后,我实施了上面提到的解决方案,从2005年到2007年,当Coosolnce由Tangosol制作时,他们得到了最好的支持。我不确定甲骨文现在的情况如何 - 希望仍然很好。
#2
1
I've worked on a large project that used asyncrhonous writes althoguh in that case it was just hand-written using background threads. You could also implement something like that by offloading the db write process to a JMS queue.
我曾经在一个使用异步写入althoguh的大型项目上工作,在这种情况下,它只是使用后台线程手写。您还可以通过将db write进程卸载到JMS队列来实现类似的功能。
One thing that will certainly speed up db writes is to do them in batches. JDBC batch updates can be orders of magnitude faster than individual writes, and if you're doing them asynchronously you can just write them 500 at a time.
有一点肯定会加速数据库写入是分批进行的。 JDBC批处理更新比单个写入快几个数量级,如果您以异步方式执行它们,则可以一次写入500个。
#3
0
Depending on how your data is organized perhaps you would be able to use sharding, if the read latency isn't low enough you can also try to add caching. Memcache is one popular solution.
根据数据的组织方式,您可以使用分片,如果读取延迟不够低,您也可以尝试添加缓存。 Memcache是一种流行的解决方案。
#4
0
Berkeley DB has a very high performance disk-based hash table that supports transactions, and integrates with a Java EE environment if you need that. If you're able to model the data as key/value pairs, this can be a very scalable solution.
Berkeley DB具有非常高性能的基于磁盘的哈希表,支持事务,并在需要时与Java EE环境集成。如果您能够将数据建模为键/值对,则这可以是一种非常可扩展的解决方案。
http://www.oracle.com/technology/products/berkeley-db/je/index.html
(Note: oracle bought berkeley db about 5-10 years ago; the original product has been around for 15-20 years).
(注意:oracle在5 - 10年前购买了berkeley db;原始产品已经存在了15 - 20年)。
#1
2
If you have the budget to use Coherence for this, I highly recommend doing so. There is direct support for write-behind, eventual consistency behavior in Coherence and it is very survivable to both a database outage and Coherence cluster node outages (if you use >= 3 Coherence nodes on separate JVMs, preferably on separate hosts). I have implemented this for doing high-volume CRM for a Fortune 100 company's e-commerce site and it works fantastically.
如果您有预算使用Coherence,我强烈建议您这样做。 Coherence中直接支持后写,最终一致性行为,并且它对于数据库中断和Coherence群集节点中断都是非常可生存的(如果在单独的JVM上使用> = 3 Coherence节点,最好是在单独的主机上)。我已经实现了这一目的,为财富100强公司的电子商务网站进行大批量CRM,它的运作非常出色。
One of the best aspects of this architecture is that you write your Java application code as if none of the write-behind behavior were taking place, and then plug in the Coherence topology and configuration that makes it happen. If you need to change the behavior or topology of Coherence later, no change in your application is required. I know there are probably a handful of reasonable ways to do this, but this behavior is directly supported in Coherence rather than having to invent or hand-roll a way of doing it.
此体系结构的最佳方面之一是您编写Java应用程序代码,就好像没有发生后写行为一样,然后插入Coherence拓扑和配置来实现它。如果以后需要更改Coherence的行为或拓扑,则不需要更改应用程序。我知道可能有一些合理的方法可以做到这一点,但这种行为在Coherence中直接支持,而不是必须发明或手动一种方式。
To make a really fine point - your worry about adding application complexity is a good one. With Coherence, you simply write updates to the cache (or if you're using Hibernate it can be the L2 cache provider). Depending upon your Coherence configuration and topology, you have the option to deploy your application to use write-behind, distributed, caches. So, your application is no more complex (and, frankly unaware) due to the features of the cache.
要提出一个非常好的观点 - 您对增加应用程序复杂性的担心是一个很好的问题。使用Coherence,您只需将更新写入缓存(或者如果您使用Hibernate,它可以是L2缓存提供程序)。根据您的Coherence配置和拓扑,您可以选择部署应用程序以使用后写,分布式缓存。因此,由于缓存的功能,您的应用程序不再复杂(并且坦率地说不知道)。
Finally, I implemented the solution mentioned above from 2005-2007 when Coherence was made by Tangosol and they had the best possible support. I'm not sure how things are now under Oracle - hopefully still good.
最后,我实施了上面提到的解决方案,从2005年到2007年,当Coosolnce由Tangosol制作时,他们得到了最好的支持。我不确定甲骨文现在的情况如何 - 希望仍然很好。
#2
1
I've worked on a large project that used asyncrhonous writes althoguh in that case it was just hand-written using background threads. You could also implement something like that by offloading the db write process to a JMS queue.
我曾经在一个使用异步写入althoguh的大型项目上工作,在这种情况下,它只是使用后台线程手写。您还可以通过将db write进程卸载到JMS队列来实现类似的功能。
One thing that will certainly speed up db writes is to do them in batches. JDBC batch updates can be orders of magnitude faster than individual writes, and if you're doing them asynchronously you can just write them 500 at a time.
有一点肯定会加速数据库写入是分批进行的。 JDBC批处理更新比单个写入快几个数量级,如果您以异步方式执行它们,则可以一次写入500个。
#3
0
Depending on how your data is organized perhaps you would be able to use sharding, if the read latency isn't low enough you can also try to add caching. Memcache is one popular solution.
根据数据的组织方式,您可以使用分片,如果读取延迟不够低,您也可以尝试添加缓存。 Memcache是一种流行的解决方案。
#4
0
Berkeley DB has a very high performance disk-based hash table that supports transactions, and integrates with a Java EE environment if you need that. If you're able to model the data as key/value pairs, this can be a very scalable solution.
Berkeley DB具有非常高性能的基于磁盘的哈希表,支持事务,并在需要时与Java EE环境集成。如果您能够将数据建模为键/值对,则这可以是一种非常可扩展的解决方案。
http://www.oracle.com/technology/products/berkeley-db/je/index.html
(Note: oracle bought berkeley db about 5-10 years ago; the original product has been around for 15-20 years).
(注意:oracle在5 - 10年前购买了berkeley db;原始产品已经存在了15 - 20年)。