分布式数据库中的数据分配

How to optimize a data allocation in the distributed database?

如何优化分布式数据库中的数据分配?

Are there any software products for solving this problem?

有没有解决这个问题的软件产品?

For example:

There are some number of connected servers for the distributed database. Each server simultaneously is a client of this database.

分布式数据库有一些连接的服务器。每个服务器同时是该数据库的客户端。

The database has many tables.

数据库有很多表。

We have statistic of queries from each client to the particular table.

我们有从每个客户端到特定表的查询统计信息。

There is some price of the data storage for each server. There is some price of transfer, known for each pair of the server and the client.

每台服务器的数据存储都有一定的价格。对于每对服务器和客户端而言,存在一些转移价格。

Objective: To allocate all tables (or parts of tables) on servers in the best possible way.

目标:以最佳方式在服务器上分配所有表(或表的一部分)。

To solve this problem we can apply a variety of heuristic algorithms: genetic algorithms, evolution strategies, ant algorithms, etc.

为了解决这个问题,我们可以应用各种启发式算法:遗传算法,进化策略,蚂蚁算法等。

But I could not find any ready software tools that would have implemented these algorithms.

但我找不到任何可以实现这些算法的现成软件工具。

Are there any tools to solve this problem for distributed databases (Oracle or others)?

是否有任何工具可以解决分布式数据库(Oracle或其他)的问题?

Does anybody care about it?

有人关心吗?

And maybe somebody has examples of systems with a query statistic with the distributed database that have been optimized in this way?

也许有人有这样的系统示例,其中包含已经以这种方式优化的分布式数据库的查询统计信息?

Thanks!

3 个解决方案

#1

I've looked for something similar, but the sad truth is that there aren't off-the-shelf tools for doing this kind of analysis in regards to databases. You can find a lot of information, though, with various research projects, university papers, and so on.

我寻找类似的东西,但令人遗憾的事实是,没有现成的工具可以对数据库进行这种分析。但是,您可以通过各种研究项目,大学论文等找到大量信息。

As an alternative, this could be modelled using off-the-shelf mathematical tools to optimize the data localization/correlation to specific clients.

作为替代方案,可以使用现成的数学工具对其进行建模,以优化与特定客户端的数据本地化/关联。

#2

I think it is a lot easier to just store the data in a centralized database and configure a cache for the various locations. Because the different locations are not likely able to be in the same grid, the cache configuration should be a synchronous cache because in an async cache solution the order of updates in the database might not be the order in which the updates were applied. The cache will reduce lots of query network traffic and improve performance for the remote locations, compared to when they should access the database directly. The Oracle In-Memory Cache Database Option could be worth investigating. Works for 10.2.0.4 databases and above, using the 11.2.1.8 version of what was formerly called TimesTen. A great option. The algorithms you asked for, are effectively caching algorithm. Make sure that often used data is close to the consumer, at the best possible price. If you can spend more on memory, more data fits in. The LRU will take care for cleaning of less often used data from the cache.

我认为将数据存储在集中式数据库中并为各个位置配置缓存要容易得多。由于不同的位置不太可能位于同一网格中,因此缓存配置应该是同步缓存,因为在异步缓存解决方案中,数据库中的更新顺序可能不是应用更新的顺序。与直接访问数据库时相比,缓存将减少大量查询网络流量并提高远程位置的性能。 Oracle内存缓存数据库选项值得研究。适用于10.2.0.4及更高版本的数据库,使用以前称为TimesTen的11.2.1.8版本。一个很好的选择。您要求的算法是有效的缓存算法。确保经常使用的数据以最优惠的价格接近消费者。如果您可以在内存上花费更多,则可以使用更多数据.LRU将负责清理缓存中不常使用的数据。

#3

An example of a distributed database that solves this problem is Clustrix, which is the only database that has independent index distribution. Clustrix is a database built from the ground up to be a distributed MySQL replacement.

解决此问题的分布式数据库的一个示例是Clustrix,它是唯一具有独立索引分发的数据库。 Clustrix是一个从头开始构建的数据库,是一个分布式MySQL替代品。

More on how Clustrix does data distribution and the distributed evaluation model

更多关于Clustrix如何进行数据分发和分布式评估模型

#1