I have a data store with about 150,000 entities in it. When I query the store using filters, my queries are REALLY slow. My structure is completely flat, i.e. every entity is a sibling of every other.
我有一个数据存储,其中包含大约150,000个实体。当我使用过滤器查询商店时,我的查询真的很慢。我的结构是完全平坦的,即每个实体都是彼此的兄弟。
1: Is it better to use GQL instead of filters?
1:使用GQL代替过滤器更好吗?
2: Is this not the best use-case for Data Store, and should I use a SQL database instead?
2:这不是Data Store的最佳用例,我应该使用SQL数据库吗?
Here's an example of my code:
这是我的代码示例:
// Look for a buy opportunity
dateFilter = new FilterPredicate("date", FilterOperator.EQUAL, dt);
scoreFilter = new FilterPredicate("score", FilterOperator.LESS_THAN_OR_EQUAL, 10.0);
safetyFilter = new FilterPredicate("score", FilterOperator.GREATER_THAN_OR_EQUAL, -1.0);
mainFilter = CompositeFilterOperator.and(dateFilter,scoreFilter,safetyFilter);
q = new Query("StockEntity",stockKey).setFilter(mainFilter);
q.addSort("score", Query.SortDirection.ASCENDING);
stocks = datastore.prepare(q).asList(FetchOptions.Builder.withLimit(availableSlots));
Some more details:
更多细节:
-
150,000ish records, divided amongst 500 stocks, so about 300 records per stock, one for each day in a date range.
150,000个记录,分为500个股票,每个股票约300个记录,日期范围内每天一个。
-
Query like that above, where a specific date is passed in, and the 500 stocks are effectively filtered based on a 'score', with the number of records desired to return is between 10 and 20 takes upwards of 30 seconds to complete, on my development machine.
如上所述的查询,其中传递了特定日期,并且基于“得分”有效地过滤了500个股票,期望返回的记录数量在10到20之间需要超过30秒来完成,在我的开发机器。
Haven't tried pushing to production yet, but I guess I will try that next -- I figured that there wouldn't be a huge difference. My dev machine is quite a high spec iMac.
还没有尝试推动生产,但我想我会尝试下一步 - 我认为不会有巨大的差异。我的开发机器是一个相当高的规格iMac。
2 个解决方案
#1
0
https://developers.google.com/appengine/docs/java/datastore/queries#Java_Restrictions_on_queries
https://developers.google.com/appengine/docs/java/datastore/queries#Java_Restrictions_on_queries
Inequality filters are limited to at most one property
不等式过滤器最多只能限制一个属性
To avoid having to scan the entire index table, the query mechanism relies on all of a query's potential results being adjacent to one another in the index. To satisfy this constraint, a single query may not use inequality comparisons (LESS_THAN, LESS_THAN_OR_EQUAL, GREATER_THAN, GREATER_THAN_OR_EQUAL, NOT_EQUAL) on more than one property across all of its filters. For example, the following query is valid, because both inequality filters apply to the same property:
为了避免必须扫描整个索引表,查询机制依赖于所有查询的潜在结果在索引中彼此相邻。为了满足此约束,单个查询可能不会在其所有过滤器上的多个属性上使用不等式比较(LESS_THAN,LESS_THAN_OR_EQUAL,GREATER_THAN,GREATER_THAN_OR_EQUAL,NOT_EQUAL)。例如,以下查询有效,因为两个不等式过滤器都适用于同一属性:
Short answer is that you really can't quite do what you want with data store.
简短的回答是你真的无法用数据存储做你想做的事。
#2
0
First up, that query will run faster on the actual Datastore.
首先,该查询将在实际数据存储上运行得更快。
-
Using GQL or Filters is basically the same.
使用GQL或过滤器基本相同。
-
When using the Datastore you should first define the functionality you need. For example: You want to show a list of stocks with a specific order and filters. Now look at any other views of the same data that your app needs. Then decide how the data should be structured.
使用数据存储区时,您应首先定义所需的功能。例如:您想显示具有特定订单和过滤器的股票列表。现在查看您的应用所需的相同数据的任何其他视图。然后决定如何构建数据。
This is very different from an RDBMS where the database can often accommodate most functionality without changing the data model and the data is modeled in a more 'generic' way (normalization).
这与RDBMS非常不同,在RDBMS中,数据库通常可以容纳大多数功能而无需更改数据模型,并且数据以更“通用”的方式建模(规范化)。
In general, the Datastore's read performance will be optimal if you know the KEY of whatever it is you want to read and it will perform at it's worst when doing queries since that always requires an index 'scan'.
通常,如果您知道要读取的任何内容的KEY,那么数据存储区的读取性能将是最佳的,并且在执行查询时它将执行最差的操作,因为它始终需要索引“扫描”。
Knowing this, I tend to use the Ancestor relationship often. Requesting the 'children' of an Ancestor seems to perform better and is Consistent. For example, I use a query like:
知道了这一点,我倾向于经常使用祖先的关系。要求祖先的“孩子”似乎表现得更好并且是一致的。例如,我使用如下查询:
SELECT * WHERE ANCESTOR IS {key}
Where {key} is the key of the ancestor (or 'parent'). This query returns the ancestor entity and all entities that have this ancestor key in their paths. On rare occasions I use one of the filters as a parent 'value' to group objects but be careful, a key is not changeable once the entity is written (you can change the key, but it will result in a copy).
其中{key}是祖先(或“父”)的关键。此查询返回祖先实体以及在其路径中具有此祖先键的所有实体。在极少数情况下,我使用其中一个过滤器作为父“值”来对对象进行分组但要小心,一旦写入实体,密钥就不会改变(您可以更改密钥,但会产生副本)。
Also, if you know the average size of a 'set'. For example, Orderlines that belong to an Order. You could choose to keep track of each Orderline key somewhere. Requesting the first 20 keys in a batched read is a fast operation. This is basically the same as indexing, however the ordering and filtering could be done at 'write time' so your list only contains keys that match your filters.
此外,如果你知道'集'的平均大小。例如,属于订单的订单行。您可以选择在某处跟踪每个Orderline键。在批量读取中请求前20个键是快速操作。这与索引基本相同,但是排序和过滤可以在“写入时间”完成,因此您的列表仅包含与过滤器匹配的键。
Avoid creating views that allow users to 'dynamically' select filters.
避免创建允许用户“动态”选择过滤器的视图。
How to optimize further: 1. Use denormalization to minimize the number of lookups or queries. 2. Cache (Memcache) where you can.
如何进一步优化:1。使用非规范化来最小化查找或查询的数量。 2.尽可能缓存(Memcache)。
#1
0
https://developers.google.com/appengine/docs/java/datastore/queries#Java_Restrictions_on_queries
https://developers.google.com/appengine/docs/java/datastore/queries#Java_Restrictions_on_queries
Inequality filters are limited to at most one property
不等式过滤器最多只能限制一个属性
To avoid having to scan the entire index table, the query mechanism relies on all of a query's potential results being adjacent to one another in the index. To satisfy this constraint, a single query may not use inequality comparisons (LESS_THAN, LESS_THAN_OR_EQUAL, GREATER_THAN, GREATER_THAN_OR_EQUAL, NOT_EQUAL) on more than one property across all of its filters. For example, the following query is valid, because both inequality filters apply to the same property:
为了避免必须扫描整个索引表,查询机制依赖于所有查询的潜在结果在索引中彼此相邻。为了满足此约束,单个查询可能不会在其所有过滤器上的多个属性上使用不等式比较(LESS_THAN,LESS_THAN_OR_EQUAL,GREATER_THAN,GREATER_THAN_OR_EQUAL,NOT_EQUAL)。例如,以下查询有效,因为两个不等式过滤器都适用于同一属性:
Short answer is that you really can't quite do what you want with data store.
简短的回答是你真的无法用数据存储做你想做的事。
#2
0
First up, that query will run faster on the actual Datastore.
首先,该查询将在实际数据存储上运行得更快。
-
Using GQL or Filters is basically the same.
使用GQL或过滤器基本相同。
-
When using the Datastore you should first define the functionality you need. For example: You want to show a list of stocks with a specific order and filters. Now look at any other views of the same data that your app needs. Then decide how the data should be structured.
使用数据存储区时,您应首先定义所需的功能。例如:您想显示具有特定订单和过滤器的股票列表。现在查看您的应用所需的相同数据的任何其他视图。然后决定如何构建数据。
This is very different from an RDBMS where the database can often accommodate most functionality without changing the data model and the data is modeled in a more 'generic' way (normalization).
这与RDBMS非常不同,在RDBMS中,数据库通常可以容纳大多数功能而无需更改数据模型,并且数据以更“通用”的方式建模(规范化)。
In general, the Datastore's read performance will be optimal if you know the KEY of whatever it is you want to read and it will perform at it's worst when doing queries since that always requires an index 'scan'.
通常,如果您知道要读取的任何内容的KEY,那么数据存储区的读取性能将是最佳的,并且在执行查询时它将执行最差的操作,因为它始终需要索引“扫描”。
Knowing this, I tend to use the Ancestor relationship often. Requesting the 'children' of an Ancestor seems to perform better and is Consistent. For example, I use a query like:
知道了这一点,我倾向于经常使用祖先的关系。要求祖先的“孩子”似乎表现得更好并且是一致的。例如,我使用如下查询:
SELECT * WHERE ANCESTOR IS {key}
Where {key} is the key of the ancestor (or 'parent'). This query returns the ancestor entity and all entities that have this ancestor key in their paths. On rare occasions I use one of the filters as a parent 'value' to group objects but be careful, a key is not changeable once the entity is written (you can change the key, but it will result in a copy).
其中{key}是祖先(或“父”)的关键。此查询返回祖先实体以及在其路径中具有此祖先键的所有实体。在极少数情况下,我使用其中一个过滤器作为父“值”来对对象进行分组但要小心,一旦写入实体,密钥就不会改变(您可以更改密钥,但会产生副本)。
Also, if you know the average size of a 'set'. For example, Orderlines that belong to an Order. You could choose to keep track of each Orderline key somewhere. Requesting the first 20 keys in a batched read is a fast operation. This is basically the same as indexing, however the ordering and filtering could be done at 'write time' so your list only contains keys that match your filters.
此外,如果你知道'集'的平均大小。例如,属于订单的订单行。您可以选择在某处跟踪每个Orderline键。在批量读取中请求前20个键是快速操作。这与索引基本相同,但是排序和过滤可以在“写入时间”完成,因此您的列表仅包含与过滤器匹配的键。
Avoid creating views that allow users to 'dynamically' select filters.
避免创建允许用户“动态”选择过滤器的视图。
How to optimize further: 1. Use denormalization to minimize the number of lookups or queries. 2. Cache (Memcache) where you can.
如何进一步优化:1。使用非规范化来最小化查找或查询的数量。 2.尽可能缓存(Memcache)。