如何使用CouchDB构建“标记”支持?

时间:2022-01-28 12:50:22

I'm using the following view function to iterate over all items in the database (in order to find a tag), but I think the performance is very poor if the dataset is large. Any other approach?

我正在使用以下视图函数来迭代数据库中的所有项目(以便查找标记),但我认为如果数据集很大,性能非常差。还有其他方法吗?

def by_tag(tag):
return  '''
        function(doc) {
            if (doc.tags.length > 0) {
                for (var tag in doc.tags) {
                    if (doc.tags[tag] == "%s") {
                        emit(doc.published, doc)
                    }
                }
            }
        };
        ''' % tag

4 个解决方案

#1


7  

Disclaimer: I didn't test this and don't know if it can perform better.

免责声明:我没有对此进行测试,也不知道它是否能够表现更好。

Create a single perm view:

创建一个perm视图:

function(doc) {
  for (var tag in doc.tags) {
    emit([tag, doc.published], doc)
  }
};

And query with _view/your_view/all?startkey=['your_tag_here']&endkey=['your_tag_here', {}]

并使用_view / your_view / all查询?startkey = ['your_tag_here']&endkey = ['your_tag_here',{}]

Resulting JSON structure will be slightly different but you will still get the publish date sorting.

生成的JSON结构略有不同,但您仍将获得发布日期排序。

#2


3  

You can define a single permanent view, as Bahadir suggests. when doing this sort of indexing, though, don't output the doc for each key. Instead, emit([tag, doc.published], null). In current release versions you'd then have to do a separate lookup for each doc, but SVN trunk now has support for specifying "include_docs=True" in the query string and CouchDB will automatically merge the docs into your view for you, without the space overhead.

正如Bahadir建议的那样,您可以定义单个永久视图。但是,在进行此类索引时,请不要为每个键输出doc。相反,发出([tag,doc.published],null)。在当前版本中,您必须为每个文档单独查找,但SVN主干现在支持在查询字符串中指定“include_docs = True”,CouchDB会自动将文档合并到您的视图中,而不是空间开销。

#3


1  

You are very much on the right track with the view. A list of thoughts though:

你对这个观点非常正确。一系列的想法:

View generation is incremental. If you're read traffic is greater than you're write traffic, then your views won't cause an issue at all. People that are concerned about this generally shouldn't be. Frame of reference, you should be worried if you're dumping hundreds of records into the view without an update.

视图生成是增量的。如果您的阅读流量大于您的写入流量,那么您的观点根本不会导致问题。关注这一点的人通常不应该这样。参考框架,如果您在没有更新的情况下将数百条记录转储到视图中,您应该担心。

Emitting an entire document will slow things down. You should only emit what is necessary for use of the view.

发送整个文档会减慢速度。您应该只发出使用视图所需的内容。

Not sure what the val == "%s" performance would be, but you shouldn't over think things. If there's a tag array you should emit the tags. Granted if you expect a tags array that will contain non-strings, then ignore this.

不确定val ==“%s”的性能是什么,但你不应该过度思考。如果有标签数组,您应该发出标签。如果你期望一个包含非字符串的标签数组,那么请忽略它。

#4


0  

# Works on CouchDB 0.8.0
from couchdb import Server # http://code.google.com/p/couchdb-python/

byTag = """
function(doc) {
if (doc.type == 'post' && doc.tags) {
    doc.tags.forEach(function(tag) {
        emit(tag, doc);
    });
}
}
"""

def findPostsByTag(self, tag):
    server = Server("http://localhost:1234")
    db = server['my_table']
    return [row for row in db.query(byTag, key = tag)]

The byTag map function returns the data with each unique tag in the "key", then each post with that tag in value, so when you grab key = "mytag", it will retrieve all posts with the tag "mytag".

byTag map函数返回带有“key”中每个唯一标记的数据,然后每个帖子都带有该标记的值,所以当你抓住key =“mytag”时,它会检索所有标记为“mytag”的帖子。

I've tested it against about 10 entries and it seems to take about 0.0025 seconds per query, not sure how efficient it is with large data sets..

我已经针对大约10个条目测试了它,并且每个查询似乎需要大约0.0025秒,不确定大数据集的效率如何。

#1


7  

Disclaimer: I didn't test this and don't know if it can perform better.

免责声明:我没有对此进行测试,也不知道它是否能够表现更好。

Create a single perm view:

创建一个perm视图:

function(doc) {
  for (var tag in doc.tags) {
    emit([tag, doc.published], doc)
  }
};

And query with _view/your_view/all?startkey=['your_tag_here']&endkey=['your_tag_here', {}]

并使用_view / your_view / all查询?startkey = ['your_tag_here']&endkey = ['your_tag_here',{}]

Resulting JSON structure will be slightly different but you will still get the publish date sorting.

生成的JSON结构略有不同,但您仍将获得发布日期排序。

#2


3  

You can define a single permanent view, as Bahadir suggests. when doing this sort of indexing, though, don't output the doc for each key. Instead, emit([tag, doc.published], null). In current release versions you'd then have to do a separate lookup for each doc, but SVN trunk now has support for specifying "include_docs=True" in the query string and CouchDB will automatically merge the docs into your view for you, without the space overhead.

正如Bahadir建议的那样,您可以定义单个永久视图。但是,在进行此类索引时,请不要为每个键输出doc。相反,发出([tag,doc.published],null)。在当前版本中,您必须为每个文档单独查找,但SVN主干现在支持在查询字符串中指定“include_docs = True”,CouchDB会自动将文档合并到您的视图中,而不是空间开销。

#3


1  

You are very much on the right track with the view. A list of thoughts though:

你对这个观点非常正确。一系列的想法:

View generation is incremental. If you're read traffic is greater than you're write traffic, then your views won't cause an issue at all. People that are concerned about this generally shouldn't be. Frame of reference, you should be worried if you're dumping hundreds of records into the view without an update.

视图生成是增量的。如果您的阅读流量大于您的写入流量,那么您的观点根本不会导致问题。关注这一点的人通常不应该这样。参考框架,如果您在没有更新的情况下将数百条记录转储到视图中,您应该担心。

Emitting an entire document will slow things down. You should only emit what is necessary for use of the view.

发送整个文档会减慢速度。您应该只发出使用视图所需的内容。

Not sure what the val == "%s" performance would be, but you shouldn't over think things. If there's a tag array you should emit the tags. Granted if you expect a tags array that will contain non-strings, then ignore this.

不确定val ==“%s”的性能是什么,但你不应该过度思考。如果有标签数组,您应该发出标签。如果你期望一个包含非字符串的标签数组,那么请忽略它。

#4


0  

# Works on CouchDB 0.8.0
from couchdb import Server # http://code.google.com/p/couchdb-python/

byTag = """
function(doc) {
if (doc.type == 'post' && doc.tags) {
    doc.tags.forEach(function(tag) {
        emit(tag, doc);
    });
}
}
"""

def findPostsByTag(self, tag):
    server = Server("http://localhost:1234")
    db = server['my_table']
    return [row for row in db.query(byTag, key = tag)]

The byTag map function returns the data with each unique tag in the "key", then each post with that tag in value, so when you grab key = "mytag", it will retrieve all posts with the tag "mytag".

byTag map函数返回带有“key”中每个唯一标记的数据,然后每个帖子都带有该标记的值,所以当你抓住key =“mytag”时,它会检索所有标记为“mytag”的帖子。

I've tested it against about 10 entries and it seems to take about 0.0025 seconds per query, not sure how efficient it is with large data sets..

我已经针对大约10个条目测试了它,并且每个查询似乎需要大约0.0025秒,不确定大数据集的效率如何。