I'm looking to speed up a query that needs to use distinct because it has a M2M field it selects on. At this point I'm not sure if my speed issues are related to how I have my DB server configured or if it's related to my queryset.
我希望加快查询的速度,这个查询需要使用不同的字段,因为它有一个M2M字段。在这一点上,我不确定我的速度问题是否与我的DB服务器的配置有关,或者是否与我的queryset有关。
My question: What is the fastest queryset and can I also improve the speed by changing my Postgresql settings?
我的问题是:什么是最快的查询集,我还可以通过更改Postgresql设置来提高速度吗?
Postgresql Server Information
Instance: EC2 m1.xlarge
Postgresql Version: 9.1
Article Records: 240,695
Total Memory: 14980 MB
shared_buffers: 3617MB
effective_cache_size: 8000MB
work_mem: 40MB
checkpoint_segments: 10
maintenance_work_mem: 64MB
实例:EC2 m1。xlarge Postgresql版本:9.1文章记录:240,695总内存:14980 MB shared_buffers: 3617MB有效缓存大小:8000MB work_mem: 40MB checkpoint_segment: 10 maintenance_work_mem: 64MB
Related Models
class AuthorsModelMixin(models.Model):
authors = models.ManyToManyField('people.Person', blank=True)
nonstaff_authors = models.CharField(
verbose_name='Non-staff authors', max_length=255, blank=True,
help_text="Used for the name of the author for non-staff members.")
byline_title = models.CharField(
max_length=255, blank=True,
help_text="Often contains an organization. Title of the person, or " \
"entity associated with the byline and a specified person " \
"(i.e. Associated Press).")
class Meta:
abstract = True
class TaxonomyModelMixin(models.Model):
sections = models.ManyToManyField(Section, blank=True)
tags = TaggableManager(
blank=True, help_text='A comma-separated list of tags (i.e. ' \
'Outdoors, Election, My Great News Topic).')
class Meta:
abstract = True
class PublishModelMixin(models.Model):
status_choices = (
('D', 'Draft'),
('P', 'Published'),
('T', 'Trash'),
)
comment_choices = (
('enabled', 'Enabled'),
('disabled', 'Disabled'),
)
sites = models.ManyToManyField(Site, default=[1])
status = models.CharField(
max_length=1, default='P', db_index=True, choices=status_choices,
help_text='Only published items will appear on the site')
published = models.DateTimeField(
default=timezone.now, db_index=True,
help_text='Select the date you want the content to be published.')
is_premium = models.BooleanField(
choices=((True, 'Yes'), (False, 'No')),
verbose_name='Premium Content', default=True)
comments = models.CharField(
max_length=30, default='enabled',
choices=comment_choices, help_text='Enable or disable comments.')
created = models.DateTimeField(auto_now_add=True)
modified = models.DateTimeField(auto_now=True)
objects = PublishedManager()
class Meta:
abstract = True
class Article(AuthorsModelMixin, TaxonomyModelMixin, PublishModelMixin):
title = models.CharField(max_length=255)
slug = SlugModelField(max_length=255)
lead_photo = models.ForeignKey('media.Photo', blank=True, null=True)
summary = models.TextField(blank=True)
body = models.TextField()
Querysets I've Tried
Queryset 1
Query time: (76 ms)
Pros: Fast and no chance published articles won't be displayed
Cons: If a higher id has an older pub date then the article list will be out of order
查询时间:(76毫秒)优点:快速且不可能发表的文章不会显示缺点:如果一个较高的id有一个较早的发布日期,那么文章列表将会失效
queryset = Article.objects \
.published() \
.filter(sections__full_slug__startswith=section.full_slug) \
.prefetch_related('lead_photo', 'authors') \
.order_by('-id') \
.distinct('id')
Queryset 2
Query time: (76 ms)
Pros: Articles are in order all the time
Cons: If two articles have the same pub date and time, only one will be listed
查询时间:(76 ms)优点:文章始终是有序的,缺点是:如果两篇文章的发布日期和时间相同,那么只会列出一篇
queryset = Article.objects \
.published() \
.filter(sections__full_slug__startswith=section.full_slug) \
.prefetch_related('lead_photo', 'authors') \
.order_by('-published') \
.distinct('published')
Queryset 3
Query time: (1007 ms)
Pros: Articles are in order all the time and no chance of articles not being listed
Cons: Much slower!
查询时间:(1007 ms)优点:文章一直都是有序的,没有文章不被列出缺点的机会:慢多了!
queryset = Article.objects \
.published() \
.filter(sections__full_slug__startswith=section.full_slug) \
.prefetch_related('lead_photo', 'authors') \
.order_by('-id', '-published') \
.distinct('id')
Queryset 4
Query time: (4797.85 ms)
Pros: Not much, however not using DISTINCT ON
means it works on other databases like SQLite for tests
Cons: Much slower!!!
查询时间:(4797.85 ms)优点:不多,但是没有使用DISTINCT ON,这意味着它可以在SQLite之类的其他数据库上进行测试:慢得多!
queryset = Article.objects \
.published() \
.filter(sections__full_slug__startswith=section.full_slug) \
.prefetch_related('lead_photo', 'authors') \
.order_by('-published') \
.distinct()
1 个解决方案
#1
2
Can you try a performance test on this query? As you haven't posted your models, please adapt any field names.
您可以在这个查询上尝试性能测试吗?由于您还没有发布您的模型,请修改任何字段名。
The idea is to break it into two: one that will return all Article ids looking at the intermediary table.
我们的想法是把它分成两部分:一个返回所有查看中间表的文章id。
queryset = Article.objects \
.published() \
.filter(id__in=Article.sections.through.objects
.filter(section__full_slug__startswith=section.full_slug)
.values_list('article_id', flat=True)) \
.prefetch_related('lead_photo', 'authors') \
.order_by('-published', '-id')
#1
2
Can you try a performance test on this query? As you haven't posted your models, please adapt any field names.
您可以在这个查询上尝试性能测试吗?由于您还没有发布您的模型,请修改任何字段名。
The idea is to break it into two: one that will return all Article ids looking at the intermediary table.
我们的想法是把它分成两部分:一个返回所有查看中间表的文章id。
queryset = Article.objects \
.published() \
.filter(id__in=Article.sections.through.objects
.filter(section__full_slug__startswith=section.full_slug)
.values_list('article_id', flat=True)) \
.prefetch_related('lead_photo', 'authors') \
.order_by('-published', '-id')