提高多对多关系中的查询速度

时间:2022-10-04 16:15:17

In an attempt to teach myself how to program, I'm making a little web app (Flask, SQLAlchemy, Jijna) to display all the books I've ever ordered from Amazon.

为了自学如何编程,我正在制作一个小型网络应用程序(Flask,SQLAlchemy,Jijna)来显示我从亚马逊订购过的所有书籍。

In the "barest bones" possible way, I'm trying to learn how to replicate http://pinboard.in—that's my paragon; Maciej Cegłowski is a straight G ... I have no idea how his site runs so goddamn fast: I can load 160 bookmark entries—all with associated tags—in, I dunno, 500 ms? ... which is why I know I am doing something terribly, terribly wrong, as is discussed below. (If I could, I would just pay him to tutor me. lulz.)

在“barest bones”可能的方式中,我正在尝试学习如何复制http://pinboard.in-这是我的典范; MaciejCegłowski是一个直接的G ...我不知道他的网站如何运行如此快死:我可以加载160个书签条目 - 所有与相关的标签 - 在,我不知道,500毫秒? ......这就是为什么我知道我正在做一些非常非常错误的事情,如下所述。 (如果可以的话,我会付钱给他来辅导我.lulz。)

In any case, I created a many-to-many relationship between my books Class and my tag Class such that a user can (1) click on a book and see all its tags, as well as (2) click on a tag and see all associated books. Here is my table architecture:

无论如何,我在我的书类和我的标签类之间建立了多对多关系,这样用户就可以(1)点击一本书并查看其所有标签,以及(2)点击标签和看所有相关书籍。这是我的表架构:

Entity relationship diagram

实体关系图

Here is the code for the relationship between the two Classes:

以下是两个类之间关系的代码:

assoc = db.Table('assoc',
    db.Column('book_id', db.Integer, db.ForeignKey('books.book_id')),
    db.Column('tag_id', db.Integer, db.ForeignKey('tags.tag_id'))
)

class Book(db.Model):
    __tablename__ = 'books'
    book_id = db.Column(db.Integer, primary_key=True)
    title = db.Column(db.String(120), unique=True)
    auth = db.Column(db.String(120), unique=True)
    comment = db.Column(db.String(120), unique=True)
    date_read = db.Column(db.DateTime)
    era = db.Column(db.String(36))
    url = db.Column(db.String(120))
    notable = db.Column(db.String(1))

    tagged = db.relationship('Tag', secondary=assoc, backref=db.backref('thebooks',lazy='dynamic'))

    def __init__(self, title, auth, comment, date_read, url, notable):
        self.title = title
        self.auth = auth
        self.comment = comment
        self.date_read = date_read
        self.era = era
        self.url = url
        self.notable = notable

class Tag(db.Model):
    __tablename__ = 'tags'
    tag_id = db.Column(db.Integer, primary_key=True)
    tag_name = db.Column(db.String(120))

the problem

If I iterate through the books table only (~400 rows), the query runs and renders to the browser in lightning speed. No problem there.

如果我只遍历books表(约400行),查询将以闪电般的速度运行并呈现给浏览器。没问题。

{% for i in book_query %}
    <li>
      {{i.notable}}{{i.notable}}
      <a href="{{i.url}}">{{i.title}}</a>, {{i.auth}}
      <a href="/era/{{i.era}}">{{i.era}}</a> {{i.date_read}}
        {% if i.comment %}
          <p>{{i.comment}}</p>
        {% else %}
          <!-- print nothing -->
        {% endif %}
    </li>
{% endfor %}

If, however, I want to show any and all tags associated with a book, I change the code by nesting a for loop as follows:

但是,如果我想显示与书籍相关的任何和所有标签,我通过嵌套for循环来更改代码,如下所示:

{% for i in book_query %}
    <li>
      {{i.notable}}{{i.notable}}
      <a href="{{i.url}}">{{i.title}}</a>, {{i.auth}}
      <a href="/era/{{i.era}}">{{i.era}}</a>
        {% for ii in i.tagged %}
            <a href="/tag/{{ii.tag_name}}">{{ii.tag_name}}</a>
        {% endfor %}
      {{i.date_read}}
        {% if i.comment %}
          <p>{{i.comment}}</p>
        {% else %}
          <!-- print nothing -->
        {% endif %}
    </li>
  {% endfor %}

The query slows down significantly (takes about 20 seconds). My understanding is that this is happening because for every row in the book table, my code is iterating through the entire assoc table (i.e., "full table scan").

查询显着减慢(大约需要20秒)。我的理解是发生了这种情况,因为对于书表中的每一行,我的代码都在整个关联表中进行迭代(即“全表扫描”)。

discussion (or, "what i think is happening")

Obviously, I am a complete noob—I've been programming for ~3 months. It's motivating just to get things working, but I realize I have large gaps in my knowledge base that I am trying to fill as I go along.

显然,我是一个完整的菜鸟 - 我已经编程了大约3个月。它只是为了让事情变得有效,但我意识到我的知识库中存在很大的差距,我正在努力填补这些空白。

Right off that bat, I can appreciate that it's incredibly inefficient that, with each new book, the code is iterating through the entire association table (if that's indeed what's happening, which I believe it is). I think I need to cluster(?) or sort(?) the assoc table in such a way that once I retrieve all tags for book with book_id == 1, I never again "check" the rows with book_id == 1 in the assoc table.

在那个蝙蝠之后,我可以理解,这是非常低效的,每本新书,代码都在整个关联表中迭代(如果确实发生了什么,我相信它是这样)。我想我需要集群(?)或排序(?)关联表,这样一旦我用book_id == 1检索书籍的所有标签,我就再也没有“检查”具有book_id == 1的行。关联表。

In other words, what I think is happening is this (in computerspeak):

换句话说,我认为正在发生的是(在计算机语言中):

  • Oh, he wants to know how the book with book_id == 1 in books table has been tagged
  • 哦,他想知道书籍表中book_id == 1的书是如何被标记的

  • Okay, let me go to the assoc table
  • 好的,让我去关联表

  • Row #1 ... Is book_id in assoc table equal to 1?
  • 第1行......关联表中的book_id是否等于1?

  • Okay, it is; then what is the tag_id for Row #1? ... [then computer goes to tag table to get tag_name, and returns it to the browser]
  • 好的,是的;那么#1行的tag_id是什么? ... [然后计算机转到标签表获取tag_name,并将其返回给浏览器]

  • Row #2 ... is book_id in assoc table equal to 1?
  • 第2行...是关联表中的book_id等于1?

  • Oh, no, it isn't ... okay, go to Row #3
  • 哦,不,不是......好吧,去第3排

  • Hmmmm, because my programmer is stupid and didn't make this table sorted or indexed in some way, I'm going to have to go through the entire assoc table looking for book_id == 1 when perhaps there aren't any more ...
  • 嗯,因为我的程序员是愚蠢的并且没有以某种方式对这个表进行排序或索引,所以我将不得不通过整个关联表寻找book_id == 1,或许还没有更多... 。

Then, once we get to book_id == 2 in the books table the computer gets really mad:

然后,一旦我们在books表中找到book_id == 2,计算机真的很生气:

  • Okay, he wants to know all the tags that go with book_id == 2
  • 好的,他想知道book_id == 2的所有标签

  • Okay, let me go to the assoc table
  • 好的,让我去关联表

  • Row #1 ... wait a second ... didn't I check this one already?? Holy sh#t, I have to do this all over again??
  • 排#1 ......等一下......我不是已经检查了这个?天哪,我必须重新做这个吗?

  • Dammit ... okay ... Row #1 ... is book_id == 2? (I know it isn't! But I have to check anyway because my programmer is a dum-dum ...)
  • 该死......好吧......排#1 ......是book_id == 2? (我知道事实并非如此!但无论如何我必须检查,因为我的程序员是一个dum-dum ...)

questions

So the question is, can I (1) sort(?) or cluster(?) the assoc table in some way that ensures more "intelligent" traversal through the assoc table, or, as a friend of mine suggested, do I (2) "learn to write good SQL queries"? (Note, I've never learned SQL since I've been handling everything with SQLAlchemy ... damn Alchemists ... enshrouding their magics in secret and whatnot.)

所以问题是,我可以(1)以某种方式对关联表进行排序(?)或集群(?),以确保通过关联表进行更多“智能”遍历,或者,如我的朋友建议的那样,我做(2) )“学会编写好的SQL查询”? (注意,我从来没有学过SQL,因为我一直用SQLAlchemy处理所有事情......该死的炼金术士......秘密地将魔法包裹起来,以及诸如此类的东西。)

final words

Thanks for any input. If you have any suggestions to help me improve how I ask questions on * (this is my first post!) please let me know.

感谢您的任何意见。如果您有任何建议可以帮助我改进我在*上提出的问题(这是我的第一篇文章!),请告诉我。

3 个解决方案

#1


1  

Most of the answer is in the question.

大多数答案都在问题中。

In the first example 1 SQL query is executed when you iterate through books table. In the second example a separate assoc query is executed for every Book. So it is about 400 SQL queries which are quite time consuming. You can view them in your app debug log if you set SQLALCHEMY_ECHO config parameter:

在第一个示例中,当您遍历books表时,将执行SQL查询。在第二个示例中,为每个Book执行单独的assoc查询。因此,大约400个SQL查询非常耗时。如果设置了SQLALCHEMY_ECHO配置参数,则可以在应用调试日志中查看它们:

app.config['SQLALCHEMY_ECHO'] = True

Or you can install Flask-DebugToolbar and look at these queries in web interface.

或者您可以安装Flask-DebugToolbar并在Web界面中查看这些查询。

The best approach to handle this problem is to learn SQL basics, you will need them anyway when your applications grow larger. Try to write a more optimized query in pure SQL. For your case it may look like this:

处理此问题的最佳方法是学习SQL基础知识,当应用程序变大时,无论如何都需要它们。尝试在纯SQL中编写更优化的查询。对于您的情况,它可能看起来像这样:

SELECT books.*, tags.tag_name FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id

Then try to rewrite it in SQLAlchemy code and then group by book before passing to HTML renderer:

然后尝试在SQLAlchemy代码中重写它,然后在传递给HTML渲染器之前按书分组:

# Single query to get all books and their tags
query = db.session.query(Book, Tag.tag_name).join('tagged')
# Dictionary of data to be passed to renderer
books = {}
for book, tag_name in query:
    book_data = books.setdefault(book.book_id, {'book': book, 'tags': []})
    book_data['tags'].append(tag_name)
# Rendering HTML
return render_template('yourtemplate.html', books=books)

Template code will look like this:

模板代码如下所示:

{% for book in books %}
<li>
  {{ book.book.notable }}{{ book.book.notable }}
  <a href="{{ book.book.url }}">{{ book.book.title }}</a>, {{ book.book.auth }}
  <a href="/era/{{ book.book.era }}">{{ book.book.era }}</a>
  {% for tag in book.tags %}
    &nbsp;<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>&nbsp;
  {% endfor %}
  {{ book.book.date_read }}
    {% if book.book.comment %}
      <p>{{ book.book.comment }}</p>
    {% else %}
      <!-- print nothing -->
    {% endif %}
</li>
{% endfor %}

Another approach

If your database is PostgreSQL you can write such query:

如果您的数据库是PostgreSQL,您可以编写这样的查询:

SELECT books.title, books.auth (...), array_agg(tags.tag_name) as book_tags FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id
GROUP BY books.title, books.auth (...)

In this case you will get books data with already aggregated tags as an array. SQLAlchemy allows you to make such query:

在这种情况下,您将获得具有已聚合标签的书籍数据作为数组。 SQLAlchemy允许您进行此类查询:

from sqlalchemy import func

books = db.session.query(Book, func.array_agg(Tag.tag_name)).\
    join('tagged').group_by(Book).all()
return render_template('yourtemplate.html', books=books)

And template has the following structure:

模板具有以下结构:

{% for book, tags in books %}
<li>
  {{ book.notable }}{{ book.notable }}
  <a href="{{ book.url }}">{{ book.title }}</a>, {{ book.auth }}
  <a href="/era/{{ book.era }}">{{ book.era }}</a>
  {% for tag in tags %}
    &nbsp;<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>&nbsp;
  {% endfor %}
  {{ book.date_read }}
    {% if book.comment %}
      <p>{{ book.comment }}</p>
    {% else %}
      <!-- print nothing -->
    {% endif %}
</li>
{% endfor %}

#2


1  

The following implementation, adapted from @Sergey-Shubin, was a workable solution to this question:

改编自@ Sergey-Shubin的以下实现是这个问题的可行解决方案:

classes & table association declaration

assoc = db.Table('assoc',
    db.Column('book_id', db.Integer, db.ForeignKey('books.book_id')),
    db.Column('tag_id', db.Integer, db.ForeignKey('tags.tag_id'))
    )

class Book(db.Model):
    __tablename__ = 'books'
    book_id = db.Column(db.Integer, primary_key=True)
    title = db.Column(db.String(120), unique=True)
    auth = db.Column(db.String(120), unique=True)
    comment = db.Column(db.String(120), unique=True)
    date_read = db.Column(db.DateTime)
    era = db.Column(db.String(36))
    url = db.Column(db.String(120))
    notable = db.Column(db.String(1))    

    tagged = db.relationship('Tag', secondary=assoc, backref=db.backref('thebooks',lazy='dynamic'))

class Tag(db.Model):
    __tablename__ = 'tags'
    tag_id = db.Column(db.Integer, primary_key=True)
    tag_name = db.Column(db.String(120))

def construct_dict(query):
        books_dict = {}
        for each in query: # query is {<Book object>, <Tag object>} in the style of assoc table - therefore, must make a dictionary bc of the multiple tags per Book object
            book_data = books_dict.setdefault(each[0].book_id, {'bookkey':each[0], 'tagkey':[]}) # query is a list of like this {index-book_id, {<Book object>}, {<Tag object #1>, <Tag object #2>, ... }}
            book_data['tagkey'].append(each[1])
        return books_dict

route, sql-alchemy query

@app.route('/query')
def query():
    query = db.session.query(Book, Tag).outerjoin('tagged') # query to get all books and their tags
    books_dict = construct_dict(query)

    return render_template("query.html", query=query, books_dict=books_dict)

#3


0  

If your query has a lot of books, fetching the tags for each book one by one in a separate SQL statement will kill your response time in network I/O.

如果您的查询包含大量书籍,则在单独的SQL语句中逐个获取每本书的标记将会终止您在网络I / O中的响应时间。

One way to optimize that, if you know you always will need the tags for this query, is to hint SQLAlchemy to fetch all the dependent tags in one query either via join or subquery.

优化它的一种方法是,如果您知道您将始终需要此查询的标记,则提示SQLAlchemy通过join或子​​查询在一个查询中获取所有依赖标记。

I don't see your query, but my guess is a subquery load would work best for your use case:

我没有看到你的查询,但我的猜测是子查询加载最适合你的用例:

session.query(Book).options(subqueryload('tagged')).filter(...).all()

#1


1  

Most of the answer is in the question.

大多数答案都在问题中。

In the first example 1 SQL query is executed when you iterate through books table. In the second example a separate assoc query is executed for every Book. So it is about 400 SQL queries which are quite time consuming. You can view them in your app debug log if you set SQLALCHEMY_ECHO config parameter:

在第一个示例中,当您遍历books表时,将执行SQL查询。在第二个示例中,为每个Book执行单独的assoc查询。因此,大约400个SQL查询非常耗时。如果设置了SQLALCHEMY_ECHO配置参数,则可以在应用调试日志中查看它们:

app.config['SQLALCHEMY_ECHO'] = True

Or you can install Flask-DebugToolbar and look at these queries in web interface.

或者您可以安装Flask-DebugToolbar并在Web界面中查看这些查询。

The best approach to handle this problem is to learn SQL basics, you will need them anyway when your applications grow larger. Try to write a more optimized query in pure SQL. For your case it may look like this:

处理此问题的最佳方法是学习SQL基础知识,当应用程序变大时,无论如何都需要它们。尝试在纯SQL中编写更优化的查询。对于您的情况,它可能看起来像这样:

SELECT books.*, tags.tag_name FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id

Then try to rewrite it in SQLAlchemy code and then group by book before passing to HTML renderer:

然后尝试在SQLAlchemy代码中重写它,然后在传递给HTML渲染器之前按书分组:

# Single query to get all books and their tags
query = db.session.query(Book, Tag.tag_name).join('tagged')
# Dictionary of data to be passed to renderer
books = {}
for book, tag_name in query:
    book_data = books.setdefault(book.book_id, {'book': book, 'tags': []})
    book_data['tags'].append(tag_name)
# Rendering HTML
return render_template('yourtemplate.html', books=books)

Template code will look like this:

模板代码如下所示:

{% for book in books %}
<li>
  {{ book.book.notable }}{{ book.book.notable }}
  <a href="{{ book.book.url }}">{{ book.book.title }}</a>, {{ book.book.auth }}
  <a href="/era/{{ book.book.era }}">{{ book.book.era }}</a>
  {% for tag in book.tags %}
    &nbsp;<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>&nbsp;
  {% endfor %}
  {{ book.book.date_read }}
    {% if book.book.comment %}
      <p>{{ book.book.comment }}</p>
    {% else %}
      <!-- print nothing -->
    {% endif %}
</li>
{% endfor %}

Another approach

If your database is PostgreSQL you can write such query:

如果您的数据库是PostgreSQL,您可以编写这样的查询:

SELECT books.title, books.auth (...), array_agg(tags.tag_name) as book_tags FROM books
JOIN assoc ON assoc.book_id = books.book_id
JOIN tags ON assoc.tag_id = tags.tag_id
GROUP BY books.title, books.auth (...)

In this case you will get books data with already aggregated tags as an array. SQLAlchemy allows you to make such query:

在这种情况下,您将获得具有已聚合标签的书籍数据作为数组。 SQLAlchemy允许您进行此类查询:

from sqlalchemy import func

books = db.session.query(Book, func.array_agg(Tag.tag_name)).\
    join('tagged').group_by(Book).all()
return render_template('yourtemplate.html', books=books)

And template has the following structure:

模板具有以下结构:

{% for book, tags in books %}
<li>
  {{ book.notable }}{{ book.notable }}
  <a href="{{ book.url }}">{{ book.title }}</a>, {{ book.auth }}
  <a href="/era/{{ book.era }}">{{ book.era }}</a>
  {% for tag in tags %}
    &nbsp;<a href="/tag/{{ tag }}" class="tag-link">{{ tag }}</a>&nbsp;
  {% endfor %}
  {{ book.date_read }}
    {% if book.comment %}
      <p>{{ book.comment }}</p>
    {% else %}
      <!-- print nothing -->
    {% endif %}
</li>
{% endfor %}

#2


1  

The following implementation, adapted from @Sergey-Shubin, was a workable solution to this question:

改编自@ Sergey-Shubin的以下实现是这个问题的可行解决方案:

classes & table association declaration

assoc = db.Table('assoc',
    db.Column('book_id', db.Integer, db.ForeignKey('books.book_id')),
    db.Column('tag_id', db.Integer, db.ForeignKey('tags.tag_id'))
    )

class Book(db.Model):
    __tablename__ = 'books'
    book_id = db.Column(db.Integer, primary_key=True)
    title = db.Column(db.String(120), unique=True)
    auth = db.Column(db.String(120), unique=True)
    comment = db.Column(db.String(120), unique=True)
    date_read = db.Column(db.DateTime)
    era = db.Column(db.String(36))
    url = db.Column(db.String(120))
    notable = db.Column(db.String(1))    

    tagged = db.relationship('Tag', secondary=assoc, backref=db.backref('thebooks',lazy='dynamic'))

class Tag(db.Model):
    __tablename__ = 'tags'
    tag_id = db.Column(db.Integer, primary_key=True)
    tag_name = db.Column(db.String(120))

def construct_dict(query):
        books_dict = {}
        for each in query: # query is {<Book object>, <Tag object>} in the style of assoc table - therefore, must make a dictionary bc of the multiple tags per Book object
            book_data = books_dict.setdefault(each[0].book_id, {'bookkey':each[0], 'tagkey':[]}) # query is a list of like this {index-book_id, {<Book object>}, {<Tag object #1>, <Tag object #2>, ... }}
            book_data['tagkey'].append(each[1])
        return books_dict

route, sql-alchemy query

@app.route('/query')
def query():
    query = db.session.query(Book, Tag).outerjoin('tagged') # query to get all books and their tags
    books_dict = construct_dict(query)

    return render_template("query.html", query=query, books_dict=books_dict)

#3


0  

If your query has a lot of books, fetching the tags for each book one by one in a separate SQL statement will kill your response time in network I/O.

如果您的查询包含大量书籍,则在单独的SQL语句中逐个获取每本书的标记将会终止您在网络I / O中的响应时间。

One way to optimize that, if you know you always will need the tags for this query, is to hint SQLAlchemy to fetch all the dependent tags in one query either via join or subquery.

优化它的一种方法是,如果您知道您将始终需要此查询的标记,则提示SQLAlchemy通过join或子​​查询在一个查询中获取所有依赖标记。

I don't see your query, but my guess is a subquery load would work best for your use case:

我没有看到你的查询,但我的猜测是子查询加载最适合你的用例:

session.query(Book).options(subqueryload('tagged')).filter(...).all()