I've got a problem that keeps coming up with normalized databases and was looking for the best solution.
我有一个问题,一直在提出规范化的数据库,并正在寻找最好的解决方案。
Suppose I've got an album information database. I want to setup the schema in a normalized fashion, so I setup two tables - albums, which has one listing for each album, and songs, which lists all songs contained by albums.
假设我有一个专辑信息数据库。我想以规范化的方式设置架构,所以我设置了两个表 - 专辑,每个专辑有一个列表,还有歌曲,列出了专辑中包含的所有歌曲。
albums
------
aid
name
songs
-----
aid
sid
length
This setup is good for storing the data in a normalized fashion, as an album can contain any number of songs. However, accessing the data in an intuitive manner has now become a lot more difficult. A query which only grabs the information on a single album is simple, but how do you grab multiple albums at once in a single query?
此设置适用于以标准化方式存储数据,因为相册可包含任意数量的歌曲。但是,以直观的方式访问数据现在变得更加困难。只收集单个相册信息的查询很简单,但如何在一个查询中同时获取多个相册?
Thus far, the best answer I have come up with is grouping by aid and converting the songs information as arrays. For example, the result would look something like this:
到目前为止,我提出的最佳答案是通过辅助分组并将歌曲信息转换为数组。例如,结果看起来像这样:
aid, sids, lengths
1, [1, 2], [1:04, 5:45]
2, [3, 4, 5], [3:30, 4:30, 5:30]
When I want to work with the data, I have to then parse the sids and lengths, which seems a pointless exercise: I'm making the db concatenate a bunch of values just to separate them later.
当我想处理数据时,我必须解析sids和length,这似乎是一个毫无意义的练习:我正在使db连接一堆值,以便稍后将它们分开。
My question: What is the best way to access a database with this sort of schema? Am I stuck with multiple arrays? Should I store the entirety of a song's information in an object and then those songs into a single array, instead of having multiple arrays? Or is there a way of adding an arbitrary number of columns to the resultset (sort of an infinite-join), to accommodate N number of songs? I'm open to any ideas on how to best access the data.
我的问题:使用这种模式访问数据库的最佳方法是什么?我是否坚持使用多个阵列?我应该将整个歌曲的信息存储在一个对象中,然后将这些歌曲存储到一个阵列中,而不是拥有多个阵列吗?或者有没有办法在结果集中添加任意数量的列(无限连接类型),以容纳N个歌曲?我对如何最好地访问数据的任何想法持开放态度。
I'm also concerned about efficiency, as these queries will be run often.
我也关注效率,因为这些查询会经常运行。
If it makes any difference, I'm using a PostgreSQL db along with a PHP front-end.
如果它有所不同,我正在使用PostgreSQL数据库和PHP前端。
6 个解决方案
#1
I have difficulty seeing your point. What exactly do you mean by "how do you grab multiple albums at once in a single query"? What exactly do you have difficulties with?
我很难看出你的观点。 “你如何在一次查询中一次抓取多张专辑”究竟是什么意思?你究竟有什么困难?
Intuitively I would say:
直觉我会说:
SELECT
a.aid album_id,
a.name album_name,
s.sid song_id,
s.name song_name,
s.length song_length
FROM
albums a
INNER JOIN songs s ON a.aid = s.aid
WHERE
a.aid IN (1, 2, 3)
and
SELECT
a.aid album_id,
a.name album_name,
COUNT(s.sid) count_songs,
SUM(s.length) sum_length /* assuming you store an integer seconds value */
FROM /* here, not a string containing '3:18' or such */
albums a
INNER JOIN songs s ON a.aid = s.aid
WHERE
a.aid IN (1, 2, 3)
GROUP BY
a.aid
Depending on what you want to know/display. Either you query the database for aggregate information, or you calculate it yourself out of the query result #1 in your app.
取决于您想知道/显示的内容。您可以在数据库中查询聚合信息,也可以在应用程序的查询结果#1中自行计算。
Depending on how much data is cached in your app, and how long queries take the one strategy can be faster than the other. I would recommend querying the DB, though. DBs are made for this kind of stuff.
根据应用程序中缓存的数据量以及查询所花费的时间,一个策略可以比另一个更快。不过,我建议查询数据库。 DB就是为这种东西而制作的。
#2
I see your point, but I have issues with the first query, because you end up with a lot of repeated data - the album name is repeated many times. I'm trying to have my cake and eat it, too - I want the data to be as compact as possible, but that's not realistic without aggregates.
我明白了你的观点,但我对第一个查询有问题,因为你最终会得到很多重复数据 - 专辑名称会重复多次。我也想尝试吃蛋糕 - 我希望数据尽可能紧凑,但如果没有聚合,这是不现实的。
Ah, I understand your question now. You're asking how best to micro-optimize something that's actually not very expensive for most cases. And the solution you're toying with is actually going to be significantly less efficient than the "problem" it's trying to solve.
啊,我现在明白你的问题了。你问的是如何最好地微观优化大多数情况下实际上并不昂贵的东西。你正在解决的解决方案实际上要比它试图解决的“问题”效率低得多。
My advice would be to join the tables and return the columns you need. For anything less than 10,000 records returned, you won't notice any significant wire time penalty for handing back that AlbumName with each Song record.
我的建议是加入表并返回您需要的列。对于少于10,000条记录返回的内容,您不会注意到将AlbumName与每首歌曲记录一起交回时有任何重大的电汇时间损失。
If you notice it performing slowly in the field, then optimize it. But keep in mind that a lot of smart people have spent about 50 years of research making the "join the tables & return what you need" solution fast. I doubt you'll beat it with your home-rolled string concatenation/de-concatenation strategy.
如果您注意到它在现场缓慢执行,则进行优化。但请记住,许多聪明人花了大约50年的时间研究,使“加入表格并快速返回您需要的”解决方案。我怀疑你会用你的家庭滚动字符串连接/去连接策略击败它。
#3
I agree with Jason Kester insofar as I think this is unlikely to really be a performance bottleneck in practice, even if you have 10 columns with repeated data. However, if you're bent on cutting out that repeated data then I'll suggest using 2 queries:
我同意Jason Kester的观点,因为我认为即使你有10列重复数据,这在实践中也不太可能成为性能瓶颈。但是,如果您倾向于删除重复的数据,那么我建议使用2个查询:
Query #1:
SELECT sid, length -- And whatever other per-song fields you want
FROM songs
ORDER BY aid
Query #2:
SELECT aid, a.name, COUNT(*)
FROM albums a
JOIN songs s USING (aid)
GROUP BY aid, a.name
ORDER BY aid, a.name
The second query enables you to break up the output of the first query into segments appropriately. Note that this will only work reliably if you can assume that no changes will be made to the table between these two queries -- otherwise you'll need a transaction with SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
.
第二个查询使您可以适当地将第一个查询的输出分解为段。请注意,只有在您可以假设不会对这两个查询之间的表进行任何更改时,这才能可靠地工作 - 否则您将需要具有SET TRANSACTION ISOLATION LEVEL SERIALIZABLE的事务。
Again, the mere fact that you're using two separate queries is likely to make this slower overall as in most cases the doubled network latency + query parsing + query planning is likely to swamp the effective increase in network throughput. But at least you won't have that nasty horrible feeling of sending repeated data... :)
同样,您使用两个单独查询这一事实可能会使整体速度变慢,因为在大多数情况下,网络延迟加倍+查询解析+查询计划可能会淹没网络吞吐量的有效增长。但至少你不会有那种发送重复数据的令人讨厌的可怕感觉...... :)
#4
The join queries will ask the database to put the tables together, matching the ids and return a single table. That way the data can be dynamically configured to the current task, something that non normalized databases cannot do.
连接查询将要求数据库将表放在一起,匹配id并返回单个表。这样,数据可以动态配置到当前任务,这是非规范化数据库无法做到的。
#5
SELECT aid,GROUP_CONCAT(sid) FROM songs GROUP BY aid;
+----+-------------------------+
|aid | GROUP_CONCAT(sid) |
+----+-------------------------+
| 3 | 5,6,7 |
+----+-------------------------+
#6
I wouldn't break your normalisation for that. Leave the tables normailsed and then use the following to query - How to concatenate strings of a string field in a PostgreSQL 'group by' query?
我不会打破你的正常化。保留表格normailsed然后使用以下查询 - 如何通过'查询连接PostgreSQL'组中的字符串字段的字符串?
#1
I have difficulty seeing your point. What exactly do you mean by "how do you grab multiple albums at once in a single query"? What exactly do you have difficulties with?
我很难看出你的观点。 “你如何在一次查询中一次抓取多张专辑”究竟是什么意思?你究竟有什么困难?
Intuitively I would say:
直觉我会说:
SELECT
a.aid album_id,
a.name album_name,
s.sid song_id,
s.name song_name,
s.length song_length
FROM
albums a
INNER JOIN songs s ON a.aid = s.aid
WHERE
a.aid IN (1, 2, 3)
and
SELECT
a.aid album_id,
a.name album_name,
COUNT(s.sid) count_songs,
SUM(s.length) sum_length /* assuming you store an integer seconds value */
FROM /* here, not a string containing '3:18' or such */
albums a
INNER JOIN songs s ON a.aid = s.aid
WHERE
a.aid IN (1, 2, 3)
GROUP BY
a.aid
Depending on what you want to know/display. Either you query the database for aggregate information, or you calculate it yourself out of the query result #1 in your app.
取决于您想知道/显示的内容。您可以在数据库中查询聚合信息,也可以在应用程序的查询结果#1中自行计算。
Depending on how much data is cached in your app, and how long queries take the one strategy can be faster than the other. I would recommend querying the DB, though. DBs are made for this kind of stuff.
根据应用程序中缓存的数据量以及查询所花费的时间,一个策略可以比另一个更快。不过,我建议查询数据库。 DB就是为这种东西而制作的。
#2
I see your point, but I have issues with the first query, because you end up with a lot of repeated data - the album name is repeated many times. I'm trying to have my cake and eat it, too - I want the data to be as compact as possible, but that's not realistic without aggregates.
我明白了你的观点,但我对第一个查询有问题,因为你最终会得到很多重复数据 - 专辑名称会重复多次。我也想尝试吃蛋糕 - 我希望数据尽可能紧凑,但如果没有聚合,这是不现实的。
Ah, I understand your question now. You're asking how best to micro-optimize something that's actually not very expensive for most cases. And the solution you're toying with is actually going to be significantly less efficient than the "problem" it's trying to solve.
啊,我现在明白你的问题了。你问的是如何最好地微观优化大多数情况下实际上并不昂贵的东西。你正在解决的解决方案实际上要比它试图解决的“问题”效率低得多。
My advice would be to join the tables and return the columns you need. For anything less than 10,000 records returned, you won't notice any significant wire time penalty for handing back that AlbumName with each Song record.
我的建议是加入表并返回您需要的列。对于少于10,000条记录返回的内容,您不会注意到将AlbumName与每首歌曲记录一起交回时有任何重大的电汇时间损失。
If you notice it performing slowly in the field, then optimize it. But keep in mind that a lot of smart people have spent about 50 years of research making the "join the tables & return what you need" solution fast. I doubt you'll beat it with your home-rolled string concatenation/de-concatenation strategy.
如果您注意到它在现场缓慢执行,则进行优化。但请记住,许多聪明人花了大约50年的时间研究,使“加入表格并快速返回您需要的”解决方案。我怀疑你会用你的家庭滚动字符串连接/去连接策略击败它。
#3
I agree with Jason Kester insofar as I think this is unlikely to really be a performance bottleneck in practice, even if you have 10 columns with repeated data. However, if you're bent on cutting out that repeated data then I'll suggest using 2 queries:
我同意Jason Kester的观点,因为我认为即使你有10列重复数据,这在实践中也不太可能成为性能瓶颈。但是,如果您倾向于删除重复的数据,那么我建议使用2个查询:
Query #1:
SELECT sid, length -- And whatever other per-song fields you want
FROM songs
ORDER BY aid
Query #2:
SELECT aid, a.name, COUNT(*)
FROM albums a
JOIN songs s USING (aid)
GROUP BY aid, a.name
ORDER BY aid, a.name
The second query enables you to break up the output of the first query into segments appropriately. Note that this will only work reliably if you can assume that no changes will be made to the table between these two queries -- otherwise you'll need a transaction with SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
.
第二个查询使您可以适当地将第一个查询的输出分解为段。请注意,只有在您可以假设不会对这两个查询之间的表进行任何更改时,这才能可靠地工作 - 否则您将需要具有SET TRANSACTION ISOLATION LEVEL SERIALIZABLE的事务。
Again, the mere fact that you're using two separate queries is likely to make this slower overall as in most cases the doubled network latency + query parsing + query planning is likely to swamp the effective increase in network throughput. But at least you won't have that nasty horrible feeling of sending repeated data... :)
同样,您使用两个单独查询这一事实可能会使整体速度变慢,因为在大多数情况下,网络延迟加倍+查询解析+查询计划可能会淹没网络吞吐量的有效增长。但至少你不会有那种发送重复数据的令人讨厌的可怕感觉...... :)
#4
The join queries will ask the database to put the tables together, matching the ids and return a single table. That way the data can be dynamically configured to the current task, something that non normalized databases cannot do.
连接查询将要求数据库将表放在一起,匹配id并返回单个表。这样,数据可以动态配置到当前任务,这是非规范化数据库无法做到的。
#5
SELECT aid,GROUP_CONCAT(sid) FROM songs GROUP BY aid;
+----+-------------------------+
|aid | GROUP_CONCAT(sid) |
+----+-------------------------+
| 3 | 5,6,7 |
+----+-------------------------+
#6
I wouldn't break your normalisation for that. Leave the tables normailsed and then use the following to query - How to concatenate strings of a string field in a PostgreSQL 'group by' query?
我不会打破你的正常化。保留表格normailsed然后使用以下查询 - 如何通过'查询连接PostgreSQL'组中的字符串字段的字符串?