I am building an app for a business company, and they need to control who sees which reports by project and roles, the report can belong to one project and can be seen by many roles (employees roles).
so when the report is submitted it is tagged with the project and roles, like "project1" and {"manager","seller"} , so for example the employees who are working on project1 and are managers can see this report. The way i do it now is very much depends on arrays, this is what i have:
我正在为一家商业公司构建一个应用程序,他们需要控制谁按项目和角色查看哪些报告,该报告可以属于一个项目,并且可以被许多角色(员工角色)看到。因此,当提交报告时,它会被标记为项目和角色,例如“project1”和{“manager”,“seller”},因此,例如,正在处理project1并且是经理的员工可以看到此报告。我现在这样做的方式很大程度上取决于数组,这就是我所拥有的:
reports table:
project (string)
roles (array of strings)
报告表:项目(字符串)角色(字符串数组)
employees table:
projects (array of strings) // all the projects the employee working/worked on
roles (array of strings) // employees can have many roles
employees表:项目(字符串数组)//员工在角色上工作/工作的所有项目(字符串数组)//员工可以有多个角色
when querying the reports the employee can see, i do something like this:
在查询员工可以看到的报告时,我会这样做:
select *
from reports
WHERE (employee.roles && report.roles) AND (report.project = ANY (employee.projects))
i use postgresql
我使用postgresql
the problem is i think this will not have a good performance (i'm not sure)
the only way i know to speed this query is making a GIN index on reports (roles) column, to make the overlap faster
问题是我认为这不会有一个好的表现(我不确定)我知道加速这个查询的唯一方法是在报告(角色)列上制作GIN索引,以使重叠更快
beside performance this tip here, just made me worry:
除了表演这个提示,这让我担心:
Tip: Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.
提示:数组不是集合;搜索特定的数组元素可能是数据库错误设计的标志。考虑为每个将成为数组元素的项使用一个单独的表。这将更容易搜索,并且可能更好地扩展到大量元素。
so is there a much better design to do this, or this will just works fine?
所以有一个更好的设计来做到这一点,或者这将正常工作?
1 个解决方案
#1
1
Short short answer: What you're doing is reasonably sane,but consider using int arrays rather than strings, as they're faster to compare, and mind the caveats.
简短的回答:你正在做的事情是合理的,但考虑使用int数组而不是字符串,因为它们比较快,并且注意警告。
Personally, I'd normalize it: add a user_roles table, along with role2report and user2role. Performance-wise, the optimal case in my own experience is to pre-compute the current user's role_ids in your app, and then query with an IN clause for roles. This means:
就个人而言,我将其标准化:添加user_roles表,以及role2report和user2role。在性能方面,根据我自己的经验,最佳情况是在应用程序中预先计算当前用户的role_ids,然后使用IN子句查询角色。意即:
select from reports join role2report ...
The same in triggers and such: the key is to compute the role_ids (or perm_ids), and then query. You do NOT, under any circumstance, want:
触发器中的相同内容:关键是计算role_ids(或perm_ids),然后进行查询。在任何情况下,您都不希望:
select from reports join role2report join crazy_user2role_role2role_rec_view
The biggest optimization from there involves caching a user's role for convenience using an int array or memcached or whatever. This avoids constantly using a crazy user2role joined with recursive role2role view definition, and whatever other types of craziness your specs' edge cases lead you to. Mind cache invalidation.
从那里开始的最大优化包括使用int数组或memcached或其他任何方式缓存用户的角色。这避免了不断使用与递归role2role视图定义相结合的疯狂user2role,以及您的规范边缘情况引导您的任何其他类型的疯狂。介意缓存失效。
Caching the access lists is much trickier in my experience: should you cache who can read? Write? Both? Are some objects public? Can non-logged in guests access them to? It's a deluge of questions.
根据我的经验,缓存访问列表要复杂得多:你应该缓存谁可以阅读?写?都?有些物品是公开的吗?未登录的访客可以访问它们吗?这是一大堆问题。
If you do cache that, use an int array as well. Toss in e.g. -1 to stand for public/guest access, and 0 in it to stand for registered/user access. And then use array overlaps in your queries (with registered users getting rows 0 and -1 automatically). Optimize your arrays accordingly to keep them small: if it contains -1, that should be the only value; else the same for zero; else list the role ids with grant access.
如果你进行缓存,也可以使用int数组。扔在例如-1代表公共/访客访问,0代表注册/用户访问。然后在查询中使用数组重叠(注册用户自动获取行0和-1)。相应地优化你的数组以保持它们的小:如果它包含-1,那应该是唯一的值;否则为零;否则列出具有授权访问权限的角色ID。
One caveat of using arrays, btw: at least until a recent version of Postgres (not sure now), no stats were collected on an array's contents. This made using an array sub-optimal for data sets in which a certain role_id who can access most things should lead to Postgres ignoring the GIN index. That's a real performance killer right there, because it means PG will basically fetch the entire table to fetch top-10 rows with appropriate perms instead of index scanning it with a filter.
使用数组的一个警告,顺便说一句:至少直到最近版本的Postgres(现在还不确定),没有收集数组内容的统计数据。这使得对于数据集使用数组次优,其中可以访问大多数事物的某个role_id应该导致Postgres忽略GIN索引。那是一个真正的性能杀手,因为它意味着PG将基本上获取整个表以使用适当的perms获取前10行而不是使用过滤器对其进行索引扫描。
#1
1
Short short answer: What you're doing is reasonably sane,but consider using int arrays rather than strings, as they're faster to compare, and mind the caveats.
简短的回答:你正在做的事情是合理的,但考虑使用int数组而不是字符串,因为它们比较快,并且注意警告。
Personally, I'd normalize it: add a user_roles table, along with role2report and user2role. Performance-wise, the optimal case in my own experience is to pre-compute the current user's role_ids in your app, and then query with an IN clause for roles. This means:
就个人而言,我将其标准化:添加user_roles表,以及role2report和user2role。在性能方面,根据我自己的经验,最佳情况是在应用程序中预先计算当前用户的role_ids,然后使用IN子句查询角色。意即:
select from reports join role2report ...
The same in triggers and such: the key is to compute the role_ids (or perm_ids), and then query. You do NOT, under any circumstance, want:
触发器中的相同内容:关键是计算role_ids(或perm_ids),然后进行查询。在任何情况下,您都不希望:
select from reports join role2report join crazy_user2role_role2role_rec_view
The biggest optimization from there involves caching a user's role for convenience using an int array or memcached or whatever. This avoids constantly using a crazy user2role joined with recursive role2role view definition, and whatever other types of craziness your specs' edge cases lead you to. Mind cache invalidation.
从那里开始的最大优化包括使用int数组或memcached或其他任何方式缓存用户的角色。这避免了不断使用与递归role2role视图定义相结合的疯狂user2role,以及您的规范边缘情况引导您的任何其他类型的疯狂。介意缓存失效。
Caching the access lists is much trickier in my experience: should you cache who can read? Write? Both? Are some objects public? Can non-logged in guests access them to? It's a deluge of questions.
根据我的经验,缓存访问列表要复杂得多:你应该缓存谁可以阅读?写?都?有些物品是公开的吗?未登录的访客可以访问它们吗?这是一大堆问题。
If you do cache that, use an int array as well. Toss in e.g. -1 to stand for public/guest access, and 0 in it to stand for registered/user access. And then use array overlaps in your queries (with registered users getting rows 0 and -1 automatically). Optimize your arrays accordingly to keep them small: if it contains -1, that should be the only value; else the same for zero; else list the role ids with grant access.
如果你进行缓存,也可以使用int数组。扔在例如-1代表公共/访客访问,0代表注册/用户访问。然后在查询中使用数组重叠(注册用户自动获取行0和-1)。相应地优化你的数组以保持它们的小:如果它包含-1,那应该是唯一的值;否则为零;否则列出具有授权访问权限的角色ID。
One caveat of using arrays, btw: at least until a recent version of Postgres (not sure now), no stats were collected on an array's contents. This made using an array sub-optimal for data sets in which a certain role_id who can access most things should lead to Postgres ignoring the GIN index. That's a real performance killer right there, because it means PG will basically fetch the entire table to fetch top-10 rows with appropriate perms instead of index scanning it with a filter.
使用数组的一个警告,顺便说一句:至少直到最近版本的Postgres(现在还不确定),没有收集数组内容的统计数据。这使得对于数据集使用数组次优,其中可以访问大多数事物的某个role_id应该导致Postgres忽略GIN索引。那是一个真正的性能杀手,因为它意味着PG将基本上获取整个表以使用适当的perms获取前10行而不是使用过滤器对其进行索引扫描。