在红移中存储事件数据的最佳方法是什么?

时间:2022-07-10 13:49:37

I'm new to Redshift and am looking at the best way to store event data. The data consists of an identifier, time and JSON metadata about the current state.

我是Redshift的新手,正在寻找存储事件数据的最佳方式。数据由关于当前状态的标识符、时间和JSON元数据组成。

I'm considering three approaches:

我正考虑三种方法:

  1. Create a table for each event type with a column for each piece of data.
  2. 为每个事件类型创建一个表,并为每个数据块创建一个列。
  3. Create a single table for events and store metadata as a JSON field.
  4. 为事件创建一个表,并将元数据存储为JSON字段。
  5. Create a single table with a column for every possible piece of data I might want to store.
  6. 为我想要存储的每个可能的数据片段创建一个包含列的表。

The advantage of #1 is I can filter on all data fields and the solution is more flexible. The disadvantage is every time I want to add a new event I have to create a new table.

#1的优点是我可以在所有数据字段上进行过滤,并且解决方案更加灵活。缺点是每次我想添加一个新事件时,我都必须创建一个新表。

The advantage of #2 is I can put all types of events into a single table. The disadvantage is to filter on any of the data in the metadata I would need to use a JSON function on every row.

#2的优点是我可以将所有类型的事件放在一个表中。缺点是过滤元数据中的任何数据,我需要在每一行上使用JSON函数。

The advantage of #3 is I can easily access all the fields without running a function and don't have to create a new table for each type. The disadvantage is whoever is using the data needs to remember which columns to ignore.

#3的优点是,我可以很容易地访问所有字段,而不需要运行函数,也不需要为每种类型创建一个新表。缺点是,使用数据的人需要记住要忽略哪些列。

Is one of these ways better than the others or am I missing something entirely?

这些方法中有一种比其他方法更好,还是我完全漏掉了什么?

1 个解决方案

#1


2  

This is a classic dilemma. After thinking for a while, in my company we ended up keeping the common properties of the events in separate columns and the unique properties in the JSON field. Examples of the common properties:

这是一个典型的两难困境。在考虑了一段时间之后,在我的公司中,我们最终将事件的公共属性保存在单独的列中,并在JSON字段中保留惟一的属性。常见特性的例子:

  • event type, timestamp (every event has it)
  • 事件类型、时间戳(每个事件都有)
  • URL (this will be missing for backend events and mobile app events but is present for all frontend events and is worth to have in a separate column)
  • URL(这将丢失后端事件和移动应用程序事件,但存在于所有前端事件中,并且值得在单独的列中使用)
  • client properties: device, browser, OS (will be missing in backend but present in mobile app events and frontend events)
  • 客户属性:设备、浏览器、操作系统(将在后台丢失,但出现在移动应用程序事件和前端事件中)

Examples of unique properties (no such properties in other events):

独特属性的例子(在其他事件中没有此类属性):

  • test name and variant in AB test event
  • 在AB测试事件中测试名称和变体。
  • product name or ID in purchase event
  • 购买事件中的产品名称或ID

Borderline between common and unique property is your own judgement based on how many events share this property and how often will this property be used in the analytics queries to filter or group data. If some property is just "nice-to-have" and it is not involved in regular analysis use cases (yeah, we all love to store anything that is trackable just in case) the burden of maintaining a separate column is an overkill.

公共属性和惟一属性之间的边界是您自己的判断,基于有多少事件共享该属性,以及该属性在分析查询中用于过滤或分组数据的频率。如果某些属性只是“很有可能拥有的”,并且它不涉及常规的分析用例(是的,我们都喜欢存储任何可跟踪的东西,以防万一),那么维护单独的列的负担就太大了。

Also, if you have some unique property that you use extensively in the queries there is a hacky way to optimize. You can place this property at the beginning of your JSON column (yes, in Python JSON is not ordered but in Redshift it is a string, so the order of keys can be fixed if you want) and use LIKE with a wildcard only at the end of the field:

此外,如果您有一些在查询中广泛使用的惟一属性,那么可以使用一种简单的方法进行优化。您可以将此属性放在JSON列的开头(是的,在Python JSON中不是有序的,但在Redshift中它是一个字符串,因此如果您愿意,可以对键的顺序进行修改),并只在字段末尾使用通配符:

select * 
from event_table
where event_type='Start experiment'
and event_json like '{"test_name":"my_awesome_test"%'  -- instead of below
-- and json_extract_path_text(event_json,'test_name')='my_awesome_test'

LIKE used this way works much faster than JSON lookup (2-3x times faster) because it doesn't need to scan every row, decode JSON, find the key and check the value but it just checks if the string starts with a substring (much cheaper operation).

使用这种方法比JSON查找要快得多(速度是JSON的2-3倍),因为它不需要扫描每一行、解码JSON、查找键并检查值,而只需检查字符串是否以子字符串开头(更便宜的操作)。

#1


2  

This is a classic dilemma. After thinking for a while, in my company we ended up keeping the common properties of the events in separate columns and the unique properties in the JSON field. Examples of the common properties:

这是一个典型的两难困境。在考虑了一段时间之后,在我的公司中,我们最终将事件的公共属性保存在单独的列中,并在JSON字段中保留惟一的属性。常见特性的例子:

  • event type, timestamp (every event has it)
  • 事件类型、时间戳(每个事件都有)
  • URL (this will be missing for backend events and mobile app events but is present for all frontend events and is worth to have in a separate column)
  • URL(这将丢失后端事件和移动应用程序事件,但存在于所有前端事件中,并且值得在单独的列中使用)
  • client properties: device, browser, OS (will be missing in backend but present in mobile app events and frontend events)
  • 客户属性:设备、浏览器、操作系统(将在后台丢失,但出现在移动应用程序事件和前端事件中)

Examples of unique properties (no such properties in other events):

独特属性的例子(在其他事件中没有此类属性):

  • test name and variant in AB test event
  • 在AB测试事件中测试名称和变体。
  • product name or ID in purchase event
  • 购买事件中的产品名称或ID

Borderline between common and unique property is your own judgement based on how many events share this property and how often will this property be used in the analytics queries to filter or group data. If some property is just "nice-to-have" and it is not involved in regular analysis use cases (yeah, we all love to store anything that is trackable just in case) the burden of maintaining a separate column is an overkill.

公共属性和惟一属性之间的边界是您自己的判断,基于有多少事件共享该属性,以及该属性在分析查询中用于过滤或分组数据的频率。如果某些属性只是“很有可能拥有的”,并且它不涉及常规的分析用例(是的,我们都喜欢存储任何可跟踪的东西,以防万一),那么维护单独的列的负担就太大了。

Also, if you have some unique property that you use extensively in the queries there is a hacky way to optimize. You can place this property at the beginning of your JSON column (yes, in Python JSON is not ordered but in Redshift it is a string, so the order of keys can be fixed if you want) and use LIKE with a wildcard only at the end of the field:

此外,如果您有一些在查询中广泛使用的惟一属性,那么可以使用一种简单的方法进行优化。您可以将此属性放在JSON列的开头(是的,在Python JSON中不是有序的,但在Redshift中它是一个字符串,因此如果您愿意,可以对键的顺序进行修改),并只在字段末尾使用通配符:

select * 
from event_table
where event_type='Start experiment'
and event_json like '{"test_name":"my_awesome_test"%'  -- instead of below
-- and json_extract_path_text(event_json,'test_name')='my_awesome_test'

LIKE used this way works much faster than JSON lookup (2-3x times faster) because it doesn't need to scan every row, decode JSON, find the key and check the value but it just checks if the string starts with a substring (much cheaper operation).

使用这种方法比JSON查找要快得多(速度是JSON的2-3倍),因为它不需要扫描每一行、解码JSON、查找键并检查值,而只需检查字符串是否以子字符串开头(更便宜的操作)。