基于Mysql json的趋势标签实现

I am trying to identifying the trending tags (based on maximum hits) on time series using mysql json feature. Below is my table

我正在尝试使用mysql json功能识别时间序列中的趋势标签（基于最大命中）。以下是我的表格

CREATE TABLE TAG_COUNTER (
    account       varchar(36) NOT NULL,
    time_id       INT NOT NULL,
    counters      JSON,
    PRIMARY KEY   (account, time_id)
)

In every web api request, i will be getting the multiple different tags per account, and based on number of tags, i will prepare the INSERT ON DUPLICATE KEY UPDATE query. Below example is showing insertion with two tags.

在每个web api请求中，我将获得每个帐户的多个不同标签，并根据标签数量，我将准备INSERT ON DUPLICATE KEY UPDATE查询。下面的示例显示了带有两个标记的插入。

INSERT INTO `TAG_COUNTER`
  (`account`, `time_id`, `counters`)
VALUES
  ('google', '2018061023', '{"tag1": 1, "tag2": 1}')
ON DUPLICATE KEY UPDATE `counters` =
  JSON_SET(`counters`,
           '$."tag1"',
           IFNULL(JSON_EXTRACT(`counters`,
                        '$."tag1"'), 0) + 1,
           '$."tag2"',
           IFNULL(JSON_EXTRACT(`counters`,
                        '$."tag2"'), 0) + 1
  );

time_id is yyyyMMddhh, and it is hourly aggregation on each row.

time_id是yyyyMMddhh，它是每行的每小时聚合。

Now my problem is retrival of treding tags. Below query will give me aggregation for tag1, but we will not be knowing the tags before making this query.

现在我的问题是treding标签的回溯。下面的查询将为我提供tag1的聚合，但在进行此查询之前我们不会知道这些标记。

SELECT
SUBSTRING(time_id, 1, 6) AS month,
SUM(counters->>'$.tag1')
FROM TAG_COUNTER
WHERE counters->>'$.tag1' > 0
GROUP BY month;

So i need generic group by query along with order by to get the trending tags for the time hourly/daily/monthly.

所以我需要按查询一般分组，然后按顺序获取每小时/每日/每月的趋势标签。

The sample of output expected is

预期的输出样本是

Time(hour/day/month)  Tag_name  Tag_count_value(total hits)

When i have searched the web, every where it is mentioned like below {"tag_name": "tag1", "tag_count": 1} instead of direct {"tag1" : 1} and they were using tag_name in the group by.

当我在网上搜索时，每个提到的位置如下面{“tag_name”：“tag1”，“tag_count”：1}而不是直接{“tag1”：1}并且他们在group by中使用tag_name。

Q1) So is it always mandatory to have common known json key to perform group by ..?

Q1）因此，总是必须使用公知的json密钥来执行分组...？

Q2) If i have to go with this way, then what is the change in my INSERT ON DUPLICATE KEY UPDATE query for this new json label/value struture? Since the counter has to be created when it is not existing and should increment by one when it is existing.

Q2）如果我必须采用这种方式，那么对于这个新的json标签/值struture，我的INSERT ON DUPLICATE KEY UPDATE查询的变化是什么？由于计数器必须在不存在时创建，并且在存在时应该增加1。

Q3) do i have to maintain array of objects

Q3）我是否必须维护对象数组

[
 {"tag_name": "tag1", "tag_count": 2},
 {"tag_name": "tag2", "tag_count": 3}
]

OR object of objects like below?

OR对象如下？

{
 {"tag_name": "tag1", "tag_count": 2},
 {"tag_name": "tag2", "tag_count": 3}
}

So which is better above json structure interms of INSERT and RETRIEVAL of trending count?

那么对于趋势计数的INSERT和RETRIEVAL的json结构中间哪个更好？

Q4) Can i go with existing {"key" : "value"} format instead of {"key_label" : key, "value_lable" : "value"} and possible to extract trending ..? since i am thinking that {"key" : "value"} is very straight forward and good at performance wise.

Q4）我可以使用现有的{“key”：“value”}格式代替{“key_label”：key，“value_lable”：“value”}并且可以提取趋势..？因为我认为{“关键”：“价值”}非常直接并且擅长表现。

Q5) While retrieving i am using SUBSTRING(time_id, 1, 6) AS month. Will it be able to use index?

Q5）检索我使用SUBSTRING（time_id，1,6）AS月。它可以使用索引吗？

OR do i need to create multiple columns like time_hour(2018061023), time_day(20180610), time_month(201806) and use query on specific columns?

或者我需要创建多个列，如time_hour（2018061023），time_day（20180610），time_month（201806）并在特定列上使用查询？

OR can i use mysql date-time functions? will that use index for faster retrieval?

或者我可以使用mysql日期时间函数吗？那会使用索引来加快检索速度吗？

Please help.

请帮忙。

2 个解决方案

#1

I don't see a good reason, why you use JSON here. It's also not clear, why you believe that a "nosql schema" within MySQL would do anything better.

我没有看到一个很好的理由，为什么你在这里使用JSON。目前还不清楚，为什么你认为MySQL中的“nosql架构”会做得更好。

What you probably need is something like this:

你可能需要的是这样的：

CREATE TABLE TAG_COUNTER (
    account       varchar(36) NOT NULL,
    time_id       INT NOT NULL,
    tag_name      varchar(50) NOT NULL,
    counter       INT UNSIGNED NOT NULL,
    PRIMARY KEY   (account, time_id, tag_name)
);

This will simplify your queries. The INSERT statement would look like:

这将简化您的查询。 INSERT语句如下所示：

INSERT INTO TAG_COUNTER
  (account, time_id, tag_name, counter)
VALUES
  ('google', 2018061023, 'tag1', 1),
  ('google', 2018061023, 'tag2', 1)
ON DUPLICATE KEY UPDATE counter = counter + VALUES(counter);

The SELECT statement might be something like this

SELECT语句可能是这样的

SELECT
    SUBSTRING(time_id, 1, 6) AS month,
    tag_name,
    SUM(counter) AS counter_agg
FROM TAG_COUNTER
GROUP BY month, tag_name
ORDER BY month, counter_agg DESC;

Note that I did't try to optimize the table/schema for data size and performance. That would be a different question. But you must see, that the queries are much simpler now.

请注意，我没有尝试优化表/架构的数据大小和性能。那将是一个不同的问题。但是你必须看到，现在的查询要简单得多。

#2

As I have said in comments, I think moving away from JSON is the way to go. However, if you want to keep using JSON, this function (a direct copy of the one in my answer to this question, see an explanation of what it does there) and procedure will do what you want.

正如我在评论中所说，我认为远离JSON是要走的路。但是，如果你想继续使用JSON，这个函数（在我对这个问题的回答中的一个直接副本，看看它在那里做了什么的解释）和过程将做你想要的。

DELIMITER //
DROP FUNCTION IF EXISTS json_merge_sum //
CREATE FUNCTION json_sum_merge(IN j1 JSON, IN total JSON) RETURNS JSON
BEGIN
  DECLARE knum INT DEFAULT 0;
  DECLARE jkeys JSON DEFAULT JSON_KEYS(j1);
  DECLARE kpath VARCHAR(30);
  DECLARE v INT;
  DECLARE l INT DEFAULT JSON_LENGTH(jkeys);
  kloop: LOOP
    IF knum >= l THEN
      LEAVE kloop;
    END IF;
    SET kpath = CONCAT('$.', JSON_EXTRACT(jkeys, CONCAT('$[', knum, ']')));
    SET v = JSON_EXTRACT(j1, kpath);
    IF JSON_CONTAINS_PATH(total, 'one', kpath) THEN
      SET total = JSON_REPLACE(total, kpath, JSON_EXTRACT(total, kpath) + v);
    ELSE
      SET total = JSON_SET(total, kpath, v);
    END IF;
    SET knum = knum + 1;
  END LOOP kloop;
  RETURN total;
END //

The procedure is similar to the one in my other answer, in that it finds all the distinct tags associated with a given substring of time_id (specified as a parameter) and sums the values associated with each tag. The individual tags and counts are then written to a temporary table, from which a selection is then made grouping by time period and tag name.

该过程类似于我的另一个答案中的过程，因为它找到与time_id的给定子字符串（指定为参数）关联的所有不同标记，并对与每个标记关联的值求和。然后将各个标签和计数写入临时表，然后根据时间段和标签名称对其进行分组。

DELIMITER //
DROP PROCEDURE IF EXISTS count_tags //
CREATE PROCEDURE count_tags(IN period VARCHAR(50))
BEGIN
  DECLARE finished INT DEFAULT 0;
  DECLARE timeval VARCHAR(20);
  DECLARE knum, l INT;
  DECLARE jkeys JSON;
  DECLARE time_cursor CURSOR FOR SELECT DISTINCT time_id FROM tag_counter;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET finished=1;
  CREATE TEMPORARY TABLE tag_counts (Time VARCHAR(20), Tag_Name VARCHAR(30), Tag_count_value INT, INDEX(Time, Tag_Name));
  OPEN time_cursor;
  time_loop: LOOP
    FETCH time_cursor INTO timeval;
    IF finished=1 THEN
      LEAVE time_loop;
    END IF;
    SET @total = '{}';
    SET @query = CONCAT("SELECT MIN(@total:=json_sum_merge(counters, @total)) INTO @json FROM TAG_COUNTER WHERE time_id='", timeval, "'");
    PREPARE stmt FROM @query;
    EXECUTE stmt;
    DEALLOCATE PREPARE stmt;
    SET @query = CONCAT('INSERT INTO tag_counts VALUES(', period, ', ?, ?)');
    PREPARE stmt FROM @query;
    SET @timeval = timeval;
    SET l = JSON_LENGTH(@total);
    SET jkeys = JSON_KEYS(@total);
    SET knum = 0;
    key_loop: LOOP
      IF knum >= l THEN
        LEAVE key_loop;
      END IF;
      SET @k = JSON_EXTRACT(jkeys, CONCAT('$[', knum, ']'));
      SET @t = JSON_EXTRACT(@total, CONCAT('$.', @k));
      EXECUTE stmt USING @k, @t;
      SET knum = knum + 1;
    END LOOP key_loop;
    DEALLOCATE PREPARE stmt;
  END LOOP time_loop;
  SELECT Time, Tag_Name, SUM(Tag_count_value) AS Tag_count_value FROM tag_counts GROUP BY Time, Tag_Name;
  DROP TABLE tag_counts;
END

A couple of examples based on some limited sample data from your prior question. In these examples @timeval is equivalent to the time_id column. Input data:

基于您之前问题的一些有限样本数据的几个示例。在这些示例中，@ timeval等同于time_id列。输入数据：

account     time_id     counters
google      20180510    {"gmail_page_viewed": 2, "search_page_viewed": 51}
google      20180511    {"gmail_page_viewed": 3, "search_page_viewed": 102}
apple       20180511    {"apple_page_viewed": 5, "search_page_viewed": 16}

CALL count_tags('@timeval'):

CALL count_tags（'@ timeval'）：

Time        Tag_Name                Tag_count_value
20180510    "gmail_page_viewed"     2
20180510    "search_page_viewed"    51
20180511    "apple_page_viewed"     5
20180511    "gmail_page_viewed"     3
20180511    "search_page_viewed"    118

CALL count_tags('SUBSTRING(@timeval, 1, 6)'):

CALL count_tags（'SUBSTRING（@timeval，1,6）'）：

Time    Tag_Name                Tag_count_value
201805  "apple_page_viewed"     5
201805  "gmail_page_viewed"     5
201805  "search_page_viewed"    169

Note that you can also use json_sum_merge to simplify your INSERT query e.g.

请注意，您还可以使用json_sum_merge来简化INSERT查询，例如

INSERT INTO `TAG_COUNTER`
  (`account`, `time_id`, `counters`)
VALUES
  ('apple', '20180511', '{"apple_page_viewed": 9, "itunes_page_viewed": 4}')
ON DUPLICATE KEY UPDATE `counters` = json_sum_merge(VALUES(counters), counters)

Result:

结果：

account     time_id     counters
apple       20180511    {"apple_page_viewed": 14, "itunes_page_viewed": 4, "search_page_viewed": 16}

In terms of the specific questions in your answer:

就你答案中的具体问题而言：

No. This answer shows it can be done with your existing data format.
不会。这个答案显示可以使用现有的数据格式完成。
Not applicable.
不适用。
Not applicable.
不适用。
Yes, you can stick with the existing {"key" : "value"} format
是的，您可以坚持使用现有的{“key”：“value”}格式
Since we have to go through every entry of tag_counter to get the list of tags, an index is not beneficial for that section. For the temporary table I have included indexes on the Time and Tag_Name columns which should benefit speed as they are used directly in the GROUP BY clause.
由于我们必须遍历tag_counter的每个条目以获取标记列表，因此索引对该部分不利。对于临时表，我在Time和Tag_Name列中包含了索引，这些索引应该有利于速度，因为它们直接在GROUP BY子句中使用。

If you were to maintain a list of keys (e.g. in a separate table, maintained by a trigger on insert/update/delete to tag_counter) this code could be made a lot simpler and more efficient. But that is for another question.

如果要维护一个键列表（例如，在一个单独的表中，由插入/更新/删除到tag_counter的触发器维护），这个代码可以变得更简单，更有效。但这是另一个问题。

#1