1410 google 2006-05-01 21:40:54 1 http://www.google.com
2005 google 2006-03-24 21:25:10 1 http://www.google.com
2005 google 2006-03-26 21:58:12
2178 google 2006-03-27 20:58:44 1 http://www.google.com
2178 google 2006-04-11 11:06:20
2178 google 2006-04-11 11:06:41
2178 google 2006-05-16 10:54:39 1 http://www.google.com
2421 google 2006-05-04 15:39:25 1 http://www.google.com
2421 google 2006-05-04 21:14:33 1 http://www.google.com
2421 google 2006-05-05 16:16:01
2722 google 2006-04-12 15:18:12 1 http://www.google.com
2722 google 2006-05-02 09:09:19 1 http://www.google.com
2722 google 2006-05-25 15:42:26 1 http://www.google.com
2722 google 2006-05-25 15:42:26 1 http://www.google.com
6497 google 2006-04-06 22:47:10 1 http://www.google.com
6497 google 2006-04-06 23:05:58 1 http://www.google.com
9777 google 2006-03-11 23:25:57 1 http://www.google.com
9844 google 2006-03-19 10:31:09
9844 google 2006-03-19 10:31:12 1 http://www.google.com
12404 google 2006-03-04 00:42:26 1 http://www.google.com
12404 google 2006-03-13 21:17:22 1 http://www.google.com
12404 google 2006-03-13 21:17:22 1 http://www.google.com
12404 google 2006-03-13 21:17:22 1 http://www.google.com
12404 google 2006-03-13 21:17:22 1 http://www.google.com
12404 google 2006-03-13 21:47:04 1 http://www.google.com
12404 google 2006-03-13 21:47:04 1 http://www.google.com
12404 google 2006-03-22 16:57:44 1 http://www.google.com
12404 google 2006-03-23 22:07:33 1 http://www.google.com
12404 google 2006-03-23 22:07:33 1 http://www.google.com
12404 google 2006-03-23 22:07:33 1 http://www.google.com
Considering the above search query nugget, I would like to extract two things: first of all, randomly select a user (based on the id) and secondly, I would like to extract the first and last time stamp of the corresponding user. I have come along a similar answer with the following regex:
考虑到上面的搜索查询块,我想提取两件事:首先,随机选择一个用户(基于id),其次,我想提取相应用户的第一个和最后一个时间戳。我用以下正则表达式给出了类似的答案:
private static final Pattern LINE_REGEX = Pattern.compile(
"[0-9]+" // user id
+ "\\s+" // space after user id
+ "(.*?[^\\s])" // user name (group 1)
+ "\\s+" // space after user name
+ "([0-9]+-.{14})" // timestamp (group 2)
+ "\\s+" //space after timestamp
+ "[0-9]*" // random int
+ "\\s+" //space after random int
+ "(.*[^\\s])" // user action (group 3)
);
try(Stream<String> stream = Files.lines(Paths.get("file name"))) {
result = stream.map(LINE_REGEX::matcher)
// filter out any lines without an Action
.filter(Matcher::matches)
// group by User
.collect(Collectors.groupingBy((Matcher m) -> m.group(1),
Collectors.collectingAndThen(
// compare Timestamp (min for earliest, max for latest)
Collectors.maxBy(Comparator.comparing((Matcher m) -> m.group(2))),
// extract Action
(Optional<Matcher> m) -> m.get().group(3))));
}
But there are two problems, first it will group (in my case) by the keywords and not the user's id and secondly, if I use .minBy()
it will get the first time stamp of some other random user which is not the same user as .maxBy()
. Any idea how to fix this?
但是有两个问题,首先它将按照关键字分组(在我的情况下)而不是用户的id。其次,如果我使用.minBy(),它将获得一些其他随机用户的第一个时间戳,这是不一样的用户为.maxBy()。知道如何解决这个问题吗?
1 个解决方案
#1
0
You are currently grouping the users by the USERNAME (or keyword), not the USER ID. Since the username is always "google", all lines are in one group.
您当前按USERNAME(或关键字)对用户进行分组,而不是用户ID。由于用户名始终为“google”,因此所有行都在一个组中。
Put the first part of the regular expression (user ID) in parentheses; and either remove the group parentheses around the username part or increase the group indexes for timestamp and action.
将正则表达式的第一部分(用户ID)放在括号中;并删除用户名部分周围的组括号或增加时间戳和操作的组索引。
private static final Pattern LINE_REGEX = Pattern.compile(
"([0-9]+)" // user id <- parentheses go here
+ "\\s+" // space after user id
+ ".*?[^\\s]" // user name (group 1) <- not here
+ "\\s+" // space after user name
+ "([0-9]+-.{14})" // timestamp (group 2)
+ "\\s+" //space after timestamp
+ "[0-9]*" // random int
+ "\\s+" //space after random int
+ "(.*[^\\s])" // user action (group 3)
);
If you need the keyword/username, you could keep it as a group as well, and then ask the matcher for keyword/username + action
afterwards.
如果您需要关键字/用户名,您也可以将其保留为一个组,然后向匹配器询问关键字/用户名+操作。
Also keep in mind that you currently order the timestamps by String comparison (which works woth this specific format, though).
还要记住,您当前通过字符串比较来排序时间戳(尽管这种格式有效)。
Edit: I'll add a complete example.
编辑:我将添加一个完整的示例。
Pattern LINE_REGEX = Pattern.compile("([0-9]+)" // user id (group 1)
+ "\\s+" // space after user id
+ "(.*?[^\\s])" // user name (group 2)
+ "\\s+" // space after user name
+ "([0-9]+-.{14})" // timestamp (group 3)
+ "\\s+" // space after timestamp
+ "([0-9]*)" // random int (group 4)
+ "\\s+" // space after random int
+ "(.*[^\\s])" // user action (group 5)
);
Stream<String> lines = Stream.of("1410 google 2006-05-01 21:40:54 1 http://www.google.com",
"2005 google 2006-03-24 21:25:10 1 http://www.google.com", "2005 google 2006-03-26 21:58:12",
"2178 google 2006-03-27 20:58:44 1 http://www.google.com", "2178 google 2006-04-11 11:06:20",
"2178 google 2006-04-11 11:06:41", "2178 google 2006-05-16 10:54:39 1 http://www.google.com",
"2421 google 2006-05-04 15:39:25 1 http://www.google.com", "2421 google 2006-05-04 21:14:33 1 http://www.google.com",
"2421 google 2006-05-05 16:16:01", "2722 google 2006-04-12 15:18:12 1 http://www.google.com",
"2722 google 2006-05-02 09:09:19 1 http://www.google.com", "2722 google 2006-05-25 15:42:26 1 http://www.google.com",
"2722 google 2006-05-25 15:42:26 1 http://www.google.com", "6497 google 2006-04-06 22:47:10 1 http://www.google.com",
"6497 google 2006-04-06 23:05:58 1 http://www.google.com", "9777 google 2006-03-11 23:25:57 1 http://www.google.com",
"9844 google 2006-03-19 10:31:09", "9844 google 2006-03-19 10:31:12 1 http://www.google.com",
"12404 google 2006-03-04 00:42:26 1 http://www.google.com", "12404 google 2006-03-13 21:17:22 1 http://www.google.com",
"12404 google 2006-03-13 21:17:22 1 http://www.google.com", "12404 google 2006-03-13 21:17:22 1 http://www.google.com",
"12404 google 2006-03-13 21:17:22 1 http://www.google.com", "12404 google 2006-03-13 21:47:04 1 http://www.google.com",
"12404 google 2006-03-13 21:47:04 1 http://www.google.com", "12404 google 2006-03-22 16:57:44 1 http://www.google.com",
"12404 google 2006-03-23 22:07:33 1 http://www.google.com", "12404 google 2006-03-23 22:07:33 1 http://www.google.com",
"12404 google 2006-03-23 22:07:33 1 http://www.google.com");
Map<String, Matcher> result =
lines.map(LINE_REGEX::matcher)
.filter(Matcher::matches)
.collect(Collectors.groupingBy((Matcher m) -> m.group(1),
Collectors.collectingAndThen(Collectors.maxBy(Comparator.comparing((Matcher m) -> m.group(2))),
Optional<Matcher>::get)));
result.forEach((k, v) -> System.out.println(v.group(0) + ": " + v.group(1) + " " + v.group(2) + " "
+ v.group(3) + " " + v.group(4) + " " + v.group(5)));
--------- Output ---------------
2722: 2722 google 2006-04-12 15:18:12 1 http://www.google.com
9777: 9777 google 2006-03-11 23:25:57 1 http://www.google.com
2005: 2005 google 2006-03-24 21:25:10 1 http://www.google.com
9844: 9844 google 2006-03-19 10:31:12 1 http://www.google.com
6497: 6497 google 2006-04-06 22:47:10 1 http://www.google.com
1410: 1410 google 2006-05-01 21:40:54 1 http://www.google.com
2421: 2421 google 2006-05-04 15:39:25 1 http://www.google.com
2178: 2178 google 2006-03-27 20:58:44 1 http://www.google.com
12404: 12404 google 2006-03-04 00:42:26 1 http://www.google.com
I now wrapped every part of the line in a capturing group (read the Pattern
documentation for more info). This way, you can access it at the end by asking the matcher for the groups.
我现在将该行的每个部分都包装在一个捕获组中(有关详细信息,请阅读Pattern文档)。这样,您可以通过向匹配器询问组来访问它。
The groups are enumerated in the order they appear in the regex, starting with 1. Asking for matcher.group(0)
returns the whole line.
这些组按照它们出现在正则表达式中的顺序进行枚举,从1开始。要求matcher.group(0)返回整行。
#1
0
You are currently grouping the users by the USERNAME (or keyword), not the USER ID. Since the username is always "google", all lines are in one group.
您当前按USERNAME(或关键字)对用户进行分组,而不是用户ID。由于用户名始终为“google”,因此所有行都在一个组中。
Put the first part of the regular expression (user ID) in parentheses; and either remove the group parentheses around the username part or increase the group indexes for timestamp and action.
将正则表达式的第一部分(用户ID)放在括号中;并删除用户名部分周围的组括号或增加时间戳和操作的组索引。
private static final Pattern LINE_REGEX = Pattern.compile(
"([0-9]+)" // user id <- parentheses go here
+ "\\s+" // space after user id
+ ".*?[^\\s]" // user name (group 1) <- not here
+ "\\s+" // space after user name
+ "([0-9]+-.{14})" // timestamp (group 2)
+ "\\s+" //space after timestamp
+ "[0-9]*" // random int
+ "\\s+" //space after random int
+ "(.*[^\\s])" // user action (group 3)
);
If you need the keyword/username, you could keep it as a group as well, and then ask the matcher for keyword/username + action
afterwards.
如果您需要关键字/用户名,您也可以将其保留为一个组,然后向匹配器询问关键字/用户名+操作。
Also keep in mind that you currently order the timestamps by String comparison (which works woth this specific format, though).
还要记住,您当前通过字符串比较来排序时间戳(尽管这种格式有效)。
Edit: I'll add a complete example.
编辑:我将添加一个完整的示例。
Pattern LINE_REGEX = Pattern.compile("([0-9]+)" // user id (group 1)
+ "\\s+" // space after user id
+ "(.*?[^\\s])" // user name (group 2)
+ "\\s+" // space after user name
+ "([0-9]+-.{14})" // timestamp (group 3)
+ "\\s+" // space after timestamp
+ "([0-9]*)" // random int (group 4)
+ "\\s+" // space after random int
+ "(.*[^\\s])" // user action (group 5)
);
Stream<String> lines = Stream.of("1410 google 2006-05-01 21:40:54 1 http://www.google.com",
"2005 google 2006-03-24 21:25:10 1 http://www.google.com", "2005 google 2006-03-26 21:58:12",
"2178 google 2006-03-27 20:58:44 1 http://www.google.com", "2178 google 2006-04-11 11:06:20",
"2178 google 2006-04-11 11:06:41", "2178 google 2006-05-16 10:54:39 1 http://www.google.com",
"2421 google 2006-05-04 15:39:25 1 http://www.google.com", "2421 google 2006-05-04 21:14:33 1 http://www.google.com",
"2421 google 2006-05-05 16:16:01", "2722 google 2006-04-12 15:18:12 1 http://www.google.com",
"2722 google 2006-05-02 09:09:19 1 http://www.google.com", "2722 google 2006-05-25 15:42:26 1 http://www.google.com",
"2722 google 2006-05-25 15:42:26 1 http://www.google.com", "6497 google 2006-04-06 22:47:10 1 http://www.google.com",
"6497 google 2006-04-06 23:05:58 1 http://www.google.com", "9777 google 2006-03-11 23:25:57 1 http://www.google.com",
"9844 google 2006-03-19 10:31:09", "9844 google 2006-03-19 10:31:12 1 http://www.google.com",
"12404 google 2006-03-04 00:42:26 1 http://www.google.com", "12404 google 2006-03-13 21:17:22 1 http://www.google.com",
"12404 google 2006-03-13 21:17:22 1 http://www.google.com", "12404 google 2006-03-13 21:17:22 1 http://www.google.com",
"12404 google 2006-03-13 21:17:22 1 http://www.google.com", "12404 google 2006-03-13 21:47:04 1 http://www.google.com",
"12404 google 2006-03-13 21:47:04 1 http://www.google.com", "12404 google 2006-03-22 16:57:44 1 http://www.google.com",
"12404 google 2006-03-23 22:07:33 1 http://www.google.com", "12404 google 2006-03-23 22:07:33 1 http://www.google.com",
"12404 google 2006-03-23 22:07:33 1 http://www.google.com");
Map<String, Matcher> result =
lines.map(LINE_REGEX::matcher)
.filter(Matcher::matches)
.collect(Collectors.groupingBy((Matcher m) -> m.group(1),
Collectors.collectingAndThen(Collectors.maxBy(Comparator.comparing((Matcher m) -> m.group(2))),
Optional<Matcher>::get)));
result.forEach((k, v) -> System.out.println(v.group(0) + ": " + v.group(1) + " " + v.group(2) + " "
+ v.group(3) + " " + v.group(4) + " " + v.group(5)));
--------- Output ---------------
2722: 2722 google 2006-04-12 15:18:12 1 http://www.google.com
9777: 9777 google 2006-03-11 23:25:57 1 http://www.google.com
2005: 2005 google 2006-03-24 21:25:10 1 http://www.google.com
9844: 9844 google 2006-03-19 10:31:12 1 http://www.google.com
6497: 6497 google 2006-04-06 22:47:10 1 http://www.google.com
1410: 1410 google 2006-05-01 21:40:54 1 http://www.google.com
2421: 2421 google 2006-05-04 15:39:25 1 http://www.google.com
2178: 2178 google 2006-03-27 20:58:44 1 http://www.google.com
12404: 12404 google 2006-03-04 00:42:26 1 http://www.google.com
I now wrapped every part of the line in a capturing group (read the Pattern
documentation for more info). This way, you can access it at the end by asking the matcher for the groups.
我现在将该行的每个部分都包装在一个捕获组中(有关详细信息,请阅读Pattern文档)。这样,您可以通过向匹配器询问组来访问它。
The groups are enumerated in the order they appear in the regex, starting with 1. Asking for matcher.group(0)
returns the whole line.
这些组按照它们出现在正则表达式中的顺序进行枚举,从1开始。要求matcher.group(0)返回整行。