
时间:2022-09-26 12:44:29
1410    google  2006-05-01 21:40:54 1   http://www.google.com
2005    google  2006-03-24 21:25:10 1   http://www.google.com
2005    google  2006-03-26 21:58:12
2178    google  2006-03-27 20:58:44 1   http://www.google.com
2178    google  2006-04-11 11:06:20
2178    google  2006-04-11 11:06:41
2178    google  2006-05-16 10:54:39 1   http://www.google.com
2421    google  2006-05-04 15:39:25 1   http://www.google.com
2421    google  2006-05-04 21:14:33 1   http://www.google.com
2421    google  2006-05-05 16:16:01
2722    google  2006-04-12 15:18:12 1   http://www.google.com
2722    google  2006-05-02 09:09:19 1   http://www.google.com
2722    google  2006-05-25 15:42:26 1   http://www.google.com
2722    google  2006-05-25 15:42:26 1   http://www.google.com
6497    google  2006-04-06 22:47:10 1   http://www.google.com
6497    google  2006-04-06 23:05:58 1   http://www.google.com
9777    google  2006-03-11 23:25:57 1   http://www.google.com
9844    google  2006-03-19 10:31:09
9844    google  2006-03-19 10:31:12 1   http://www.google.com
12404   google  2006-03-04 00:42:26 1   http://www.google.com
12404   google  2006-03-13 21:17:22 1   http://www.google.com
12404   google  2006-03-13 21:17:22 1   http://www.google.com
12404   google  2006-03-13 21:17:22 1   http://www.google.com
12404   google  2006-03-13 21:17:22 1   http://www.google.com
12404   google  2006-03-13 21:47:04 1   http://www.google.com
12404   google  2006-03-13 21:47:04 1   http://www.google.com
12404   google  2006-03-22 16:57:44 1   http://www.google.com
12404   google  2006-03-23 22:07:33 1   http://www.google.com
12404   google  2006-03-23 22:07:33 1   http://www.google.com
12404   google  2006-03-23 22:07:33 1   http://www.google.com

Considering the above search query nugget, I would like to extract two things: first of all, randomly select a user (based on the id) and secondly, I would like to extract the first and last time stamp of the corresponding user. I have come along a similar answer with the following regex:


 private static final Pattern LINE_REGEX = Pattern.compile(
    "[0-9]+" // user id
    + "\\s+" // space after user id
    + "(.*?[^\\s])" // user name (group 1)
    + "\\s+" // space after user name
    + "([0-9]+-.{14})" // timestamp (group 2)
    + "\\s+" //space after timestamp
    + "[0-9]*" // random int
    + "\\s+" //space after random int
    + "(.*[^\\s])" // user action (group 3)


try(Stream<String> stream = Files.lines(Paths.get("file name"))) {
    result = stream.map(LINE_REGEX::matcher)
        // filter out any lines without an Action
        // group by User
        .collect(Collectors.groupingBy((Matcher m) -> m.group(1),
                // compare Timestamp (min for earliest, max for latest)
                Collectors.maxBy(Comparator.comparing((Matcher m) -> m.group(2))),
                // extract Action
                (Optional<Matcher> m) -> m.get().group(3))));


But there are two problems, first it will group (in my case) by the keywords and not the user's id and secondly, if I use .minBy() it will get the first time stamp of some other random user which is not the same user as .maxBy(). Any idea how to fix this?


1 个解决方案



You are currently grouping the users by the USERNAME (or keyword), not the USER ID. Since the username is always "google", all lines are in one group.


Put the first part of the regular expression (user ID) in parentheses; and either remove the group parentheses around the username part or increase the group indexes for timestamp and action.


private static final Pattern LINE_REGEX = Pattern.compile(
    "([0-9]+)" // user id                      <- parentheses go here
    + "\\s+" // space after user id
    + ".*?[^\\s]" // user name (group 1)       <- not here
    + "\\s+" // space after user name
    + "([0-9]+-.{14})" // timestamp (group 2)
    + "\\s+" //space after timestamp
    + "[0-9]*" // random int
    + "\\s+" //space after random int
    + "(.*[^\\s])" // user action (group 3)

If you need the keyword/username, you could keep it as a group as well, and then ask the matcher for keyword/username + action afterwards.


Also keep in mind that you currently order the timestamps by String comparison (which works woth this specific format, though).


Edit: I'll add a complete example.


Pattern LINE_REGEX = Pattern.compile("([0-9]+)"       // user id (group 1)
                                   + "\\s+"           // space after user id
                                   + "(.*?[^\\s])"    // user name (group 2)
                                   + "\\s+"           // space after user name
                                   + "([0-9]+-.{14})" // timestamp (group 3)
                                   + "\\s+"           // space after timestamp
                                   + "([0-9]*)"       // random int (group 4)
                                   + "\\s+"           // space after random int
                                   + "(.*[^\\s])"     // user action (group 5)
Stream<String> lines = Stream.of("1410    google  2006-05-01 21:40:54 1   http://www.google.com",
        "2005    google  2006-03-24 21:25:10 1   http://www.google.com", "2005    google  2006-03-26 21:58:12",
        "2178    google  2006-03-27 20:58:44 1   http://www.google.com", "2178    google  2006-04-11 11:06:20",
        "2178    google  2006-04-11 11:06:41", "2178    google  2006-05-16 10:54:39 1   http://www.google.com",
        "2421    google  2006-05-04 15:39:25 1   http://www.google.com", "2421    google  2006-05-04 21:14:33 1   http://www.google.com",
        "2421    google  2006-05-05 16:16:01", "2722    google  2006-04-12 15:18:12 1   http://www.google.com",
        "2722    google  2006-05-02 09:09:19 1   http://www.google.com", "2722    google  2006-05-25 15:42:26 1   http://www.google.com",
        "2722    google  2006-05-25 15:42:26 1   http://www.google.com", "6497    google  2006-04-06 22:47:10 1   http://www.google.com",
        "6497    google  2006-04-06 23:05:58 1   http://www.google.com", "9777    google  2006-03-11 23:25:57 1   http://www.google.com",
        "9844    google  2006-03-19 10:31:09", "9844    google  2006-03-19 10:31:12 1   http://www.google.com",
        "12404   google  2006-03-04 00:42:26 1   http://www.google.com", "12404   google  2006-03-13 21:17:22 1   http://www.google.com",
        "12404   google  2006-03-13 21:17:22 1   http://www.google.com", "12404   google  2006-03-13 21:17:22 1   http://www.google.com",
        "12404   google  2006-03-13 21:17:22 1   http://www.google.com", "12404   google  2006-03-13 21:47:04 1   http://www.google.com",
        "12404   google  2006-03-13 21:47:04 1   http://www.google.com", "12404   google  2006-03-22 16:57:44 1   http://www.google.com",
        "12404   google  2006-03-23 22:07:33 1   http://www.google.com", "12404   google  2006-03-23 22:07:33 1   http://www.google.com",
        "12404   google  2006-03-23 22:07:33 1   http://www.google.com");

Map<String, Matcher> result = 
         .collect(Collectors.groupingBy((Matcher m) -> m.group(1),
                                        Collectors.collectingAndThen(Collectors.maxBy(Comparator.comparing((Matcher m) -> m.group(2))),

result.forEach((k, v) -> System.out.println(v.group(0) + ": " + v.group(1) + " " + v.group(2) + " "
                                            + v.group(3) + " " + v.group(4) + " " + v.group(5)));

--------- Output ---------------

2722: 2722 google 2006-04-12 15:18:12 1 http://www.google.com
9777: 9777 google 2006-03-11 23:25:57 1 http://www.google.com
2005: 2005 google 2006-03-24 21:25:10 1 http://www.google.com
9844: 9844 google 2006-03-19 10:31:12 1 http://www.google.com
6497: 6497 google 2006-04-06 22:47:10 1 http://www.google.com
1410: 1410 google 2006-05-01 21:40:54 1 http://www.google.com
2421: 2421 google 2006-05-04 15:39:25 1 http://www.google.com
2178: 2178 google 2006-03-27 20:58:44 1 http://www.google.com
12404: 12404 google 2006-03-04 00:42:26 1 http://www.google.com

I now wrapped every part of the line in a capturing group (read the Pattern documentation for more info). This way, you can access it at the end by asking the matcher for the groups.


The groups are enumerated in the order they appear in the regex, starting with 1. Asking for matcher.group(0) returns the whole line.




You are currently grouping the users by the USERNAME (or keyword), not the USER ID. Since the username is always "google", all lines are in one group.


Put the first part of the regular expression (user ID) in parentheses; and either remove the group parentheses around the username part or increase the group indexes for timestamp and action.


private static final Pattern LINE_REGEX = Pattern.compile(
    "([0-9]+)" // user id                      <- parentheses go here
    + "\\s+" // space after user id
    + ".*?[^\\s]" // user name (group 1)       <- not here
    + "\\s+" // space after user name
    + "([0-9]+-.{14})" // timestamp (group 2)
    + "\\s+" //space after timestamp
    + "[0-9]*" // random int
    + "\\s+" //space after random int
    + "(.*[^\\s])" // user action (group 3)

If you need the keyword/username, you could keep it as a group as well, and then ask the matcher for keyword/username + action afterwards.


Also keep in mind that you currently order the timestamps by String comparison (which works woth this specific format, though).


Edit: I'll add a complete example.


Pattern LINE_REGEX = Pattern.compile("([0-9]+)"       // user id (group 1)
                                   + "\\s+"           // space after user id
                                   + "(.*?[^\\s])"    // user name (group 2)
                                   + "\\s+"           // space after user name
                                   + "([0-9]+-.{14})" // timestamp (group 3)
                                   + "\\s+"           // space after timestamp
                                   + "([0-9]*)"       // random int (group 4)
                                   + "\\s+"           // space after random int
                                   + "(.*[^\\s])"     // user action (group 5)
Stream<String> lines = Stream.of("1410    google  2006-05-01 21:40:54 1   http://www.google.com",
        "2005    google  2006-03-24 21:25:10 1   http://www.google.com", "2005    google  2006-03-26 21:58:12",
        "2178    google  2006-03-27 20:58:44 1   http://www.google.com", "2178    google  2006-04-11 11:06:20",
        "2178    google  2006-04-11 11:06:41", "2178    google  2006-05-16 10:54:39 1   http://www.google.com",
        "2421    google  2006-05-04 15:39:25 1   http://www.google.com", "2421    google  2006-05-04 21:14:33 1   http://www.google.com",
        "2421    google  2006-05-05 16:16:01", "2722    google  2006-04-12 15:18:12 1   http://www.google.com",
        "2722    google  2006-05-02 09:09:19 1   http://www.google.com", "2722    google  2006-05-25 15:42:26 1   http://www.google.com",
        "2722    google  2006-05-25 15:42:26 1   http://www.google.com", "6497    google  2006-04-06 22:47:10 1   http://www.google.com",
        "6497    google  2006-04-06 23:05:58 1   http://www.google.com", "9777    google  2006-03-11 23:25:57 1   http://www.google.com",
        "9844    google  2006-03-19 10:31:09", "9844    google  2006-03-19 10:31:12 1   http://www.google.com",
        "12404   google  2006-03-04 00:42:26 1   http://www.google.com", "12404   google  2006-03-13 21:17:22 1   http://www.google.com",
        "12404   google  2006-03-13 21:17:22 1   http://www.google.com", "12404   google  2006-03-13 21:17:22 1   http://www.google.com",
        "12404   google  2006-03-13 21:17:22 1   http://www.google.com", "12404   google  2006-03-13 21:47:04 1   http://www.google.com",
        "12404   google  2006-03-13 21:47:04 1   http://www.google.com", "12404   google  2006-03-22 16:57:44 1   http://www.google.com",
        "12404   google  2006-03-23 22:07:33 1   http://www.google.com", "12404   google  2006-03-23 22:07:33 1   http://www.google.com",
        "12404   google  2006-03-23 22:07:33 1   http://www.google.com");

Map<String, Matcher> result = 
         .collect(Collectors.groupingBy((Matcher m) -> m.group(1),
                                        Collectors.collectingAndThen(Collectors.maxBy(Comparator.comparing((Matcher m) -> m.group(2))),

result.forEach((k, v) -> System.out.println(v.group(0) + ": " + v.group(1) + " " + v.group(2) + " "
                                            + v.group(3) + " " + v.group(4) + " " + v.group(5)));

--------- Output ---------------

2722: 2722 google 2006-04-12 15:18:12 1 http://www.google.com
9777: 9777 google 2006-03-11 23:25:57 1 http://www.google.com
2005: 2005 google 2006-03-24 21:25:10 1 http://www.google.com
9844: 9844 google 2006-03-19 10:31:12 1 http://www.google.com
6497: 6497 google 2006-04-06 22:47:10 1 http://www.google.com
1410: 1410 google 2006-05-01 21:40:54 1 http://www.google.com
2421: 2421 google 2006-05-04 15:39:25 1 http://www.google.com
2178: 2178 google 2006-03-27 20:58:44 1 http://www.google.com
12404: 12404 google 2006-03-04 00:42:26 1 http://www.google.com

I now wrapped every part of the line in a capturing group (read the Pattern documentation for more info). This way, you can access it at the end by asking the matcher for the groups.


The groups are enumerated in the order they appear in the regex, starting with 1. Asking for matcher.group(0) returns the whole line.
