I have a log file written to by several instances of a cgi script. I need to extract certain information, with the following typical workflow:
我有几个cgi脚本实例写的日志文件。我需要提取某些信息,具有以下典型工作流程:
- search for the first occurrence of
RequestString
- 搜索第一次出现的RequestString
- extract PID from that log line
- 从该日志行中提取PID
- search backwards for the first occurrence of
PID<separator>ConnectionString
, to identify the client that initiated the request -
向后搜索第一次出现的PID
ConnectionString,以识别发起请求的客户端 - do something with
ConnectionString
and repeat the search from after 'RequestString' - 使用ConnectionString执行某些操作并在“RequestString”之后重复搜索
What is the best way to do this? I was thinking of writing a perl script to do this with caching the last N
lines, and then match through those lines to perform 3.
做这个的最好方式是什么?我正在考虑编写一个perl脚本来缓存最后N行,然后通过这些行匹配执行3。
Is there any better way to do this? Like extended regex that would do exactly this?
有没有更好的方法来做到这一点?像扩展的正则表达式那样可以做到这一点?
Sample with line numbers for reference -- not part of the file:
带行号的样本供参考 - 不属于文件的一部分:
1 date pid1 ConnectionString1
2 date pid2 ConnectionString2
3 date pid3 ConnectionString3
4 date pid2 SomeOutput2
5 date pid2 SomeOutput2
6 date pid4 ConnectionString4
7 date pid3 SomeOutput3
8 date pid4 RequestString4
9 date pid1 SomeOutput1
10 date pid1 ConnectionString1
11 date pid1 RequestString1
12 date pid5 RequestString5
When I grep through this sample file, I wish for the following to match:
当我浏览此示例文件时,我希望以下内容匹配:
- line 8, paired with line 6
- 第8行,与第6行配对
- line 11, paired with line 10 (and not with line 1)
- 第11行,与第10行配对(而不是第1行)
Specifically, the following shouldn't be matched:
具体而言,不应匹配以下内容:
- line 12, because no matching ConnectionString with that pid is found (pid5)
- 第12行,因为找不到与该pid匹配的ConnectionString(pid5)
- line 1, because there is a new ConnectionString for that pid before the next RequestString for that pid (line 10). Imagine that the first connection attempt failed before logging the RequestString)
- 第1行,因为在该pid的下一个RequestString之前,该pid有一个新的ConnectionString(第10行)。想象一下,在记录RequestString之前,第一次连接尝试失败了)
- any of the lines from pid2/pid3, because hey dont have a RequestString logged.
- 来自pid2 / pid3的任何行,因为他们没有记录RequestString。
I could imagine writing a regex with the option for . to match \n:((pid\d)\s*(ConnectionString\d))(?!\1).*\2\s*RequestString\d
and then use \3
to identify the client.
我可以想象用一个选项写一个正则表达式。匹配\ n :((pid \ d)\ s *(ConnectionString \ d))(?!\ 1)。* \ 2 \ s * RequestString \ d然后使用\ 3来标识客户端。
However, there are disproportionately more (perhaps between 1000 and 10000 times more) ConnectionString
s than RequestString
s, so my intuition was to first go for the RequestString
and then backtrack.
然而,ConnectionStrings比RequestStrings更多(可能在1000到10000倍之间),所以我的直觉是首先去RequestString,然后回溯。
I guess I could play with (?<) for lookbehind, but the lengths between ConnectionString
s and RequestString
s are essentially arbitrary -- will that work well?
我想我可以使用(?<)for lookbehind,但ConnectionStrings和RequestStrings之间的长度基本上是任意的 - 这样会有效吗?
1 个解决方案
#1
1
Something along these lines:
这些方面的东西:
#!/bin/bash
# Find and number all RequestStrings, then loop through them
grep -n RequestString file | while IFS=":" read n string; do
echo $n,$string # Debug
head -n $n file | tail -r | grep -m1 Connection
done
Output
产量
4,RequestString 1
6189:Connection
7,RequestString 2
7230:Connection
9,RequestString 3
8280:Connection
with this input file
使用此输入文件
6189:Connection
RequestString 1
7229:Connection
7230:Connection
RequestString 2
8280:Connection
RequestString 3
Note: I used tail -r
because OSX lacks tac
which I would have preferred.
注意:我使用tail -r因为OSX缺少tac而我更喜欢。
#1
1
Something along these lines:
这些方面的东西:
#!/bin/bash
# Find and number all RequestStrings, then loop through them
grep -n RequestString file | while IFS=":" read n string; do
echo $n,$string # Debug
head -n $n file | tail -r | grep -m1 Connection
done
Output
产量
4,RequestString 1
6189:Connection
7,RequestString 2
7230:Connection
9,RequestString 3
8280:Connection
with this input file
使用此输入文件
6189:Connection
RequestString 1
7229:Connection
7230:Connection
RequestString 2
8280:Connection
RequestString 3
Note: I used tail -r
because OSX lacks tac
which I would have preferred.
注意:我使用tail -r因为OSX缺少tac而我更喜欢。