Data looks like:
数据看起来像:
text textext text a 111.222.222.111(123) -> 22.222.111.111(7895)
txt txt txxt text b 22.111.22.222(8153) -> 153.33.233.111(195)
text text txt txt c 222.30.233.121 -> 44.233.111.111
txt text txt text d 22.111.22.222 -> 153.33.233.111
I want to capture a, b, and c along with the two IPs on that line. I do not want the numbers in parentheses that are attached to some of the IPs.
我想捕获a,b和c以及该行上的两个IP。我不希望附加到某些IP的括号中的数字。
I want my output to look something like this:
我希望我的输出看起来像这样:
a 111.222.222.111 22.222.111.111
b 22.111.22.222 153.33.233.111
c 222.30.233.121 44.233.111.111
What the code looks like:
代码是什么样的:
f=gzip.open(path+Fname,'rb')
for line in f:
IP_info=re.findall( r'(a|b|c)\s+([0-9]+(?:\.[0-9]+){3})+[ -> ]+([0-9]+(?:\.[0-9]+){3})', line )
print IP_info
f.flose
What my out put actually looks like:
我的看法实际上是这样的:
[('a', '111.222.222.111', '2.222.111.111')]
[('b', '22.111.22.222', '3.33.233.111')]
The two biggest problems I'm having:
我遇到的两个最大的问题:
1) The second IP in the output is not complete. The first two digits have been truncated.
1)输出中的第二个IP未完成。前两位数字已被截断。
2) I am not capturing information for "c".
2)我没有捕获“c”的信息。
1 个解决方案
#1
2
Here is a regex that you can use:
这是一个你可以使用的正则表达式:
\b([abcd])\s+([0-9]+(?:\.[0-9]+){3})(?:\(\d+\))? +-> +([0-9]+(?:\.[0-9]+){3})
See regex demo
请参阅正则表达式演示
There are several points of interest here:
这里有几个兴趣点:
- I replaced your
[ -> ]+
with+-> +
since you meant to match a sequence of characters, not just single characters in various order. Note that->
in the character class created a range, from space to>
and that included special symbols, punctuation, AND digits, too. That is why your IPs were partially "eaten". - Since there are optional numbers in parentheses after an IP, I added an optional non-capturing group
(?:\(\d+\))?
after the first IP - You did not match
d
in the first capturing group (that I transformed into a character class since I see just single letters - if these are "placeholders", please revert to a group with alternatives -(a|b|c|d)
).
我用+ - > +替换了你的[ - >] +,因为你的意思是匹配一系列字符,而不仅仅是单个字符的各种顺序。注意 - >在字符类中创建了一个范围,从空间到>,包括特殊符号,标点符号和AND数字。这就是你的IP被部分“吃掉”的原因。
由于在IP之后括号中有可选数字,我添加了一个可选的非捕获组(?:\(\ d + \))?在第一个IP之后
你没有匹配第一个捕获组中的d(我转换为一个字符类,因为我只看到一个字母 - 如果这些是“占位符”,请回复到具有替代的组 - (a | b | c | d)) 。
See Python demo:
参见Python演示:
import re
p = re.compile(r'\b([abcd])\s+([0-9]+(?:\.[0-9]+){3})(?:\(\d+\))? +-> +([0-9]+(?:\.[0-9]+){3})')
test_str = "text textext text a 111.222.222.111(123) -> 22.222.111.111(7895)\ntxt txt txxt text b 22.111.22.222(8153) -> 153.33.233.111(195)\ntext text txt txt c 222.30.233.121 -> 44.233.111.111\ntxt text txt text d 22.111.22.222 -> 153.33.233.111"
for x in test_str.split("\n"):
print(re.findall(p, x))
Output:
[('a', '111.222.222.111', '22.222.111.111')]
[('b', '22.111.22.222', '153.33.233.111')]
[('c', '222.30.233.121', '44.233.111.111')]
[('d', '22.111.22.222', '153.33.233.111')]
#1
2
Here is a regex that you can use:
这是一个你可以使用的正则表达式:
\b([abcd])\s+([0-9]+(?:\.[0-9]+){3})(?:\(\d+\))? +-> +([0-9]+(?:\.[0-9]+){3})
See regex demo
请参阅正则表达式演示
There are several points of interest here:
这里有几个兴趣点:
- I replaced your
[ -> ]+
with+-> +
since you meant to match a sequence of characters, not just single characters in various order. Note that->
in the character class created a range, from space to>
and that included special symbols, punctuation, AND digits, too. That is why your IPs were partially "eaten". - Since there are optional numbers in parentheses after an IP, I added an optional non-capturing group
(?:\(\d+\))?
after the first IP - You did not match
d
in the first capturing group (that I transformed into a character class since I see just single letters - if these are "placeholders", please revert to a group with alternatives -(a|b|c|d)
).
我用+ - > +替换了你的[ - >] +,因为你的意思是匹配一系列字符,而不仅仅是单个字符的各种顺序。注意 - >在字符类中创建了一个范围,从空间到>,包括特殊符号,标点符号和AND数字。这就是你的IP被部分“吃掉”的原因。
由于在IP之后括号中有可选数字,我添加了一个可选的非捕获组(?:\(\ d + \))?在第一个IP之后
你没有匹配第一个捕获组中的d(我转换为一个字符类,因为我只看到一个字母 - 如果这些是“占位符”,请回复到具有替代的组 - (a | b | c | d)) 。
See Python demo:
参见Python演示:
import re
p = re.compile(r'\b([abcd])\s+([0-9]+(?:\.[0-9]+){3})(?:\(\d+\))? +-> +([0-9]+(?:\.[0-9]+){3})')
test_str = "text textext text a 111.222.222.111(123) -> 22.222.111.111(7895)\ntxt txt txxt text b 22.111.22.222(8153) -> 153.33.233.111(195)\ntext text txt txt c 222.30.233.121 -> 44.233.111.111\ntxt text txt text d 22.111.22.222 -> 153.33.233.111"
for x in test_str.split("\n"):
print(re.findall(p, x))
Output:
[('a', '111.222.222.111', '22.222.111.111')]
[('b', '22.111.22.222', '153.33.233.111')]
[('c', '222.30.233.121', '44.233.111.111')]
[('d', '22.111.22.222', '153.33.233.111')]