解析文件 - 在python中使用正则表达式

时间:2023-01-29 15:47:24

I want to export to a new txt file a list of URLs from another txt file. The first txt file looks like this :

我想从另一个txt文件导出一个新的txt文件列表。第一个txt文件如下所示:

http://pastebin.com/raw/10hvUbTi Emails: 631 Keywords: 0.0

http://pastebin.com/raw/10hvUbTi电子邮件:631关键词:0.0

http://pastebin.com/raw/5f0bnCq9 Emails: 61 Keywords: 0.0

http://pastebin.com/raw/5f0bnCq9电子邮件:61关键字:0.0

I am trying to create a list that will look like this:

我正在尝试创建一个如下所示的列表:

URL

网址

URL

网址

I am not get anything as an output in pycharm

我在pycharm中没有得到任何输出

Can someone help please?

有人可以帮忙吗?

import re
import urllib2
filename = 'C:\\file.txt'
pattern = ('^\S*')
with open(filename) as f:
    for line in f:
        if pattern in line:
            print line

2 个解决方案

#1


1  

You could go for:

你可以去:

import re

rx = re.compile(r'^(?P<email>[^|\n]+)', re.MULTILINE)
with open("emails.txt") as f:
    raw_data = f.read()
    emails = [match.group('email') for match in rx.finditer(raw_data)]
    print emails

Obviously, emails.txt needs to be adjusted here.
See a demo on regex101.com.

显然,需要在这里调整emails.txt。请参阅regex101.com上的演示。

#2


0  

You did not use regular expression at all. You merely tested whether the raw string is in the line or not. To use regex,

你根本没有使用正则表达式。您只是测试了原始字符串是否在行中。要使用正则表达式,

pattern = re.compile(r'^\S*')

notice the r before pattern string there, it stands for raw string and is very important in regex.

注意那里的模式字符串之前的r,它代表原始字符串,在正则表达式中非常重要。

To search for a pattern in a particular line, use

要搜索特定行中的模式,请使用

pattern.search(line)

It will return a MatchObject is a match is found, or None if nothing is found. More reference on python regular expression can be found in documentation.

它将返回MatchObject,找到匹配项,如果没有找到则返回None。有关python正则表达式的更多参考资料可以在文档中找到。

#1


1  

You could go for:

你可以去:

import re

rx = re.compile(r'^(?P<email>[^|\n]+)', re.MULTILINE)
with open("emails.txt") as f:
    raw_data = f.read()
    emails = [match.group('email') for match in rx.finditer(raw_data)]
    print emails

Obviously, emails.txt needs to be adjusted here.
See a demo on regex101.com.

显然,需要在这里调整emails.txt。请参阅regex101.com上的演示。

#2


0  

You did not use regular expression at all. You merely tested whether the raw string is in the line or not. To use regex,

你根本没有使用正则表达式。您只是测试了原始字符串是否在行中。要使用正则表达式,

pattern = re.compile(r'^\S*')

notice the r before pattern string there, it stands for raw string and is very important in regex.

注意那里的模式字符串之前的r,它代表原始字符串,在正则表达式中非常重要。

To search for a pattern in a particular line, use

要搜索特定行中的模式,请使用

pattern.search(line)

It will return a MatchObject is a match is found, or None if nothing is found. More reference on python regular expression can be found in documentation.

它将返回MatchObject,找到匹配项,如果没有找到则返回None。有关python正则表达式的更多参考资料可以在文档中找到。