I'm writing a python script to replace strings from a each text file in a directory with a specific extension (.seq). The strings replaced should only be from the second line of each file, and the output is a new subdirectory (call it clean) with the same file names as the original files, but with a *.clean suffix. The output file contains exactly the same text as the original, but with the strings replaced. I need to replace all these strings: 'K','Y','W','M','R','S' with 'N'.
我正在编写一个python脚本来替换具有特定扩展名(.seq)的目录中的每个文本文件中的字符串。替换的字符串应该只来自每个文件的第二行,并且输出是一个新的子目录(称之为干净),其文件名与原始文件相同,但带有* .clean后缀。输出文件包含与原始文本完全相同的文本,但替换了字符串。我需要替换所有这些字符串:'K','Y','W','M','R','S'和'N'。
This is what I've come up with after googling. It's very messy (2nd week of programming), and it stops at copying the files into the clean directory without replacing anything. I'd really appreciate any help.
这是我在谷歌搜索后想出来的。这是非常混乱的(编程的第二周),它停止将文件复制到干净的目录而不替换任何东西。我真的很感激任何帮助。
Thanks before!
谢谢!
import os, shutil
os.mkdir('clean')
for file in os.listdir(os.getcwd()):
if file.find('.seq') != -1:
shutil.copy(file, 'clean')
os.chdir('clean')
for subdir, dirs, files in os.walk(os.getcwd()):
for file in files:
f = open(file, 'r')
for line in f.read():
if line.__contains__('>'): #indicator for the first line. the first line always starts with '>'. It's a FASTA file, if you've worked with dna/protein before.
pass
else:
line.replace('M', 'N')
line.replace('K', 'N')
line.replace('Y', 'N')
line.replace('W', 'N')
line.replace('R', 'N')
line.replace('S', 'N')
5 个解决方案
#1
6
some notes:
一些说明:
-
string.replace
andre.sub
are not in-place so you should be assigning the return value back to your variable. - string.replace和re.sub不在原位,因此您应该将返回值赋给变量。
-
glob.glob
is better for finding files in a directory matching a defined pattern... - glob.glob更适合在匹配定义模式的目录中查找文件...
- maybe you should be checking if the directory already exists before creating it (I just assumed this, this could not be your desired behavior)
- 也许你应该在创建它之前检查目录是否已经存在(我只是假设这个,这可能不是你想要的行为)
- the
with
statement takes care of closing the file in a safe way. if you don't want to use it you have to usetry
finally
. - with语句负责以安全的方式关闭文件。如果你不想使用它,你必须最后使用try。
- in your example you where forgetting to put the sufix
*.clean
;) - 在你的例子中,你忘了把sufix * .clean;)
- you where not actually writing the files, you could do it like i did in my example or use
fileinput
module (which until today i did not know) - 你在哪里实际上没有写文件,你可以像我在我的例子中那样做或者使用fileinput模块(直到今天我都不知道)
here's my example:
这是我的例子:
import re
import os
import glob
source_dir=os.getcwd()
target_dir="clean"
source_files = [fname for fname in glob.glob(os.path.join(source_dir,"*.seq"))]
# check if target directory exists... if not, create it.
if not os.path.exists(target_dir):
os.makedirs(target_dir)
for source_file in source_files:
target_file = os.path.join(target_dir,os.path.basename(source_file)+".clean")
with open(source_file,'r') as sfile:
with open(target_file,'w') as tfile:
lines = sfile.readlines()
# do the replacement in the second line.
# (remember that arrays are zero indexed)
lines[1]=re.sub("K|Y|W|M|R|S",'N',lines[1])
tfile.writelines(lines)
print "DONE"
hope it helps.
希望能帮助到你。
#2
5
You should replace line.replace('M', 'N')
with line=line.replace('M', 'N')
. replace returns a copy of the original string with the relevant substrings replaced.
你应该用line = line.replace('M','N')替换line.replace('M','N')。 replace返回原始字符串的副本,并替换相关的子字符串。
An even better way (IMO) is to use re.
更好的方法(IMO)是使用re。
import re
line="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
line=re.sub("K|Y|W|M|R|S",'N',line)
print line
#3
4
Here are some general hints:
以下是一些一般性提示:
-
Don't use
find
for checking the file extension (e.g., this would also match "file1.seqdata.xls
"). At least usefile.endswith('seq')
, or, better yet,os.path.splitext(file)[1]
不要使用find来检查文件扩展名(例如,这也匹配“file1.seqdata.xls”)。至少使用file.endswith('seq'),或者更好的是,os.path.splitext(文件)[1]
-
Actually, don't do that altogether. This is what you want:
实际上,不要完全这样做。这就是你想要的:
import glob seq_files = glob.glob("*.seq")
-
Don't copy the files, it's much easier to use just one loop:
不要复制文件,只使用一个循环要容易得多:
for filename in seq_files: in_file = open(filename) out_file = open(os.path.join("clean", filename), "w") # now read lines from in_file and write lines to out_file
-
Don't use
line.__contains__('>')
. What you mean is不要使用line .__ contains __('>')。你的意思是
if '>' in line:
(which will call
__contains__
internally). But actually, you want to know wether the line starts with a `">", not if there's one somewhere within the line, be it at the beginning or not. So the better way would be this:(将在内部调用__contains__)。但实际上,你想要知道线是否以“>”开始,而不是如果线路内某处有一个,不管是在开头还是没有。所以更好的方法是:
if line.startswith(">"):
I'm not familiar with your file type; if the
">"
check really is just for determining the first line, there's better ways to do that.我不熟悉你的文件类型;如果“>”检查确实只是用于确定第一行,那么有更好的方法来做到这一点。
-
You don't need the
if
block (you justpass
). It's cleaner to write你不需要if块(你只是通过)。写作更清晰
if not something: do_things() other_stuff()
instead of
代替
if something: pass else: do_things() other_stuff()
Have fun learning Python!
玩得开心学习Python!
#4
3
you need to allocate the result of the replacement back to "line" variable
您需要将替换结果分配回“line”变量
line=line.replace('M', 'N')
you can also use the module fileinput for inplace edit
您还可以使用模块fileinput进行就地编辑
import os, shutil,fileinput
if not os.path.exists('clean'):
os.mkdir('clean')
for file in os.listdir("."):
if file.endswith(".seq"):
shutil.copy(file, 'clean')
os.chdir('clean')
for subdir, dirs, files in os.walk("."):
for file in files:
f = fileinput.FileInput(file,inplace=0)
for n,line in enumerate(f):
if line.lstrip().startswith('>'):
pass
elif n==1: #replace 2nd line
for repl in ["M","K","Y","W","R","S"]:
line=line.replace(ch, 'N')
print line.rstrip()
f.close()
change inplace=0 to inplace=1 for in place editing of your files.
将inplace = 0更改为inplace = 1以进行文件的编辑。
#5
0
line.replace is not a mutator, it leaves the original string unchanged and returns a new string with the replacements made. You'll need to change your code to line = line.replace('R', 'N')
, etc.
line.replace不是一个mutator,它保持原始字符串不变,并返回一个带有替换的新字符串。您需要将代码更改为line = line.replace('R','N')等。
I think you also want to add a break
statement at the end of your else clause, so that you don't iterate over the entire file, but stop after having processed line 2.
我想你也想在你的else子句的末尾添加一个break语句,这样你就不会遍历整个文件,而是在处理第2行之后停止。
Lastly, you'll need to actually write the file out containing your changes. So far, you are just reading the file and updating the line in your program variable 'line'. You need to actually create an output file as well, to which you will write the modified lines.
最后,您需要实际编写包含更改的文件。到目前为止,您只是在读取文件并更新程序变量“line”中的行。您还需要实际创建一个输出文件,您将编写修改后的行。
#1
6
some notes:
一些说明:
-
string.replace
andre.sub
are not in-place so you should be assigning the return value back to your variable. - string.replace和re.sub不在原位,因此您应该将返回值赋给变量。
-
glob.glob
is better for finding files in a directory matching a defined pattern... - glob.glob更适合在匹配定义模式的目录中查找文件...
- maybe you should be checking if the directory already exists before creating it (I just assumed this, this could not be your desired behavior)
- 也许你应该在创建它之前检查目录是否已经存在(我只是假设这个,这可能不是你想要的行为)
- the
with
statement takes care of closing the file in a safe way. if you don't want to use it you have to usetry
finally
. - with语句负责以安全的方式关闭文件。如果你不想使用它,你必须最后使用try。
- in your example you where forgetting to put the sufix
*.clean
;) - 在你的例子中,你忘了把sufix * .clean;)
- you where not actually writing the files, you could do it like i did in my example or use
fileinput
module (which until today i did not know) - 你在哪里实际上没有写文件,你可以像我在我的例子中那样做或者使用fileinput模块(直到今天我都不知道)
here's my example:
这是我的例子:
import re
import os
import glob
source_dir=os.getcwd()
target_dir="clean"
source_files = [fname for fname in glob.glob(os.path.join(source_dir,"*.seq"))]
# check if target directory exists... if not, create it.
if not os.path.exists(target_dir):
os.makedirs(target_dir)
for source_file in source_files:
target_file = os.path.join(target_dir,os.path.basename(source_file)+".clean")
with open(source_file,'r') as sfile:
with open(target_file,'w') as tfile:
lines = sfile.readlines()
# do the replacement in the second line.
# (remember that arrays are zero indexed)
lines[1]=re.sub("K|Y|W|M|R|S",'N',lines[1])
tfile.writelines(lines)
print "DONE"
hope it helps.
希望能帮助到你。
#2
5
You should replace line.replace('M', 'N')
with line=line.replace('M', 'N')
. replace returns a copy of the original string with the relevant substrings replaced.
你应该用line = line.replace('M','N')替换line.replace('M','N')。 replace返回原始字符串的副本,并替换相关的子字符串。
An even better way (IMO) is to use re.
更好的方法(IMO)是使用re。
import re
line="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
line=re.sub("K|Y|W|M|R|S",'N',line)
print line
#3
4
Here are some general hints:
以下是一些一般性提示:
-
Don't use
find
for checking the file extension (e.g., this would also match "file1.seqdata.xls
"). At least usefile.endswith('seq')
, or, better yet,os.path.splitext(file)[1]
不要使用find来检查文件扩展名(例如,这也匹配“file1.seqdata.xls”)。至少使用file.endswith('seq'),或者更好的是,os.path.splitext(文件)[1]
-
Actually, don't do that altogether. This is what you want:
实际上,不要完全这样做。这就是你想要的:
import glob seq_files = glob.glob("*.seq")
-
Don't copy the files, it's much easier to use just one loop:
不要复制文件,只使用一个循环要容易得多:
for filename in seq_files: in_file = open(filename) out_file = open(os.path.join("clean", filename), "w") # now read lines from in_file and write lines to out_file
-
Don't use
line.__contains__('>')
. What you mean is不要使用line .__ contains __('>')。你的意思是
if '>' in line:
(which will call
__contains__
internally). But actually, you want to know wether the line starts with a `">", not if there's one somewhere within the line, be it at the beginning or not. So the better way would be this:(将在内部调用__contains__)。但实际上,你想要知道线是否以“>”开始,而不是如果线路内某处有一个,不管是在开头还是没有。所以更好的方法是:
if line.startswith(">"):
I'm not familiar with your file type; if the
">"
check really is just for determining the first line, there's better ways to do that.我不熟悉你的文件类型;如果“>”检查确实只是用于确定第一行,那么有更好的方法来做到这一点。
-
You don't need the
if
block (you justpass
). It's cleaner to write你不需要if块(你只是通过)。写作更清晰
if not something: do_things() other_stuff()
instead of
代替
if something: pass else: do_things() other_stuff()
Have fun learning Python!
玩得开心学习Python!
#4
3
you need to allocate the result of the replacement back to "line" variable
您需要将替换结果分配回“line”变量
line=line.replace('M', 'N')
you can also use the module fileinput for inplace edit
您还可以使用模块fileinput进行就地编辑
import os, shutil,fileinput
if not os.path.exists('clean'):
os.mkdir('clean')
for file in os.listdir("."):
if file.endswith(".seq"):
shutil.copy(file, 'clean')
os.chdir('clean')
for subdir, dirs, files in os.walk("."):
for file in files:
f = fileinput.FileInput(file,inplace=0)
for n,line in enumerate(f):
if line.lstrip().startswith('>'):
pass
elif n==1: #replace 2nd line
for repl in ["M","K","Y","W","R","S"]:
line=line.replace(ch, 'N')
print line.rstrip()
f.close()
change inplace=0 to inplace=1 for in place editing of your files.
将inplace = 0更改为inplace = 1以进行文件的编辑。
#5
0
line.replace is not a mutator, it leaves the original string unchanged and returns a new string with the replacements made. You'll need to change your code to line = line.replace('R', 'N')
, etc.
line.replace不是一个mutator,它保持原始字符串不变,并返回一个带有替换的新字符串。您需要将代码更改为line = line.replace('R','N')等。
I think you also want to add a break
statement at the end of your else clause, so that you don't iterate over the entire file, but stop after having processed line 2.
我想你也想在你的else子句的末尾添加一个break语句,这样你就不会遍历整个文件,而是在处理第2行之后停止。
Lastly, you'll need to actually write the file out containing your changes. So far, you are just reading the file and updating the line in your program variable 'line'. You need to actually create an output file as well, to which you will write the modified lines.
最后,您需要实际编写包含更改的文件。到目前为止,您只是在读取文件并更新程序变量“line”中的行。您还需要实际创建一个输出文件,您将编写修改后的行。