基本文件读写及简单停用词处理

时间:2021-04-08 12:28:39
  1. 背景
  2. 功能及实现
  3. 总结

    一.背景
    背景:近日用c++及mfc来处理字符串问题弊端很多:①mfc代码与c++之间类型转换②mfc跨平台性③c/c++语言本身处理字符串复杂。
    故考虑到以上几点,安装Python,开始字符串处理的新旅程
    二.功能及实现
    模块化功能简述:
    I.进行文件的大小写转化及特定停用词的删除.II.遍历指定文件夹下的所有文件(非包含子文件夹)III.建立新的处理后文件
    附注:所需处理文档
    基本文件读写及简单停用词处理

I.进行文件的大小写转化及特定停用词的删除.

#!python/usr/bin
#! pythonreadfile.py
#!utf-8
# wirted by youfuwen
#Date:2016.03.25
# E-mail:yfwen@bjtu.edu.cn
def writeFillForUsing(path,path1,stopList=None):
f1 = open(path1, 'w')
f = open(path, 'r')
myList = []
for eachline in f:
# print eachline #testing for line
# 1.change uppercase to lowercase
eachline = eachline.lower()
# 2.using stoplist
for line in eachline.split():
if line not in stopList:
myList.append(line)
# 3.write to file
strs = " ".join(myList)
f1.write(strs)
myList = []
f1.write('\n')
return 1
if __name__ == '__main__':
#myList = ['wen', 'youfu', 'beijing', 'jiaotong', 'daxue']
path = 'C:/Users/wen/Desktop/fortesting/amanda_all.txt'# souce file path
path1 = 'C:/Users/wen/Desktop/fortesting/amanda_all_copy.txt'# write file path
stopList = {'2009' , 'is' , 'we' , 'are' ,'i','i\'ll'}# tingyong vocabulary
a = writeFillForUsing(path , path1 , stopList)
if a == 1:
print 'write ok'
else:
print 'No'

II.遍历指定文件夹下的所有文件(非包含子文件夹)
代码如下:

#!/usr/bin/python 
# -*- coding: cp936 -*-revised by youfuwen
# -*- copyfrom bokeyuan-*-

import os
from pythonreadfile import writeFillForUsing
def printPath(level, path):
'''''
一个目录下的的所有文件
'''

# 所有文件夹,第一个字段是次目录的级别
dirList = []
# 所有文件
fileList = []
# 返回一个列表,其中包含在目录条目的名称(google翻译)
files = os.listdir(path)
# 先添加目录级别
dirList.append(str(level))
for f in files:
if(os.path.isdir(path + '/' + f)):
# 排除隐藏文件夹。因为隐藏文件夹过多
if(f[0] == '.'):
pass
else:
# 添加非隐藏文件夹
dirList.append(f)
if(os.path.isfile(path + '/' + f)):
# 添加文件
fileList.append(f)
#fileList的拼接过程
fileListTemp=[]
for fl in fileList:
# 打印文件
# print fl#将文件名读入到fileList
# print type(fl)
# f=open(path+'/'+str(fl))
fileListTemp.append(path+'/'+str(fl))
#f.close()#安全关闭文件!
return fileListTemp
if __name__ == '__main__':
fullPath=printPath(1, 'C:/Users/wen/Desktop/fortesting')
# 1.for testing function printPath
# print fullPath
# print len(fullPath)
# 2.processing every file by own function
'''
path = 'C:/Users/wen/Desktop/fortesting/amanda_all.txt'# souce file path
path1 = 'C:/Users/wen/Desktop/fortesting/amanda_all_copys.txt'# write file path
'''

stopList = {'2009' , 'is' , 'we' , 'are' ,'i','i\'ll'}# tingyong vocabulary
for listEle in fullPath:
writeFillForUsing(listEle, str(listEle+'processing.txt'), stopList)

三.总结
I.python对于字符串的处理较一般语言有优势,在工作和生活我们需要多思多问,“不积跬步无以至千里,不积小流无以成江海”II.让我们一同努力,明天会更好!