CeleA是香港中文大学的开放数据,包含10177个名人身份的202599张图片,并且都做好了特征标记,这对人脸相关的训练是非常好用的数据集。网盘链接
数据包含了三个文件夹,一个描述文档如下:
img文件夹下有两个压缩包
img_align_celeba.zip & img_align_celeba_png.7z
我选择下载的是
img_align_celeba.zip
解压后的内容是包含202599张图片,如下
Anno文件夹下有个文档identity_CelebA,部分内容如下:
000001.jpg 2880
000002.jpg 2937
000003.jpg 8692
000004.jpg 5805
000005.jpg 9295
000006.jpg 4153
000007.jpg 9040
000008.jpg 6369
000009.jpg 3332
000010.jpg 612
此文档是10,177个名人身份标识,每张图片后面的数字即是该图片对应的标签;
下面我们利用这两个文档处理这个数据集:
首先我们利用dlib这个库做人脸检测,将人脸框出并保存下来,代码如下:
import dlib import cv2 import os # \B4\AB\C8\EB\B5\C4\C3\FC\C1\EE\D0в\CE\CA\FD def read_txt_file(file): inde=[] with open(file,'r') as f: lines=f.readlines() for line in lines: items=line.split(' ') inde.append(items[0]) return inde def face_path(path): file_paths=[] file_path=os.listdir(path) file_path.sort(key=lambda x:int(x[:-4])) for files in file_path: paths=path+'/'+files file_paths.append(paths) return file_paths def face_detction(): inde=read_txt_file('/home/zy/PycharmProjects/CelebA/identity_CelebA.txt') file_path=face_path('/home/zy/PycharmProjects/CelebA/img_align_celeba') i=1 for f in file_path: img = cv2.imread(f, cv2.IMREAD_COLOR) b, g, r = cv2.split(img) img2 = cv2.merge([r, g, b]) detector = dlib.get_frontal_face_detector() dets = detector(img, 1) if len(dets)==0: print(i) i = i + 1 print("Number of faces detected: {}".format(len(dets))) for index, face in enumerate(dets): print('face {}; left {}; top {}; right {}; bottom {}'.format(index, face.left(), face.top(), face.right(), face.bottom())) left = face.left() top = face.top() right = face.right() bottom = face.bottom() # cv2.rectangle(img, (left, top), (right, bottom), (0, 255, 0), 3) imgs=img[top:bottom,left:right] cv2.imwrite('/home/zy/PycharmProjects/CelebA/cropdata'+'/'+inde[i],imgs) i=i+1 cv2.destroyAllWindows() face_detction()
人脸检测完,你会发现,有的人脸不能检测出来,所以需要根据identity_CelebA文档重新制作一个图片路径,与对应标签文档,代码如下:
import os import cv2 img_path='/home/zy/PycharmProjects/CelebA/cropdata' text_file='/home/zy/PycharmProjects/CelebA/identity_CelebA.txt' file_path=os.listdir(img_path) file_path.sort(key=lambda x:int(x[:-4])) def train_path(): with open(text_file,'r') as f: inde=[] lines=f.readlines() print(lines) for i in file_path: print(i) for line in lines: items = line.split(' ') if i==items[0]: img_paths=img_path+'/'+i+" "+items[1] inde.append(img_paths) return inde data_set=train_path() with open('trainggg_text', "w") as f: for i in range(len(data_set)): f.write(data_set[i])
如果想要使数据集变成一个文件夹下为同一个人可以使用如下代码:
with open('./trainggg_text','r') as f: lines = f.readlines() print(lines[1]) inde=[] paths=[] for i in lines: i = i.strip('\n') item = i.split(" ") paths.append(item[0]) inde.append(item[1]) # print(inde[2]) for j in range(11000): j = j + 1 print(j) os.makedirs('./ace/'+str(j)+'/'+str(0)) # path=os.path.join('./ace',os.mkdir(str(j))) # paths=os.path.join(path,os.mkdir(str(0))) l=0 for k,element in enumerate(inde): # print('ss',k) if j==int(element): # print('s') l=l+1 img=cv2.imread(paths[k]) # print(img) cv2.imwrite('./ace/'+str(j)+'/'+str(0)+'/'+'zy'+str(l)+'.jpg',img) # cv2.imwrite('./ace/zy'+str(j)+str(l)+'.jpg',img) # print('dd')