根据文件名将多个.xlsx文件从目录读取到单独的Pan​​das数据帧中

时间:2021-07-09 04:36:24

I want to load multiple xlsx files with varying structures from a directory and assign these their own data frame based on the file name. I have 30+ files with differing structures but for brevity please consider the following:

我想从目录加载多个具有不同结构的xlsx文件,并根据文件名分配这些自己的数据框。我有30多个具有不同结构的文件但为了简洁起见,请考虑以下内容:

3 excel files [wild_animals.xlsx, farm_animals_xlsx, domestic_animals.xlsx]

3个excel文件[wild_animals.xlsx,farm_animals_xlsx,domestic_animals.xlsx]

I want to assign each with their own data frame so if the file name contains 'wild' it is assigned to wild_df, if farm then farm_df and if domestic then dom_df. This is just the first step in a process as the actual files contain a lot of 'noise' that needs to be cleaned depending on file type etc they file names will also change on a weekly basis with only a few key markers staying the same.

我想为每个人分配他们自己的数据框,所以如果文件名包含'wild',它将被分配给wild_df,如果是farm然后是farm_df,那么如果是domestic,则为dom_df。这只是一个过程的第一步,因为实际文件包含许多需要根据文件类型清理的“噪音”等,它们的文件名也会每周更改,只有少数关键标记保持不变。

My assumption is the glob module is the best way to begin to do this but in terms of taking very specific parts of the file extension and using this to assign to a specific df I become a bit lost so any help appreciated.

我的假设是glob模块是开始这样做的最佳方式,但是在获取文件扩展名的非常具体的部分并使用它来分配给特定的df我变得有点迷失,所以任何帮助赞赏。

I asked a similar question a while back but it was part of a wider question most of which I have now solved.

我不久前问了一个类似的问题,但这是一个更广泛的问题的一部分,我现在已经解决了大部分问题。

3 个解决方案

#1


2  

I would parse them into a dictionary of DataFrame's:

我会将它们解析为DataFrame的字典:

import os
import glob
import pandas as pd

files = glob.glob('/path/to/*.xlsx')
dfs = {}

for f in files:
    dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)

then you can access them as a normal dictionary elements:

然后你可以作为普通的字典元素访问它们:

dfs['wild_animals']
dfs['domestic_animals']

etc.

#2


2  

You nee to get all xlsx files, than using comprehension dict, you can access to any elm

你需要获得所有xlsx文件,而不是使用理解dict,你可以访问任何榆树

import pandas as pd
import os
import glob

path = 'Your_path'
extension = 'xlsx'
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format(extension))]

{elm:pd.ExcelFile(elm) for elm in result}

#3


1  

For completeness wanted to show the solution I ended up using, very close to Khelili suggestion with a few tweaks to suit my particular code including not creating a DataFrame at this stage

为了完整性想要显示我最终使用的解决方案,非常接近Khelili建议,并进行一些调整以适应我的特定代码,包括在此阶段不创建DataFrame

import os
import pandas as pd
import openpyxl as excel
import glob



#setting up path

path = 'data_inputs'
extension = 'xlsx'
os.chdir(path)
files = [i for i in glob.glob('*.{}'.format(extension))]

#Grouping files - brings multiple files of same type together in a list 

wild_groups = ([s for s in files if "wild" in s])
domestic_groups = ([s for s in files if "domestic" in s])

#Sets up a dictionary associated with the file groupings to be called in another module 
file_names = {"WILD":wild_groups, "DOMESTIC":domestic_groups}
...

#1


2  

I would parse them into a dictionary of DataFrame's:

我会将它们解析为DataFrame的字典:

import os
import glob
import pandas as pd

files = glob.glob('/path/to/*.xlsx')
dfs = {}

for f in files:
    dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)

then you can access them as a normal dictionary elements:

然后你可以作为普通的字典元素访问它们:

dfs['wild_animals']
dfs['domestic_animals']

etc.

#2


2  

You nee to get all xlsx files, than using comprehension dict, you can access to any elm

你需要获得所有xlsx文件,而不是使用理解dict,你可以访问任何榆树

import pandas as pd
import os
import glob

path = 'Your_path'
extension = 'xlsx'
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format(extension))]

{elm:pd.ExcelFile(elm) for elm in result}

#3


1  

For completeness wanted to show the solution I ended up using, very close to Khelili suggestion with a few tweaks to suit my particular code including not creating a DataFrame at this stage

为了完整性想要显示我最终使用的解决方案,非常接近Khelili建议,并进行一些调整以适应我的特定代码,包括在此阶段不创建DataFrame

import os
import pandas as pd
import openpyxl as excel
import glob



#setting up path

path = 'data_inputs'
extension = 'xlsx'
os.chdir(path)
files = [i for i in glob.glob('*.{}'.format(extension))]

#Grouping files - brings multiple files of same type together in a list 

wild_groups = ([s for s in files if "wild" in s])
domestic_groups = ([s for s in files if "domestic" in s])

#Sets up a dictionary associated with the file groupings to be called in another module 
file_names = {"WILD":wild_groups, "DOMESTIC":domestic_groups}
...