根据名称/类型对文件名（导出到Excel）进行分类

For a part of my job we make a comprehensive list based on all files a user has in their drive. These users have to decide per file whether to archive these or not (indicated by Y or N). As a service to these users we manually fill this in for them.

对于我的工作的一部分，我们根据用户在其驱动器中的所有文件制作一个综合列表。这些用户必须根据文件决定是否存档这些文件（由Y或N表示）。作为对这些用户的服务，我们手动填写这些用户。

We export these files to a long list in excel, which displays each file as X:\4. Economics\10. xxxxxxxx\04. xxxxxxxxx\04. xxxxxxxxxx\filexyz.pdf

我们将这些文件导出到excel中的长列表中，该列表将每个文件显示为X：\ 4。经济学\ 10。 XXXXXXXX \ 04。 XXXXXXXXX \ 04。 XXXXXXXXXX \ filexyz.pdf

I'd argue that we can easily automate this, as standard naming conventions make it easy to decide which files to keep and which to delete. A file with the string "CAB" in the filename should for example be kept. However, I have no idea how and where to start. Can someone point me in the right direction?

我认为我们可以很容易地自动执行此操作，因为标准命名约定可以轻松确定要保留哪些文件以及删除哪些文件。例如，应保留文件名中带有字符串“CAB”的文件。但是，我不知道如何以及从哪里开始。有人能指出我正确的方向吗？

1 个解决方案

#1

I would suggest the following general steps

我建议采取以下一般步骤

Get the raw data
获取原始数据

You can read the excel file into a pandas dataframe in python. Ideally you will have a raw dataframe that looks something like this

您可以将excel文件读入python中的pandas数据帧。理想情况下，您将拥有一个类似于此的原始数据框

     Filename                           Keep
0    X:\4. Economics ...\filexyz.pdf    0
1    X:\4. Economics ...\fileabc.pdf    1
2    X:\3. Finance   ...\filetef.pdf    1
3    X:\3. Finance   ...\file123.pdf    0
4    G:\2. Philosophy ..\file285.pdf    0
                   ....

Preprocess/clean
预处理/清洁

This part is more up to you, for example you could remove all special characters and numbers. This would leave letters as follows

这部分更取决于您，例如您可以删除所有特殊字符和数字。这将留下如下字母

     Filename                     Keep
0    "X Economics filexyz pdf"    0
1    "X Economics fileabc pdf"    1
2    "X Finance filetef pdf"      1
3    "X Finance file123 pdf"      0
4    "G Philosophy file285 pdf"   0
                ....

Vectorize your strings
矢量化你的字符串

For an algorithm to understand your text data, you typically vectorize them. This means you turn them into numbers that the algorithm can process. An easy way to do this is with tf-idf and scikit-learn. After this your dataframe might look something like this

要获得理解文本数据的算法，通常会对它们进行矢量化。这意味着您将它们转换为算法可以处理的数字。一个简单的方法是使用tf-idf和scikit-learn。在此之后，您的数据框可能看起来像这样

     Filename                               Keep
0    [0.6461,  0.3816 ...  0.01,  0.38]     0
1    [0.,      0.4816 ...  0.25,  0.31]     1
2    [0.61,    0.1663 ...  0.11,  0.35]     1
                       ....

Train a classifier
训练分类器

Now that you have nice numbers for the algorithms to work with, you can train a classifier with scikit-learn. Simply search for "scikit learn classification example" and you will find plenty.

既然你有很好的数字可以使用算法，你可以用scikit-learn训练一个分类器。只需搜索“scikit learn分类示例”，您就会发现很多。

Once you have a trained classifier, you can compare its predictions on test data that it has not seen before. That way you get a feeling for accuracy.

一旦你有一个训练有素的分类器，你可以比较它以前没见过的测试数据的预测。这样你就能获得准确的感觉。

Hopefully that is enough to get you started!

希望这足以让你入门！

#1