解析极大Excel 2007文件的最佳语言

My boss has a habit of performing queries on our databases that return tens of thousands of rows and saving them into excel files. I, being the intern, constantly have to write scripts that work with the information from these files. Thus far I've tried VBScript and Powershell for my scripting needs. Both of these can take several minutes to perform even the simplest of tasks, which would mean that the script when finished would take most of an 8 hour day.

我的老板习惯在我们的数据库上执行查询，返回数万行并将它们保存到excel文件中。作为实习生，我经常不得不编写与这些文件中的信息一起使用的脚本。到目前为止，我已经尝试过VBScript和Powershell来满足我的脚本需求。即使是最简单的任务，这两个任务都可能需要几分钟才能执行，这意味着脚本完成后将花费大部分时间为8小时。

My workaround right now is simply to write a PowerShell script that removes all of the commas and newline characters from an xlsx file, saves the .xlsx files to .csv, and then have a Java program handle the data gathering and output, and have my script clean up the .csv files when finished. This runs in a matter of seconds for my current project, but I can't help but wonder if there's a more elegant alternative for my next one. Any suggestions?

我现在的解决方法就是编写一个PowerShell脚本，删除xlsx文件中的所有逗号和换行符，将.xlsx文件保存到.csv，然后让Java程序处理数据收集和输出，并让我的脚本在完成后清理.csv文件。对于我目前的项目，这在几秒钟内完成，但我不禁想知道下一个项目是否有更优雅的选择。有什么建议么？

10 个解决方案

#1

I kept getting all kinds of weird errors when working with .xlsx files.

使用.xlsx文件时，我一直遇到各种奇怪的错误。

Here's a simple example of using Apache POI to traverse an .xlsx file. See also Upgrading to POI 3.5, including converting existing HSSF Usermodel code to SS Usermodel (for XSSF and HSSF).

这是使用Apache POI遍历.xlsx文件的简单示例。另请参阅升级到POI 3.5，包括将现有HSSF Usermodel代码转换为SS Usermodel（适用于XSSF和HSSF）。

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.FormulaEvaluator;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

public class XlsxReader {

    public static void main(String[] args) throws IOException {
        InputStream myxls = new FileInputStream("test.xlsx");
        Workbook book = new XSSFWorkbook(myxls);
        FormulaEvaluator eval =
            book.getCreationHelper().createFormulaEvaluator();
        Sheet sheet = book.getSheetAt(0);
        for (Row row : sheet) {
            for (Cell cell : row) {
                printCell(cell, eval);
                System.out.print("; ");
            }
            System.out.println();
        }
        myxls.close();
    }

    private static void printCell(Cell cell, FormulaEvaluator eval) {
        switch (cell.getCellType()) {
            case Cell.CELL_TYPE_BLANK:
                System.out.print("EMPTY");
                break;
            case Cell.CELL_TYPE_STRING:
                System.out.print(cell.getStringCellValue());
                break;
            case Cell.CELL_TYPE_NUMERIC:
                if (DateUtil.isCellDateFormatted(cell)) {
                    System.out.print(cell.getDateCellValue());
                } else {
                    System.out.print(cell.getNumericCellValue());
                }
                break;
            case Cell.CELL_TYPE_BOOLEAN:
                System.out.print(cell.getBooleanCellValue());
                break;
            case Cell.CELL_TYPE_FORMULA:
                System.out.print(cell.getCellFormula());
                break;
            default:
                System.out.print("DEFAULT");
        }
    }
}

#2

Your goal is to do "data transformation" on your Excel files.

您的目标是在Excel文件上执行“数据转换”。

To solve this, I would use a dedicated ETL tool (Extract Transform Load), such as Talend Open Studio.

为了解决这个问题，我将使用专用的ETL工具（Extract Transform Load），例如Talend Open Studio。

You just have to put a "Excel Input" component, a "data transform" component, and a "CSV output component". Talend ETL will convert this functional description of your problem into a Java code. Finally, you just have to execute this program...

您只需要放置“Excel输入”组件，“数据转换”组件和“CSV输出组件”。 Talend ETL会将您的问题的功能描述转换为Java代码。最后，你只需要执行这个程序......

#3

I personally would use Python for this. I have found that it runs fast enough to not be a noticeable problem.

我个人会为此使用Python。我发现它运行得足够快，不会成为明显的问题。

If you don't want to worry about a new language, why not just use Java for the entire thing? Removing commas and newlines is pretty trivial in Java and it would save you a step.

如果您不想担心新语言，为什么不直接使用Java呢？删除逗号和换行符在Java中非常简单，它可以为您节省一步。

#4

You should always think about the future of your code...

您应该始终考虑代码的未来......

Who will maintain your script in the future? Does your company have any other developers that are familiar with PowerShell/VBScript?

谁将在未来维护您的脚本？贵公司是否有其他熟悉PowerShell / VBScript的开发人员？

I would have to say that you should stick to one language that fits your (and your company's) needs. As Nathan suggested, Python would be a great choice for creating fast scripts.

我不得不说你应该坚持一种适合你（和你公司）需要的语言。正如Nathan所说，Python将是创建快速脚本的绝佳选择。

And one more thing - If you can control the SQL statements your boss does, you can make him create outputs that will ease your parsers' development and make them much more simple.

还有一件事 - 如果你可以控制你的老板所做的SQL语句，你可以让他创建输出，这将简化你的解析器的开发并使它们变得更加简单。

Good luck!

祝你好运！

Tal.

塔尔。

#5

in addition to trashgod's answer, for large files, I'd suggest POI SXSSF (Since POI 3.8 beta3). (http://poi.apache.org/spreadsheet/) With SXSSF, you can handle large files in streams, and that will help avoiding memory errors.

除了trashgod的答案，对于大文件，我建议使用POI SXSSF（自POI 3.8 beta3起）。（http://poi.apache.org/spreadsheet/）使用SXSSF，您可以处理流中的大文件，这有助于避免内存错误。

adding link to SXSSF details: http://poi.apache.org/spreadsheet/how-to.html#sxssf

添加链接到SXSSF详细信息：http：//poi.apache.org/spreadsheet/how-to.html#sxssf

#6

You could use Java POI to access the .xlsx directly.

您可以使用Java POI直接访问.xlsx。

#7

If you save the file as a CSV, you can use any language you want to parse it.

如果将文件另存为CSV，则可以使用要解析的任何语言。

#8

You can import data into an embedded database - e.g., apache derby (or http://hsqldb.org/). Depending on the nature of your queries, it can be a little bit faster. For sure, it will save a lot of your time if your boss requests new features often. You will simply write most of your new functionality in SQL.

您可以将数据导入嵌入式数据库 - 例如，apache derby（或http://hsqldb.org/）。根据查询的性质，它可以更快一点。当然，如果您的老板经常要求新功能，它将节省您的大量时间。您只需在SQL中编写大部分新功能。

#9

If you need an ADVANCE analysis -- beyond grouping, joining, filtering --, just go for free mining tools, such as Wekka*, Rapid miner (based on Wekka but nicer GUI), or knime. These tools have very nice interfaces and provide operators to read cvs files. You can also run rapidminer and wekka libraries inside your java program. If not, go for embedded database as I proposed before.

如果您需要ADVANCE分析 - 除了分组，加入，过滤 - 之外，只需使用免费的挖掘工具，例如Wekka *，Rapid miner（基于Wekka但更好的GUI）或knime。这些工具具有非常好的接口，并提供操作员来读取cvs文件。您还可以在java程序中运行rapidminer和wekka库。如果没有，请按照我之前的建议选择嵌入式数据库。

Using Apache POI is not bad idea but I -- personally -- prefer to use it only to read excel before uploading it into e.g., a database.

使用Apache POI并不是一个坏主意，但我 - 个人 - 更愿意在将其上传到例如数据库之前将其用于读取excel。

Regarding the language. The best language that I have found for adhoc tasks is groovy. It is scripting language on the top of Java so you can use all Java libs (POI, jdbcs, ...a very looong list) and mix groovy classes with Java classes.

关于语言。我为adhoc任务找到的最好的语言是groovy。它是Java顶部的脚本语言，因此您可以使用所有Java库（POI，jdbcs，......一个非常宽松的列表），并将groovy类与Java类混合使用。

#10

I have two options for parsing excel(.xlsx or xls) files. 1-You can use apache POI api to extract data from it.Now Apache poi has improved and fast.

我有两个选项来解析excel（.xlsx或xls）文件。 1 - 你可以使用apache POI api从中提取数据。现在Apache poi已经改进并且速度很快。

2- Convert excel to open xml then write a xslt file. I think it should do work for a long file excel file.

2-转换excel打开xml然后写一个xslt文件。我认为它应该适用于长文件excel文件。

#1