提高在java中将查询结果写入CSV的性能

时间:2020-12-11 07:41:43

I have the following code that executes a query and writes it directly to a string buffer which then dumps it to a CSV file. I will need to write large amount of records (maximum to a million). This works for a million records it takes about half an hour for a file that is around 200mb! which seems to me like a lot of time, not sure if this is the best. Please recommend me better ways even if it includes using other jars/db connection utils.

我有以下代码执行查询并将其直接写入字符串缓冲区,然后将其转储到CSV文件。我需要写大量的记录(最多一百万)。这适用于一百万条记录,对于一个大约200mb的文件大约需要半小时!在我看来好像很多时间,不确定这是否是最好的。即使它包括使用其他jar / db连接工具,请向我推荐更好的方法。

....
eventNamePrepared = con.prepareStatement(gettingStats + 
    filterOptionsRowNum + filterOptions);
ResultSet rs = eventNamePrepared.executeQuery(); 
int i=0;
try{
......
FileWriter fstream = new FileWriter(realPath + 
    "performanceCollectorDumpAll.csv");
BufferedWriter out = new BufferedWriter(fstream);
StringBuffer partialCSV = new StringBuffer();


while (rs.next()) { 
  i++;
  if (current_appl_id_col_display) 
      partialCSV.append(rs.getString("current_appl_id") + ",");
  if (event_name_col_display) 
      partialCSV.append(rs.getString("event_name") + ",");
  if (generic_method_name_col_display) 
      partialCSV.append(rs.getString("generic_method_name") + ",");
  ..... // 23 more columns to be copied same way to buffer
  partialCSV.append(" \r\n");
  // Writing to file after 10000 records to prevent partialCSV 
  // from going too big and consuming lots of memory
  if (i % 10000 == 0){
      out.append(partialCSV);
      partialCSV = new StringBuffer();
  }
}               
con.close();
out.append(partialCSV);
out.close();

Thanks,

Tam

7 个解决方案

#1


Just write to the BufferedWriter directly instead of constructing the StringBuffer.

只需直接写入BufferedWriter而不是构造StringBuffer。

Also note that you should likely use StringBuilder instead of StringBuffer... StringBuffer has an internal lock, which is usually not necessary.

另请注意,您应该使用StringBuilder而不是StringBuffer ... StringBuffer有一个内部锁,通常不需要。

#2


Profiling is generally the only sure-fire way to know why something's slow. However, in this example I would suggest two things that are low-hanging fruit:

分析通常是了解为什么某些东西变慢的唯一确定方法。但是,在这个例子中,我会建议两件不为人知的事情:

  1. Write directly to the buffered writer instead of creating your own buffering with the StringBuilder.
  2. 直接写入缓冲的编写器,而不是使用StringBuilder创建自己的缓冲。

  3. Refer to the columns in the result-set by integer ordinal. Some drivers can be slow when resolving column names.
  4. 通过整数序数引用结果集中的列。解析列名时,某些驱动程序可能会很慢。

#3


You could tweak various things, but for a real improvement I would try using the native tool of whatever database you are using to generate the file. If it is SQL Server, this would be bcp which can take a query string and generate the file directly. If you need to call it from Java you can spawn it as a process.

你可以调整各种各样的东西,但为了真正的改进,我会尝试使用你用来生成文件的任何数据库的本机工具。如果它是SQL Server,这将是bcp,它可以获取查询字符串并直接生成文件。如果你需要从Java调用它,你可以将它作为一个进程生成。

As way of an example, I have just run this...

作为一个例子,我刚刚运行这个......

bcp "select * from trading..bar_db" queryout bar_db.txt -c -t, -Uuser -Ppassword -Sserver

bcp“select * from trading..bar_db”queryout bar_db.txt -c -t,-Uuser -Ppassword -Sserver

...this generated a 170MB file containing 2 million rows in 10 seconds.

...这产生了一个170MB的文件,在10秒内包含200万行。

#4


I just wanted to add a sample code for the suggestion of Jared Oberhaus:

我只是想为Jared Oberhaus的建议添加一个示例代码:

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class CSVExport {
    public static void main(String[] args) throws Exception {
    String table = "CUSTOMER";
    int batch = 100;

    Class.forName("oracle.jdbc.driver.OracleDriver");
    Connection conn = DriverManager.getConnection(
        "jdbc:oracle:thin:@server:orcl", "user", "pass");
    PreparedStatement pstmt = conn.prepareStatement(
        "SELECT /*+FIRST_ROWS(" + batch + ") */ * FROM " + table);
    ResultSet rs = pstmt.executeQuery();
    rs.setFetchSize(batch);
    ResultSetMetaData rsm = rs.getMetaData();
    File output = new File("result.csv");
    PrintWriter out = new PrintWriter(new BufferedWriter(
        new OutputStreamWriter(
        new FileOutputStream(output), "UTF-8")), false);
    Set<String> columns = new HashSet<String>(
        Arrays.asList("COL1", "COL3", "COL5")
    );
    while (rs.next()) {
        int k = 0;
        for (int i = 1; i <= rsm.getColumnCount(); i++) {
        if (columns.contains(rsm.getColumnName(i).toUpperCase())) {
            if (k > 0) {
                out.print(",");
            }
            String s = rs.getString(i);
            out.print("\"");
            out.print(s != null ? s.replaceAll("\"", "\\\"") : "");
            out.print("\"");
            k++;
        }
        }
        out.println();
    }
    out.flush();
    out.close();
    rs.close();
    pstmt.close();
    conn.close();
    }
}

#5


I have two quick thoughts. The first is, are you sure writing to disk is the problem? Could you actually be spending most of your time waiting on data from the DB?

我有两个快速的想法。首先,你确定写入磁盘是问题吗?你真的可以花大部分时间等待来自数据库的数据吗?

The second is to try removing all the + ","s, and use more .appends for that. It may help considering how often you are doing those.

第二个是尝试删除所有+“,”s,并使用更多.adnds。它可能有助于考虑您经常这样做的频率。

#6


You mentioned that you are using Oracle. You may want to investigate using the Oracle External Table feature or Oracle Data Pump depending on exactly what you are trying to do.

您提到您正在使用Oracle。您可能希望使用Oracle外部表功能或Oracle数据泵进行调查,具体取决于您要执行的操作。

See http://www.orafaq.com/node/848 (Unloading data into an external file...)

请参阅http://www.orafaq.com/node/848(将数据卸载到外部文件中......)

Another option could be connecting by sqlplus and running "spool " prior to the query.

另一个选项可能是通过sqlplus连接并在查询之前运行“spool”。

#7


Writing to a buffered writer is normally fast "enough". If it isn't for you, then something else is slowing it down.

写入缓冲的写入器通常很快“足够”。如果它不适合你,那么其他东西就会减慢速度。

The easiest way to profile it is to use jvisualvm available in the latest JDK.

分析它的最简单方法是使用最新JDK中提供的jvisualvm。

#1


Just write to the BufferedWriter directly instead of constructing the StringBuffer.

只需直接写入BufferedWriter而不是构造StringBuffer。

Also note that you should likely use StringBuilder instead of StringBuffer... StringBuffer has an internal lock, which is usually not necessary.

另请注意,您应该使用StringBuilder而不是StringBuffer ... StringBuffer有一个内部锁,通常不需要。

#2


Profiling is generally the only sure-fire way to know why something's slow. However, in this example I would suggest two things that are low-hanging fruit:

分析通常是了解为什么某些东西变慢的唯一确定方法。但是,在这个例子中,我会建议两件不为人知的事情:

  1. Write directly to the buffered writer instead of creating your own buffering with the StringBuilder.
  2. 直接写入缓冲的编写器,而不是使用StringBuilder创建自己的缓冲。

  3. Refer to the columns in the result-set by integer ordinal. Some drivers can be slow when resolving column names.
  4. 通过整数序数引用结果集中的列。解析列名时,某些驱动程序可能会很慢。

#3


You could tweak various things, but for a real improvement I would try using the native tool of whatever database you are using to generate the file. If it is SQL Server, this would be bcp which can take a query string and generate the file directly. If you need to call it from Java you can spawn it as a process.

你可以调整各种各样的东西,但为了真正的改进,我会尝试使用你用来生成文件的任何数据库的本机工具。如果它是SQL Server,这将是bcp,它可以获取查询字符串并直接生成文件。如果你需要从Java调用它,你可以将它作为一个进程生成。

As way of an example, I have just run this...

作为一个例子,我刚刚运行这个......

bcp "select * from trading..bar_db" queryout bar_db.txt -c -t, -Uuser -Ppassword -Sserver

bcp“select * from trading..bar_db”queryout bar_db.txt -c -t,-Uuser -Ppassword -Sserver

...this generated a 170MB file containing 2 million rows in 10 seconds.

...这产生了一个170MB的文件,在10秒内包含200万行。

#4


I just wanted to add a sample code for the suggestion of Jared Oberhaus:

我只是想为Jared Oberhaus的建议添加一个示例代码:

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class CSVExport {
    public static void main(String[] args) throws Exception {
    String table = "CUSTOMER";
    int batch = 100;

    Class.forName("oracle.jdbc.driver.OracleDriver");
    Connection conn = DriverManager.getConnection(
        "jdbc:oracle:thin:@server:orcl", "user", "pass");
    PreparedStatement pstmt = conn.prepareStatement(
        "SELECT /*+FIRST_ROWS(" + batch + ") */ * FROM " + table);
    ResultSet rs = pstmt.executeQuery();
    rs.setFetchSize(batch);
    ResultSetMetaData rsm = rs.getMetaData();
    File output = new File("result.csv");
    PrintWriter out = new PrintWriter(new BufferedWriter(
        new OutputStreamWriter(
        new FileOutputStream(output), "UTF-8")), false);
    Set<String> columns = new HashSet<String>(
        Arrays.asList("COL1", "COL3", "COL5")
    );
    while (rs.next()) {
        int k = 0;
        for (int i = 1; i <= rsm.getColumnCount(); i++) {
        if (columns.contains(rsm.getColumnName(i).toUpperCase())) {
            if (k > 0) {
                out.print(",");
            }
            String s = rs.getString(i);
            out.print("\"");
            out.print(s != null ? s.replaceAll("\"", "\\\"") : "");
            out.print("\"");
            k++;
        }
        }
        out.println();
    }
    out.flush();
    out.close();
    rs.close();
    pstmt.close();
    conn.close();
    }
}

#5


I have two quick thoughts. The first is, are you sure writing to disk is the problem? Could you actually be spending most of your time waiting on data from the DB?

我有两个快速的想法。首先,你确定写入磁盘是问题吗?你真的可以花大部分时间等待来自数据库的数据吗?

The second is to try removing all the + ","s, and use more .appends for that. It may help considering how often you are doing those.

第二个是尝试删除所有+“,”s,并使用更多.adnds。它可能有助于考虑您经常这样做的频率。

#6


You mentioned that you are using Oracle. You may want to investigate using the Oracle External Table feature or Oracle Data Pump depending on exactly what you are trying to do.

您提到您正在使用Oracle。您可能希望使用Oracle外部表功能或Oracle数据泵进行调查,具体取决于您要执行的操作。

See http://www.orafaq.com/node/848 (Unloading data into an external file...)

请参阅http://www.orafaq.com/node/848(将数据卸载到外部文件中......)

Another option could be connecting by sqlplus and running "spool " prior to the query.

另一个选项可能是通过sqlplus连接并在查询之前运行“spool”。

#7


Writing to a buffered writer is normally fast "enough". If it isn't for you, then something else is slowing it down.

写入缓冲的写入器通常很快“足够”。如果它不适合你,那么其他东西就会减慢速度。

The easiest way to profile it is to use jvisualvm available in the latest JDK.

分析它的最简单方法是使用最新JDK中提供的jvisualvm。