I have the following design issue that I hope to get your help to resolve. Below is a simplistic look at what the code looks like
我有以下设计问题,希望能帮助您解决。下面简单介绍一下代码的样子
class DataProcessor{
public List<Record> processData(DataFile file){
List<Record> recordsList = new ArrayList<Record>();
for(Line line : file.getLines()){
String processedData = processData(line);
recordsList.add(new Record(processedData));
}
}
private String processData(String rawLine){
//code to process line
}
}
class DatabaseManager{
saveRecords(List<Record> recordsList){
//code to insert records objects in database
}
}
class Manager{
public static void main(String[] args){
DatabaseManager dbManager = new DatabaseManager("e:\\databasefile.db");
DataFile dataFile = new DataFile("e:\\hugeRawFile.csv");
DataProcessor dataProcessor = new DataProcessor();
dbManager.saveRecords(dataProcessor.processData(dataFile));
}
}
As you can see, "processData" method of class "DataProcessor" takes DataFile object, processes the whole file, create Record object for each line and then it returns a list of "Record" objects.
如您所见,“DataProcessor”类的“processData”方法接受DataFile对象,处理整个文件,为每一行创建Record对象,然后返回“Record”对象列表。
My problem with "processData" method: When the raw file is really huge, "List of Record" objects takes a lot of memory and sometimes the program fails. I need to change the current desgin so that the memory usage is minimized. "DataProcessor" should not have direct access to "DatabaseManager". I was thinking of passing a queue to "processData" method, where one thread run "processData" method to insert "Record" object in the queue, while another thread remove "Record" object from the queue and insert it in database. But I'm not sure about the performance issues with this.
我对“processData”方法的问题:当原始文件非常庞大时,“List of Record”对象占用大量内存,有时程序会失败。我需要更改当前的设计,以便最大限度地减少内存使用量。 “DataProcessor”不应该直接访问“DatabaseManager”。我正在考虑将队列传递给“processData”方法,其中一个线程运行“processData”方法在队列中插入“Record”对象,而另一个线程从队列中删除“Record”对象并将其插入数据库中。但我不确定这方面的性能问题。
2 个解决方案
#1
1
Put the responsibility of driving the process into the most constrained resource (in your case the DataProcessor
) - this will make sure the constraints are best obeyed rather than forced to the breaking point.
将驱动流程的责任放入最受限制的资源(在您的情况下是DataProcessor) - 这将确保最好地遵守约束而不是强制到断点。
Note: don't even think of multithreading, it is not going to do you any good for processing files. Threads will be a solution if your data comes over the wire, when you don't know when your next data chunk is going to arrive ad perhaps you have better things to do with your CPU time than to wait "until cows come home to roost" (grin). But with files? You know the job has a start and an end, so get on with it as fast as possible.
注意:甚至不考虑多线程,它对你处理文件没有任何帮助。线程将是一个解决方案,如果您的数据通过线路,当您不知道下一个数据块何时到达广告时,您可能更好地处理CPU时间而不是等待“直到奶牛回到栖息地“(笑)。但是有文件吗?你知道这份工作有一个开始和结束,所以尽可能快地继续工作。
class DataProcessor{
public List<Record> processData(DataFile file){
List<Record> recordsList = new ArrayList<Record>();
for(Line line : file.getLines()){
String processedData = processData(line);
recordsList.add(new Record(processedData));
}
}
private String processData(String rawLine){
//code to process line
}
public void processAndSaveData(DataFile dataFile, DatabaseManager db) {
int maxBuffSize=1024;
ArrayList<Record> buff=new ArrayList<Record>(maxBuffSize);
for(Line line : file.getLines()){
String processedData = processData(line);
buff.add(new Record(processedData));
if(buff.size()==maxBuffSize) {
db.saveRecords(buff);
buff.clear();
}
}
// some may be still unsaved here, less that maxBuffSize
if(buff.size()>0) {
db.saveRecords(buff);
// help the CG, let it recycle the records
// without needing to look "is buff still reacheable"?
buff.clear();
}
}
}
class Manager{
public static void main(String[] args){
DatabaseManager dbManager = new DatabaseManager("e:\\databasefile.db");
DataFile dataFile = new DataFile("e:\\hugeRawFile.csv");
DataProcessor dataProcessor = new DataProcessor();
// So... do we need another stupid manager to tell us what to do?
// dbManager.saveRecords(dataProcessor.processData(dataFile));
// Hell, no, the most constrained resource knows better
// how to deal with the job!
dataProcessor.processAndSaveData(dataFile, dbManager);
}
}
[edit] Addressing the "but we settled on what and how, and now you are coming to tell us we need to write extra code?"
[编辑]解决“但我们决定了什么以及如何,现在你来告诉我们我们需要编写额外的代码?”
Build an AbstractProcessor
class and ask your mates just to derive from it.
构建一个AbstractProcessor类,并让你的伙伴从中派生出来。
class AbstractProcessor {
// sorry, need to be protected to be able to call it
abstract protected Record processData(String rawLine);
abstract protected Class<? extends Record> getRecordClass();
public void processAndSaveData(DataFile dataFile, DatabaseManager db) {
Class<? extends Record> recordType=this.getRecordClass();
if(recordType.equals(MyRecord1.class) {
// buffered read and save MyRecord1 types specifically
}
else if(recordType.equals(YourRecord.class)) {
// buffered read and save YourRecord types specifically
}
// etc...
}
}
Now, all they need to do is to "code" extends AbstractProcessor
and make their processData(String)
protected and write a trivial method declaring its record type (may as well be an enum). It's not like you ask them a huge effort and makes what would have been a costly (or even impossible, for a TB input file) operation a "as fast as possible one".
现在,他们需要做的就是“编码”扩展AbstractProcessor并使其processData(String)受到保护,并编写一个简单的方法来声明其记录类型(也可以是一个枚举)。这并不像你要求他们付出巨大的努力,并且使得“尽可能快”的操作成本高昂(甚至不可能为TB输入文件)操作。
#2
0
You should be able to use streaming to do this in one thread, one record at a time in memory. The implementation depends on the technology your DatabaseManager is using.
您应该能够使用流式传输在一个线程中执行此操作,在内存中一次执行一条记录。实现取决于您的DatabaseManager使用的技术。
#1
1
Put the responsibility of driving the process into the most constrained resource (in your case the DataProcessor
) - this will make sure the constraints are best obeyed rather than forced to the breaking point.
将驱动流程的责任放入最受限制的资源(在您的情况下是DataProcessor) - 这将确保最好地遵守约束而不是强制到断点。
Note: don't even think of multithreading, it is not going to do you any good for processing files. Threads will be a solution if your data comes over the wire, when you don't know when your next data chunk is going to arrive ad perhaps you have better things to do with your CPU time than to wait "until cows come home to roost" (grin). But with files? You know the job has a start and an end, so get on with it as fast as possible.
注意:甚至不考虑多线程,它对你处理文件没有任何帮助。线程将是一个解决方案,如果您的数据通过线路,当您不知道下一个数据块何时到达广告时,您可能更好地处理CPU时间而不是等待“直到奶牛回到栖息地“(笑)。但是有文件吗?你知道这份工作有一个开始和结束,所以尽可能快地继续工作。
class DataProcessor{
public List<Record> processData(DataFile file){
List<Record> recordsList = new ArrayList<Record>();
for(Line line : file.getLines()){
String processedData = processData(line);
recordsList.add(new Record(processedData));
}
}
private String processData(String rawLine){
//code to process line
}
public void processAndSaveData(DataFile dataFile, DatabaseManager db) {
int maxBuffSize=1024;
ArrayList<Record> buff=new ArrayList<Record>(maxBuffSize);
for(Line line : file.getLines()){
String processedData = processData(line);
buff.add(new Record(processedData));
if(buff.size()==maxBuffSize) {
db.saveRecords(buff);
buff.clear();
}
}
// some may be still unsaved here, less that maxBuffSize
if(buff.size()>0) {
db.saveRecords(buff);
// help the CG, let it recycle the records
// without needing to look "is buff still reacheable"?
buff.clear();
}
}
}
class Manager{
public static void main(String[] args){
DatabaseManager dbManager = new DatabaseManager("e:\\databasefile.db");
DataFile dataFile = new DataFile("e:\\hugeRawFile.csv");
DataProcessor dataProcessor = new DataProcessor();
// So... do we need another stupid manager to tell us what to do?
// dbManager.saveRecords(dataProcessor.processData(dataFile));
// Hell, no, the most constrained resource knows better
// how to deal with the job!
dataProcessor.processAndSaveData(dataFile, dbManager);
}
}
[edit] Addressing the "but we settled on what and how, and now you are coming to tell us we need to write extra code?"
[编辑]解决“但我们决定了什么以及如何,现在你来告诉我们我们需要编写额外的代码?”
Build an AbstractProcessor
class and ask your mates just to derive from it.
构建一个AbstractProcessor类,并让你的伙伴从中派生出来。
class AbstractProcessor {
// sorry, need to be protected to be able to call it
abstract protected Record processData(String rawLine);
abstract protected Class<? extends Record> getRecordClass();
public void processAndSaveData(DataFile dataFile, DatabaseManager db) {
Class<? extends Record> recordType=this.getRecordClass();
if(recordType.equals(MyRecord1.class) {
// buffered read and save MyRecord1 types specifically
}
else if(recordType.equals(YourRecord.class)) {
// buffered read and save YourRecord types specifically
}
// etc...
}
}
Now, all they need to do is to "code" extends AbstractProcessor
and make their processData(String)
protected and write a trivial method declaring its record type (may as well be an enum). It's not like you ask them a huge effort and makes what would have been a costly (or even impossible, for a TB input file) operation a "as fast as possible one".
现在,他们需要做的就是“编码”扩展AbstractProcessor并使其processData(String)受到保护,并编写一个简单的方法来声明其记录类型(也可以是一个枚举)。这并不像你要求他们付出巨大的努力,并且使得“尽可能快”的操作成本高昂(甚至不可能为TB输入文件)操作。
#2
0
You should be able to use streaming to do this in one thread, one record at a time in memory. The implementation depends on the technology your DatabaseManager is using.
您应该能够使用流式传输在一个线程中执行此操作,在内存中一次执行一条记录。实现取决于您的DatabaseManager使用的技术。