文档转换性能测试
在财务系统中使用了两种PDF转换组件
一种是com.artofsolving,也是系统第一次引用的组件:
<!-- https://mvnrepository.com/artifact/com.artofsolving/jodconverter-->
<dependency>
<groupId>com.artofsolving</groupId>
<artifactId>jodconverter</artifactId>
<version>2.2.1</version>
</dependency>
另外一种是org.artofsolving,系统第二次引用的上传组件:
<!-- https://mvnrepository.com/artifact/org.artofsolving.jodconverter/jodconverter-core -->
<dependency>
<groupId>org.artofsolving.jodconverter</groupId>
<artifactId>jodconverter-core</artifactId>
<version>3.0-beta-4</version>
</dependency>
这两种在项目开发测试过程中有不同的表现,
首先openoffice是4.1.2
支持建议:
* 微软 Windows XP, Vista, Windows 7 或者 Windows 8
* Pentium III 或更高系列处理器
* 256 MB RAM(建议使用 512 MB RAM)
* 高达 1.5 GB 的硬盘可用空间
* 1024x768 分辨率(建议使用更高分辨率),至少 256 色
Dropzone:支持的配置
单个文件最大支持100M上传
没有限制上传文件数量
同时上传文件的数量是3
下面主要看上传后进行pdf转换的效率
测试文件:test.doc,test.ppt,test.xls
通过上面的对比,除了org支持更多格式之外在速度上没有优势,并且转换出来的文字清晰度比com低了一点点。
特别的在文件名有()符号的话Linux上传读取不到。
为什么支持的文件类型有区别,因为Com.artofsolving的源码中DocumentFormatRegistry有多种实现方式,这是一个接口,默认的文档格式注册对象documentFormats list,中就没有MS 2007的:
public class DefaultDocumentFormatRegistry extends BasicDocumentFormatRegistry {
public DefaultDocumentFormatRegistry() {
final DocumentFormat pdf = new DocumentFormat("Portable Document Format", "application/pdf", "pdf");
pdf.setExportFilter(DocumentFamily.DRAWING, "draw_pdf_Export");
pdf.setExportFilter(DocumentFamily.PRESENTATION, "impress_pdf_Export");
pdf.setExportFilter(DocumentFamily.SPREADSHEET, "calc_pdf_Export");
pdf.setExportFilter(DocumentFamily.TEXT, "writer_pdf_Export");
addDocumentFormat(pdf);
final DocumentFormat swf = new DocumentFormat("Macromedia Flash", "application/x-shockwave-flash", "swf");
swf.setExportFilter(DocumentFamily.DRAWING, "draw_flash_Export");
swf.setExportFilter(DocumentFamily.PRESENTATION, "impress_flash_Export");
addDocumentFormat(swf);
final DocumentFormat xhtml = new DocumentFormat("XHTML", "application/xhtml+xml", "xhtml");
xhtml.setExportFilter(DocumentFamily.PRESENTATION, "XHTML Impress File");
xhtml.setExportFilter(DocumentFamily.SPREADSHEET, "XHTML Calc File");
xhtml.setExportFilter(DocumentFamily.TEXT, "XHTML Writer File");
addDocumentFormat(xhtml);
// HTML is treated as Text when supplied as input, but as an output it is also
// available for exporting Spreadsheet and Presentation formats
final DocumentFormat html = new DocumentFormat("HTML", DocumentFamily.TEXT, "text/html", "html");
html.setExportFilter(DocumentFamily.PRESENTATION, "impress_html_Export");
html.setExportFilter(DocumentFamily.SPREADSHEET, "HTML (StarCalc)");
html.setExportFilter(DocumentFamily.TEXT, "HTML (StarWriter)");
addDocumentFormat(html);
final DocumentFormat odt = new DocumentFormat("OpenDocument Text", DocumentFamily.TEXT, "application/vnd.oasis.opendocument.text", "odt");
odt.setExportFilter(DocumentFamily.TEXT, "writer8");
addDocumentFormat(odt);
final DocumentFormat sxw = new DocumentFormat("OpenOffice.org 1.0 Text Document", DocumentFamily.TEXT, "application/vnd.sun.xml.writer", "sxw");
sxw.setExportFilter(DocumentFamily.TEXT, "StarOffice XML (Writer)");
addDocumentFormat(sxw);
final DocumentFormat doc = new DocumentFormat("Microsoft Word", DocumentFamily.TEXT, "application/msword", "doc");
doc.setExportFilter(DocumentFamily.TEXT, "MS Word 97");
addDocumentFormat(doc);
final DocumentFormat rtf = new DocumentFormat("Rich Text Format", DocumentFamily.TEXT, "text/rtf", "rtf");
rtf.setExportFilter(DocumentFamily.TEXT, "Rich Text Format");
addDocumentFormat(rtf);
final DocumentFormat wpd = new DocumentFormat("WordPerfect", DocumentFamily.TEXT, "application/wordperfect", "wpd");
addDocumentFormat(wpd);
final DocumentFormat txt = new DocumentFormat("Plain Text", DocumentFamily.TEXT, "text/plain", "txt");
// set FilterName to "Text" to prevent OOo from tryign to display the "ASCII Filter Options" dialog
// alternatively FilterName could be "Text (encoded)" and FilterOptions used to set encoding if needed
txt.setImportOption("FilterName", "Text");
txt.setExportFilter(DocumentFamily.TEXT, "Text");
addDocumentFormat(txt);
final DocumentFormat wikitext = new DocumentFormat("MediaWiki wikitext", "text/x-wiki", "wiki");
wikitext.setExportFilter(DocumentFamily.TEXT, "MediaWiki");
addDocumentFormat(wikitext);
final DocumentFormat ods = new DocumentFormat("OpenDocument Spreadsheet", DocumentFamily.SPREADSHEET, "application/vnd.oasis.opendocument.spreadsheet", "ods");
ods.setExportFilter(DocumentFamily.SPREADSHEET, "calc8");
addDocumentFormat(ods);
final DocumentFormat sxc = new DocumentFormat("OpenOffice.org 1.0 Spreadsheet", DocumentFamily.SPREADSHEET, "application/vnd.sun.xml.calc", "sxc");
sxc.setExportFilter(DocumentFamily.SPREADSHEET, "StarOffice XML (Calc)");
addDocumentFormat(sxc);
final DocumentFormat xls = new DocumentFormat("Microsoft Excel", DocumentFamily.SPREADSHEET, "application/vnd.ms-excel", "xls");
xls.setExportFilter(DocumentFamily.SPREADSHEET, "MS Excel 97");
addDocumentFormat(xls);
final DocumentFormat csv = new DocumentFormat("CSV", DocumentFamily.SPREADSHEET, "text/csv", "csv");
csv.setImportOption("FilterName", "Text - txt - csv (StarCalc)");
csv.setImportOption("FilterOptions", "44,34,0"); // Field Separator: ','; Text Delimiter: '"'
csv.setExportFilter(DocumentFamily.SPREADSHEET, "Text - txt - csv (StarCalc)");
csv.setExportOption(DocumentFamily.SPREADSHEET, "FilterOptions", "44,34,0");
addDocumentFormat(csv);
final DocumentFormat tsv = new DocumentFormat("Tab-separated Values", DocumentFamily.SPREADSHEET, "text/tab-separated-values", "tsv");
tsv.setImportOption("FilterName", "Text - txt - csv (StarCalc)");
tsv.setImportOption("FilterOptions", "9,34,0"); // Field Separator: '\t'; Text Delimiter: '"'
tsv.setExportFilter(DocumentFamily.SPREADSHEET, "Text - txt - csv (StarCalc)");
tsv.setExportOption(DocumentFamily.SPREADSHEET, "FilterOptions", "9,34,0");
addDocumentFormat(tsv);
final DocumentFormat odp = new DocumentFormat("OpenDocument Presentation", DocumentFamily.PRESENTATION, "application/vnd.oasis.opendocument.presentation", "odp");
odp.setExportFilter(DocumentFamily.PRESENTATION, "impress8");
addDocumentFormat(odp);
final DocumentFormat sxi = new DocumentFormat("OpenOffice.org 1.0 Presentation", DocumentFamily.PRESENTATION, "application/vnd.sun.xml.impress", "sxi");
sxi.setExportFilter(DocumentFamily.PRESENTATION, "StarOffice XML (Impress)");
addDocumentFormat(sxi);
final DocumentFormat ppt = new DocumentFormat("Microsoft PowerPoint", DocumentFamily.PRESENTATION, "application/vnd.ms-powerpoint", "ppt");
ppt.setExportFilter(DocumentFamily.PRESENTATION, "MS PowerPoint 97");
addDocumentFormat(ppt);
final DocumentFormat odg = new DocumentFormat("OpenDocument Drawing", DocumentFamily.DRAWING, "application/vnd.oasis.opendocument.graphics", "odg");
odg.setExportFilter(DocumentFamily.DRAWING, "draw8");
addDocumentFormat(odg);
final DocumentFormat svg = new DocumentFormat("Scalable Vector Graphics", "image/svg+xml", "svg");
svg.setExportFilter(DocumentFamily.DRAWING, "draw_svg_Export");
addDocumentFormat(svg);
}
}
而org中则有源码如下:
public class DefaultDocumentFormatRegistry extends SimpleDocumentFormatRegistry {
public DefaultDocumentFormatRegistry() {
DocumentFormat pdf = new DocumentFormat("Portable Document Format", "pdf", "application/pdf");
pdf.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "writer_pdf_Export"));
pdf.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "calc_pdf_Export"));
pdf.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "impress_pdf_Export"));
pdf.setStoreProperties(DocumentFamily.DRAWING, Collections.singletonMap("FilterName", "draw_pdf_Export"));
this.addFormat(pdf);
DocumentFormat swf = new DocumentFormat("Macromedia Flash", "swf", "application/x-shockwave-flash");
swf.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "impress_flash_Export"));
swf.setStoreProperties(DocumentFamily.DRAWING, Collections.singletonMap("FilterName", "draw_flash_Export"));
this.addFormat(swf);
DocumentFormat html = new DocumentFormat("HTML", "html", "text/html");
html.setInputFamily(DocumentFamily.TEXT);
html.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "HTML (StarWriter)"));
html.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "HTML (StarCalc)"));
html.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "impress_html_Export"));
this.addFormat(html);
DocumentFormat odt = new DocumentFormat("OpenDocument Text", "odt", "application/vnd.oasis.opendocument.text");
odt.setInputFamily(DocumentFamily.TEXT);
odt.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "writer8"));
this.addFormat(odt);
DocumentFormat sxw = new DocumentFormat("OpenOffice.org 1.0 Text Document", "sxw", "application/vnd.sun.xml.writer");
sxw.setInputFamily(DocumentFamily.TEXT);
sxw.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "StarOffice XML (Writer)"));
this.addFormat(sxw);
DocumentFormat doc = new DocumentFormat("Microsoft Word", "doc", "application/msword");
doc.setInputFamily(DocumentFamily.TEXT);
doc.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "MS Word 97"));
this.addFormat(doc);
DocumentFormat docx = new DocumentFormat("Microsoft Word 2007 XML", "docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
docx.setInputFamily(DocumentFamily.TEXT);
this.addFormat(docx);
DocumentFormat rtf = new DocumentFormat("Rich Text Format", "rtf", "text/rtf");
rtf.setInputFamily(DocumentFamily.TEXT);
rtf.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "Rich Text Format"));
this.addFormat(rtf);
DocumentFormat wpd = new DocumentFormat("WordPerfect", "wpd", "application/wordperfect");
wpd.setInputFamily(DocumentFamily.TEXT);
this.addFormat(wpd);
DocumentFormat txt = new DocumentFormat("Plain Text", "txt", "text/plain");
txt.setInputFamily(DocumentFamily.TEXT);
LinkedHashMap txtLoadAndStoreProperties = new LinkedHashMap();
txtLoadAndStoreProperties.put("FilterName", "Text (encoded)");
txtLoadAndStoreProperties.put("FilterOptions", "utf8");
txt.setLoadProperties(txtLoadAndStoreProperties);
txt.setStoreProperties(DocumentFamily.TEXT, txtLoadAndStoreProperties);
this.addFormat(txt);
DocumentFormat wikitext = new DocumentFormat("MediaWiki wikitext", "wiki", "text/x-wiki");
wikitext.setStoreProperties(DocumentFamily.TEXT, Collections.singletonMap("FilterName", "MediaWiki"));
DocumentFormat ods = new DocumentFormat("OpenDocument Spreadsheet", "ods", "application/vnd.oasis.opendocument.spreadsheet");
ods.setInputFamily(DocumentFamily.SPREADSHEET);
ods.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "calc8"));
this.addFormat(ods);
DocumentFormat sxc = new DocumentFormat("OpenOffice.org 1.0 Spreadsheet", "sxc", "application/vnd.sun.xml.calc");
sxc.setInputFamily(DocumentFamily.SPREADSHEET);
sxc.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "StarOffice XML (Calc)"));
this.addFormat(sxc);
DocumentFormat xls = new DocumentFormat("Microsoft Excel", "xls", "application/vnd.ms-excel");
xls.setInputFamily(DocumentFamily.SPREADSHEET);
xls.setStoreProperties(DocumentFamily.SPREADSHEET, Collections.singletonMap("FilterName", "MS Excel 97"));
this.addFormat(xls);
DocumentFormat xlsx = new DocumentFormat("Microsoft Excel 2007 XML", "xlsx", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
xlsx.setInputFamily(DocumentFamily.SPREADSHEET);
this.addFormat(xlsx);
DocumentFormat csv = new DocumentFormat("Comma Separated Values", "csv", "text/csv");
csv.setInputFamily(DocumentFamily.SPREADSHEET);
LinkedHashMap csvLoadAndStoreProperties = new LinkedHashMap();
csvLoadAndStoreProperties.put("FilterName", "Text - txt - csv (StarCalc)");
csvLoadAndStoreProperties.put("FilterOptions", "44,34,0");
csv.setLoadProperties(csvLoadAndStoreProperties);
csv.setStoreProperties(DocumentFamily.SPREADSHEET, csvLoadAndStoreProperties);
this.addFormat(csv);
DocumentFormat tsv = new DocumentFormat("Tab Separated Values", "tsv", "text/tab-separated-values");
tsv.setInputFamily(DocumentFamily.SPREADSHEET);
LinkedHashMap tsvLoadAndStoreProperties = new LinkedHashMap();
tsvLoadAndStoreProperties.put("FilterName", "Text - txt - csv (StarCalc)");
tsvLoadAndStoreProperties.put("FilterOptions", "9,34,0");
tsv.setLoadProperties(tsvLoadAndStoreProperties);
tsv.setStoreProperties(DocumentFamily.SPREADSHEET, tsvLoadAndStoreProperties);
this.addFormat(tsv);
DocumentFormat odp = new DocumentFormat("OpenDocument Presentation", "odp", "application/vnd.oasis.opendocument.presentation");
odp.setInputFamily(DocumentFamily.PRESENTATION);
odp.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "impress8"));
this.addFormat(odp);
DocumentFormat sxi = new DocumentFormat("OpenOffice.org 1.0 Presentation", "sxi", "application/vnd.sun.xml.impress");
sxi.setInputFamily(DocumentFamily.PRESENTATION);
sxi.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "StarOffice XML (Impress)"));
this.addFormat(sxi);
DocumentFormat ppt = new DocumentFormat("Microsoft PowerPoint", "ppt", "application/vnd.ms-powerpoint");
ppt.setInputFamily(DocumentFamily.PRESENTATION);
ppt.setStoreProperties(DocumentFamily.PRESENTATION, Collections.singletonMap("FilterName", "MS PowerPoint 97"));
this.addFormat(ppt);
DocumentFormat pptx = new DocumentFormat("Microsoft PowerPoint 2007 XML", "pptx", "application/vnd.openxmlformats-officedocument.presentationml.presentation");
pptx.setInputFamily(DocumentFamily.PRESENTATION);
this.addFormat(pptx);
DocumentFormat odg = new DocumentFormat("OpenDocument Drawing", "odg", "application/vnd.oasis.opendocument.graphics");
odg.setInputFamily(DocumentFamily.DRAWING);
odg.setStoreProperties(DocumentFamily.DRAWING, Collections.singletonMap("FilterName", "draw8"));
this.addFormat(odg);
DocumentFormat svg = new DocumentFormat("Scalable Vector Graphics", "svg", "image/svg+xml");
svg.setStoreProperties(DocumentFamily.DRAWING, Collections.singletonMap("FilterName", "draw_svg_Export"));
this.addFormat(svg);
}
}
原理实现基本差不多,可能通过定制化来实现com的多种文件方式支持。
对于测试文件数量和大小的不同所花费的时间也不同,多文件,中型文件大小采用串行方式进行pdf转换所用时间肯定比较长,这里可以通过改为并行的方式来加快处理速度。
特别的,org有两种创建转换方式,一种支持MS 2007的,另一种不支持:
不支持MS 2007:
DocumentConverter converter = new StreamOpenOfficeDocumentConverter(connection);
但是网上说可以解决:
com.artofsolving.jodconverter.openoffice.connection.OpenOfficeException: conversion failed: could not load input document的异常,也就是文件名在Linux系统中路径解析的问题。
支持:
OfficeManager officeManager = getOfficeManager();
// 连接OpenOffice
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
Com的创建转换对象的方式:
connection.connect();
DocumentConverter converter = new OpenOfficeDocumentConverter(connection);
com同时也有通过StreamOpenOfficeDocumentConverter创建转换对象的方式,本系统没有采用该方式。
综上,如果平均上传的文件不大于5M,并且不超过5个文件,系统可以在10秒内处理完成。
在后面的测试中如果文件大于10M,转换频率较高则会消耗系统资源,无法完成转换,后面提交的转换任务在组件的任务队列中将不会被接受。这里有个性能问题,大文件转换(20M左右)有时候会出现超时,而源码中设置的单个pdf转换任务的执行时间是120s,超时则报错,并重新进行连接,处理下一个任务。
在附件上传的开发中出现了很多坑:
- 无法读取输入的文件—端口占用,重新启动
- 无法解析文件名中的特殊字符串—这里跟阿里云文件上传有关
- 端口占用—无法继续处理其他小文件的转换工作
将连接openoffice的代码修改一下,首先连接已经启动的openoffice服务,否则重启新建连接转换服务。(代码级修复) - 不支持docx,等高版本MS 文档。(添加org组件解决该问题)
- 不支持并发处理,不支持大文件转换
这里阅读源码后发现无法进行优化,所使用的组件基本没有源码,看到的也仅仅是反编译的。 - 在Windows上和Linux上的openoffice表现不太一样,主要就是转换时间,对文件格式,文件名,文件类型的解析不太一样。