I have been using POI to parse XLS and XLSX files successfully. However, I am unable to correctly extract special characters, such as UTF-8 encoded characters like Chinese or Japanese, from an Excel spreadsheet. I have figured out how to extract data from a UTF-8 encoded csv or tab delimited file, but no luck with the Excel file. Can anyone help?
我一直在使用POI成功解析XLS和XLSX文件。但是,我无法从Excel电子表格中正确提取特殊字符,如中文或日文等UTF-8编码字符。我已经弄清楚如何从UTF-8编码的csv或制表符分隔文件中提取数据,但是Excel文件没有运气。有人可以帮忙吗?
(Edit: Code snippet from comments)
(编辑:评论中的代码段)
HSSFSheet sheet = workbook.getSheet(worksheet);
HSSFEvaluationWorkbook ewb = HSSFEvaluationWorkbook.create(workbook);
while (rowCtr <= lastRow && !rowBreakOut)
{
Row row = sheet.getRow(rowCtr);//rows.next();
for (int col=firstCell; col<lastCell && !breakOut; col++) {
Cell cell;
cell = row.getCell(col,Row.RETURN_BLANK_AS_NULL);
if (ctype == Cell.CELL_TYPE_STRING) {
sValue = cell.getStringCellValue();
log.warn("String value = "+sValue);
String encoded = URLEncoder.encode(sValue, "UTF-8");
log.warn("URL-encoded with UTF-8: " + encoded);
....
4 个解决方案
#1
9
I had the same problem while extracting Persian text from an Excel file. I was using Eclipse, and simply going to Project -> Properties and changing the "text file encoding" to UTF-8 solved the problem.
从Excel文件中提取波斯文本时遇到了同样的问题。我正在使用Eclipse,只需转到Project - > Properties并将“text file encoding”更改为UTF-8解决了这个问题。
#2
3
in POI you can use like this:
在POI你可以像这样使用:
Workbook wb = new HSSFWorkbook();
Sheet sheet = wb.createSheet("new sheet");
// Create a row and put some cells in it. Rows are 0 based.
Row row = sheet.createRow(1);
// Create a new font and alter it.
Font font = wb.createFont();
font.setCharSet(FontCharset.ARABIC.getValue());
font.setFontHeightInPoints((short)24);
font.setFontName("B Nazanin");
font.setItalic(true);
font.setStrikeout(true);
// Fonts are set into a style so create a new one to use.
CellStyle style = wb.createCellStyle();
style.setFont(font);
// Create a cell and put a value in it.
Cell cell = row.createCell(1);
cell.setCellValue("سلام");
cell.setCellStyle(style);
// Write the output to a file
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();
and can use another charset in FontCharset
并且可以在FontCharset中使用另一个字符集
#3
1
The solution is simple, to read cell string values of any encoding (non English characters); just use the following method:
解决方案很简单,读取任何编码的单元格字符串值(非英文字符);只需使用以下方法:
sValue = cell.getRichStringCellValue().getString();
instead of:
代替:
sValue = cell.getStringCellValue();
This applies to UTF-8 encoded characters like Chinese, Arabic or Japanese.
这适用于UTF-8编码的字符,如中文,阿拉伯语或日语。
P.S if anybody is using the Command line utility nullpunkt/excel-to-json which utilize the "Apache POI" library, modify the file converter/ExcelToJsonConverter.java by replacing the occurrences of "getStringCellValue()" to avoid reading non-english characters as "???".
PS如果有人使用命令行实用程序nullpunkt / excel-to-json利用“Apache POI”库,通过替换“getStringCellValue()”的出现来修改文件转换器/ ExcelToJsonConverter.java以避免读取非英文字符作为“???”。
#4
0
Get bytes using UTF as follows
使用UTF获取字节如下
cell.getStringCellValue().getBytes(Charset.forName("UTF-8"));
#1
9
I had the same problem while extracting Persian text from an Excel file. I was using Eclipse, and simply going to Project -> Properties and changing the "text file encoding" to UTF-8 solved the problem.
从Excel文件中提取波斯文本时遇到了同样的问题。我正在使用Eclipse,只需转到Project - > Properties并将“text file encoding”更改为UTF-8解决了这个问题。
#2
3
in POI you can use like this:
在POI你可以像这样使用:
Workbook wb = new HSSFWorkbook();
Sheet sheet = wb.createSheet("new sheet");
// Create a row and put some cells in it. Rows are 0 based.
Row row = sheet.createRow(1);
// Create a new font and alter it.
Font font = wb.createFont();
font.setCharSet(FontCharset.ARABIC.getValue());
font.setFontHeightInPoints((short)24);
font.setFontName("B Nazanin");
font.setItalic(true);
font.setStrikeout(true);
// Fonts are set into a style so create a new one to use.
CellStyle style = wb.createCellStyle();
style.setFont(font);
// Create a cell and put a value in it.
Cell cell = row.createCell(1);
cell.setCellValue("سلام");
cell.setCellStyle(style);
// Write the output to a file
FileOutputStream fileOut = new FileOutputStream("workbook.xls");
wb.write(fileOut);
fileOut.close();
and can use another charset in FontCharset
并且可以在FontCharset中使用另一个字符集
#3
1
The solution is simple, to read cell string values of any encoding (non English characters); just use the following method:
解决方案很简单,读取任何编码的单元格字符串值(非英文字符);只需使用以下方法:
sValue = cell.getRichStringCellValue().getString();
instead of:
代替:
sValue = cell.getStringCellValue();
This applies to UTF-8 encoded characters like Chinese, Arabic or Japanese.
这适用于UTF-8编码的字符,如中文,阿拉伯语或日语。
P.S if anybody is using the Command line utility nullpunkt/excel-to-json which utilize the "Apache POI" library, modify the file converter/ExcelToJsonConverter.java by replacing the occurrences of "getStringCellValue()" to avoid reading non-english characters as "???".
PS如果有人使用命令行实用程序nullpunkt / excel-to-json利用“Apache POI”库,通过替换“getStringCellValue()”的出现来修改文件转换器/ ExcelToJsonConverter.java以避免读取非英文字符作为“???”。
#4
0
Get bytes using UTF as follows
使用UTF获取字节如下
cell.getStringCellValue().getBytes(Charset.forName("UTF-8"));