Tuesday, July 20, 2010

Writing large excel files (Excel 2007)

I recently was working on an assignment of exporting data from database to excel-2007. This made me hunt for an API that would take care of low-level excel format handling. There are probably many Java API's available but I am aware of JExcel and Apache POI. I had chosen JExcel earlier to generate and download ".xls" documents on a portal. There was no particular reason I chose JExcel over Apache POI, but I was curious about Apache POI and this was my chance.

Apache POI works very well and has a huge array of functionalities to write sophisticated xlsx documents. Apache POI permits writing xls documents as well using the HSSF Suite of packages. There is a prominent difference in the internal structure of xls(Excel-97 - 2003) documents. To demonstrate this: Create two documents and save one with ".xls" extension and the other with ".xlsx" extension. Some pre-requisite to doing this is: you must have Office 2007 to create xlsx documents and you must have some Zip/UnZip software like WinRar, Zip etc installed and working. Right Click each of the documents and try to open with Zip software you have. ".xlsx" opens up and shows directory structure of what looks like, whereas the ".xls" document refuses to open.:

Structure

/
- ..
- _rels
- .rels
- docProps
- app.xml
- core.xml
- xl
- _rels
- worksheets
- sheet1.xml
- sheet2.xml
- sharedStrings.xml
- styles.xml
- workbook.xml
- [Content_Types].xml

So much for the introduction to .xlsx document. Now getting back to the writing some code to generate excel'07 docs. Check out this link for getting started quickly: http://poi.apache.org/spreadsheet/quick-guide.html.

Within some time I was able to generate excel reports by spitting the data fetched from database. However, for large reports having several thousands of rows and tens of columns I started facing fatal "OutOfMemory Error". Our server hosts 8 GB RAM and I was testing this on my 2 GB laptop. Probably if I'd run the same on server it would not have thrown up. But I did not want to take chances with multiple users trying to generate reports simultaneously which could exhaust our server

I see Apache POI has API that is capable of opening streams from xlsx documents. So I tried this approach of writing a few rows...closing the stream and then reopening it and start re-writing it. I was hoping that all data would be flushed from physical memory....but it did not go as expected. Here is the output of my code:


Please wait while data is being written...
TOTAL COLS:13
Rows updated = 2000
WorkSheet name REPORT3_VIEW
Rows updated = 4000
Connection OFF
Connection Count: 0
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:441)...


I would not suggest this approach for two reasons:
* it is slow
* it still causes java to retain document in the memory causing heap space to exhaust. I tried with smaller value for MAX_ROWS_UPDATED_AT_ONCE and it perhaps met the same fate...see below:



Headers written..
Please wait while data is being written...
TOTAL COLS:13
Rows updated = 200
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 400
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 600
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 800
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 1000
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 1200
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 1400
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 1600
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 1800
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 2000
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 2200
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 2400
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 2600
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 2800
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 3000
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 3200
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 3400
WorkSheet nameAFP_REPORT3_VIEW
Rows updated = 3600
Connection OFF
Connection Count: 0
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space



Later I came across a solution from Yegor Koslov which mitigates the error above. In brief, his solution works like this: (1) A template xlsx doc is created which indicates the number of sheets to be written and the sheet names etc (2) All the data is dumped into an xml file (3) Build the zip file with the structured described above. This zip file is nothing but .xlsx doc. Details can be found here:http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/xssf/usermodel/examples/BigGridDemo.java

Extending his logic to add more sheets in the xlsx document:

Disclaimer: Please use it at your own risk.


Yegor's solution worked very well for me and I chose to customize it a little bit. Any feedback would be helpful.

package ....common.util.excel;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.Writer;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Enumeration;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.zip.ZipEntry;
import java.util.zip.ZipException;
import java.util.zip.ZipFile;
import java.util.zip.ZipOutputStream;

import ....util.excel.BigGridDemo.SpreadsheetWriter;
import ....util.excel.SheetNotFoundException;
import org.apache.poi.ss.usermodel.BorderStyle;
import org.apache.poi.ss.usermodel.IndexedColors;
import org.apache.poi.xssf.usermodel.XSSFCellStyle;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import ...common.util.db.ExportDBView;
import ...logging.Logger;



public class Excel07WorkbookWriter {
private XSSFWorkbook xwb;
private String xlsxFileName;
private Map styles;
private LinkedHashMap sheets;
private LinkedHashMap sheetData ;
private HashMap sheetSheetRefRelation;
private static Logger logg = Logger.getInstance(Excel07WorkbookWriter.class);

/*
* @ Border selection will be same across all 4 cell borders
* @ alignment
*/
private XSSFCellStyle createDataCellStyle(XSSFCellStyle style, BorderStyle borderStyle, short horAlign, short verAlign,
boolean wrapText, boolean even){
style.setBorderBottom(borderStyle);
style.setBorderLeft(borderStyle);
style.setBorderRight(borderStyle);
style.setBorderTop(borderStyle);
style.setAlignment(horAlign);
style.setVerticalAlignment(verAlign);
style.setWrapText(wrapText);
if (even){
//style.setFillForegroundColor(this.DATA_FOREGROUND_COLOR_EVEN);
style.setFillForegroundColor(IndexedColors.LIGHT_CORNFLOWER_BLUE.getIndex());
}else{
style.setFillForegroundColor(IndexedColors.LIGHT_YELLOW.getIndex());
}
style.setFillPattern(XSSFCellStyle.SOLID_FOREGROUND);
return style;
}

/*
* @ Border selection will be same across all 4 cell borders
* @ alignment
*/
private XSSFCellStyle createHeadingCellStyle(XSSFCellStyle style, BorderStyle borderStyle, short horAlign, short verAlign,
boolean wrapText){
style.setBorderBottom(borderStyle);
style.setBorderLeft(borderStyle);
style.setBorderRight(borderStyle);
style.setBorderTop(borderStyle);
style.setAlignment(horAlign);
style.setVerticalAlignment(verAlign);
style.setWrapText(wrapText);
style.setFillForegroundColor(IndexedColors.GREY_25_PERCENT.getIndex());
style.setFillPattern(XSSFCellStyle.SOLID_FOREGROUND);
return style;
}

/*
* Create styles
*
*/
private Map getDefaultStyles(XSSFWorkbook wb){
Map styles = new HashMap();
//XSSFDataFormat fmt = wb.createDataFormat();
XSSFCellStyle headerCellStyle = wb.createCellStyle();
headerCellStyle = createHeadingCellStyle(headerCellStyle,BorderStyle.THIN,
XSSFCellStyle.ALIGN_GENERAL, XSSFCellStyle.VERTICAL_TOP, false);
styles.put("headerRowStyle", headerCellStyle);

XSSFCellStyle dataCellEvenRowStyle = wb.createCellStyle();
dataCellEvenRowStyle = createDataCellStyle(dataCellEvenRowStyle,BorderStyle.THIN,
XSSFCellStyle.ALIGN_GENERAL, XSSFCellStyle.VERTICAL_TOP, true,true);
styles.put("dataRowEvenStyle", dataCellEvenRowStyle);

XSSFCellStyle dataCellOddRowStyle = wb.createCellStyle();
dataCellOddRowStyle = createDataCellStyle(dataCellOddRowStyle,BorderStyle.THIN,
XSSFCellStyle.ALIGN_GENERAL, XSSFCellStyle.VERTICAL_TOP, true,false);
styles.put("dataRowOddStyle", dataCellOddRowStyle);

return styles;
}

/*
*
* @xlsxfilename with or without ".xlsx"
* @LinkedHashMap stores sheetnamea and sheet Obj
* @Map Map containing styles. Can be left NULL
*/

public void init(String xlsxfilename, LinkedHashMap sheets, Map styles) throws NullPointerException, IOException{
logg.info("Start init() for "+xlsxfilename);
this.xlsxFileName = xlsxfilename;
xwb = new XSSFWorkbook();
if ((null == sheets) ||(sheets.isEmpty())) throw new NullPointerException("LinkedHashMap sheets NULL!");


Iterator sheetIter = sheets.keySet().iterator();
while(sheetIter.hasNext()){
String sheetName = (String)sheetIter.next();
XSSFSheet asheet = xwb.createSheet(sheetName);
sheets.put(sheetName, asheet);
logg.info("Adding sheet:"+sheetName);
}
// create template xlsx
FileOutputStream os = new FileOutputStream("template.xlsx");
xwb.write(os);
os.close();

if (null == styles) styles = getDefaultStyles(xwb);

// holds mapping between sheetName and sheetref
sheetSheetRefRelation = new HashMap();

// populate Sheet Data with sheet reference
LinkedHashMap sheetData = new LinkedHashMap();
sheetIter = sheets.keySet().iterator();
while(sheetIter.hasNext()){
String _sheetName = sheetIter.next();
String _sheetNameRef = sheets.get(_sheetName).getPackagePart().getPartName().getName();
// substring to remove leading "/" from /xl/worksheets/sheet1.xml
sheetData.put(_sheetNameRef.substring(1), null);
//substring to remove leading "/" from /xl/worksheets/sheet1.xml
sheetSheetRefRelation.put(_sheetName, _sheetNameRef.substring(1));
//logg.info("sheetSheetRefRelation:"+_sheetName +":"+_sheetNameRef.substring(1));
}

this.sheets = new LinkedHashMap();
this.sheets = sheets;
this.sheetData = new LinkedHashMap();
this.sheetData = sheetData;
logg.info("End init(): sheetSheetRefRelation{}:"+sheetSheetRefRelation);

}

private String checkExceptions(String inpStr){
// if null then return ""
if (null == inpStr) return "";

return inpStr.replace("&", "&");
}

/*
This method does the actual hardwork of writing data as xml.
I have used my styles (color and format).you can ignore this or replace what suits you.
I have added a checkExceptions method which takes care of wrapping xml illegal characters that are not illegal in db.
For instance: if you have "&" in your data and you remove checkExceptions() below...the xlsx document will be generate
....however...the sheet containing this "&" will not be shown up even if the data was present."&" is not acceptable in xml
and since we are dealing with xmls here (zipped into xlsx) we need to take care of wrapping it up...I may not have covered all
illegal patterns...but that can be added into checkExceptions() as required.

*/

private File writeSheet(short headerStartRow, short numOfColumns, LinkedHashMap columnNames, ResultSet data) throws IOException, SQLException{
File tmp = File.createTempFile("sheet", ".xml");
Writer fw = new FileWriter(tmp);
SpreadsheetWriter sw = new SpreadsheetWriter(fw);
sw.beginSheet();

// insert header row
sw.insertRow(headerStartRow);
int styleIndex = -1;
if (null != styles) styleIndex = styles.get("headerRowStyle").getIndex();

Iterator lhm_iter = columnNames.keySet().iterator();
for(int columnIndex = 0; columnIndex <>
String nxtval = (String)lhm_iter.next();
System.out.println(""+columnIndex+":"+ nxtval);
sw.createCell(columnIndex,nxtval, styleIndex);
}
sw.endRow();
// insert header row

// insert data rows
while(data.next()){
sw.insertRow(++headerStartRow);
for(int columnIndex = 0; columnIndex <>
//String nwStr = (String)data.getString(columnIndex+1);
//if (null == nwStr) nwStr = "";
String nwStr = checkExceptions((String)data.getString(columnIndex+1));
//System.out.println("nwStr: "+nwStr);
if ((headerStartRow %2) == 0){// even # row
if (null != styles) styleIndex = styles.get("dataRowEvenStyle").getIndex();
sw.createCell(columnIndex, nwStr, styleIndex);
}else{ // odd # row
if (null != styles) styleIndex = styles.get("dataRowOddStyle").getIndex();
sw.createCell(columnIndex, nwStr,styleIndex);
}
}
sw.endRow();
}
// end insert data rows

sw.endSheet();

fw.close();
return tmp;
}

/*
Fetches

*/
public void importViewData(String viewName, short startSheetRowNumber) throws NullPointerException, SQLException, SheetNotFoundException, IOException{
logg.info("Start:importViewData(): "+viewName);
/// This is my standalone class that returns resultset and does the connection maintenance work.
ExportDBView eview = new ExportDBView(viewName);

//@throws NullPointer and SQLException
ResultSet rs = eview.initExport();
LinkedHashMap columnNames = eview.getResultSetMetaData();
short col_sz = (short)columnNames.size();
logg.info(viewName+": columnsize: "+col_sz);


//@throws IOException
File tmp = writeSheet((short)startSheetRowNumber, col_sz, columnNames, rs);

// fetch sheetref using viewName
String _sheetref = sheetSheetRefRelation.get(viewName);
if (null == _sheetref) {
eview.cleanup();// cleanup anyway
throw new SheetNotFoundException("viewName does not exist as a Sheet in this workbook");
}
// put the sheet data alongside _sheetref
logg.info("Adding sheetdata for sheetref:"+_sheetref);
this.sheetData.put(_sheetref, tmp);
eview.cleanup();
logg.info("End:importViewData(): "+viewName);
}

private static void copyStream(InputStream in, OutputStream out) throws IOException {
byte[] chunk = new byte[1024];
int count;
while ((count = in.read(chunk)) >=0 ) {
out.write(chunk,0,count);
}
}


/*
Adapted Yegor's substitute() method to add more than 1 sheets created dynamically into workbook
*/
public void flushDataIntoWorkbook() throws ZipException, IOException{
logg.info("Start:flushDataIntoWorkbook(): started writing into workbook...");
OutputStream out = new FileOutputStream(this.xlsxFileName);
ZipFile zip = new ZipFile(new File("template.xlsx"));

ZipOutputStream zos = new ZipOutputStream(out);

@SuppressWarnings("unchecked")
Enumeration en = (Enumeration) zip.entries();
while (en.hasMoreElements()) {
ZipEntry ze = en.nextElement();
Iterator sheetIter = this.sheetData.keySet().iterator();
boolean entryMatches = false;
while(sheetIter.hasNext()){
String entry = sheetIter.next();
entryMatches = ze.getName().equals(entry);
if (entryMatches == true) break;
}
if(!entryMatches){
zos.putNextEntry(new ZipEntry(ze.getName()));
InputStream is = zip.getInputStream(ze);
copyStream(is, zos);
is.close();
}
}

// add all sheets
Iterator sheetIter = this.sheetData.keySet().iterator();
//InputStream is = null;
while(sheetIter.hasNext()){
String key = (String)sheetIter.next();
logg.info("Writing sheet: "+key);
zos.putNextEntry(new ZipEntry(key));
InputStream is = new FileInputStream(this.sheetData.get(key));
copyStream(is, zos);
is.close();
zos.closeEntry();
}
zos.close();
logg.info("End:flushDataIntoWorkbook(): Finished writing...cleaning up resources");
}


/*
* Sample test:
* In my set up I have Oracle 9i setup and 3 views as mentioned below. I extract data from these 3 views and
* and write it to excel sheet. Here is a simple stored procedure I use to export any view given the name:
* ///////////////////////ORACLE STORED PROCEDURE/////////////////////////////////////
* CREATE OR REPLACE PROCEDURE "MYSCHEMANAME"."EXPORT_VIEW"(in_table IN VARCHAR2,
p_cursor OUT SYS_REFCURSOR)
is
begin
open p_cursor FOR
'select * from ' || in_table;
end EXPORT_VIEW;

*
* ///////////////////////////////////////////////////////////////////////////////////
* Note !!: MYSCHEMANAME is the default schema when you log in. EXPORT_VIEW is my procedure name.
* ExportView (see imports above) is a standalone class I use to fetch data as resultset. I used ojdbc14.jar
* for JDBC support.
*
*/
public static void main(String []args){
Excel07WorkbookWriter eww = new Excel07WorkbookWriter();
LinkedHashMap sheets = new LinkedHashMap();
sheets.put("REPORT1_VIEW", null);
sheets.put("REPORT2_VIEW", null);
sheets.put("REPORT3_VIEW", null);
try {
eww.init("myxls.xlsx", sheets, null);
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

try {
eww.importViewData("REPORT1_VIEW",(short)0);// the view names should match the view names in sheet.put(...) above
eww.importViewData("REPORT2_VIEW",(short)0);
eww.importViewData("REPORT3_VIEW",(short)0);
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SheetNotFoundException e) { // Custom Exception class written. It extends Exception
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

try {
eww.flushDataIntoWorkbook();
} catch (ZipException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}
}






7 comments:

SeƱor Pedrito said...

I just write to congratulate you for such a beautiful article.

I´ve been working on it to make a 62000 lines excel, and it worked just fine.


Thanks a lot!!!!

Unknown said...

Hey Senor...thanks a ton!...I do not check my blog often....so sorry for the late reply.

Your comment has given me encouragement to write more about some common issues I face at work and take a lot of time to research via Google.

Well I owe a lot to Yegor Kozlov from Apache foundation ..who wrote that introductory article on which I built upon...and of course Google for making searches so fruitful.

Thanks again..wish you good luck!

~Vikram

Yuvraj said...

Thank a lot for this gr8 article. It is very helpful to me. I was able to create excel file with 100k rows. Now I am facing some issue with formatting. Styles.xml is not generating correct entries. I am using your styles for formatting but my excel throws error "Excel Found unreadable content in 'test.xlsx'. Do you want to recover...." while opening file and when I click "Yes", file opens with no formatting.

I don't what’s wrong with it. I spent 2 days on it but no luck. I tried to post comment on your blog but it I not allowing me.

Your help will be greatly appreciated.

Java Lover said...

Your article is very helpfule for my project. However, i have problem exporting the excel file over SSL. Does anyone know how to fix it?

Thanks a bunch!!!

srk said...

Hi What about xls file.
Will this code work for xls format too

srk said...

Hi What about xls file.
Will this code work for xls format too

Vitasko said...

Cannot see how to write a huge sharedstrings.xml file. Excel 2007+ allows to write a million rows on one sheet, and if that million rows contains text data, you need to save them to sharedstrings.xml checking each cell for uniqueness. How to do this efficiently?