JavaRanch Home
 
Front Page FAQs Ranchers Categories Recent Changes To Do Upload

Accessing File Formats   



How do I access the XYZ file format in java ?

Specifications for many file formats can be found at Wotsit

Marco Schmidt maintains several very useful lists of links about processing a multitude of document formats.

An interesting article about Microsoft's binary file formats, especially DOC and XLS, is Why are the Microsoft Office file formats so complicated? (And some workarounds) It also mentions some alternatives to dealing with those formats directly.

Access

  • JDBC/ODBC bridge - JDBC driver for ODBC databases, comes as part of the JDK
  • Jackcess - library to read and write MDB files
  • HXTT Access - commercial pure Java JDBC driver for MS Access

CGM

  • cgmva - an applet to display CGM files; comes with source code

Excel

  • Ostermiller Utils, CSVObjects and CSVBeans - libraries to read and write CSV files. CSV is not as easy to read and write as it first looks - once all the special cases are considered, one might as well use a library.
  • POI - library to read and write XLS files
  • JExcelAPI - library to read and write XLS files
  • Opinions on jExcelApi vs. POI: here and here
  • jXLS - library for writing XLS files based on templates
  • Java2Excel - library for creating Excel files based on Collections
  • It is possible to use JDBC to read Excel files

HDF (Hierarchical Data Format)

Image files

Matlab

OpenDocument (ODF)

  • basic Java code for reading ODF files is here
  • Odf4j is a Java library for accessing ODF files. It is "currently incubating", and has a blog.
  • Another project also called odf4j is at SourceForge, but nothing much seems to be happening there.

Office Open XML

  • These are the new XML-based Microsoft Office formats.
  • OpenXML4J
  • docx4j - create and edit docx documents using a JAXB content model matching the WordML schema
  • Apache POI has some prerelease code in its scratchpad source code area

OpenOffice Java API

  • OpenOffice can read a number of file formats, and makes them accessible through its API. A starting point might be this article and of course the OO developer site
  • Some introductory information about the OO file format can be found here and here
  • oooview is an OO Viewer written in Java.
  • JODConverter is a Java library that uses the OO Java API to perform document conversions between any formats supported by OO

Outlook MSG

  • The Jakarta POI project developed some code that can read the texual contents of Outlook's MSG files. This page talks about that.

PDF

  • PDF is a hard to read format. The best one can do is try to extract the text contained in a PDF file.
  • iText - library to create PDFs
  • FOP - libray to create PDFs (and other formats) from XML by using XSL-FO transformations
  • FlyingSaucer - library to convert CSS-styled XHTML to PDF
  • PDFBox - library to create PDFs; can also extract text
  • JPedal - library to extract text from PDFs
  • PDFTextStream - commercial library to extract text from PDFs
  • Adobe AcrobatViewer for JavaBean - freeware, library to display and print PDFs; introductory article ; this library hasn't been updated in a long time and has problems displaying files that were created with recent PDF versions.
  • PDF Renderer is a more up-to-date PDF viewer that renders using Java2D.

PowerPoint

  • The Jakarta POI project developed some code that can open and (to a limited extent) edit PPT files. This page talks about it, and the code is also part of the POI 3.0 release.

Project

  • The MPXJ library can work with several Project file formats.

QIF (used by Microsoft Money and Quicken)

  • Buddi and Eurobudget are Java applications that can import and export QIF files (and thus contain code you may be able to use in your application). Both are licensed under the GPL.

RTF

  • iText - library to create RTFs
  • JavaCC - is a lexer/parser for which an RTF grammar is available. From that an RTF reader can be constructed.

Visio

  • The Jakarta POI project developed some code that can read Visio files. This page talks about that.

Word

  • POI - library to read and write DOC files. (Note that according to the POI-Dev mailing list this is unmaintained code. If it works for you - great, if not, then it will likely not be fixed soon.) Some limited progress has been made with the POI 3.0 alpha. It can be used for extracting the text of a document, though.
  • docx4j - for docx files (as opposed to doc files)
  • WordApi.exe is native Windows component with a Java interface, which lets you create Word documents, and alter word templates. Some impressions about it can be found here.

Something else?

If you encounter an obscure format for which no library is available, it may be feasible to create a reader for it if you have a file format description (which may be available on Wotsit, see link above). Several libraries, so-called lexers and parsers, are available that help in creating a reader, especially if the file format is ASCII, and not binary. You will need knowledge of regular expressions, though. Some file formats that have been tackled using this approach include RTF, CSV, HPGL and PBM/PGM/PPM. Lexers are easier to start with, but parsers can do more of the work for you. All these have ready-to-use examples on their web sites.



CategoryHowTo
Front Page FAQs Ranchers Categories Recent Changes To Do Upload
Last Edited: 04 May 2008 What's Changed?
 
Copyright © 1998-2008 Paul Wheaton | Home | Contact Us | Privacy | Register