Overview of the XML file formats in Office 2010
Published: May 16, 2012
The Open XML file formats simplify the exchange of data between Office applications and enterprise business systems. Based on open standards, these XML file formats enable the rapid creation of documents from different data sources and speed up document assembly, data mining, and content reuse.
The 2007 Office system supports the ECMA-376 Office Open XML Formats standard, which was later submitted to ISO/IEC and was published in late 2008 as the ISO/IEC 29500 Office Open XML Formats standard. Office 2010 provides read support for ECMA-376, read/write support for ISO/IEC 29500 Transitional, and read support for ISO/IEC 29500 Strict.
Documentation for the ISO/IEC 29500 Office Open XML Formats is available from ISO/IEC, and documentation for ECMA-376 is available from Ecma International. For detailed information about how these formats are supported in Office 2010 and the 2007 Office system, see Microsoft Office File Format Documents (http://go.microsoft.com/fwlink/p/?LinkId=191143) on MSDN.
In this article:
Benefits of the Open XML Formats
The Open XML Formats provide several benefits for developers, IT professionals, and users. These benefits include the following:
Compact file format Documents are automatically compressed, up to 75 percent smaller.
Improved damaged file recovery Modular data storage enables files to open even if a component within the file, such as a chart or table, is damaged.
Safer documents Embedded code, such as OLE objects or Microsoft Visual Basic for Applications (VBA) code, is stored in a separate section within the file so that it can easily be identified for special processing. IT administrators can block the documents that contain unwanted macros or controls. This helps make documents safer for users when they are opened.
Easier integration Developers have direct access to specific contents within the file, such as charts, comments, and document metadata.
Transparency and improved information security Documents can be shared confidentially because personally identifiable information and business-sensitive information, such as user names, comments, tracked changes, and file paths, can easily be identified and removed.
Compatibility By installing the Microsoft Office Compatibility Pack, users of Microsoft Office 2000, Microsoft Office XP, and Microsoft Office 2003 editions can open, edit, and save documents in one of the new XML formats.
Structure of the Open XML Formats
The basic structure of the Open XML Formats consists of the following five elements, which are described in more detail in the sections that follow.
Start part The highest order part in the hierarchy.
XML parts Files or folders consisting of XML that comprise the content of the file.
Non-XML parts Parts that are not XML and generally are either images or OLE objects.
Relationship part A type of part that generally points to other parts to define the relational hierarchy of the part structure.
ZIP package Bundles parts into a single file.
The start part, an XML part that is a relationship part and could be considered the highest order part, determines the file type. For example, if the name of the core container is WordDoc, the file name extension is .docx.
When an Office XML formatted file is saved in Office 2010 or the 2007 Office system, the file is divided into a set of logical parts that describes the entire file. For Microsoft Word, dividing the file into these parts enables the file to be easily queried or modified outside of the original Office application.
For example, it is easier for a developer to remove document properties from a file because the properties are placed in a single part, and the part can be deleted from within the document container by deleting the part. With WordprocessingML (provided as an optional XML file format in Office 2003), removing comments involved parsing the entire file to find and remove the XML representing the contents of the comment. With the new file format, feature-related data is divided into parts. Comments, links, headers, footers, and other data are in separate parts that can be removed. You do not need to parse the entire Word document.
Non-XML parts are generally images and OLE objects. Any file type that uses binary content or does not use XML is identified as non-XML. A non-XML part is usually a file attached to or embedded within a document. The Word XML format schema documentation explains the literal relationship and schema hierarchy used by Word for files of this type.
A relationship part is an XML part that points to other parts and defines the relational hierarchy of the parts. Most high-level XML parts are relationship parts. XML parts that contain data and do not point to other parts are also known as primitives, and usually have a content type of application/xml.
Using a ZIP package provides the following benefits in all applications:
Open standard The ZIP compression algorithm is a well-defined open standard.
Reduced file size Files are generally smaller than an equivalent binary file. On average, Word Open XML files are 75 percent smaller than their binary counterparts, depending on the number of images.
Increased robustness Files are more robust and less sensitive to potential errors in the file. Previous files required the file to be completely intact to function correctly.
Although use of a ZIP package means the file is binary, the WinFX application programming interface (API) set provides native support for the package format in the System.IO.Packaging name space. This enables developers to create tools that process the format and work directly against the logical model (the parts) without having to consider expansion or compression of the package.