Overview of the XML file formats in the 2007 Office system
Updated: January 15, 2009
Applies To: Office Resource Kit
Topic Last Modified: 2009-01-07
The 2007 Microsoft Office system introduces new XML file formats that are robust and based on open standards. The new XML file formats enable rapid creation of documents from disparate data sources, accelerating document assembly, data mining, and content reuse. The formats simplify exchanging data between applications in the 2007 Office system and enterprise business systems.
You can create a document in the new XML formats with any standard tool and technology—the 2007 Office system is not required. Users can improve productivity by publishing, searching, and reusing information more quickly and accurately in the environment they choose.
The new XML formats are based on industry-standard XML and ZIP technologies, support full integration by any technology provider, and are available via a royalty-free license. The XML file format specification will be published and made available under the same royalty-free license that exists for the Microsoft Office 2003 Reference Schemas, and is openly offered and available for broad industry use.
The new XML formats introduce a number of benefits for developers, IT professionals, and users. These benefits include:
Compact file format. Documents are automatically compressed, up to 75 percent smaller.
Improved damaged file recovery. Modular data storage enables files to open even if a component within the file, such as a chart or table, is damaged.
Safer documents. Embedded code, such as OLE objects or Microsoft Visual Basic for Applications (VBA) code, is stored in a separate section within the file so it is easily identified for special processing. IT Administrators can block the documents that contain unwanted macros or controls, making documents safer for users when they are opened.
Easier integration. Developers have direct access to specific contents within the file, such as charts, comments, and document metadata.
Transparency and improved information security. Documents can be shared confidentially because personally identifiable information and business-sensitive information, such as user names, comments, tracked changes, and file paths, are easily identified and removed.
Compatibility. By installing a simple update, users of Microsoft Office 2000, Microsoft Office XP, and Office 2003 editions can open, edit, and save documents in one of the new XML formats.
The basic structure of all XML formats in the 2007 Office system consists of five elements:
Start part. The highest order part in the hierarchy.
XML parts. Files or folders consisting of XML that comprise the content of the file.
Non-XML parts. Parts that are not XML and generally are either images or OLE objects.
Relationship part. A type of part that generally points to other parts to define the relational hierarchy of the part structure.
ZIP package. Bundles parts into a single file.
The start part, an XML part that is a relationship part and could be considered the highest order part, determines the file type. For example, if the name of the core container is WordDoc, the file name extension is .docx.
When an Office XML formatted file is saved in the 2007 Office system, the file is divided into a set of logical parts that describes the entire file. For Office Word 2007, dividing the file into these parts enables the file to be easily queried or modified outside of the original Office application.
For example, it is easier for a developer to remove document properties from a file because the properties are placed in a single part, and the part can be deleted from within the document container by deleting the part. With WordprocessingML (provided as an optional XML file format in Microsoft Office 2003), removing comments involved parsing the entire file to find and remove the XML representing the contents of the comment. With the new file format, feature-related data is divided into parts. Comments, links, headers, footers, and other data are in separate parts that can be removed. You do not need to parse the entire Word document.
Non-XML parts are generally images and OLE objects. Any file type that uses binary content or does not use XML is identified as non-XML. A non-XML part is most commonly a file attached to or embedded within a document. The Office Word 2007 XML format schema documentation explains the literal relationship and schema hierarchy used by Word for files of this type.
A relationship part is an XML part that points to other parts and defines the relational hierarchy of the parts. Most high-level XML parts are relationship parts. XML parts that contain data and do not point to other parts are also referred to as primitives, and usually have a content type of application/xml.
Using a ZIP package provides the following benefits in all applications:
Open standard. The ZIP compression algorithm is a well-defined open standard.
Reduced file size. Files are generally smaller than an equivalent binary file. On average, Office Word 2007 files are 75% smaller than their binary counterparts, depending on the number of images.
Increased robustness. Files are more robust and less sensitive to potential errors in the file. Previous files required the file to be completely intact to function correctly.
Although use of a ZIP package means the file is binary, the WinFX application programmer interface (API) set provides native support for the package format in the System.IO.Packaging name space. This enables developers to create tools that process the format and work directly against the logical model (the parts) without having to consider expansion or compression of the package.
This topic is included in the following downloadable book for easier reading and printing:
See the full list of available books at Downloadable content for the 2007 Office Resource Kit .