I presented this topic at Libreoffice Conference 2014. The purpose of this presentation is to highlight problems in Libreoffice writer application for interoperability with OOXML file format ( ex. DOCX files). I have also shared some tips and tricks to fix interoperability issues. This will be useful to developers who want to contribute to Libreoffice open source community.
2. 2
About Me
● Sr. Software Developer at Synerzip Softech India
● About 3 years of experience in C++ and OOXML
● Active contributor to LibreOffice product and community
● Member of TDF.
● Love to play, watch cricket
● Email: Sushil.shinde@synerzip.com
● IRC: #libreoffice-dev chat : sushils_
3. 3
Topics
● Interoperability
● OOXML and ECMA-376
● DOCX File Structure
● Challenges during 'File Import'
– File Crash
– Data Loss
● Challenges during 'File Export'
– File Corruption
– Data Loss
● LibreOffice Hang Issues
● Some Useful Tools
● Examples
4. 4
Interoperability
Many companies,
Government Organizations,
Individuals use MS
Word File Formats.
MS Word Formats:
.doc (Binary file)
.docx (OOXML
File Format)
5. 5
OOXML and ECMA-376
● Office Open XML (OOXML)
– Microsoft Office 2007 and later versions (like 2010,
2013) uses OOXML format.
● The ECMA-376 Standard
– This Standard defines OOXML's vocabularies and
document representation and packaging details.
– Specifications are freely available on the ECMA
website.
6. 6
DOCX File Structure
Docx File Package
_rels
docProps
word
_rels
Document.xml
header[n].xml
footer[n].xml
Styles.xml
media
themes
[content_types].xml
A lookup for each of the item referenced in document,
Header, footer (e.g. images, sounds, headers, footers)
The text of the document. Contains Links to
Other objects retrieved via lookup.
The text of the header, footer from
From documents. Also contains references
To other objects. (e.g. images used in header
Or footer)
charts
Contains the definitions for a set of styles used by
the document.
Contains media files like image, sounds, video
Which referenced in doument.xml(e.g.
image1.png)
Chart data folder. (chart[n].xml and chart[n].xml.rels)
..
Contains MIME type information for parts of the package
7. 7
Challenges In 'File Import'
● LibreOffice crash
● Data loss
● LibreOffice hangs
8. 8
File Import – Crash issues
● Reasons can be-
– Programming mistakes
● Null pointer check
● Memory Leaks
– Some issues in import filters
● Some specific combinations of data
9. 9
Analyzing Crash
● Optimize File
– Check MS Office version (2007/2010/2013) using which file is created
– Use “Divide and conquer” method to optimize file
– Try to optimize file upto 1-2 pages with minimum data on it
● Identify XML part which is causing error
● Try to Identify MS Office feature which is causing error
– If confirmed, try to create .doc (binary version) file with same feature
and check whether that file works
● Locate parsing and mapping of XML elements in import filters to
identify root cause
10. 10
Crash - Example
Problematic xml area
fdo#79973
11. 11
Resolving Crash - Example
Code reference : https://gerrit.libreoffice.org/#/c/9840
12. 12
File Import – Types Of Data Loss
● Feature loss (ex. Text, shapes etc)
● Feature property loss (ex. Colors, line styles
etc)
● Incorrect values (ex. Shape size, position etc)
13. File Import – Reasons For Data Loss
13
● MS Office feature is not supported
– Implement feature support
– Grab-bag
● XML Nodes not handled
● XML elements not mapped properly
● Properties lost in shape conversions
(SwXShape → SwXTextFrame)
14. 14
File Import – How To Fix Data Loss
● Check XML Schema of missing feature
● Check ECMA 376 specs of missing properties
● Check XML properties are available in model.xml
● Identify LibreOffice UNO Properties for missing data
– Insert similar feature in LibreOffice and check properties that represent
missing effects
– Create .doc file with same data
– Use XRAY tool to check properties
● Locate handling of those XML properties in dmapper
● Check XML values are properly mapped with UNO properties
– Hard-code UNO Properties to verify quickly
15. 15
Data Loss Example - shape
● TextBox Background image loss
Original TextBox fill
LO rendered before FIX
LO rendered after fix
16. 16
Data Loss Example - shape
● Set proper UNO Property
– “FillBitmapURL” property for shape
– “BackGraphicURL” property for TextFrame
● Handled “BackGraphicURL” property in export
if it is textframe
Code Reference : https://gerrit.libreoffice.org/#/c/7259
17. 17
Data Loss Example - Table
Original table
Auto width
How LO rendered
LO Rendering After Fix LO : Export Before Fix After Fix
18. 18
Data Loss Example - Table
XML Comparison
Original LO Exported this.. Fixed
Code Reference : https://gerrit.libreoffice.org/#/c/7593/
https://gerrit.libreoffice.org/#/c/7594/
19. 19
Challenges In 'File Export'
● MS Office not able to open 'saved file'
● Data loss
● LO crash
20. 20
File Export – Types Of Corruptions
● Invalid XML values exported
– XML values are not exported as per ECMA specs
ECMA specs : valid
values for rotX are
between [-90,90]
21. 21
File Export – Types Of Corruptions
● XML tag mismatch – Start and End tag not
matching
22. 22
File Export – Types Of Corruptions
● Missing target relationship entry
● Missing relationship file (ex. header.xml.rels)
● Exported 0 bytes file (Mostly in case of images/media folder
contents)
Relationship is present
in header.xml
But header.xml.rels file
Is missing
23. 23
File Export – Types Of Corruptions
● Invalid hierarchy
– Text box exported inside the another textbox
Easy
Hack
24. 24
File Export – Corruption Issues
Ms Office seems to have an internal
limitation of 4091 styles and refuses to load
“.docx” with more styles.
25. 25
Analyzing File Corruption
● Validate exported docx file
– Use OpenSDK tool to validate file (For windows only)
● Compare content of exported file with original file
– Use OOXML tool to compare file
● Check ECMA specs of invalid XML property
● Check relID's are exported properly
– Relationship target is present in rels xml file
– Check target file is available in exported file
● Search for export part of invalid XML in export files e.g.
docxattributeoutput, docxsdrexport etc.
26. File Export – Reasons For Data Loss
26
● Features rendered properly are mostly
preserved in export
● Reasons for Data loss can be-
– Mapping of UNO Properties to OOXML properties
● Invalid data conversion (from LO property to MSO valid
XML value as per ECMA)
● e.g. Rotation Angle, Dashed Borders etc
– Required XML part is missing in exported file
● e.g. Fill properties from shape XML Schema
27. 27
File Export - How To Fix Data Loss
● Compare exported and original file
– Verify XML schema for missing feature or properties
of missing feature are exported
● Check export code for missing XML part.
– Search for xml tag “XML_elementname” e.g.
XML_rot. In export classes.
– Check xml parts are written under right parent
elements.
28. 28
Data Loss - Example
● Numbered list is not preserved
– Original XML - <w:lvlText w:val="%1" />
– Exported XML - <w:lvlText w:val="" />
Numbering.xml
Original data Before Fix After Fix
Code reference : https://gerrit.libreoffice.org/#/c/8768/
29. 29
LibreOffice Hang Issues
● LibreOffice Hangs while opening/saving docx file
● Reasons can be -
– Removed required UNO Properties
● PROP_PARA_LINE_SPACING
● Code reference : https://gerrit.libreoffice.org/#/c/9560
– Not handled some required XML attributes
● Code reference : https://gerrit.libreoffice.org/#/c/8632/
– Memory Leaks
● Code Reference : https://gerrit.libreoffice.org/#/c/6850
30. 30
Some Useful Tools
● Xray Tool
● OOXML Tools (Chrome Browser plug-in)
● Open XML SDK Productivity tool. (for windows)
35. 35
Chart
Wall color
●Wall Color was missing
From exported file
Lost
Fixed
36. 36
Chart
Original XML for Chart Wall Color LO : Export before fix Export After Fix
Code References : https://gerrit.libreoffice.org/7739
https://gerrit.libreoffice.org/7792
37. 37
Doughnut chart
Original chart Before fix After fix
Code Reference : https://gerrit.libreoffice.org/#/c/6924
38. 38
Exploded Pie Chart
Original chart Before fix After fix
Code Reference : https://gerrit.libreoffice.org/#/c/6924
41. 41
Smart Art
Image Fills in smart are exported properly.
Original File LO Export : Before Fix After Fix
Code reference : https://gerrit.libreoffice.org/#/c/9121
42. 42
Synerzip's Contribution
● ~250 patches submitted by synerzip in last 1
year.
● 50+ scenarios of crash/corruption fixed.
● 270+ bugs filed on BugZilla.
● 200+ bugs resolved.