Sviatoslav Razmyslov

Nov 27 2020

Tags:

#CSharp

Analyzing the code quality of Microsoft's Open XML SDK

Nov 27 2020

Author: Sviatoslav Razmyslov

Introduction
Why still Word API and not Open XML SDK?
Two values of string
On code compactness
The infamous always true/false
Other warnings
Conclusion

My first encounter with Open XML SDK took place when I was looking for a library that I could use to create some accounting documents in Word. After more than 7 years of working with Word API, I wanted to try something new and easier-to-use. That's how I learned that Microsoft offered an alternative solution. As tradition has it, before our team adopts any program or library, we check them with the PVS-Studio analyzer.

Introduction

Office Open XML, also known as OpenXML or OOXML, is an XML-based format for representing office documents, including text documents, spreadsheets, presentations, as well as charts, figures, and other types of graphical content. The specification was developed by Microsoft and approved by ECMA International in 2006. In June 2014, Microsoft released Open XML SDK as an open-source project. The source files are currently available on GitHub under the MIT license.

I scanned the library's source code with the static analyzer PVS-Studio. This is a tool for detecting software bugs and potential vulnerabilities in the source code of programs in C, C++, C#, and Java. The analyzer runs on 64-bit Windows, Linux, and macOS.

The project is fairly small, so the number of warnings is small too. But they were prominent enough to inspire my choice of the image for this post. You see, there are too many useless conditional statements in this project. I believe that refactoring all such spots would help make the code much shorter and therefore clearer.

Why still Word API and not Open XML SDK?

As you have guessed from this title, I'm still using Word API in my work. There are a lot of downsides to this approach:

The API is old and cumbersome;
You have to have Microsoft Office installed on your computer;
You have to ship the distribution with the Office libraries included;
Word API's operation depends on the system's locale settings;
Low performance.

There's a funny story concerning the locale in particular. Windows provides a dozen of regional settings. We found that one of our servers was for some reason using a mishmash of the USA and UK locales, which caused our Word documents to substitute the ruble sign for the dollar sign, while the pound sign wasn't displayed at all. We solved the problem by tweaking the system's settings.

Now as I'm telling you all this, I'm once again asking myself why I keep using it....

But no, I still like Word API more, and I'll tell you why.

Here's what OOXML format looks like:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document ....>
  <w:body>
    <w:p w:rsidR="00E22EB6"
         w:rsidRDefault="00E22EB6">
      <w:r>
        <w:t>This is a paragraph.</w:t>
      </w:r>
    </w:p>
    <w:p w:rsidR="00E22EB6"
         w:rsidRDefault="00E22EB6">
      <w:r>
        <w:t>This is another paragraph.</w:t>
      </w:r>
    </w:p>
  </w:body>
</w:document>

Here, <w:r> (Word Run) is not a sentence or even a word – it's any block of text whose attributes are different from those of adjacent blocks.

This is programmed through code that looks something like this:

Paragraph para = body.AppendChild(new Paragraph());
Run run = para.AppendChild(new Run());
run.AppendChild(new Text(txt));

A document has its own special inner structure, and the same elements must be created in the code. In my opinion, the abstraction level of data access in Open XML SDK isn't deep enough. Creating a document using Word API is more comprehensible and takes less time – especially when you deal with spreadsheets and other complex data structures.

On the other hand, Open XML SDK helps solve a wide range of tasks. It can be used to create not only Word documents but Excel and PowerPoint documents as well. This library might well be a more preferable choice for some tasks, but I've decided to stick with Word API for now. We can't abandon Word altogether anyway since we are developing a plugin for Word for our corporate needs, and this task can be accomplished only using the Word API.

Two values of string

V3008 The '_rawOuterXml' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 164, 161. OpenXmlElement.cs 164

internal string RawOuterXml
{
    get => _rawOuterXml;

    set
    {
        if (string.IsNullOrEmpty(value))
        {
            _rawOuterXml = string.Empty;
        }

        _rawOuterXml = value;
    }
}

The string type can have two types of values: null and a text value. To use the latter is definitely a safer approach, but either is acceptable. In this particular project, the null value can't be used and the programmer overwrites it with string.Empty... at least, that was the idea. There's a mistake in RawOuterXml that makes it possible to assign the value null to the field and then get a NullReferenceException when attempting to access it.

V3022 Expression 'namespaceUri != null' is always true. OpenXmlElement.cs 497

public OpenXmlAttribute GetAttribute(string localName, string namespaceUri)
{
    ....
    if (namespaceUri == null)
    {
        // treat null string as empty.
        namespaceUri = string.Empty;
    }
    ....
    if (HasAttributes)
    {
        if (namespaceUri != null)  // <=
        {
            ....
        }
        ....
    }
    ....
}

The same approach is used in this snippet as well. It's not a severe mistake, but you can still smell the bad refactoring. I'm almost sure one of the checks can be safely removed – that would make the code narrower and therefore easier to read.

On code compactness

V3009 It's odd that this method always returns one and the same value of '".xml"'. CustomXmlPartTypeInfo.cs 31

internal static string GetTargetExtension(CustomXmlPartType partType)
{
    switch (partType)
    {
        case CustomXmlPartType.AdditionalCharacteristics:
            return ".xml";

        case CustomXmlPartType.Bibliography:
            return ".xml";

        case CustomXmlPartType.CustomXml:
            return ".xml";

        case CustomXmlPartType.InkContent:
            return ".xml";

        default:
            return ".xml";
    }
}

I'm not sure if the programmer made some typo or simply wrote what they believed to be "neat" code. If you ask me, it doesn't make much sense to return so many similar values and the code can be simplified quite a bit.

It's not the only warning of this type. Here are two more:

V3009 It's odd that this method always returns one and the same value of '".xml"'. CustomPropertyPartTypeInfo.cs 25
V3009 It's odd that this method always returns one and the same value of '".bin"'. EmbeddedControlPersistenceBinaryDataPartTypeInfo.cs 22

I wonder how the programmer would explain their decision to write the code that way.

V3139 Two or more case-branches perform the same actions. OpenXmlPartReader.cs 560

private void InnerSkip()
{
    Debug.Assert(_xmlReader != null);

    switch (_elementState)
    {
        case ElementState.Null:
            ThrowIfNull();
            break;

        case ElementState.EOF:
            return;

        case ElementState.Start:
            _xmlReader.Skip();
            _elementStack.Pop();
            GetElementInformation();
            return;

        case ElementState.End:
        case ElementState.MiscNode:
            // cursor is end element, pop stack
            _xmlReader.Skip();
            _elementStack.Pop();
            GetElementInformation();
            return;
        ....
    }
    ....
}

This snippet is less controversial than the previous one. I think the identical cases can be merged to make the code shorter and clearer.

Here are a few more issues of that kind:

V3139 Two or more case-branches perform the same actions. OpenXmlMiscNode.cs 312
V3139 Two or more case-branches perform the same actions. CustomPropertyPartTypeInfo.cs 30
V3139 Two or more case-branches perform the same actions. CustomXmlPartTypeInfo.cs 15
V3139 Two or more case-branches perform the same actions. OpenXmlElement.cs 1803

The infamous always true/false

We have finally reached the section covering examples that determined my choice of the picture for this article.

Warning 1

V3022 Expression 'Complete()' is always false. ParticleCollection.cs 243

private bool IsComplete => Current is null ||
                           Current == _collection._element.FirstChild;

public bool MoveNext()
{
    ....
    if (IsComplete)
    {
        return Complete();
    }

    if (....)
    {
        return Complete();
    }

    return IsComplete ? Complete() : true;
}

The IsComplete property is used twice, and it's clear from the code that the property's value won't change between the two checks. It means you can have the function simply return the second value of the ternary operator, i.e. true.

Warning 2

V3022 Expression '_elementStack.Count > 0' is always true. OpenXmlDomReader.cs 501

private readonly Stack<OpenXmlElement> _elementStack;

private bool MoveToNextSibling()
{
    ....
    if (_elementStack.Count == 0)
    {
        _elementState = ElementState.EOF;
        return false;
    }
    ....
    if (_elementStack.Count > 0) // <=
    {
        _elementState = ElementState.End;
    }
    else
    {
        // no more element, EOF
        _elementState = ElementState.EOF;
    }
    ....
}

If the number of elements on the _elementStack stack is different from 0, then it's obviously larger than 0. It means the code can be made at least 8 lines shorter.

Warning 3

V3022 Expression 'rootElement == null' is always false. OpenXmlPartReader.cs 746

private static OpenXmlElement CreateElement(string namespaceUri, string name)
{
    if (string.IsNullOrEmpty(name))
    {
        throw new ArgumentException(....);
    }

    if (NamespaceIdMap.TryGetNamespaceId(namespaceUri, out byte nsId)
        && ElementLookup.Parts.Create(nsId, name) is OpenXmlElement element)
    {
        return element;
    }

    return new OpenXmlUnknownElement();
}

private bool ReadRoot()
{
  ....
  var rootElement = CreateElement(....);

  if (rootElement == null) // <=
  {
      throw new InvalidDataException(....);
  }
  ....
}

The CreateElement function can't return null. If the company has adopted the rule that xml nods be created using methods that either return a valid object or throw an exception, users that employ these methods don't have to overuse additional checks.

Warning 4

V3022 Expression 'nameProvider' is always not null. The operator '?.' is excessive. OpenXmlSimpleTypeExtensions.cs 50

public static XmlQualifiedName GetSimpleTypeQualifiedName(....)
{
    foreach (var validator in validators)
    {
        if (validator is INameProvider nameProvider &&
            nameProvider?.QName is XmlQualifiedName qname) // <=
        {
            return qname;
        }
    }

    return type.GetSimpleTypeQualifiedName();
}

Here is the pattern of the is operator:

expr is type varname

If the result of the is expression is true, a nonnull reference will be written in varname. So its additional check for null is redundant.

Warning 5

V3022 Expression 'extension == ".xlsx" || extension == ".xlsm"' is always false. PresentationDocument.cs 246

public static PresentationDocument CreateFromTemplate(string path)
{
    ....
    string extension = Path.GetExtension(path);
    if (extension != ".pptx" && extension != ".pptm" &&
        extension != ".potx" && extension != ".potm")
    {
        throw new ArgumentException("...." + path, nameof(path));
    }

    using (PresentationDocument template = PresentationDocument.Open(....)
    {
        PresentationDocument document = (PresentationDocument)template.Clone();

        if (extension == ".xlsx" || extension == ".xlsm")
        {
            return document;
        }
        ....
    }
    ....
}

This is quite an interesting case. The programmer first filters off all documents whose extensions are different from .pptx, .pptm, .potx, and .potm, and then – just in case – decides to make sure there are no .xlsx and .xlsm documents left among those. The PresentationDocument function is definitely a victim of refactoring.

Warning 6

V3022 Expression 'OpenSettings.MarkupCompatibilityProcessSettings == null' is always false. OpenXmlPackage.cs 661

public MarkupCompatibilityProcessSettings MarkupCompatibilityProcessSettings
{
    get
    {
        if (_mcSettings is null)
        {
            _mcSettings = new MarkupCompatibilityProcessSettings(....);
        }

        return _mcSettings;
    }

    set
    {
        _mcSettings = value;
    }
}

public MarkupCompatibilityProcessSettings MarkupCompatibilityProcessSettings
{
    get
    {
        if (OpenSettings.MarkupCompatibilityProcessSettings == null) // <=
        {
            return new MarkupCompatibilityProcessSettings(....);
        }
        else
        {
            return OpenSettings.MarkupCompatibilityProcessSettings;
        }
    }
}

The MarkupCompatibilityProcessSettings property never returns null. If the getter finds that the class's field has the null value, the object will be overwritten with a new one. Also, note that this is not a recursive call of one and the same property but rather properties of the same name from different classes. This confusion may have caused the developer to add the extra checks.

Other warnings

Warning 1

V3080 Possible null dereference. Consider inspecting 'previousSibling'. OpenXmlCompositeElement.cs 380

public OpenXmlElement PreviousSibling()
{
    if (!(Parent is OpenXmlCompositeElement parent))
    {
        return null;
    }
    ....
}

public override T InsertBefore<T>(T newChild, OpenXmlElement referenceChild)
{
    ....
    OpenXmlElement previousSibling = nextNode.PreviousSibling();
    prevNode.Next = nextNode;
    previousSibling.Next = prevNode;    // <=
    ....
}

Contrary to the previous examples, this one does require an additional check. The PreviousSibling method can return the value null, and it will be used right away without any check.

Two more potential null dereferences:

V3080 Possible null dereference. Consider inspecting 'prevNode'. OpenXmlCompositeElement.cs 489
V3080 Possible null dereference. Consider inspecting 'prevNode'. OpenXmlCompositeElement.cs 497

Warning 2

V3093 The '&' operator evaluates both operands. Perhaps a short-circuit '&&' operator should be used instead. UniqueAttributeValueConstraint.cs 60

public override ValidationErrorInfo ValidateCore(ValidationContext context)
{
    ....
    foreach (var e in root.Descendants(....))
    {
        if (e != element & e.GetType() == elementType) // <=
        {
            var eValue = e.ParsedState.Attributes[_attribute];

            if (eValue.HasValue && _comparer.Equals(....))
            {
                return true;
            }
        }
    }
    ....
}

Some developers love applying the '&' operator to logical expressions without good reason. But whatever value its first operand evaluates to, the second operand will be evaluated anyway. In this particular case, it's not a critical mistake, but such careless code may start throwing NullReferenceExceptions after refactoring.

Warning 3

V3097 Possible exception: type marked by [Serializable] contains non-serializable members not marked by [NonSerialized]. OpenXmlPackageValidationEventArgs.cs 15

[Serializable]
[Obsolete(ObsoleteAttributeMessages.ObsoleteV1ValidationFunctionality, false)]
[EditorBrowsable(EditorBrowsableState.Never)]
public sealed class OpenXmlPackageValidationEventArgs : EventArgs
{
    private string _message;

    [NonSerialized]
    private readonly object _sender;

    [NonSerialized]
    private OpenXmlPart _subPart;

    [NonSerialized]
    private OpenXmlPart _part;

    ....

    internal DataPartReferenceRelationship
        DataPartReferenceRelationship { get; set; } // <=
}

Serialization of the OpenXmlPackageValidationEventArgs class may fail because one of the properties is not marked as serializable. Alternatively, this can be fixed by making the property's return type serializable; otherwise, you risk getting an exception at runtime.

Conclusion

We, PVS-Studio developers, are fans of Microsoft projects and technologies. We even have a separate section dedicated to Microsoft on our page listing all open-source projects checked with PVS-Studio. That section already includes 21 projects covered in 26 articles. This one is the 27th.

I bet you are wondering if Microsoft is our client. Yes, it is! But keep in mind it's a huge corporation operating all over the world. Some of its subdivisions surely use PVS-Studio in their work, but many more don't! As our experience with open-source projects shows, the latter are obviously in need of a good bug-detecting tool ;).

Those who follow news on analysis of C++, C#, and Java code may also be interested to know that we have recently added support of the OWASP standard and are actively covering it with our diagnostics.

#CSharp

Tags:

#CSharp

Fill out the form in 2 simple steps below:

Your contact information:

Desired license type:

Analyzing the code quality of Microsoft's Open XML SDK

Introduction

Why still Word API and not Open XML SDK?

Two values of string

On code compactness

The infamous always true/false

Other warnings

Conclusion

Comments (0)

Want to try PVS‑Studio for free?

Achievements

PVS-Studio

Licensing

Company