How do I avoid global state when using custom constructors in ruamel.yaml?

I am using ruamel.yaml to parse a complex YAML document where certain tagged nodes require special treatment. I inject my custom parsing logic using add_multi_constructor, as recommended by the published examples. The problem is that I need to change the injected logic dynamically depending on external states but the decoration methods like add_multi_constructor modify the global state which introduces unacceptable coupling between logically unrelated instances. Here is the MWE:

import ruamel.yaml

def get_loader(parameter):
    def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
        return parameter(tag.lstrip("!"), str(node.value))

    loader = ruamel.yaml.YAML()
    loader.constructor.add_multi_constructor("", construct_node)
    return loader

foo = get_loader(lambda tag, node: f"foo: {tag}, {node}")
bar = get_loader(lambda tag, node: f"bar: {tag}, {node}")
print(foo.load("!abc 123"), bar.load("!xyz 456"), sep="n")

Output:

bar: abc, 123
bar: xyz, 456

Expected:

foo: abc, 123
bar: xyz, 456

I made the following workaround where I create new class instances dynamically to break the coupling:

def get_loader(parameter):
    def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
        return parameter(tag.lstrip("!"), str(node.value))

    # Create a new class to prevent state sharing through class attributes.
    class ConstructorWrapper(ruamel.yaml.constructor.RoundTripConstructor):
        pass

    loader = ruamel.yaml.YAML()
    loader.Constructor = ConstructorWrapper
    loader.constructor.add_multi_constructor("", construct_node)
    return loader

My questions are:

  • Am I misusing the library? The global effects are a huge red flag which suggests that I am using the API incorrectly, but the library lacks any API documentation so I am not sure what would be the correct way.

  • Is it safe in the sense of API breakage? Since there is no documented API for this, I am not sure if this is safe to put into production.

Answer

IMO you are not misusing the library, just working around its current shortcomings/incompleteness.

Before ruamel.yaml got the API with the YAML() instance, it had the function based API of PyYAML with a few extensions, and other PyYAML’s problems had to be worked around in a similar unnatural way. E.g. I reverted to having classes whose instances could be called (using __call__()) on which methods could then be changed to just have access to YAML documents version parsed from a document (as ruamel.yaml supports YAML 1.2 and 1.1 and PyYAML only (partially) supports 1.1).

But underneath ruamel.yaml’s YAML() instance not all has improved. The code inherited from PyYAML stores the information for the various constructors in the class attributes as lookup tables (on yaml_constructor resp yaml_multi_constructor), and ruamel.yaml still does that (as the full old PyYAML-escque API is effectively still there, and only with version 0.17 has gotten a future deprecation warning).

Your approach is in so far interesting in that you do:

loader.constructor.add_multi_constructor("", construct_node)

instead of:

loader.Constructor.add_multi_constructor("", construct_node)

(you probably know that loader.constructor is a property that instantiates loader.Constructor if necessary, but other readers of this answer might not)

or even:

def get_loader(parameter):
    def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
        return parameter(tag.lstrip("!"), str(node.value))

    # Create a new class to prevent state sharing through class attributes.
    class ConstructorWrapper(ruamel.yaml.constructor.RoundTripConstructor):
        pass

    ConstructorWrapper.add_multi_constructor("", construct_node)

    loader = ruamel.yaml.YAML()
    loader.Constructor = ConstructorWrapper
    return loader

That your code works, is because constructors are stored on the class attribute as .add_multi_constructor() is a class method.

So what you do is not entirely safe in the sense of API breakage. ruamel.yaml is not at version 1.0 yet, and (API) changes that potentially break your code could come with any minor version number change. You should set your version dependencies accordingly for your production code (e.g. ruamel.yaml<0.18 ), and update that minor number only after testing with a ruamel.yaml version with a new minor version number.


It is possible to transparently change the use of the class attributes by updating the classmethods add_constructor() and add_multi_constructor() to “normal” methods and have the initialisation of the lookup tables done in __init__(). Both your examples that call the instance:

loader.constructor.add_multi_constructor("", construct_node)

will get the desired result, but ruamel.yaml’s behaviour would not change when calling add_multi_constructor on the class using:

loader.Constructor.add_multi_constructor("", construct_node)

However changing classmethods add_constructor() and add_multi_constructor() in this way affects all code out there, that happens to provide an instance instead of the class (and said code being fine with the result).

It is more likely that two new instance methods will be added either to the Constructor class and to the YAML() instance , and that the class method will be either phased out or changed to check on a class and not an instance being passed in, after a deprecation period with warnings (as will the global functions add_constructor() and add_multi_constructor() inherited from PyYAML).

The main advice, apart from having your production code fixed on the minor version number, is to make sure your testing code displays PendingDeprecationWarning. If you are using pytest this is the case by default. That should give you ample time to adapt your code to what the warning recommends.

And if ruamel.yaml’s author stops being lazy, he might provide some documentation for such API additions/changes.

import ruamel.yaml
import types
import inspect


class MyConstructor(ruamel.yaml.constructor.RoundTripConstructor):
    _cls_yaml_constructors = {}
    _cls_yaml_multi_constructors = {}

    def __init__(self, *args, **kw):
        self._yaml_constructors = {
            'tag:yaml.org,2002:null': self.__class__.construct_yaml_null,
            'tag:yaml.org,2002:bool': self.__class__.construct_yaml_bool,
            'tag:yaml.org,2002:int': self.__class__.construct_yaml_int,
            'tag:yaml.org,2002:float': self.__class__.construct_yaml_float,
            'tag:yaml.org,2002:binary': self.__class__.construct_yaml_binary,
            'tag:yaml.org,2002:timestamp': self.__class__.construct_yaml_timestamp,
            'tag:yaml.org,2002:omap': self.__class__.construct_yaml_omap,
            'tag:yaml.org,2002:pairs': self.__class__.construct_yaml_pairs,
            'tag:yaml.org,2002:set': self.__class__.construct_yaml_set,
            'tag:yaml.org,2002:str': self.__class__.construct_yaml_str,
            'tag:yaml.org,2002:seq': self.__class__.construct_yaml_seq,
            'tag:yaml.org,2002:map': self.__class__.construct_yaml_map,
            None: self.__class__.construct_undefined
        }
        self._yaml_constructors.update(self._cls_yaml_constructors)
        self._yaml_multi_constructors = self._cls_yaml_multi_constructors.copy()
        super().__init__(*args, **kw)

    def construct_non_recursive_object(self, node, tag=None):
        # type: (Any, Optional[str]) -> Any
        constructor = None  # type: Any
        tag_suffix = None
        if tag is None:
            tag = node.tag
        if tag in self._yaml_constructors:
            constructor = self._yaml_constructors[tag]
        else:
            for tag_prefix in self._yaml_multi_constructors:
                if tag.startswith(tag_prefix):
                    tag_suffix = tag[len(tag_prefix) :]
                    constructor = self._yaml_multi_constructors[tag_prefix]
                    break
            else:
                if None in self._yaml_multi_constructors:
                    tag_suffix = tag
                    constructor = self._yaml_multi_constructors[None]
                elif None in self._yaml_constructors:
                    constructor = self._yaml_constructors[None]
                elif isinstance(node, ScalarNode):
                    constructor = self.__class__.construct_scalar
                elif isinstance(node, SequenceNode):
                    constructor = self.__class__.construct_sequence
                elif isinstance(node, MappingNode):
                    constructor = self.__class__.construct_mapping
        if tag_suffix is None:
            data = constructor(self, node)
        else:
            data = constructor(self, tag_suffix, node)
        if isinstance(data, types.GeneratorType):
            generator = data
            data = next(generator)
            if self.deep_construct:
                for _dummy in generator:
                    pass
            else:
                self.state_generators.append(generator)
        return data

    def get_args(*args, **kw):
        if kw:
            raise NotImplementedError('can currently only handle positional arguments')
        if len(args) == 2:
            return MyConstructor, args[0], args[1]
        else:
            return args[0], args[1], args[2]

    def add_constructor(self, tag, constructor):
        self, tag, constructor = MyConstructor.get_args(*args, **kw)
        if inspect.isclass(self):
            self._cls_yaml_constructors[tag] = constructor
            return
        self._yaml_constructors[tag] = constructor

    def add_multi_constructor(*args, **kw): # self, tag_prefix, multi_constructor):
        self, tag_prefix, multi_constructor = MyConstructor.get_args(*args, **kw)
        if inspect.isclass(self):
            self._cls_yaml_multi_constructors[tag_prefix] = multi_constructor
            return
        self._yaml_multi_constructors[tag_prefix] = multi_constructor

def get_loader_org(parameter):
    def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
        return parameter(tag.lstrip("!"), str(node.value))

    loader = ruamel.yaml.YAML()
    loader.Constructor = MyConstructor
    loader.constructor.add_multi_constructor("", construct_node)
    return loader

foo = get_loader_org(lambda tag, node: f"foo: {tag}, {node}")
bar = get_loader_org(lambda tag, node: f"bar: {tag}, {node}")
print('>org<', foo.load("!abc 123"), bar.load("!xyz 456"), sep="n")


def get_loader_instance(parameter):
    def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
        return parameter(tag.lstrip("!"), str(node.value))

    # Create a new class to prevent state sharing through class attributes.
    class ConstructorWrapper(MyConstructor):
        pass

    loader = ruamel.yaml.YAML()
    loader.Constructor = ConstructorWrapper
    loader.constructor.add_multi_constructor("", construct_node)
    return loader

foo = get_loader_instance(lambda tag, node: f"foo: {tag}, {node}")
bar = get_loader_instance(lambda tag, node: f"bar: {tag}, {node}")
print('>instance<', foo.load("!abc 123"), bar.load("!xyz 456"), sep="n")


def get_loader_cls(parameter):
    def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
        return parameter(tag.lstrip("!"), str(node.value))

    # Create a new class to prevent state sharing through class attributes.
    class ConstructorWrapper(MyConstructor):
        pass

    loader = ruamel.yaml.YAML()
    loader.Constructor = ConstructorWrapper
    loader.Constructor.add_multi_constructor("", construct_node)
    #      ^ using the virtual class method
    return loader

foo = get_loader_cls(lambda tag, node: f"foo: {tag}, {node}")
bar = get_loader_cls(lambda tag, node: f"bar: {tag}, {node}")
print('>cls<', foo.load("!abc 123"), bar.load("!xyz 456"), sep="n")

which gives:

>org<
foo: abc, 123
bar: xyz, 456
>instance<
foo: abc, 123
bar: xyz, 456
>cls<
bar: abc, 123
bar: xyz, 456