bookmark_borderModernizing horde/text_diff

If you ever read a github pull request or similar extension proposal, you will likely have seen side by side comparisons of the original and the changed file. You may also have seen some text format that highlights only differences and a little context but hides the unchanged rest of the file. Both of these formats are called Diff, named after the popular diff and patch utilities dating back to ancient Unix times. The git diff command does something very similar. The horde/text_diff library and its ancestor, the pear/text_diff library, are tools to generate and format such difference information for different usage scenarios.

Apart from Horde’s internal usage in its repository viewer, horde/chora, and its wiki software, horde/wicked, the tool is also used by external parties. WebSVN maintainer Michael O. approached me because he wanted to use a PHP 8.2 ready version of horde/text_diff to substitute an older component which did not do the job. Michael has been very helpful in getting me started, pointing me to some issues to solve and also providing his own solutions in some parts. The result is a conservative update of horde/text_diff that will run in the upcoming versions of PHP without causing any trouble. But this is only where I started.

Breaking bad habits

A closer look at the internal structure of the library showed that it deserved a major overhaul. The solution was to refrain from a verbatim upgrade to namespaces and the likes but to actually change some things. This meant breaking backwards compatibility. I go to great lengths to keep a conservative drop-in version of everything I touch in the lib/ folder. Sometimes it is just an interface or wrapper, sometimes the new and old code do not really share a lot.

I began with adding type hints to most methods. Targetting PHP 8.1+ for the src/ path allows to use union types and intersection types. A lot of knowledge hidden in phpdoc comments is moved into actual code and makes it more robust.

Exploring the code for base classes and interfaces, I noticed that some things I did not like.

Method signatures did not add up

Some method signatures did not add up. Depending on the type of Diff Engine, the diff() method would take different types of arguments. The interface was mixing the specifics of how the diff engine is set up with the command to create a list of operations objects. Loading the engine is now separated from running it. The running method is now always called in the same way.

Internal dependency creation

The Diff, Threeway Diff and Mapped Diff utilities all created their diff engine internally. To do this, they needed a very flexible constructor that allowed passing whatever is needed to set up the actual engine. That was bad enough but they also did it in different ways. The Differs’ constructor now only accepted a pre-constructed engine. For convenience, I created a factory which would take over the responsibilities originally assigned to the differs’ constructor: Building a differ from input and if no specific differ was chosen, selecting one by some priority logic. In the end it turned out that the Differ does not need the engine at all but rather needs the product of the Engine: A set of operations to transform document A to document B. Born is the OperationList class. I did not want to just pass an array. I added a small static method as a named constructor. It frees the actual constructor of too many responsibilities and allows to keep the interface clean and strict.

More explicit type juggling

Creation if diffs contains some interesting math. The algorithms use a lot of short variable names and operations that make sense if you know the underlying theory but otherwise look like garbage. I added some explicit conversions between string and integer and made some changes to ensure a number zero or an empty string is not mistaken for a “false” or “null” value which would have another meaning.
Overall it is now much easier for static analyzers to spot any issues.

Dual stacks have a price tag

Essentially maintaining two different sets of the library comes with some cost. One must ensure that unit tests targeted at the newer platform are not run when testing for compatibility with the older platform. The conservative lib/ is ready to run PHP 7.4 code but the modern version in src/ must be transpiled to be run in PHP 7.4 – I do that on release but still it is another aspect that needs minding. As I also use automated tools for some upgrade tasks, I need to ensure I do not upgrade the lib/ path. The price is worth it as I cannot convert the code base at once and I want to provide a good development experience to all who are caught in between maintaining an older release or creating new code. I am in that spot myself. Essentially it allows me to run two conflicting major versions of some critical libraries and pick the right one for different sub systems. The need will go away as code is gradually migrating towards the newer implementations. At some point a next major version will drop the conservative path. Anybody interested is free to maintain the older major version and keep using it.

Upcoming work

I consider the external interface of the newer horde/text_diff implementation fairly stable by now. Internally, however, there is a lot of room for improvement. Some functionality should move out of the base classes and into separate traits – which the base classes will use. Some getters should be added and used, preparing to move some public variables to internal state in a next major release. The new OperationList gets unloaded to plain arrays in several places – It needs to learn some tricks without degrading into a glorified array. None of this should stop early adopters from using the new code base. None of this is supposed to break any user code.

Out of scope for now

There are some items which I decided to postpone for now. One thing which bothers me is the amount of dependencies. While a dependency on horde/exception makes sense, it pulls in horde/translation for no good reason. Horde\Util is pulled in but really only used in two places: A horde/string call which could be reduced to a direct call to an internal function and one call to a helper for handling temporary files. That helper should maybe live in its own library, nicely decoupled from unrelated utilities. There was a reason why they were packed together but it is no longer relevant.

Also, some functionality is missing in the Xdiff-based engine. Most distributions do not even offer php-xdiff, including my own development platform. I will add that feature once I get it into CI and into the development setup. I do not want to delay other items to do that right now. Patches welcome 🙂