AMD's Display Core difficulties

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

Back in 2007, the announcement that AMD intended to reverse its longstanding position and create an upstream driver for its graphics processors was joyfully received by Linux users worldwide. As 2017 approaches, an attempt by AMD to merge a driver for an upcoming graphics chip has been rejected by the kernel's graphics subsystem maintainer — a decision that engendered rather less joy. A look at this discussion reveals a pattern seen many times before; the positions and decisions taken can seem arbitrary to the wider world but they are not without their reasons and will, hopefully, lead to a better kernel in the long run.

A quick timeline

Back in February, Harry Wentland posted a patch set adding the new "DAL display driver" to the AMD GPU subsystem; this driver, he said, would "allow us going forward to bring display features on the open amdgpu driver (mostly) on par with the Catalyst driver." It was not a small patch, adding 279 new source files containing nearly 94,000 lines of code. That code saw little public discussion and was never merged, though it has become clear that some private discussion took place.

In March, Alex Deucher proposed that DAL should be merged, saying that it was to be the core of the new display stack; his goal was to get it into the 4.7 release. Graphics maintainer Dave Airlie made it clear that this was not going to happen, saying that: "I think people are focusing on the minor comments and concerns and possibly deliberately ignoring the bigger concern that this code base is pretty much unmergeable as-is." His biggest complaint had to do with the overall design, which involved a lot of abstraction code that tried to hide the details of working with the kernel from the rest of the code. Others echoed his concerns and, indeed, the code was not merged for 4.7 or any other kernel released since then.

The current chapter began on December 7, when Wentland posted an RFC note saying that this code (now going by the name "display core") was needed to support the next generation GPU coming out of AMD. The company, he said, has based all of its efforts on the display core code, using it as the foundation for all of its quality-assurance work, its OEM preloads, and more. And, he noted: "DC behavior mirrors what we do for other OSes". That last point is important; the display core code helps the company maintain the driver across multiple operating systems by hiding a lot of system-specific interfaces.

This time, Daniel Vetter complained about the abstraction layers in the code and described why they were not acceptable in the mainline kernel. Airlie responded more strongly, saying that this code would not be merged in its current form:

Given the choice between maintaining Linus' trust that I won't merge 100,000 lines of abstracted HAL code and merging 100,000 lines of abstracted HAL code I'll give you one guess where my loyalties lie.

As one might expect, a certain amount of back-and-forth resulted; the AMD developers were not pleased by this response. It can only have felt like a slap to a group of developers who were trying to do the right thing by getting support for their hardware into the mainline kernel. Even so, they stayed with the discussion, which remained almost entirely civil, and which, in the end, seems to be leading to a viable path forward.

The trouble with midlayers

There are a number of complaints with the AMD driver code; it is not often that tens of thousands of lines of new code are free of problems. But the biggest issue has to do with the midlayer architecture. A midlayer, as its name suggests, sits between two other layers of code, typically with the purpose of hiding those outer layers from each other. In this case, for example, the display core code tries to hide the details of low-level hardware access, allowing the upper-layer driver to run on any operating system.

The kernel community has a long experience with midlayers, and that experience is generally seen as being bad. For an extensive discussion of the midlayer pattern in the kernel, this 2009 article from Neil Brown is recommended reading. A midlayer architecture can bring a whole raft of long-term maintainability issues in general; the graphics developers are also concerned about some more specific issues.

The idea behind AMD's midlayer is to abstract out the Linux-specific details in the driver. That may be desirable for somebody trying to maintain a cross-platform driver; it also helps the AMD developers get the Linux driver working before the hardware engineers move on to the next product and stop answering questions. But code structured in this way is painful for people trying to maintain the Linux kernel. Understanding higher-level code becomes harder when that code does not follow the normal patterns used by graphics drivers; that can be hard for maintenance in general, but it can become a severe impediment to any sort of refactoring work. As Airlie put it:

If I'd given in and merged every vendor coded driver as-is we'd never have progressed to having atomic modesetting, there would have been too many vendor HALs and abstractions that would have blocked forward progression. Merging one HAL or abstraction is going to cause pain, but setting a precedent to merge more would be just downright stupid maintainership.

A hardware abstraction layer must be maintained to meet the needs of code for other operating systems — code that the Linux kernel developers cannot see (and probably don't want to). In effect, that means that nobody outside of the originating company can touch the midlayer code, making community maintenance impossible. If members of the community do try to patch the midlayer — often to remove code that, from the kernel's perspective, is redundant — they will run afoul of the driver maintainers, who may well try to block the work. If they are successful in doing so, the result is code in the community kernel that is essentially off-limits for community developers.

Functionality placed in a midlayer, rather than in common code, has a high likelihood of developing its own behavioral quirks. As a result, drivers using the midlayer will behave differently from other drivers for similar hardware, often in subtle ways. That creates pain for application developers, who no longer have a single driver interface to work with.

A midlayer will also tend to isolate its developers from the common core code. The midlayer will be fixed and improved, often to work around shortcomings in the common layer, rather than improving the common layer itself. Kernel developers would rather see that effort going into the common code, where it benefits all users rather than one driver in particular. Developers who work on this support code have a say in the direction it takes, while developers who work on a midlayer generally do not. So things tend to diverge further over time, with the driver developers feeling that the core is not developed with their needs in mind.

Finally, midlayer-based code has a certain tendency to get stuck on older kernel versions; indeed, the current display core code is still based on 4.7. That makes it hard to maintain as the kernel goes forward. In this case, Vetter summarized this aspect of the problem by saying: "I think you don't just need to demidlayer DAL/DC, you also need to demidlayer your development process." Code intended for the mainline needs to be developed and tested against the current mainline, or it will inevitably fall behind.

The way forward

Regardless of how one views the odds of seeing the Year of the Linux Desktop in the near future, it seems certain that those odds can only be worse in the lack of AMD GPU drivers. The blocking of such a driver — otherwise on track to be in mainline before the hardware ships — thus looks like a step backward for a subsystem that has already struggled to gain support for current hardware.

Chances are, though, that this standoff will be resolved more quickly than people might expect. The AMD developers were not unaware of the problems and, it seems, not unwilling to fix them. Deucher said:

What I really wanted to say was that this was an RFC, basically saying this is how far we've come, this is what we still need to do, and here's what we'd like to do. This was not a request to merge now or an ultimatum.

Some work has indeed been done since the early posting of this patch set; and, it is said, about one-third of the midlayer code is gone. Vetter made it clear that this work had been seen and appreciated:

I guess things went a bit sideways by me and Dave only talking about the midlayer, so let me first state that the DC stuff has massively improved through replacing all the backend services that reimplemented Linux helper libraries with their native equivalent. That's some serious work, and it shows that AMD is committed to doing the right thing.

The code that had been removed so far is, naturally enough, the easiest third to take care of; getting rid of the rest of the midlayer will require some significant refactoring of the code. Vetter provided a roadmap for how that work could be done; Wentland and AMD developer Tony Cheng agreed that the path seemed workable. Wentland acknowledged that things could have been done better at AMD, saying: "We really should've started DC development in public and probably would do that if we had to start anew." Actually getting all that work done may take a while; it must compete with other small tasks like making the driver actually work for existing and upcoming hardware.

One might conclude that what we are really seeing here is a negotiation over just how much of this work must be done before the code is deemed good enough that the rest of the fixes can be done in the mainline. Maintainers tend to worry about giving way in such situations because, once they have merged the code, they have given up their strongest leverage and can find that the developers become surprisingly unmotivated to finish the job. Arguably, that is a relatively unlikely outcome here; AMD has been trying to improve its upstream development for nearly a decade and its developers know what the rules are.

The most likely outcome, thus, is that this driver is delayed for perhaps a few more development cycles while the worst problems are taken care of and some sort of convincing story exists for the rest. Then it will be welcomed into the kernel as a welcome addition to mainline support for current hardware, and users worldwide will have something to celebrate. The Year of the Linux Desktop, unfortunately, may be a little slower in coming.