# Fix nv_drm_revoke_modeset_permission kernel WARNING

Fix kernel WARN_ON in [`nv_drm_revoke_modeset_permission`][nv_revoke_modeset] on nvidia-drm 550 branch / Debian
kernel 6.12. Two patches needed - one fixing the conftest detection (primary cause of the visible symptom) and one
fixing error handling bugs in the driver (secondary/latent bugs).

## Error Symptom

```
WARNING at nvidia-drm/nvidia-drm-drv.c:1221 nv_drm_revoke_modeset_permission+0x327/0x340
```
Triggered from [`drm_file_free`][drm_file_free] `->` [`drm_release`][drm_release] `-> __fput -> close()` -- i.e.,
whenever a DRM file descriptor is closed (by processes like `glxtest`, `kioworker`, `kscreenlocker_g`).

The WARN instruction (`ud2`) is at offset 0x327 in a 0x340-byte function, placing it in the
[`DRM_MODESET_LOCK_ALL_END`][DRM_MODESET_LOCK_ALL] macro expansion at the end of the function. Line 1221 falls within
the connector iteration block.

## Root Cause: Conftest Failure (Primary -- causes the visible WARN)

The WARNING is primarily caused by a **conftest failure** that activates a buggy fallback code path:

### 1. Conftest for `drm_connector_list_iter` fails

The compile test for [`struct drm_connector_list_iter`][drm_connector_list_iter] in [`conftest.sh`][conftest_sh]
fails on Debian kernel 6.12.90. The test includes [`<drm/drm_connector.h>`][drm_connector.h] directly, but that header
depends on types from [`<drm/drm_device.h>`][drm_device.h] and [`<drm/drm_mode_config.h>`][drm_mode_config.h] that are
not transitively available.

This is a **latent upstream bug** (the conftest was always fragile) **triggered by the Debian
`use-kbuild-flags.patch`**, which prepends `KBUILD_CFLAGS` to the conftest compile flags. Under the stricter flags
(or changed include order), `drm/drm_connector.h` cannot compile standalone.

Evidence in the build output:
```
# conftest/compile-tests/drm_connector_list_iter.h (BEFORE fix)
#undef NV_DRM_CONNECTOR_LIST_ITER_PRESENT       ← types conftest correctly fails
#define NV_DRM_CONNECTOR_LIST_ITER_BEGIN_PRESENT  ← functions conftest false positive (inverted logic)
```

### 2. Fallback macro fires bogus WARN_ON

Without `NV_DRM_CONNECTOR_LIST_ITER_PRESENT`, the code uses the fallback [`nv_drm_for_each_connector`][nv_for_each_conn]
macro in [`nvidia-drm-helper.h`][nv_drm_helper_h]:

```c
#define nv_drm_for_each_connector(connector, conn_iter, dev) \
    WARN_ON(!mutex_is_locked(&dev->mode_config.mutex));      \
    list_for_each_entry(connector, &(dev)->mode_config.connector_list, head)  /* kernel list.h */
```

### 3. WARN_ON fires because mode_config.mutex is not held

The call chain in [`nv_drm_revoke_modeset_permission`][nv_revoke_modeset] is:

```
DRM_MODESET_LOCK_ALL_BEGIN(dev, ctx, flags, ret)
    nv_drm_for_each_connector(connector, conn_iter, dev)   ← WARN_ON here
        ...
DRM_MODESET_LOCK_ALL_END(dev, ctx, ret)
```

[`DRM_MODESET_LOCK_ALL_BEGIN`][DRM_MODESET_LOCK_ALL] (the 3-arg version used here) expands roughly to:

```c
drm_modeset_acquire_init(&ctx, flags);               /* kernel: drm_modeset_lock.c */
retry:
ret = drm_modeset_lock_all_ctx(dev, &ctx);            /* kernel: drm_modeset_lock.c */
if (ret == 0) {
    /* user code -- the connector iteration goes here */
}
```

[`drm_modeset_lock_all_ctx`][drm_modeset_lock_all_ctx] acquires locks in this order:
1. `dev->mode_config.connection_mutex` -- protects the connector list and connector properties
2. Each CRTC's `crtc->mutex`
3. Each plane's `plane->mutex`

Crucially, it does **not** acquire `dev->mode_config.mutex`. That mutex is only taken by the legacy
[`drm_modeset_lock_all()`][drm_modeset_lock_all] path (non-`_ctx` variant), which is used by non-atomic (legacy)
drivers. The nvidia-drm driver is an atomic driver ([`drm_drv_uses_atomic_modeset(dev)`][drm_drv_uses_atomic] returns
true), so it correctly uses the `_ctx` path which relies on `connection_mutex` to serialize connector list access.

The fallback [`nv_drm_for_each_connector`][nv_for_each_conn] macro checks
`WARN_ON(!mutex_is_locked(&dev->mode_config.mutex))`, but this is the **wrong mutex** to assert on for atomic
drivers. The connector list is properly protected by `connection_mutex` (which *is* held), making this WARN_ON a
false alarm. The proper [`drm_connector_list_iter`][drm_connector_list_iter]-based code path (used when the conftest
succeeds) doesn't have this incorrect assertion at all -- it uses [refcounted iteration][drm_connector_list_iter_begin]
that is safe under `connection_mutex`.

## Secondary Bugs in [`nvidia-drm-drv.c`][nv_drm_drv_c] (Latent)

Two additional bugs exist in [`nv_drm_revoke_modeset_permission`][nv_revoke_modeset] which patch 0084 fixes:

### Bug 1: Connector list iterator leak via `goto done`

When [`nv_drm_atomic_disable_connector`][nv_atomic_disable] returns an error, `goto done` skips
[`nv_drm_connector_list_iter_end`][nv_iter_end], leaking the iterator's internal reference.

### Bug 2: NULL state dereference on alloc failure

If [`drm_atomic_state_alloc`][drm_atomic_state_alloc] fails, `goto done` leads to
[`drm_atomic_state_put(NULL)`][drm_atomic_state_put] / `drm_atomic_state_free(NULL)`, which dereferences a NULL
pointer.

These bugs exist in all versions from 550 through 580+. In 580+, the wrapper macro removal hides the WARN symptom,
but the iterator leak persists. In 610+, the code was fully refactored to use
[`drm_connector_list_iter`][drm_connector_list_iter] APIs directly without conftests or wrappers.

## Fix: Two Debian Patches

### Patch 0085: Fix conftest (primary fix)

File: `debian/patches/module/0085-fix-drm_connector_list_iter-conftest.patch`

Adds prerequisite DRM headers (guarded by their own conftest defines) before `#include <drm/drm_connector.h>` in the
`drm_connector_list_iter` conftest, and adds a missing `return 0;` to avoid `-Werror=return-type`:

```c
CODE="
#if defined(NV_DRM_DRM_DEVICE_H_PRESENT)
#include <drm/drm_device.h>
#endif
#if defined(NV_DRM_DRM_MODE_CONFIG_H_PRESENT)
#include <drm/drm_mode_config.h>
#endif
#include <drm/drm_connector.h>
int conftest_drm_connector_list_iter(void) {
    struct drm_connector_list_iter conn_iter;
    return 0;
}"
```

With this fix, the conftest correctly detects `struct drm_connector_list_iter` as present, the proper
[`nv_drm_connector_list_iter_begin`][nv_iter_begin]/[`end`][nv_iter_end] API path is used, and the fallback macro
with the bogus `WARN_ON` is never reached.

### Patch 0084: Fix error handling (secondary fix)

File: `debian/patches/module/0084-fix-nv_drm_revoke_modeset_permission-error-handling.patch`

Fixes two bugs in [`nv_drm_revoke_modeset_permission`][nv_revoke_modeset]:

- Replace `goto done` with `break` inside the connector loop (ensures iterator cleanup)
- Add `if (ret < 0) goto done;` after the iterator end call
- Guard `drm_atomic_state_put`/`drm_atomic_state_free` with `if (state != NULL)`

## Files Modified

In the [nvidia-graphics-drivers](https://salsa.debian.org/nvidia-team/nvidia-graphics-drivers) repo:

- `debian/patches/module/0084-fix-nv_drm_revoke_modeset_permission-error-handling.patch` (created)
- `debian/patches/module/0085-fix-drm_connector_list_iter-conftest.patch` (created)
- `debian/patches/module/series` (updated to include both patches)

## Verification

After rebuilding the DKMS module with both patches:

```bash
# Verify conftest now succeeds:
grep NV_DRM_CONNECTOR_LIST_ITER /var/lib/dkms/nvidia-current/550.163.01/build/conftest/types.h
# Expected: #define NV_DRM_CONNECTOR_LIST_ITER_PRESENT

# Verify module loaded (if version was bumped):
modinfo nvidia-drm | grep version

# Verify WARNING no longer appears in dmesg:
dmesg | grep nv_drm_revoke_modeset_permission
# Expected: no output
```

## Important: DKMS Rebuild Gotcha

Simply copying patched files into the DKMS build directory is not enough. A full `dkms remove` + `dkms install` cycle
is required, followed by rebuilding the initramfs (`update-initramfs -u` or equivalent), to ensure the old module is
fully replaced. Without this, the old (unpatched) module may persist in the initramfs and continue to be loaded at
boot.

## Source References

### NVIDIA driver (open-gpu-kernel-modules, tag 550.163.01)

[nv_revoke_modeset]:   https://github.com/NVIDIA/open-gpu-kernel-modules/blob/550.163.01/kernel-open/nvidia-drm/nvidia-drm-drv.c#L1182
[nv_atomic_disable]:   https://github.com/NVIDIA/open-gpu-kernel-modules/blob/550.163.01/kernel-open/nvidia-drm/nvidia-drm-drv.c#L1152
[nv_for_each_conn]:    https://github.com/NVIDIA/open-gpu-kernel-modules/blob/550.163.01/kernel-open/nvidia-drm/nvidia-drm-helper.h#L157
[nv_drm_helper_h]:     https://github.com/NVIDIA/open-gpu-kernel-modules/blob/550.163.01/kernel-open/nvidia-drm/nvidia-drm-helper.h
[nv_drm_drv_c]:        https://github.com/NVIDIA/open-gpu-kernel-modules/blob/550.163.01/kernel-open/nvidia-drm/nvidia-drm-drv.c
[nv_iter_begin]:       https://github.com/NVIDIA/open-gpu-kernel-modules/blob/550.163.01/kernel-open/nvidia-drm/nvidia-drm-helper.h#L520
[nv_iter_end]:         https://github.com/NVIDIA/open-gpu-kernel-modules/blob/550.163.01/kernel-open/nvidia-drm/nvidia-drm-helper.h#L531
[conftest_sh]:         https://github.com/NVIDIA/open-gpu-kernel-modules/blob/550.163.01/kernel-open/conftest.sh#L4272

### Linux kernel (v6.12 on Elixir/Bootlin)

[drm_file_free]:                  https://elixir.bootlin.com/linux/v6.12/source/drivers/gpu/drm/drm_file.c#L223
[drm_release]:                    https://elixir.bootlin.com/linux/v6.12/source/drivers/gpu/drm/drm_file.c#L410
[DRM_MODESET_LOCK_ALL]:           https://elixir.bootlin.com/linux/v6.12/source/include/drm/drm_modeset_lock.h#L175
[drm_modeset_lock_all_ctx]:       https://elixir.bootlin.com/linux/v6.12/source/drivers/gpu/drm/drm_modeset_lock.c#L449
[drm_modeset_lock_all]:           https://elixir.bootlin.com/linux/v6.12/source/drivers/gpu/drm/drm_modeset_lock.c#L143
[drm_drv_uses_atomic]:            https://elixir.bootlin.com/linux/v6.12/source/include/drm/drm_drv.h#L533
[drm_connector_list_iter]:        https://elixir.bootlin.com/linux/v6.12/source/include/drm/drm_connector.h#L2336
[drm_connector_list_iter_begin]:  https://elixir.bootlin.com/linux/v6.12/source/drivers/gpu/drm/drm_connector.c#L878
[drm_connector.h]:                https://elixir.bootlin.com/linux/v6.12/source/include/drm/drm_connector.h
[drm_device.h]:                   https://elixir.bootlin.com/linux/v6.12/source/include/drm/drm_device.h
[drm_mode_config.h]:              https://elixir.bootlin.com/linux/v6.12/source/include/drm/drm_mode_config.h
[drm_atomic_state_alloc]:         https://elixir.bootlin.com/linux/v6.12/source/drivers/gpu/drm/drm_atomic.c#L166
[drm_atomic_state_put]:           https://elixir.bootlin.com/linux/v6.12/source/include/drm/drm_atomic.h#L536

---

*Document written with assistance from Claude Opus 4.6 (Anthropic) via Cursor IDE.*
