I noticed that, for any screen scaling other than 1x, my instance of the OpenLoco game looks blurry on my Arch Linux w/ Sway system. This has the footprint of a XWayland issue; in fact, if I tab over to my Discord client right now then all of the text is blurry because it's (regrettably) using X11. For OpenLoco, though, this seems to be different because the issue only occurs in Fullscreen (regardless of borderless).
I don't want to just trust my eyes, so let's compare a windowed screenshot (good, left) with a fullscreen screenshot (bad, right), both taken with grim via grimshot.
Curious! The screenshots don't show the problem that I am seeing on my screen in front of me. What about OBS?
Well, okay, I can't easily show you so you'll need to trust me here. As soon as I open OBS Studio, my fullscreen OpenLoco window immediately becomes clear, negating the whole problem.
This debug has a very simple solution (which you could figure out without much system-level knowledge), but there are still a couple of things that make it tricky:
- This is a bit of a Heisenbug in that it temporarily goes away every time I try to observe it via screen capture.
- There are a lot of different components that could be involved here. I don't know who I'd even talk to to get help with this!
In my head, I went through some of the different components:
- OpenLoco, the game itself. The OpenLoco people are mostly involved in the business of reverse engineering and reimplementing Chris Sawyer's Locomotion, and are unlikely to be able to help with specific Linux graphics stack issues.
- More importantly, it's unlikely that OpenLoco is doing anything wrong unless it's doing something weird like implementing its own resolution detection.
- SDL2, the windowing system used by OpenLoco. I noted this down but didn't think super far into it as I haven't worked with SDL before.
- PipeWire, the A/V system used by OBS.
- Sway, my Wayland compositor.
First I considered asking the PW folks (or even the OBS folks!) if they knew anything about why that might bypass the issue. I then checked how grim is getting its screenshots, to find that it uses Wayland protocols to obtain the screen buffer (not PW). This all suggests that there is some undesired scaling going on during "presentation".
With this in mind, I put aside grim + OBS and considered a couple of possibilities:
- Sway is doing something wrong.
- OpenLoco is doing some wrong resolution math that is causing avoidable scaling.
Unlike X11-based desktop environments that rely on the Xorg server, Wayland compositors do quite a lot of work. There is a common library to provide DE-agnostic functionality, wlroots, but not every compositor uses it. The variety in Wayland implementations makes it worthwhile to see which compositors an issue reproduces with. After all, weird compositor-specific issues have happened.
I installed my favorite DE that has a floating window manager: KDE Plasma. KDE Plasma automatically used a sensible display scaling (around 1.4). I booted up OpenLoco, went into fullscreen, and it was crystal clear! I also noticed that the game UI was smaller, revealing that OpenLoco is using my display scaling on Sway but not Plasma. This is perfectly fine, as OpenLoco has its own scaling built-in, so I don't need the compositor to try and be smart about it.
Now this smelled even more strongly of an XWayland issue. I popped up xeyes on both systems, and sure enough, OpenLoco was using X11 on Sway but not Plasma. For many intents and purposes, problem solved! I can do what the ArchWiki says to force SDL to use Wayland. But why the descrepancy?
At this point, my full focus is on SDL. I'm not confident as to why XWayland behaved the way it did, but I no longer care as we don't need XWayland here.
First, let's make sure that there's no environment variable set on Plasma forcing SDL:
$ env | grep SDL
Nope.
OpenLoco uses SDL2, so I pulled up that source tree. In absence of the SDL_VIDEODRIVER environment variable, it uses symbol-based heuristics to check whether X11 needs to be forced: (source code reference)
if (dlsym(global_symbols, "glxewInit") != NULL) { /* GLEW (e.g. Frogatto, SLUDGE) */
force_x11 = SDL_TRUE;
} else if (dlsym(global_symbols, "cgGLEnableProgramProfiles") != NULL) { /* NVIDIA Cg (e.g. Awesomenauts, Braid) */
force_x11 = SDL_TRUE;
} else if (dlsym(global_symbols, "_Z7ssgInitv") != NULL) { /* ::ssgInit(void) in plib (e.g. crrcsim) */
force_x11 = SDL_TRUE;
}I can't recreate this test outside of OpenLoco as it depends on the dynamically linked libraries, so I pivoted to using a debugger. There is practically no setup involved on Arch, as GDB with Debuginfod automatically downloads symbols and source code. GDB's UI will display the "canonical" path, e.g. /usr/src/debug/sdl2-compat/sdl2-compat-2.32.60/src/dynapi/SDL_dynapi_procs.h, while the actual source file is sitting in your home directory:
(gdb) info source
Current source file is /usr/src/debug/sdl2-compat/sdl2-compat-2.32.60/src/dynapi/SDL_dynapi_procs.h
Compilation directory is /usr/src/debug/sdl2-compat/build
Located in /home/koopa/.cache/debuginfod_client/cdcb869bcde00ba2cba4b4f099d1dd2513710eba/source-d54b1408-#usr#src#debug#sdl2-compat#sdl2-compat-2.32.60#src#dynapi#SDL_dynapi_procs.h
Contains 971 lines.
Source language is c.
Let's set a breakpoint for the function this force_x11 check is in!
(gdb) break SDL_VideoInit
Breakpoint 1 at 0x7ffff7f5faf0: file /usr/src/debug/sdl2-compat/sdl2-compat-2.32.60/src/dynapi/SDL_dynapi_procs.h, line 523.
This is the point at which I learned that, although OpenLoco as a project is using the SDL2 API, my system is not using SDL2. Instead, it's using the sdl2-compat translation layer.
This knowledge still isn't enough, however. The SDL_VideoInit breakpoint is never hit! I opted to instead break on an SDL entrypoint that I know for sure is executed:
(gdb) run
Starting program: /usr/bin/OpenLoco
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[INF] OpenLoco, 25.12 ( on )
[INF] Linux (x86-64)
Breakpoint 2.2, SDL_Init (a=32) at /usr/src/debug/sdl2-compat/sdl2-compat-2.32.60/src/dynapi/SDL_dynapi_procs.h:81
81 SDL_DYNAPI_PROC(int,SDL_Init,(Uint32 a),(a),return)
(gdb) info breakpoints
2 breakpoint keep y <MULTIPLE>
breakpoint already hit 1 time
2.1 y 0x00007ffff7041c20 in SDL_Init at /usr/src/debug/sdl3/SDL3-3.2.28/src/dynapi/SDL_dynapi_procs.h:638
2.2 y 0x00007ffff7f5def0 in SDL_Init at /usr/src/debug/sdl2-compat/sdl2-compat-2.32.60/src/dynapi/SDL_dynapi_procs.h:81
I entered GDB's TUI (layout src + focus cmd) and then used a combination of next, step and finish to follow SDL_Init's execution:
- sdl2-compat:
InitSubsystemInternal - SDL3:
SDL3_InitSubSystem
It doesn't take very long to enter SDL3 land. Taking a look at this output:
(gdb) s
SDL_InitSubSystem (a=32) at /usr/src/debug/sdl3/SDL3-3.2.28/src/dynapi/SDL_dynapi_procs.h:640
(gdb) s
SDL_InitSubSystem_REAL (flags=32) at /usr/src/debug/sdl3/SDL3-3.2.28/src/SDL.c:307
(gdb) bt
#0 SDL_InitSubSystem_REAL (flags=32) at /usr/src/debug/sdl3/SDL3-3.2.28/src/SDL.c:307
#1 0x00007ffff7f50224 in InitSubsystemInternal (flags=32) at /usr/src/debug/sdl2-compat/sdl2-compat-2.32.60/src/sdl2_compat.c:7042
#2 0x00005555555856e7 in OpenLoco::main(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
> >&&) ()
#3 0x0000555555583d72 in main ()
We see that, as a result of SDL's dynamic API, the function implementations are all appended with _REAL. This seems to explains why our SDL_VideoInit breakpoint wasn't working.
Finally, the sdl2-compat revalation helps explain why OpenLoco on Plasma is unscaled: (code source)
// Pretend Wayland doesn't have fractional scaling by default.
// This is more compatible with applications that have only been tested under X11 without high DPI support.
// For apps that support high DPI on Wayland, add a SDL_HINT_VIDEO_WAYLAND_SCALE_TO_DISPLAY=0 quirk for them.
// Full discussion is here: https://github.com/libsdl-org/SDL/issues/12158
SDL3_SetHint(SDL_HINT_VIDEO_WAYLAND_SCALE_TO_DISPLAY, "1");
You might now believe that we can breaking on SDL_VideoInit_REAL to get to the code we want to check. Except that doesn't get hit either! The best way I've found to get to this function is to break on SDL_InitSubSystem_REAL and manually step your way to the SDL_VideoInit() call.
Even when we get there, though, we are constantly bouncing around between functions due to optimizations. As this was becoming way too complicated, I realized that I can simply add a breakpoint for those force_x11 lines.
It turns out that, in SDL3, that force_x11 variable doesn't even exist. Oops!
I turned my eyes to the logic in the (correct) SDL_VideoInit (code link). I was somehow able to break within the video logic. I believe I did this by setting a breakpoint on just the right source line, but I am having a hard time replicating it. What I do know is that I worked in this direction with strategic breakpoints on X11_CreateDevice and Wayland_CreateDevice, allowing me to look at the backtrace.
This is the key loop:
for (i = 0; bootstrap[i]; ++i) {
video = bootstrap[i]->create();
if (video) {
break;
}
}I immediately tried to inspect this bootstrap array, but it shows what is obviously the wrong object:
(gdb) print bootstrap
$25 = {0x7ffff72953e0 <V4L2_bootstrap>, 0x7ffff72953c0 <PIPEWIRECAMERA_bootstrap>, 0x7ffff72953a0 <DUMMYCAMERA_bootstrap>, 0x0}
A command of somewhat-more-obscure GDB shows that this is because of inlining:
(gdb) print 'SDL_video.c'::bootstrap
$20 = <optimized out>
(gdb) print 'SDL_video.c'::_this
$21 = <optimized out>
I determined that, once emerging from this loop, Sway has an x11 video and Wayland has a wayland video. This was unintuitive, considering:
- both environments should be following the same list of video driver bootstrap functions
- this code is trying each environment in order, starting with Wayland. What would cause the Wayland backend to fail to initialize on Sway?
It turns out that the order is: (code source)
- Wayland "preferred"
- X11
- Wayland
where "preferred" fails if the preferred protocols are not found. Currently, the only preferred protocol is fifo-v1, which is where Sway and KDE Plasma diverge. Problem solved!
Things that would have saved time:
- Checking for XWayland sooner.
- Understanding the sdl2-compat situation sooner by looking more closely at the dependencies of the OpenLoco AUR package I use.
- Spending less time with top-down debugging (break at SDL_Init and step from there). Instead, reach for functions that are too-specific (like
X11_CreateDevice) and work bottom-up to find break-able leads in the middle. - Being able to see
SDL_LogInfomessages. This has a log message that could have avoided this entire mess!
Stepping through the optimized code was rather painful, but I didn't feel like compiling a debug build of OpenLoco and SDL.