Opened 8 months ago

Closed 7 months ago

#14638 closed defect (fixed)

SWORD1: Crashes due to data races (multi-threading)

Reported by: PushmePullyu Owned by: AndywinXp
Priority: normal Component: Engine: Sword1
Version: Keywords:
Cc: PushmePullyu Game: Broken Sword 1

Description

Using Fedora Linux Workstation 37 x86_64 and the GOG.com/Windows/English version of Broken Sword @ 9de51fda8a871422c580481ef6f172b64cc12751

While testing the menu rewrite (see PR #5332 at https://github.com/scummvm/scummvm/pull/5332) I observed these crashes:

This sometimes happens when moving the mouse to the inventory/manipulating it and simultaneously pressing ESC to open the main menu, but it is not consistent
(line numbers will differ due to added debugging code):

ERROR: SDL_BlitSurface failed: Surfaces must not be locked during blit!

#0  0x00007ffff7e08620 in SDL_GetError () at /lib64/libSDL2-2.0.so.0
#1  0x00000000006d4951 in SurfaceSdlGraphicsManager::drawMouse() (this=0x166c1a0) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2394
#2  0x00000000006d0cb1 in SurfaceSdlGraphicsManager::internUpdateScreen() (this=0x166c1a0) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:1401
#3  0x00000000006cfe39 in SurfaceSdlGraphicsManager::updateScreen() (this=0x166c1a0) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:1142
#4  0x00000000006af286 in ModularGraphicsBackend::updateScreen() (this=0x14b2560) at backends/modular-backend.cpp:187
#5  0x00000000005c015a in Sword1::SwordEngine::pollInput(unsigned int) (this=0x27c8070, delay=0) at engines/sword1/sword1.cpp:910
#6  0x00000000005bff75 in Sword1::SwordEngine::waitForFade() (this=0x27c8070) at engines/sword1/sword1.cpp:866
#7  0x00000000005c1063 in Sword1::Control::getPlayerOptions() (this=0x2b16960) at engines/sword1/control.cpp:244
#8  0x00000000005bfdc8 in Sword1::SwordEngine::mainLoop() (this=0x27c8070) at engines/sword1/sword1.cpp:834
#9  0x00000000005bf5ac in Sword1::SwordEngine::go() (this=0x27c8070) at engines/sword1/sword1.cpp:683
#10 0x00000000005c0608 in Sword1::SwordEngine::run() (this=0x27c8070) at ./engines/sword1/sword1.h:127
#11 0x0000000000447fdf in runGame(Plugin const*, Plugin const*, OSystem&, Common::String const&) (plugin=0x15f3620, enginePlugin=0x158e410, system=..., debugLevels=...) at base/main.cpp:324
#12 0x0000000000449f71 in scummvm_main(int, char const* const*) (argc=1, argv=0x7fffffffde78) at base/main.cpp:764
#13 0x0000000000445370 in main(int, char**) (argc=1, argv=0x7fffffffde78) at backends/platform/sdl/posix/posix-main.cpp:44

In cases where the above procedure did not lead to an immediate crash, it sometimes caused crashes later (on quitting the game or when restoring), e.g.:

Thread 1 "scummvm" received signal SIGSEGV, Segmentation fault.
in Sword1::MemMan::flush (engines/sword1/memman.cpp):
free(_memListFreeEnd->data);

I can not reproduce these bugs consistently, but suspect they are caused by unsynchronized access from multiple threads to MemMan and other objects:

The game sets up a timer in engines/sword1/sword1.cpp:640:

Common::Error SwordEngine::go() {
	...
	installTimerRoutines();
	...
}

This causes regular calls to vblCallback (which can happen from another thread):
engines/sword1/sword1.cpp:988:

void SwordEngine::installTimerRoutines() {
	debug(2, "SwordEngine::installTimerRoutines(): Installing timers...");
	getTimerManager()->installTimerProc(&vblCallback, 1000000 / TIMER_RATE, this, "AILTimer");
}

The callback receives a pointer to the SwordEngine and uses it to call several methods e.g. Sword1::SwordEngine::updateTopMenu().
This can eventually cause access to MemMan or different surfaces. However, the main thread will also use these (e.g. from getPlayerOptions()).

Since there is no synchronization, this can lead to corrupt MemMan data structures or attempts to simultaneously blit to the same surface.

To make it easier to catch this in a debugger, I addded a delayMillis(300) call to MemMan::alloc. This backtrace then shows both threads running MemMan::alloc():

(gdb) i threads
  Id   Target Id                                           Frame 
* 1    Thread 0x7ffff671a880 (LWP 317181) "scummvm"        Sword1::MemMan::alloc (this=0x1b633a0, bsMem=0x1b71ba8, pSize=2164, pCond=2) at engines/sword1/memman.cpp:41
...
  21   Thread 0x7fffd8bfa6c0 (LWP 317240) "SDLTimer"       0x00007ffff72f5045 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fffd8bf9340, 
    rem=rem@entry=0x7fffd8bf9350) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
...

(gdb) bt
#0  Sword1::MemMan::alloc(Sword1::MemHandle*, unsigned int, unsigned short) (this=0x1b633a0, bsMem=0x1b71ba8, pSize=2164, pCond=2) at engines/sword1/memman.cpp:41
#1  0x00000000005d8431 in Sword1::ResMan::resOpen(unsigned int) (this=0x1ff3020, id=67305483) at engines/sword1/resman.cpp:257
#2  0x00000000005d81b0 in Sword1::ResMan::openFetchRes(unsigned int) (this=0x1ff3020, id=67305483) at engines/sword1/resman.cpp:202
#3  0x00000000005e5ca2 in Sword1::Screen::showFrame(unsigned short, unsigned short, unsigned int, unsigned int, unsigned char const*, signed char)
    (this=0x1988030, x=0, y=0, resId=67305483, frameNo=1, fadeMask=0x1287f40 <Sword1::Menu::_fadeEffectTop> "\001\a\005\003\002\004\006", fadeStatus=0 '\000')
    at engines/sword1/screen.cpp:1385
#4  0x00000000005d2cea in Sword1::MenuIcon::draw(unsigned char const*, signed char)
    (this=0x19e35e0, fadeMask=0x1287f40 <Sword1::Menu::_fadeEffectTop> "\001\a\005\003\002\004\006", fadeStatus=0 '\000') at engines/sword1/menu.cpp:88
#5  0x00000000005d361c in Sword1::Menu::refresh(unsigned char) (this=0x1c21410, menuType=0 '\000') at engines/sword1/menu.cpp:238
#6  0x00000000005d097d in Sword1::Logic::fnWipeHands(Sword1::Object*, int, int, int, int, int, int, int) (this=0x1b9da40, cpt=0x0, id=0, c=0, d=0, e=0, f=0, z=0, x=0)
    at engines/sword1/logic.cpp:1127
#7  0x00000000005c0f92 in Sword1::Control::getPlayerOptions() (this=0x28e6ea0) at engines/sword1/control.cpp:232
#8  0x00000000005bfdc8 in Sword1::SwordEngine::mainLoop() (this=0x2848ab0) at engines/sword1/sword1.cpp:834
#9  0x00000000005bf5ac in Sword1::SwordEngine::go() (this=0x2848ab0) at engines/sword1/sword1.cpp:683
#10 0x00000000005c0608 in Sword1::SwordEngine::run() (this=0x2848ab0) at ./engines/sword1/sword1.h:127
#11 0x0000000000447fdf in runGame(Plugin const*, Plugin const*, OSystem&, Common::String const&) (plugin=0x15f4620, enginePlugin=0x2d30990, system=..., debugLevels=...) at base/main.cpp:324
#12 0x0000000000449f71 in scummvm_main(int, char const* const*) (argc=1, argv=0x7fffffffde78) at base/main.cpp:764
#13 0x0000000000445370 in main(int, char**) (argc=1, argv=0x7fffffffde78) at backends/platform/sdl/posix/posix-main.cpp:44

(gdb) thread 21
[Switching to thread 21 (Thread 0x7fffd8bfa6c0 (LWP 317240))]
#0  0x00007ffff72f5045 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fffd8bf9340, rem=rem@entry=0x7fffd8bf9350)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
48	  r = INTERNAL_SYSCALL_CANCEL (clock_nanosleep_time64, clock_id, flags, req,

(gdb) bt
#0  0x00007ffff72f5045 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7fffd8bf9340, rem=rem@entry=0x7fffd8bf9350)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007ffff72f97f7 in __GI___nanosleep (req=req@entry=0x7fffd8bf9340, rem=rem@entry=0x7fffd8bf9350) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007ffff7f09a01 in SDL_Delay_REAL (ms=<optimized out>) at /usr/src/debug/SDL2-2.26.3-1.fc37.x86_64/src/timer/unix/SDL_systimer.c:219
#3  0x00000000004409cd in OSystem_SDL::delayMillis(unsigned int) (this=0x14b3550, msecs=300) at backends/platform/sdl/sdl.cpp:733
#4  0x00000000005ed332 in Sword1::MemMan::alloc(Sword1::MemHandle*, unsigned int, unsigned short) (this=0x1b633a0, bsMem=0x1b71ba8, pSize=2164, pCond=2) at engines/sword1/memman.cpp:41
#5  0x00000000005d8431 in Sword1::ResMan::resOpen(unsigned int) (this=0x1ff3020, id=67305483) at engines/sword1/resman.cpp:257
#6  0x00000000005d81b0 in Sword1::ResMan::openFetchRes(unsigned int) (this=0x1ff3020, id=67305483) at engines/sword1/resman.cpp:202
#7  0x00000000005e5ca2 in Sword1::Screen::showFrame(unsigned short, unsigned short, unsigned int, unsigned int, unsigned char const*, signed char)
    (this=0x1988030, x=0, y=0, resId=67305483, frameNo=1, fadeMask=0x1287f40 <Sword1::Menu::_fadeEffectTop> "\001\a\005\003\002\004\006", fadeStatus=0 '\000')
    at engines/sword1/screen.cpp:1385
#8  0x00000000005d2cea in Sword1::MenuIcon::draw(unsigned char const*, signed char)
    (this=0x19e35e0, fadeMask=0x1287f40 <Sword1::Menu::_fadeEffectTop> "\001\a\005\003\002\004\006", fadeStatus=0 '\000') at engines/sword1/menu.cpp:88
#9  0x00000000005d361c in Sword1::Menu::refresh(unsigned char) (this=0x1c21410, menuType=0 '\000') at engines/sword1/menu.cpp:238
#10 0x00000000005c02d8 in Sword1::SwordEngine::updateTopMenu() (this=0x2848ab0) at engines/sword1/sword1.cpp:936
#11 0x00000000005c0412 in Sword1::vblCallback(void*) (refCon=0x2848ab0) at engines/sword1/sword1.cpp:969
#12 0x00000000006e75eb in DefaultTimerManager::handler() (this=0x18875d0) at backends/timer/default/default-timer.cpp:110
#13 0x00000000006d7f61 in timer_handler(Uint32, void*) (interval=10, param=0x18875d0) at backends/timer/sdl/sdl-timer.cpp:32
#14 0x00007ffff7e5e595 in SDL_TimerThread (_data=0x7ffff7f981e0 <SDL_timer_data>) at /usr/src/debug/SDL2-2.26.3-1.fc37.x86_64/src/timer/SDL_timer.c:168
#15 0x00007ffff7f07687 in SDL_RunThread (thread=0x18433d0) at /usr/src/debug/SDL2-2.26.3-1.fc37.x86_64/src/thread/SDL_thread.c:305
#16 RunThread (data=0x18433d0) at /usr/src/debug/SDL2-2.26.3-1.fc37.x86_64/src/thread/pthread/SDL_systhread.c:77
#17 0x00007ffff72ae15d in start_thread (arg=<optimized out>) at pthread_create.c:442
#18 0x00007ffff732fc00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

The installTimerProc() call was introduced in commit e21e0ed089c60378cccb5dadda3327d679524871, so earlier versions should be unaffected. This also means the menu rewrite is not the cause.

Change History (16)

comment:1 by AndywinXp, 8 months ago

Thank you for this very valuable report! My bad for this issue, I'll attempt a fix in the next hours (I guess I'll have to introduce locks?)

comment:2 by AndywinXp, 7 months ago

Hi! I have attempted a fix to master; can you please tell me if you can reproduce it now?

comment:3 by AndywinXp, 7 months ago

Owner: set to AndywinXp
Resolution: fixed
Status: newpending

comment:4 by PushmePullyu, 7 months ago

Just tested 2ab1385183bada86a72577ed67b2e82574717be1, sorry for the delay.

Unfortunately I still get this crash:
"ERROR: SDL_BlitSurface failed: Surfaces must not be locked during blit!"

The locking happens from the SDLTimer thread in blitCursor() (backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2284):

SDL_LockSurface(_mouseSurface);

Backtrace:
Thread 21 "SDLTimer"

#0  SDL_LockSurface_REAL (surface=0x2bc45c0) at /usr/src/debug/SDL2-2.26.3-1.fc37.x86_64/src/video/SDL_surface.c:1121
#1  0x00000000006d56de in SurfaceSdlGraphicsManager::blitCursor() (this=0x1823910) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2284
#2  0x00000000006d3985 in SurfaceSdlGraphicsManager::setPalette(unsigned char const*, unsigned int, unsigned int) (this=0x1823910, colors=0x7fffd8bf9870 "", start=0, num=256)
    at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:1803
#3  0x00000000005e2395 in Sword1::Screen::fadePalette() (this=0x2abbfe0) at engines/sword1/screen.cpp:284
#4  0x00000000005c0fc5 in Sword1::SwordEngine::fadePaletteStep() (this=0x27c79c0) at engines/sword1/sword1.cpp:1187
#5  0x00000000005c112c in Sword1::vblCallback(void*) (refCon=0x27c79c0) at engines/sword1/sword1.cpp:1223
#6  0x00000000006e8b43 in DefaultTimerManager::handler() (this=0x1814fc0) at backends/timer/default/default-timer.cpp:110
#7  0x00000000006d94b9 in timer_handler(Uint32, void*) (interval=10, param=0x1814fc0) at backends/timer/sdl/sdl-timer.cpp:32
#8  0x00007ffff7e5e595 in SDL_TimerThread (_data=0x7ffff7f981e0 <SDL_timer_data>) at /usr/src/debug/SDL2-2.26.3-1.fc37.x86_64/src/timer/SDL_timer.c:168
#9  0x00007ffff7f07687 in SDL_RunThread (thread=0x16d4d10) at /usr/src/debug/SDL2-2.26.3-1.fc37.x86_64/src/thread/SDL_thread.c:305
#10 RunThread (data=0x16d4d10) at /usr/src/debug/SDL2-2.26.3-1.fc37.x86_64/src/thread/pthread/SDL_systhread.c:77
#11 0x00007ffff72ae15d in start_thread (arg=<optimized out>) at pthread_create.c:442
#12 0x00007ffff732fc00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

The error occurs here:
Thread 1 "scummvm"

...
#14 0x00000000009a735c in error(char const*, ...) (s=0x12f7637 "SDL_BlitSurface failed: %s") at common/textconsole.cpp:100
#15 0x00000000006d60a7 in SurfaceSdlGraphicsManager::drawMouse() (this=0x1823910) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2395
#16 0x00000000006d23cf in SurfaceSdlGraphicsManager::internUpdateScreen() (this=0x1823910) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:1401
#17 0x00000000006d1557 in SurfaceSdlGraphicsManager::updateScreen() (this=0x1823910) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:1142
#18 0x00000000006b09a4 in ModularGraphicsBackend::updateScreen() (this=0x14b4560) at backends/modular-backend.cpp:187
#19 0x00000000005e2fab in Sword1::Screen::updateScreen() (this=0x2abbfe0) at engines/sword1/screen.cpp:433
#20 0x00000000005c092d in Sword1::SwordEngine::mainLoop() (this=0x27c79c0) at engines/sword1/sword1.cpp:1057
#21 0x00000000005bf800 in Sword1::SwordEngine::go() (this=0x27c79c0) at engines/sword1/sword1.cpp:738
#22 0x00000000005c12ce in Sword1::SwordEngine::run() (this=0x27c79c0) at ./engines/sword1/sword1.h:138
#23 0x0000000000447fdf in runGame(Plugin const*, Plugin const*, OSystem&, Common::String const&) (plugin=0x15f58a0, enginePlugin=0x1590410, system=..., debugLevels=...) at base/main.cpp:324
#24 0x0000000000449f71 in scummvm_main(int, char const* const*) (argc=1, argv=0x7fffffffde78) at base/main.cpp:764
#25 0x0000000000445370 in main(int, char**) (argc=1, argv=0x7fffffffde78) at backends/platform/sdl/posix/posix-main.cpp:44

This should help in reproducing the bug:

index a18ef063519..22f0e960335 100644
--- a/backends/graphics/surfacesdl/surfacesdl-graphics.cpp
+++ b/backends/graphics/surfacesdl/surfacesdl-graphics.cpp
@@ -2282,7 +2282,7 @@ void SurfaceSdlGraphicsManager::blitCursor() {
 
        SDL_LockSurface(_mouseOrigSurface);
        SDL_LockSurface(_mouseSurface);
-
+       g_system->delayMillis(10);
        // If possible, use the same scaler for the cursor as for the rest of
        // the game. This only works well with the non-blurring scalers so we
        // otherwise use the Normal scaler

comment:5 by AndywinXp, 7 months ago

Thanks! I tried a more aggressive strategy: https://github.com/scummvm/scummvm/commit/dcc37481dcb4655f93a0614fc09f5ca26455bd4b

This will prevent any simultaneous access to sensitive blitting operations; let me know if this one works out well.

comment:6 by PushmePullyu, 7 months ago

Thanks for the quick response! I just tested da5c3b8cd833a7de31d0e946248f1a7ab6a039ea

It is still crashing.

I wonder if the goal of improving timer accuracy will still be achieved
after adding all the necessary locks and if it is worth the added complexity.
It might be safer to move back to a single thread and rework the main loop instead.

Info on the crashes:
All crashes happened while repeatedly opening and closing the menu and simultaneously moving the mouse around / over the inventory.

Crash 1 (line numbers will be slightly off due to added debugging code):
Thread 1 "scummvm" received signal SIGSEGV, Segmentation fault.

#0  SurfaceSdlGraphicsManager::SDL_SetColors(SDL_Surface*, SDL_Color*, int, int) (this=0x1658050, surface=0x2c2f810, colors=0x16670b0, firstcolor=0, ncolors=256)
    at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2884
#1  0x00000000006d59f6 in SurfaceSdlGraphicsManager::blitCursor() (this=0x1658050) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2289
#2  0x00000000006d5627 in SurfaceSdlGraphicsManager::setMouseCursor(void const*, unsigned int, unsigned int, int, int, unsigned int, bool, Graphics::PixelFormat const*, unsigned char const*) (this=0x1658050, buf=0x2bdbf20, w=13, h=16, hotspotX=1, hotspotY=1, keyColor=255, dontScale=false, format=0x279f8e0, mask=0x0) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2211
#3  0x00000000006b110a in ModularGraphicsBackend::setMouseCursor(void const*, unsigned int, unsigned int, int, int, unsigned int, bool, Graphics::PixelFormat const*, unsigned char const*)
    (this=0x14b5560, buf=0x2bdbf20, w=13, h=16, hotspotX=1, hotspotY=1, keycolor=255, dontScale=false, format=0x279f8e0, mask=0x0) at backends/modular-backend.cpp:268
#4  0x0000000000757cfb in Graphics::CursorManager::replaceCursor(Graphics::Surface const&, int, int, unsigned int, bool, unsigned char const*)
    (this=0x272bf10, surf=..., hotspotX=1, hotspotY=1, keycolor=255, dontScale=false, mask=0x0) at graphics/cursorman.cpp:184
#5  0x00000000007579d5 in Graphics::CursorManager::replaceCursor(void const*, unsigned int, unsigned int, int, int, unsigned int, bool, Graphics::PixelFormat const*, unsigned char const*)
    (this=0x272bf10, buf=0x7fffac002e1a, w=13, h=16, hotspotX=1, hotspotY=1, keycolor=255, dontScale=false, format=0x0, mask=0x0) at graphics/cursorman.cpp:138
#6  0x00000000005d652e in Sword1::Mouse::animate() (this=0x27c1400) at engines/sword1/mouse.cpp:312
#7  0x00000000005d637b in Sword1::Mouse::setPointer(unsigned int, unsigned int) (this=0x27c1400, resId=67174400, rate=0) at engines/sword1/mouse.cpp:289
#8  0x00000000005d2164 in Sword1::Logic::fnSetMousePointer(Sword1::Object*, int, int, int, int, int, int, int) (this=0x277b4b0, cpt=0x0, id=65536, tag=67174400, rate=0, e=0, f=0, z=0, x=0)
    at engines/sword1/logic.cpp:1343
#9  0x00000000005cfbc4 in Sword1::Logic::interpretScript(Sword1::Object*, int, Sword1::Header*, int, int)
    (this=0x277b4b0, compact=0x0, id=65536, scriptModule=0x2800670, scriptBase=6, scriptNum=6) at engines/sword1/logic.cpp:522
#10 0x00000000005cf81e in Sword1::Logic::runMouseScript(Sword1::Object*, int) (this=0x277b4b0, cpt=0x0, scriptId=6) at engines/sword1/logic.cpp:477
#11 0x00000000005d5b29 in Sword1::Mouse::engine(unsigned short, unsigned short, unsigned short) (this=0x27c1400, x=284, y=258, eventFlags=0) at engines/sword1/mouse.cpp:156
#12 0x00000000005c0a3f in Sword1::SwordEngine::mainLoop() (this=0x27c4910) at engines/sword1/sword1.cpp:1072
#13 0x00000000005bf800 in Sword1::SwordEngine::go() (this=0x27c4910) at engines/sword1/sword1.cpp:738

2883	int SurfaceSdlGraphicsManager::SDL_SetColors(SDL_Surface *surface, SDL_Color *colors, int firstcolor, int ncolors) {
2884		if (surface->format->palette) {
2885			return !SDL_SetPaletteColors(surface->format->palette, colors, firstcolor, ncolors) ? 1 : 0;
2886		} else {
2887			return 0;
2888		}

(gdb) p surface
$1 = (SDL_Surface *) 0x2c2f810
(gdb) p surface->format
$2 = (SDL_PixelFormat *) 0xb81b45331bbaf1d3

The following was captured later, but might have caused the crash above:
Note: The timer thread locks Screen::_screenAccessMutex, but the main thread does not attempt to acquire it here.

Thread 1 "scummvm" hit Breakpoint 2, SurfaceSdlGraphicsManager::setMouseCursor (this=0x17bf350, buf=0x1e3e410, w=22, h=22, hotspotX=7, hotspotY=7, keyColor=255, dontScale=false, 
    format=0x1e4f900, mask=0x0) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2001

Main thread:
#0  SurfaceSdlGraphicsManager::setMouseCursor(void const*, unsigned int, unsigned int, int, int, unsigned int, bool, Graphics::PixelFormat const*, unsigned char const*)
    (this=0x17bf350, buf=0x1e3e410, w=22, h=22, hotspotX=7, hotspotY=7, keyColor=255, dontScale=false, format=0x1e4f900, mask=0x0)
    at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2001
#1  0x00000000006b110a in ModularGraphicsBackend::setMouseCursor(void const*, unsigned int, unsigned int, int, int, unsigned int, bool, Graphics::PixelFormat const*, unsigned char const*)
    (this=0x14b5560, buf=0x1e3e410, w=22, h=22, hotspotX=7, hotspotY=7, keycolor=255, dontScale=false, format=0x1e4f900, mask=0x0) at backends/modular-backend.cpp:268
#2  0x0000000000757cfb in Graphics::CursorManager::replaceCursor(Graphics::Surface const&, int, int, unsigned int, bool, unsigned char const*)
    (this=0x272e400, surf=..., hotspotX=7, hotspotY=7, keycolor=255, dontScale=false, mask=0x0) at graphics/cursorman.cpp:184
#3  0x00000000007579d5 in Graphics::CursorManager::replaceCursor(void const*, unsigned int, unsigned int, int, int, unsigned int, bool, Graphics::PixelFormat const*, unsigned char const*)
    (this=0x272e400, buf=0x1e3fd8a, w=22, h=22, hotspotX=7, hotspotY=7, keycolor=255, dontScale=false, format=0x0, mask=0x0) at graphics/cursorman.cpp:138
#4  0x00000000005d652e in Sword1::Mouse::animate() (this=0x16bfac0) at engines/sword1/mouse.cpp:312
#5  0x00000000005d637b in Sword1::Mouse::setPointer(unsigned int, unsigned int) (this=0x16bfac0, resId=67174403, rate=0) at engines/sword1/mouse.cpp:289
#6  0x00000000005d2164 in Sword1::Logic::fnSetMousePointer(Sword1::Object*, int, int, int, int, int, int, int)
    (this=0x192c0d0, cpt=0x199d388, id=65546, tag=67174403, rate=0, e=0, f=0, z=0, x=0) at engines/sword1/logic.cpp:1343
#7  0x00000000005cfbc4 in Sword1::Logic::interpretScript(Sword1::Object*, int, Sword1::Header*, int, int)
    (this=0x192c0d0, compact=0x199d388, id=65546, scriptModule=0x19d8de0, scriptBase=5, scriptNum=5) at engines/sword1/logic.cpp:522
#8  0x00000000005cf81e in Sword1::Logic::runMouseScript(Sword1::Object*, int) (this=0x192c0d0, cpt=0x199d388, scriptId=5) at engines/sword1/logic.cpp:477
#9  0x00000000005d5bb5 in Sword1::Mouse::engine(unsigned short, unsigned short, unsigned short) (this=0x16bfac0, x=174, y=159, eventFlags=0) at engines/sword1/mouse.cpp:161
#10 0x00000000005c0a3f in Sword1::SwordEngine::mainLoop() (this=0x282e630) at engines/sword1/sword1.cpp:1072
#11 0x00000000005bf800 in Sword1::SwordEngine::go() (this=0x282e630) at engines/sword1/sword1.cpp:738

Timer thread:
#0  SurfaceSdlGraphicsManager::blitCursor() (this=0x17bf350) at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:2217
#1  0x00000000006d3bfd in SurfaceSdlGraphicsManager::setPalette(unsigned char const*, unsigned int, unsigned int) (this=0x17bf350, colors=0x7fffd8bf9870 "", start=0, num=256)
    at backends/graphics/surfacesdl/surfacesdl-graphics.cpp:1803
#2  0x00000000005e2538 in Sword1::Screen::fadePalette() (this=0x192c320) at engines/sword1/screen.cpp:294
#3  0x00000000005c0feb in Sword1::SwordEngine::fadePaletteStep() (this=0x282e630) at engines/sword1/sword1.cpp:1189
#4  0x00000000005c1152 in Sword1::vblCallback(void*) (refCon=0x282e630) at engines/sword1/sword1.cpp:1225

Crash 2:
WARNING: fetchRes:: resource 0 out of bounds!
Segmentation fault (core dumped)

Crash 3 (Not actually a crash, but the freelist was corrupted):
WARNING: addToFreeList: mem block is already in freeList!
On exit:
WARNING: MemMan::flush(): _memListFreeEnd is nullptr!
WARNING: MemMan::flush: Something's wrong: still 3722627 bytes alloced!

The following shows both threads accessing the freelist simultaneously, which could have caused the corruption
(line numbers for memman.cpp and memman.h will be slightly off):

  Id   Target Id                                           Frame 
* 1    Thread 0x7ffff671a880 (LWP 428055) "scummvm"        Sword1::MemMan::addToFreeList (this=0x1b2f340, bsMem=0x19d3b60) at engines/sword1/memman.cpp:111
  21   Thread 0x7fffd8bfa6c0 (LWP 428076) "SDLTimer"       Sword1::MemMan::addToFreeList (this=0x1b2f340, bsMem=0x1b4dd60) at engines/sword1/memman.cpp:111

Main thread:
#0  Sword1::MemMan::addToFreeList(Sword1::MemHandle*) (this=0x1b2f340, bsMem=0x19d3b60) at engines/sword1/memman.cpp:111
#1  0x00000000005eebf7 in Sword1::MemMan::setCondition(Sword1::MemHandle*, unsigned short) (this=0x1b2f340, bsMem=0x19d3b60, pCond=1) at engines/sword1/memman.cpp:73
#2  0x00000000005d9cdc in Sword1::ResMan::resClose(unsigned int) (this=0x19d4a90, id=16842752) at engines/sword1/resman.cpp:284
#3  0x00000000005d99ca in Sword1::ResMan::unlockScript(unsigned int) (this=0x19d4a90, scrID=8388609) at engines/sword1/resman.cpp:237
#4  0x00000000005cf745 in Sword1::Logic::scriptManager(Sword1::Object*, unsigned int) (this=0x1b7a380, compact=0x1baba20, id=8388609) at engines/sword1/logic.cpp:455
#5  0x00000000005cea78 in Sword1::Logic::processLogic(Sword1::Object*, unsigned int) (this=0x1b7a380, compact=0x1baba20, id=8388609) at engines/sword1/logic.cpp:225
#6  0x00000000005ce855 in Sword1::Logic::engine() (this=0x1b7a380) at engines/sword1/logic.cpp:163
#7  0x00000000005c07a2 in Sword1::SwordEngine::mainLoop() (this=0x27dcb10) at engines/sword1/sword1.cpp:1034
#8  0x00000000005bf800 in Sword1::SwordEngine::go() (this=0x27dcb10) at engines/sword1/sword1.cpp:738

Timer thread:
#0  Sword1::MemMan::addToFreeList(Sword1::MemHandle*) (this=0x1b2f340, bsMem=0x1b4dd60) at engines/sword1/memman.cpp:111
#1  0x00000000005eebf7 in Sword1::MemMan::setCondition(Sword1::MemHandle*, unsigned short) (this=0x1b2f340, bsMem=0x1b4dd60, pCond=1) at engines/sword1/memman.cpp:73
#2  0x00000000005d9cdc in Sword1::ResMan::resClose(unsigned int) (this=0x19d4a90, id=67305474) at engines/sword1/resman.cpp:284
#3  0x00000000005e77ca in Sword1::Screen::showFrame(unsigned short, unsigned short, unsigned int, unsigned int, unsigned char const*, signed char)
    (this=0x1b7a5d0, x=0, y=0, resId=67305474, frameNo=1, fadeMask=0x1289960 <Sword1::Menu::_fadeEffectTop> "\001\a\005\003\002\004\006", fadeStatus=6 '\006')
    at engines/sword1/screen.cpp:1452
#4  0x00000000005d42e6 in Sword1::MenuIcon::draw(unsigned char const*, signed char)
    (this=0x2a30520, fadeMask=0x1289960 <Sword1::Menu::_fadeEffectTop> "\001\a\005\003\002\004\006", fadeStatus=6 '\006') at engines/sword1/menu.cpp:88
#5  0x00000000005d4c18 in Sword1::Menu::refresh(unsigned char) (this=0x1b3f5c0, menuType=0 '\000') at engines/sword1/menu.cpp:238
#6  0x00000000005c0fa0 in Sword1::SwordEngine::updateTopMenu() (this=0x27dcb10) at engines/sword1/sword1.cpp:1181
#7  0x00000000005c10f1 in Sword1::vblCallback(void*) (refCon=0x27dcb10) at engines/sword1/sword1.cpp:1215

comment:7 by AndywinXp, 7 months ago

Thanks for the detailed response! Unfortunately I have tried in my previous experiments force the palette fades to be single threaded, but I found it to be next to impossible, given how the main loop is structured (for instance, by design it is full of busy waits).

As for timing accuracy, I understand the concern; but for now it is important that the game doesn't crash while running code which is supposed to be there. I will definitely begin comparing timings between DOS and ScummVM once all the changes I've made are perfectly stable.

So I'll give a crack at ensuring thread safeness by fixing all these crash reports :-)

comment:8 by AndywinXp, 7 months ago

One other thing, since I'm developing and running on Windows 10; in order to replicate the crashes, would a Linux VM suffice? Do I have to run ScummVM with any specific arguments or build it with any flags?

Last edited 7 months ago by AndywinXp (previous) (diff)

comment:9 by PushmePullyu, 7 months ago

It should work in a VM (I used real hardware with a 2+2HT core Pentium Gold 4417U).

Build config was just:
./configure --disable-all-engines --enable-debug --disable-optimizations --enable-engine=sword1

One could also use --enable-tsan (or use valgrind instead).

I observed more crashes with video mode set to SDL Surface, but that could have been random.

Valgrind with helgrind should prove helpful in debugging this, e.g. something like:
valgrind --tool=helgrind --log-file=~/sword1_%p.log ./scummvm

useful valgrind options:
--scheduling-quantum=<number> (default is 100000)
--free-is-write=yes (default is no)
If you get too many bogus warnings you can create a suppressions file:
--gen-suppressions=<yes|no|all>
--suppressions=<filename>

Just ping if you need more testing done.

Last edited 7 months ago by PushmePullyu (previous) (diff)

comment:10 by AndywinXp, 7 months ago

Thanks for the advice! Using TSan I was able to knock out more data races connected to possible concurrent screen accesses. Could you please check how it is now?

comment:11 by PushmePullyu, 7 months ago

Preliminary results for c99265cf9725b8ec32698b83088fe2c8f304749f

I observed no concurrent blitting or memMan access anymore and only saw a single crash, which I could not reproduce so far.
It happened while spam clicking and moving the mouse in and out of the inventory bar:

(Line numbers will be different).

Thread 21 "SDLTimer" received signal SIGSEGV, Segmentation fault.

0x000000000048887d in READ_UINT32 (ptr=0x14) at ./common/endian.h:214
214			return ((const Unaligned32 *)ptr)->val;

#0  0x000000000048887d in READ_UINT32(void const*) (ptr=0x14) at ./common/endian.h:214
#1  Sword1::ResMan::fetchFrame(void*, unsigned int) (this=0x17d0900, resourceData=0x0, frameNo=3640146302) at engines/sword1/resman.cpp:298
#2  0x00000000004961ba in Sword1::Screen::showFrame(unsigned short, unsigned short, unsigned int, unsigned int, unsigned char const*, signed char)
    (this=0x19770e0, x=3600, y=440, resId=0, frameNo=3640146302, fadeMask=0x10ae1e0 <Sword1::Menu::_fadeEffectTop> "\001\a\005\003\002\004\006", fadeStatus=2 '\002')
    at engines/sword1/screen.cpp:1451
#3  0x0000000000482a10 in Sword1::MenuIcon::draw(unsigned char const*, signed char)
    (this=0x2812800, fadeMask=0x10ae1e0 <Sword1::Menu::_fadeEffectTop> "\001\a\005\003\002\004\006", fadeStatus=2 '\002') at engines/sword1/menu.cpp:88
#4  0x00000000004833ca in Sword1::Menu::refresh(unsigned char) (this=0x193b410, menuType=0 '\000') at engines/sword1/menu.cpp:243
#5  0x000000000046f14a in Sword1::SwordEngine::updateTopMenu() (this=0x25d9280) at engines/sword1/sword1.cpp:1186
#6  0x000000000046f2ae in Sword1::vblCallback(void*) (refCon=0x25d9280) at engines/sword1/sword1.cpp:1221

Possible cause:
Both threads accessing Sword1::Menu::_objects[]

Main thread
#0  Sword1::Menu::buildMenu() (this=0x21d39c0) at engines/sword1/menu.cpp:321
#1  0x0000000000483a16 in Sword1::Menu::fnStartMenu() (this=0x21d39c0) at engines/sword1/menu.cpp:344
#2  0x00000000004840a2 in Sword1::Mouse::engine(unsigned short, unsigned short, unsigned short) (this=0x25bc5c0, x=132, y=16, eventFlags=2) at engines/sword1/mouse.cpp:123
#3  0x000000000046ebe9 in Sword1::SwordEngine::mainLoop() (this=0x25bee30) at engines/sword1/sword1.cpp:1077
#4  0x000000000046d988 in Sword1::SwordEngine::go() (this=0x25bee30) at engines/sword1/sword1.cpp:739

Timer thread
#0  Sword1::Menu::refresh(unsigned char) (this=0x21d39c0, menuType=0 '\000') at engines/sword1/menu.cpp:234
#1  0x000000000046f14a in Sword1::SwordEngine::updateTopMenu() (this=0x25bee30) at engines/sword1/sword1.cpp:1186
#2  0x000000000046f2ae in Sword1::vblCallback(void*) (refCon=0x25bee30) at engines/sword1/sword1.cpp:1221

I'll see if I can find anything else.

comment:12 by AndywinXp, 7 months ago

Thanks, menu related bugs should also be fixed now

comment:13 by AndywinXp, 7 months ago

Hi, are there any new issues with this?

comment:14 by PushmePullyu, 7 months ago

I saw zero crashes with abd2a3d630805e1fbb62d0583b0e7fb13b127d3e, so I think the ticket can be closed now.

Many thanks for you work on this, much appreciated!

comment:15 by AndywinXp, 7 months ago

Thanks for your patience and your help with finding the issues! :) closing!

comment:16 by AndywinXp, 7 months ago

Status: pendingclosed
Note: See TracTickets for help on using tickets.