Opened 15 years ago
Closed 14 years ago
#4828 closed defect (fixed)
BS2: Game lockup in British Museum
Reported by: | digitall | Owned by: | eriktorbjorn |
---|---|---|---|
Priority: | normal | Component: | Engine: Sword2 |
Version: | Keywords: | ||
Cc: | Game: | Broken Sword 2 |
Description
Playing through BS2, when Nico talks to the Attendant in the British Museum, causes a interactive lockup.
ScummVM is not frozen and the debug console is available, but BS2 will not respond i.e. this looks like a script bug.
This is likely to be the same or similar issue to bug 2607089.
ScummVM 1.1.0pre48374 (Mar 23 2010 23:16:52) Features compiled in: Vorbis FLAC MP3 ALSA RGB zLib FluidSynth built on Linux x86_32 2.6.31 by GCC 4.3.4
Ticket imported from: #2976008. Ticket imported from: bugs/4828.
Attachments (20)
Change History (42)
by , 15 years ago
Attachment: | sword2.010 added |
---|
by , 15 years ago
Attachment: | scummvm00000.jpg added |
---|
Screen capture of lockup with debug overlay information activated
comment:1 by , 15 years ago
Have attached : * latest savegame prior to the lockup occurring, * a capture of the console output which like bug 2607089 shows a lot of "WARNING: No free slot in FX queue!", but I'm not sure that this is related to the issue seen, and * a screen capture of the lockup. Since the lockup did not affect the engine, I used the debug console to enable debug output information, which is shown.
I think the line "Script waiting for attendant_71 (3029)" is the issue at hand.
comment:2 by , 15 years ago
Summary: | Game lockup in British Museum → BS2: Game lockup in British Museum |
---|
comment:4 by , 15 years ago
I was probably not clear about the phrase "Playing through BS2".
I have also been unable to replicate this by loading the savegame, but this occurred when I was playtesting BS2 and had played through almost the entire game from the start. I had been saving and I think I loaded from a savegame fairly near the beginning, but otherwise it was a straight play run.
This bug looks like it might be caused by a resource leak of some kind, maybe related to "WARNING: No free slot in FX queue!" which then causes a script to get hung up, but replication is likely to require playing through the game from the start with a debugger attached to try to get the same resource exhaustion...
comment:5 by , 15 years ago
The strange thing is that I'm pretty sure the FX queue is cleared every time you enter a new room, so I don't know why it would fill up here. The only thing I can think of is that the resource manager (if memory serves me) expires resources when it runs out of memory. I guess it's possible, in some rare case, that it expires something that it shouldn't have, and that this would cause it to forget that a sound has already been added to the queue, or something silly like that. (I don't think there's any safeguard against adding the same sound more than once, and for all I know there may be cases where it's sensible to allow it.)
I've added an "fxq" debugger command to the trunk, which prints the contents of the FX queue. Maybe that will come in use some day.
comment:6 by , 15 years ago
eriktorbjorn : Thanks for the "fxq" command. Will try playing through from start and see if I can replicate and see what "fxq" then gives. Will also give valgrind a try. Suspect this is going to be a slow one to get replication, but will try.
comment:7 by , 15 years ago
Owner: | set to |
---|
by , 15 years ago
Attachment: | scummvm00002.jpg added |
---|
Screen capture of lockup with debug output overlay Run #2
comment:8 by , 15 years ago
Have played through again from beginning with : ScummVM 1.2.0svn48496 (Apr 4 2010 00:27:34) Features compiled in: Vorbis FLAC MP3 ALSA RGB zLib FluidSynth
And reliably hit the bug.
The "WARNING: No free slot in FX queue!" are the cause. The fxq command shows that the FX queue has been exahusted of slots as upon entering the British Museum (As opposed to starting to talk to the attendant), three background effects are meant to be loaded.
For some reason, when you play from the start, these three are loaded repeatedly i.e about 10+ times exhausting the FX queue and causing the script problems afterwards.
Loading a savegame in the British Museum or at Ketch's Landing in the previous scene, does not trigger this problem.. so replication still requires playing from a new game state.
I have attached copies of the fxq and reslist output with the bug and fxq without.
comment:9 by , 15 years ago
I'm going to think out loud for a while...
A game resource is whatever game data that's loaded from disk. It could be graphics, sound, scripts, etc.
The engine is full of cases where it does something like "open resource, manipulate resource, close resource". However, closing a resource is not the same as removing it from memory. It simply tells the resource manager that it's done with the resource for now. The resource manager is expected to cache the resource for later use. In fact, some resources are stateful (e.g. script resources contain local variables) which means that if they were thrown away on closing and reloaded on opening, information would be lost. Incidentally, configuring the resource manager to not cache causes "No free slot in FX queue" warnings at the very beginning of the game.
But obviously, resources cannot be cached forever or we'd run out of memory. The resource manager deals with this by keeping track of how much memory is allocated, which resources are open at the moment, and when it exceeds its memory quota it expires what it thinks are the oldest resources that aren't open at the time.
So maybe the resource manager is expiring some stateful resource that belongs to the current room, and this causes things to go haywire. In that case, I can think of two possibilities:
1) There's something wrong with the "expire the oldest resources" logic. However, I can't see anything obviously wrong at the moment.
2) The game leaks resources, i.e. it opens some resource that it should have closed but doesn't. (There are resources that it legitimately should never close, of course. The global variables resource, for instance.) When I tried a random savegame, I had to enter three rooms before it started expiring anything, so normally it shouldn't expire anything museum-related while you're still there. But maybe if you play the game from the beginning, the museum is the point where the leak has accumulated to where it can no longer hold an entire room's resources in what memory remains?
I'll have to play around a bit with the latter assumption. Not right now, though.
comment:10 by , 15 years ago
After playing through the early parts of the game, I didn't see any leaks in the list of loaded resources. The current implementation doesn't always evict the least recently opened resource first. Actually, it evicts the least recently *closed* resource, but I think that should still be ok because we can assume that resources are used while they're held open. (I believe it's primarily sound effects that are held open for long periods of time, but I think the sounds from the previous room are still thrown out before anything from the current room.)
I'm not sure what to look for next. Perhaps checking if the memory usage is calculated properly. If it's not, and it thinks it's using more memory than it is, it will of course start throwing out resources much faster than before. But I don't see how that could happen...
comment:11 by , 15 years ago
Have run through to this bug with: ScummVM 1.2.0svn48496 (Apr 4 2010 00:27:34) Features compiled in: Vorbis FLAC MP3 ALSA RGB zLib FluidSynth
running under Valgrind.
This has revealed two memory errors, which should be corrected.
The first is : ==8765== Conditional jump or move depends on uninitialised value(s) ==8765== at 0x8138334: Sword2::Logic::fnSetFrame(int*) (function.cpp:406) ==8765== by 0x813A328: Sword2::Logic::fnSpeechProcess(int*) (function.cpp:1356) ==8765== by 0x813C7F7: Sword2::Logic::runScript2(unsigned char*, unsigned char*, unsigned char*) (interpreter.cpp:620) ==8765== by 0x813E59D: Sword2::Logic::processSession() (logic.cpp:130) ==8765== by 0x81247A0: Sword2::Sword2Engine::gameCycle() (sword2.cpp:717) ==8765== by 0x8125E20: Sword2::Sword2Engine::run() (sword2.cpp:536) ==8765== by 0x8056470: runGame(PluginSubclass<MetaEngine> const*, OSystem&, Common::String const&) (main.cpp:217) ==8765== by 0x8056E19: scummvm_main (main.cpp:390) ==8765== by 0x80532C5: main (main.cpp:65)
And occurs in Montfaulcon at the Cafe, during the "Meanwhile" non-interactive cutscene, with Nico. To replicate, load the attached savegame. Talk to the waiter about Labino etc. which triggers Labino to arrive. Talk to Labino, especially about the Pot, until the "meanwhile" cutscene is triggered. Note : Sorry, can't get a closer savegame as these sections during conversations are non-saveable points.
by , 15 years ago
Attachment: | sword2.002 added |
---|
Savegame for replication of 1st valgrind error - "Meanwhile" cutscene
comment:12 by , 15 years ago
The second is : ==24377== Use of uninitialised value of size 4 ==24377== at 0x816FCD4: Sword2::Screen::scaleImageGood(unsigned char*, unsigned short, unsigned short, unsigned short, unsigned char*, unsigned short, unsigned short, unsigned short, unsigned char*) (render.cpp:171) ==24377== by 0x8182170: Sword2::Screen::drawSprite(Sword2::SpriteInfo*) (sprite.cpp:704) ==24377== by 0x817B2EC: Sword2::Screen::processImage(Sword2::BuildUnit*) (screen.cpp:673) ==24377== by 0x817B84E: Sword2::Screen::drawSortFrames(unsigned char*) (screen.cpp:482) ==24377== by 0x817D3EC: Sword2::Screen::buildDisplay() (screen.cpp:315) ==24377== by 0x814B7AB: Sword2::Sword2Engine::run() (sword2.cpp:556) ==24377== by 0x80565CC: runGame(PluginSubclass<MetaEngine> const*, OSystem&, Common::String const&) (main.cpp:217) ==24377== by 0x8056F75: scummvm_main (main.cpp:390) ==24377== by 0x8053421: main (main.cpp:65)
This error repeats 11 times, with lines in render.cpp as follows : 171-3, 175-7, 163-5 and 167-9
And occurs when you enter the Mining Company in Quaramonte. To replicate, load the attached savegame and enter the stairs to the mining company.
by , 15 years ago
Attachment: | sword2.004 added |
---|
Savegame for replication of 2nd valgrind error - Mining Company, Quaramonte
by , 15 years ago
Attachment: | quaramonte-valgrind-debug.patch added |
---|
Debug Patch for 2nd valgrind error - Mining Company, Quaramonte
comment:13 by , 15 years ago
I have taken a look at the second valgrind error in the Mining Company in Quaramonte. I have traced the source of the uninitalised usage to the backPtr accessing outside the screen area, when y = 161 and rd.top = 278. My debugging patch for this is attached. If run with valgrind, the output should be aligned. Note, this will only manifest with the highest image quality setting in the menu i.e. this triggers the use of scaleImageGood, rather than scaleImageFast which does not use backBuf.
eriktorbjorn: I could clamp backBuf accesses, but I'm not sure that is the best solution for this.
comment:14 by , 15 years ago
Sorry about the delay. I've committed a fix (hopefully) for the scaler Valgrind warning, but I doubt that's the cause of this bug. From what I understand, it was only reading data where it shouldn't, not writing. (Even the bad reads may very well have been within the allocated buffer.)
I haven't looked at the other warning yet.
comment:15 by , 15 years ago
Have checked the fix with valgrind and the test savegame with no issues. Thanks for this.
I agree that neither valgrind issue is likely to be the cause of this, but still best to rule out memory corruption issues first.
I am attaching a debug log from the first "Meanwhile" cutscene valgrind error. I have localised it to the animation when Nico has her mouth taped by the General.
If this can be fixed, I will play though again and see if the British Museum issue still occurs, thus ruling out these issues as cause.
I will then look at getting a dump of the engine state variables prior to it occuring to compare to the values from a save-load version to determine the trigger.
by , 15 years ago
Attachment: | meanwhile-error-log.txt added |
---|
Debug Output log from "Meanwhile" cutscene error
comment:16 by , 14 years ago
I just committed what I believe is the proper fix for the "Meanwhile" Valgrind warning, though I don't have high hopes that it's related to this bug.
Apparently, when a script calls an "mcode" function (a function defined in the game engine, not the script; what we usually call "opcodes"), it doesn't have to provide all the parameters the function can accept. In this case, the fnTheyDo() function can take seven parameters, bus is frequently called with two or four. This is not a bug, as long as the undefined parameters aren't used. If they are used, it depends on how they're used I guess.
To keep things nice and predictable, I've made sure that any parameter not provided by the script is initialised to 0. If that causes anything else to break, I'm going to be very disappointed.
comment:17 by , 14 years ago
Have traced problem to script resource 3625 (background_sfx_71) which loads these three sounds in the British Museum.
In the "no bug" situation, this is executed multiple times, with the pc being changed in the script object cached in memory. However, in the "bug" situation, the resman mem cache exceeds the upper limit of 8 Meg, triggering the checkMemUsage() function to forcibly remove resources, including resource 3625.
This has the side effect of effectively resetting the pc, causing an infinite loop of sound fx reloads.
I have attached copies of the output of the console "mem" command listing the cache contents prior to the "bug" and "no bug" situations. It seems likely that one or more resources is added to the cache and never expired earlier in the game causing the cache overflow, OOM killer and issue shown.
A brute force solution could be acheived by increasing the cache size or using the equivalent of the "nuke" command prior to entering The British Museum, but I think a more finesse solution should be possible here if the offending resources hogging the cache can be identified and a workaround to remove them added.
comment:18 by , 14 years ago
Btw, kudos to erkitorbjorn for being right about the likely cause.
It might also be a good idea to add a warning to checkMemUsage() to output a warning when the cache has become overfull and has to force expiration of resources, since the normal expiration mechanism should ensure that resources are added and removed in a timely manner and thus this never happens.
comment:19 by , 14 years ago
Unless I completely misremember, there's only one mechanism for expiring resources so there's no difference between "force expiration" and "normal expiration". Also, I don't think the cache size itself should be a problem because someone decided long ago that on PalmOS, half the current cache size (i.e. 4 MB instead of 8 MB) was enough, and I assume that someone tested it.
So it seems more likely to me that either there's a resource leak that builds up over time, or there's something wrong with the mechanism that determines which resources to expire.
I made a quick hack to my local copy of the source code to keep track of when a resource was loaded from disk to see if any resources remained in memory when they should have been removed. I just arrived at Ketch's Landing, and right now there does indeed appear to be a small leak. There are 8 resources which were loaded from disk more than five minutes ago. Three of them are expected: They're resource 1 ("Global Variables"), 8 ("george") and 341 ("SpchFont"). 1 and 8 have refCount 1, and are supposed to be alive throughout the game, and if 341 is what it sounds like, it's used all the time (I do have subtitles turned on).
The others are more mysterious. They're resources 539 ("Wav fxM133"), 550 ("Wav fxM135"), 1655 ("Wav fxM137"), 2231 ("Wav fxM139") and 3685 ("Wav fxM232"), and they all have refCount 1 which means they won't be expired. The first three are from warehous.clu (the warehouse area in Marseille), the fourth and fifth are from quaramon.clu (Quaramonte City). They're all fairly large (approximately 500-600 KB each), so they occupy a not insignificant portion of the cache.
So it looks like I'll need to take a closer look at the sound handling...
comment:20 by , 14 years ago
Resources 539 and 550 are the lead-in/lead-out sounds for the cutscene where George and Nico escape from the warehouse, so apparently they aren't properly freed. I should have thought of that.
I've rewritten the playMovieSound() function to close the resource. Hopefully, that should finally put a stop to this bug.
comment:22 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Savegame prior to bug