Opened 10 years ago

Closed 10 years ago

Last modified 13 months ago

#9070 closed patch

Improve endian.h and stream.h

Reported by: SF/nolange Owned by: fingolfin
Priority: normal Component: --Other--
Keywords: Cc:
Game:

Description

I improved endian.h and stream.h in various ways, resulting in a notable size-reduction for the scummvm executeable.
x86 architectures likely have the biggest improvment, due having a bswap instruction and good support under gcc (other archtectures just have a generic software-fallback for the byteswap-builtin instead using optimized versions).

On another note, some code assumes that Stream::readByte does return 0 in case of an error, even if the comments are very clear that shouldnt be expected. (xmplparser class for example). Thus the patch still emulates that behaviour.

endian.h:
*) dont use additions if an or is sufficient - this eases up optimization work for the compiler (SWAP_BYTES_16 is now 1 instruction instead of 5 on x86)
*) use byteswap-builtin if compiler supports it. (SWAP_BYTES_32 is now 1 instruction instead of 13 on x86).
*) use compiler-support for unaligned loads, this should reduce instruction count on CPUs which have special support for this (MIPS)
stream.h:
*) use funtions from endian.h where possible
*) avoid multiple virtual-calls in the read/write stream funtions, compilers cant inline them
*) made _bigEndian constant and private in MemoryReadStreamEndian and SeekableSubReadStreamEndian
engines/saga/animation.cpp:
*) fixed an access to MemoryReadStreamEndian::_bigEndian

Ticket imported from: #2838562. Ticket imported from: patches/1175.

Attachments (6)

endianstream.patch (17.3 KB ) - added by SF/nolange 10 years ago.
adds described changes to endian.h and stream.h
endian2.patch (20.4 KB ) - added by SF/nolange 10 years ago.
add special support for PSP and CONSTANT functions
gccbswap.patch (560 bytes ) - added by SF/nolange 10 years ago.
patch for the PSPToolchains GCC to add support for builtin_bswap
endian3.patch (21.3 KB ) - added by SF/nolange 10 years ago.
make FORCEINLINE work under GCC
endian4.patch (22.0 KB ) - added by SF/nolange 10 years ago.
use special byteswap instruction for PSP
tfmx.patch (2.0 KB ) - added by SF/nolange 10 years ago.
Example using the CONSTANT_* macro

Download all attachments as: .zip

Change History (18)

by SF/nolange, 10 years ago

Attachment: endianstream.patch added

adds described changes to endian.h and stream.h

comment:1 by lordhoto, 10 years ago

Nice patch!

Some comments:

+ byte b = 0; // FIXME: remove initialisation

Removing the "= 0;" shouldn't help much, C++ defaults to zero initialization, when no initialization is done. Of course I think only g++ 4.4 onwards is implementing that properly IIRC the changelog correctly. Next MSVC does warn about uninitialized variables by our default configuration AFAIK.

About the GCC use for unaligned memory access, is that anywhere described that g++ 4.0 introduced this feature for *all* backends out there? I checked the gcc 4.0.0 changelog and didn't found any mention, that this version should introduce CPU use for unaligned memory access.

Seeing that READ_UINT## and WRITE_UINT## are supporting unaligned memory access. Is there any reason not to use those in READ_BE_UINT## and WRITE_BE_UIN## when "SCUMM_NEED_ALIGNMENT" is defined?

Seeing that you use compiler specific use for SWAP_BYTES_32, is there any reason not to use compiler specific versions of SWAP_BYTES_16 also?

comment:2 by SF/nolange, 10 years ago

> Removing the "= 0;" shouldn't help much, C++ defaults to zero
> initialization, when no initialization is done. Of course I think only g++
> 4.4 onwards is implementing that properly IIRC the changelog correctly.
> Next MSVC does warn about uninitialized variables by our default
> configuration AFAIK.

removing initialization strips off ~4KB (all my tests are done with GCc4.3.4 on AMD64) and breaks code like the XMLParser. That alone should be proof enough that the variable isnt getting default initialised (that only happens if the variable is a classmember and in context of the default-constructor). I get also no warnings in MSVC (with the scummvm-projects from svn).

Removing initialization would also kinda force new code to comply with the rules.

> About the GCC use for unaligned memory access, is that anywhere described
> that g++ 4.0 introduced this feature for *all* backends out there? I
> checked the gcc 4.0.0 changelog and didn't found any mention, that this
> version should introduce CPU use for unaligned memory access.

Per definition any member of "packed" structs need to be accesses as it would be unaligned.
Support for the packed attribute started with an earlier version (3.x I think), but I found alot bugreports with versions < 4.0.
So I`d rather be safe than sorry.

> Seeing that READ_UINT## and WRITE_UINT## are supporting unaligned memory
> access. Is there any reason not to use those in READ_BE_UINT## and
> WRITE_BE_UIN## when "SCUMM_NEED_ALIGNMENT" is defined?

The "native" reads already do, the "inverse" case doesnt
Worstcase would be that the compiler emmits code for loading all bytes, shifting and or`ing them together (ie same thing as the fallback).
And then having to mask out the bytes from the 32bit integer AGAIN. its way more efficent just to load them directly.
So unless you know that unaligned loads are only a few instructions (!defined(SCUMM_NEED_ALIGNMENT) or special instructions like MIPS) its better this way.
We would need tests for architecture (again, I only can think of MIPS where this would help) to determine which way is most efficient.

comment:3 by lordhoto, 10 years ago

> removing initialization strips off ~4KB (all my tests are done with
> GCc4.3.4 on AMD64)

See you're using GCC 4.3.4, not 4.4.0 or 4.4.1 ;-). I just checked again, it seems only value-initialization was added.

Anyway I made some tests with -O, thus not size optimized. Trunk with all engines enabled:

Without patch:
non-stripped binary: 48364545 bytes
stripped: 14240200 bytes

With patch:
non-stripped binary 48351989 bytes
stripped binary: 1429912 bytes

this is 228 bytes difference in stripped versions. So not too much :-/.

Tests with -Os

non-patched:
non-stripped: 48258383 bytes
stripped: 9625088 bytes

patched:
non-stripped: 48185423 bytes
stripped: 9626016 bytes

that is a change of -928 bytes.

comment:4 by lordhoto, 10 years ago

O2 gives the following results for me:

non-patched:
non-stripped: 57190060 bytes
stripped: 12111888 bytes

patched:
non-stripped: 55355491 bytes
striped: 11845648 bytes

that's really a difference of 266240 bytes, of course the code is still larger than -Os.

I checked the Nintendo DS build files. It seems it uses two different settings. One using -O3 for speed optimizations and one using -Os -mthumb for size optimizations. I guess most of the code, except graphics / sound code, which is probably the code which needs speed, are using the size optimization. Of course one might want to check that in depth. But I guess there's not much size saved on NDS then.

That is of course no reason not to use this patch, but for real size optimization we might need to tackle a different place of the code.

by SF/nolange, 10 years ago

Attachment: endian2.patch added

add special support for PSP and CONSTANT functions

comment:5 by SF/nolange, 10 years ago

I finally got around to fix my installation of the PSP-Toolchain. Support for the bswap builtin is not existing for MIPS, and the 16bit bswap aint getting optimized right either.

And to answer a previous question: there is no 16bit bswap builtin as gcc is supposed to be smart enough to detect the code-sequence as a 16bit rotate, This genereally works on x86 and likely other architectures which have 16 bit operations. It doesnt work for MIPS as gcc treats everything as 32bit.

So the original patch didnt do its full magic, I managed to locally patch support for 32bit bswaps into gcc, but it would likely take ages till this patch ends up in the toolchain everywhere.

I added a second version of the patch:
*) added assembly-versions for PSP
*) added CONSTANT functions to use with compile-time constants
*) added some more comments describing the functions
*) replaced FORCEINLINE with inline, this might help with -Os compiles. (only using FORCEINLINE for extremely simple functions)

by SF/nolange, 10 years ago

Attachment: gccbswap.patch added

patch for the PSPToolchains GCC to add support for builtin_bswap

comment:6 by SF/nolange, 10 years ago

default compile of PSP-trunk (-O3) gives me for filesizes:

original - 7527630
endian2 - 7375438

so its ~150KB smaller
and I added the GCC Patch, it helps for the "endianstream", but is not needed for "endian2" patch

by SF/nolange, 10 years ago

Attachment: endian3.patch added

make FORCEINLINE work under GCC

by SF/nolange, 10 years ago

Attachment: endian4.patch added

use special byteswap instruction for PSP

comment:7 by fingolfin, 10 years ago

Ah, never noticed this patch before. Overall, looks like promising work that we should try to integrate soon. Some remarks:

* I fail to see the usefulness of the *CONSTANT_* macros. Yeah, I read the comments explaining them, but am not aware of a single example in our code where those would actually be useful... Know of any?

* Somewhat independently of this patch, we should get rid of FORCEINLINE and replace it by inline. The compiler usually knows much better what to inline anyway. To convince me to keep it, somebody would have to give me provable evidence that it actually has any noticeable beneficial effect.

* I really wonder how safe resp. brittle those "__attribute__ ((__packed__)" based function variants for GCC 4.x are... stuff like that tends to be broken subtly on some less widespread systems / targets.

* In a regular (non-release, hence non-optimized) build, the resulting binary for me actually was bigger than without the patch. I'll redo this with -O2 later.

comment:8 by fingolfin, 10 years ago

Very nice: With -O2 the binary size in bytes on Mac OS X 10.5, Intel, using Apple's GCC 4.2 went from
13103544 (stripped: 11241276)
down to
12823748 (stripped: 10961968)
So that's ~270kb saved. Neat.

comment:9 by SF/nolange, 10 years ago

*) byteswaps are self inverse, so if you have an expression like
int32 a;
if (TO_LE_32(a) == 0x123456)
....
you can apply a BS on both sides and with the CONSTANT_* macro move the work from runtime to compiletime. I added a small patch demonstrating this.

*) Sometimes its necessary to ensure a function gets inlined. For this patch the functions containing __builtin_constant_p(a) would be an example, if not inlined this returns always 1.
Im aware that compilers can do a good job deciding what to inline nowadays (and thus FORCEINLINE should be used very rarely), but especially gcc only works on a per-module basis and thus doesn't know how often a function is referenced (and compilers know nothing about runtime behavior unless you profile). I can cook up some examples where this might be useful.
And from a ethic perspective I would prefer simple FORCEINLINE functions over stuff like
#define FROM_BE_32(a) SWAP_BYTES_32(a)

*) Cant say that, but I doubt theres much backend-specific code necessary, loads are either flagged as aligned or aligned and code is then emitted (on targets that aren't maintained unaligned loads would likely just use a generic software fallback). The way I use it is also rather simplistic, I seen bugreports for the GCC 3.x line but those typically used this feature in more debatable ways =)

*) code is a bit more in the stream-methods and O0 inlines nothing. The patch obviously doesnt focus on this case.

by SF/nolange, 10 years ago

Attachment: tfmx.patch added

Example using the CONSTANT_* macro

comment:10 by fingolfin, 10 years ago

re CONSTANT_* macros: I see.

re FORCEINLINE: I am still not quite convinced. We already ensure that a function is inlined by specifying "inline". If the compiler decides to ignore it, so be it, it will have its reasons. Then it'll optimize away the now useless __builtin_constant_p invocation, too. If you really want to force __builtin_constant_p being used, a macro could be used, too.
Anyway, as long as we restrict it to "system" headers (i.e. stuff in common/), and carefully limit its usage to things we are sure it could have a positive effect, well, we could keep it.
Note that in a test with GCC 4.2 and -O2, changing all FORCEINLINE to inline had almost no effect (a few hundred bytes difference in binary size).

I've commited the patch with some minor tweaks.

Thanks for this excellent contribution :).

comment:11 by fingolfin, 10 years ago

Owner: set to fingolfin
Status: newclosed

comment:12 by digitall, 13 months ago

Component: --Other--
Note: See TracTickets for help on using tickets.