Opened 12 years ago

Closed 12 years ago

Last modified 7 months ago

#8656 closed patch

WINCE: ARM version of screen rotation fn for SDL lib

Reported by: SF/robinwatts Owned by: SF/knakos
Priority: normal Component: Port: WinCE
Keywords: Cc:


The Wince port of Scummvm relies on a port of SDL, and in particular the video interface it provides implemented using GAPI.

The central part of this routine that blits stuff to the screen is GAPI_UpdateRects. This patch implements the two rotated cases of that blit in ARM code in the most cache friendly way I know how.

Screen memory is uncached generally, so the key to getting performance out should be to make use of the write buffer. At the same time, we have to be sure that we don't blow through the data cache as we read through source lines.

The ARM code routine enclosed reads and writes 4 pixels at a time in such a way that the write buffer should merge writes for the most efficient operation possible. Reading should use 4 cache lines at a time, so should make nice use of the cache too.

My next job is to do some timings to prove that this patch is genuinely faster, but I thought I'd upload it here in advance for safekeeping.

Ticket imported from: #1714659. Ticket imported from: patches/761.

Attachments (1)

diff (7.8 KB) - added by SF/robinwatts 12 years ago.
ARM version of rotation functions, plus C changes to call it

Download all attachments as: .zip

Change History (10)

Changed 12 years ago by SF/robinwatts

Attachment: diff added

ARM version of rotation functions, plus C changes to call it

comment:1 Changed 12 years ago by SF/knakos

First of all, thank you for your contribution, I didn't think anyone would have the patience to read through the Monster function :-)

Some random comments:

"Screen memory is uncached generally": All I/O ranges are uncached generally, but I do not know if this applies to screen mem for a fact.

Cache starvation is a problem, yes, especially for mixed data/instruction caches.

Is the write buffer available on most ARM devices? AFAIK, mobile devices either use XScale & one chip solutions like OMAP.

Principle of operation comments:
In my routines, I try to rely less on the "helper" aids of the processor, like the write buffer and the cache which should always be present on a decent arm processor. This is because I do not have the resources to see what configurations for caches, cache sizes and whatever else features a "typical" ce device has. Further, system-on-chip solutions including ARM cores which *should* be the typical case in wince devices have crippled implementation of the cores in, as far as I'm concerned, random ways (from extremely low AHB bus speeds/widths to small caches etc).

Since I cannot check these things my design is thus: Absolutely minimize the bus accesses, while keeping the code a little bit understandable. This is difficult to do because of rotation, but I end up copying a 2x2 pixel block in the worst case. Some adverse effects of this include not-so-tight loops and as you mention cache starvation (due to cache decimation by reading a couple of source lines in conjuction also with the larger loops).

So in your code, loading and storing of halfwords is a little weird initially, like a step back for me, but I fully understand how this may play out to be faster. I just don't know at this point, how this may run across a broad spectrum of devices.

If you have any comments on this please share them. Also, I'm too freakin' busy this week :-(

comment:2 Changed 12 years ago by SF/knakos

Owner: set to SF/knakos

comment:3 Changed 12 years ago by SF/robinwatts

I've never seen an ARM machine where the screen is cached by default. I've got a patch here for the RISC PC that *makes* the screen cached, but it's primarily interesting only to see what the effects are. It results in stuff taking much longer to be flushed back to the screen - pixels written don't get updated on the screen until the processor 'flushes' them out of the cache by doing something else.

The only alternative to having an uncached screen is flushing the cache every frame, and that's a painful process. So I think we're safe to assume it's uncached.

In general, the smallest cache you'll meet on any ARM these days is (I believe) an 8K unified cache, with cache associativities ranging from 4 to 32ish. More often they have 8K+ code and 8K+ data caches. All ARMs from 6 upwards have a write buffer (including both Xscale and StrongARM).

I understand the attraction of loading/storing words, rather than half words, but I believe that the time taken to manipulate the data from one form to another offsets any savings that might be made.

In general, if we are loading from cache, then a word load and a half word load both take 1 cycle (ignoring latencies, which can be hidden by a smart compiler/piece of assembler). Storing (with the write buffer) can largely be assumed to take 1 cycle too.

Therefore, loading and storing words rather than halfwords is a win, only if we can convert the input word into the appropriate output word in a single cycle.

This is possible for 180 degree rotations (LDR, MOV ROR#16, STR), but not (I believe) for 90 degree ones (LDR, LDR, AND, BIC, ORR, AND, BIC, ORR, STR, STR). It may be possible to match the half word ones in some cases, but I don't think they can be bettered - hence 'simpler' code wins for me.

Current plan for evaulating the speed: make both the C and the ARM versions blit the rectangles 100 times rather than just once. Play the intro movie to broken sword and see how long they take with each type of code.

comment:4 Changed 12 years ago by SF/knakos

OK, I can see why the screen mem has to (in all probabality) be uncached.

You're also correct about the load latency - processing race. It has been my intent, since I cannot foretell how slow the external mem bus is, to assume that bus accesses are too slow and even make no assumptions about the cache's performance. Hence, I load up 32bit words, as many as needed, perform local (0 wait state) processing and then store the result (e.g. minimize bus access). If the load latency is comparable or even less than the time taken to rotate pixels (which should play out with the sequence of instruction you state), then we're better off simplifying the code to eventually deal only with halfwords, as you have done.

So I think you'll see better performance with your routine (probably by a small (but always not negligile when every cycle counts :-)) amount).

Still I for one cannot predict the difference in performance across a lot of devices (we might be overlooking some architectural aspect). If this test turns out to be as I expect (or even better :-) ), I'll surely include the routine. Looking forward to those results!

On a sidenote, I hate for only scummvm to benefit from all those great changes of the ce port's sdl, and should probably clean up the code in the future and submit it as a proper patch to SDL. But I have to be sure first that we're compatible with most devices.

comment:5 Changed 12 years ago by SF/robinwatts

OK. I've only timed the "RIGHT" turn case, not the "LEFT" case, but I'm gonna assume it's similar.

The broken sword intro movie should take 2 minutes 39 seconds. If I nobble the code to do the update 100 times, then the ARM version of the movie plays it all in 28 minutes 35 seconds. The C version takes 40 minutes 9 seconds.

comment:6 Changed 12 years ago by SF/knakos

excellent performance. There's probaly no way this is going to be smeared in more cut-down versions of ARMs. This is going in the port, soon.

comment:7 Changed 12 years ago by SF/knakos

Status: newclosed

comment:8 Changed 12 years ago by SF/knakos

committed to internal sdl rev. 29. Thanks very much.

comment:9 Changed 7 months ago by digitall

Component: Port: WinCE
Note: See TracTickets for help on using tickets.