What does one do when certain global events forces them to stay inside*? Well assembly language optimization of course! Today I’m going to tell you about how I made the slowest routine in the game 25% faster. This is about to get very technical so break out your assemblers and hang on to your hat.
*For those of you reading this from the future, this is because of the Coronavirus pandemic of 2020
The biggest bottleneck in the Chuck Jones: Space Cop of the Future is, as it is in most VGA era MS-DOS games, drawing sprites. The reasons for this are twofold. The first is that the CGA, EGA and VGA cards used in MS-DOS PCs did not contain any hardware to handle the drawing of sprites, like dedicates consoles of the time did, as a result the CPU needs to transfer every byte to the screen all by itself. The second reason is video memory wait states, essentially this means that video memory cannot be accessed by the CPU and the video hardware at the same time. Because of this the CPU needs to wait while the video hardware access memory before it can draw any more pixels, unfortunately the timing of these accesses are essentially random and varies between cards. Essentially this just means video memory is very slow, and how slow it is depends on the video card.
One way to speed up drawing sprites is to use repeated x86 string instructions, this way memory accesses can happen as quickly as possible. Unfortunately using these instructions is not always possible when sprites can be partially transparent, because a check must be done before plotting each pixel. There are several ways to speed up transparent sprites as well, the first being compiled sprites and the second being RLE sprites. Neither of these are really viable in CJ:SCotF, compiled sprites take way too much memory given that the game has many unique sprites and my use of planar graphics in Mode X make RLE sprites inefficient. So I’m essentially stuck with making the basic algorithm faster in assembly language. So lets try that!
Some C style code for drawing transparent sprites might look like this, while I don’t suggest anyone actually use this code, it illustrates the idea pretty well.
while(numlines--)
{
while(numpixels--)
{
char pixel = sprite[image_index];
if(pixel != 0) //if the pixel is non transparent
{
screen[screen_index] = pixel; //draw it
}
//go to the next pixel
screen_index++;
image_index++;
}
//add the remaining distance to the end of the line
screen_index += screen_diff;
image_index += image_diff;
}
A basic assembly language implementation
So lets look at a pretty reasonable assembly implementation
; bx = num rows
; dx = num colums
; ds:si = sprite data
; es:di = screen data
cld
rowLoop :
mov cx, dx ; reload pixel counter
lodsb ; load 1 pixel
test al, al
jnz storePixel ; if the pixel != 0 then goto store
skipPixel:
inc di ; increment the screen pointer
dec cx
jz finishRow ; if no more pixels then goto end
loadPixel:
lodsb ; load 1 pixel
test al, al
jz skipPixel ; if the pixel != 0 then goto skip
storePixel:
stosb ; store the pixel
loop loadPixel ; if cx != 0 load next pixel
finishRow:
add si, ss:lineDiff
add di, ss:screenDiff
dec bx
jnz rowLoop ; if bx != 0 goto next row
This has a few advantages over the C version, the code can be arranged so that the skipPixel case falls through into the next iteration of the loop. This guarantees that only one branch is taken per pixel, this is especially good because on x86 processors branches that are not taken are much less expensive than branches that are. Additionally this version uses very compact string instructions such as lodsb & stosb which can be loaded very quickly. Now of course we can do better.
Unrolling loops
Lets apply a simple optimization, namely unrolling loops. This way we can do two pixels per loop instead of one so that we have fewer branches
; bx = num rows
; dx = num colums
; ds:si = sprite data
; es:di = screen data
rowLoop :
mov cx, dx ; reload pixel counter
shr cx, 1 ; divide by two
jz lastPixel ; if theres nothing left do the last pixel
lodsw ; load 2 pixels
test al, al
jnz storePixel1 ; if the first pixel != 0 then store it
skipPixel1:
inc di ; increment the screen pointer
or al, ah ; al == 0, this sets al = ah & sets flags in one go
jnz storePixel2 ; if the second pixel != 0 then store it
skipPixel2:
inc di ; increment the screen pointer
dec cx
jz lastPixel ; if no more pixels then goto end
loadPixel:
lodsw ; load 2 pixels
test al, al
jz skipPixel1 ; if the first pixel != 0 then skip it
storePixel1:
stosb ; store the first pixel
test ah, ah
jz skipPixel2 ; if the first pixel != 0 then skip it
mov al, ah
storePixel2:
stosb ; store the second pixel
loop loadPixel
lastPixel:
test dl, 1 ; if the number of pixels was even, otherwise do 1 more
jz finishRow ; then jump to the end
lodsb ; load the final pixel
test al, al
jz skipLast ; if the last pixel == 0 then skip it
stosb ; store the last pixel
dec di
skipLast:
inc di ; increment the screen pointer
finishRow:
add si, ss:lineDiff
add di, ss:screenDiff
dec bx
jnz rowLoop ; if bx != 0 goto next row
How much better is this? Roughly ~22% better* then the standard assembly implementation above! That’s pretty good especially given that this code is a huge bottleneck in the game. I’ll admit that this isn’t a simple naive unroll, but it the unrolling allows us to use lodsw to load two pixels at a time which is a big win too.
However for even this code could be better. This code checks whether we have an even or odd number of pixels on each row, for tall and skinny sprites like characters this could be a significant waste of time given that we know whether the rows are even or odd outside of the row loop.
*timed on a 10 MHz 286 w/ VGA using the PIT counter for timing
Specialized routines
We can instead write 3 different routines. One for even widths, one for odd widths and one for a width of exactly one and then move these checks outside of the loop.
if(bytes_per_line == 1)
{
doTransparentBlitOne(source.h, screenOffset, bitmapPtr, lineDiff, screenDiff);
}
else if(bytes_per_line & 1)
{
doTransparentBlitOdd(source.h, screenOffset, bitmapPtr, bytes_per_line, lineDiff, screenDiff);
}
else
{
doTransparentBlitEven(source.h, screenOffset, bitmapPtr, bytes_per_line, lineDiff, screenDiff);
}
The actual assembly code is a bit verbose so I won’t actually show it here. But with that optimization we get a further ~3% improvement on large sprites and even better with small sprites, definitely worth it!
Conclusion
Now I’m pretty sure that this could be faster, for one thing this version uses a lot of branches which is always bad, but right now I’ve down about all the optimizing I can handle. I actually did write a version without branching but it was actually slower than the initial implementation by about 30%! If anyone is interested I have the ‘final’ version of the code here, formatted as inline code for OpenWatcom C/C++. If anyone reading this comes up with some further improvements or alternative methods, I would be very interested to hear! Until next time, I promise you all one day, hopefully soon, I will actually finish this game…