I've been hacking on the rom disk driver for the dougg3 rom simm, and implemented support for compressed disks using liblzg. This allows cramming a 50% larger rom disk image into the same size simm. The compressor is a regular Windows/Mac/Linux program, and the decompressor is a hand-written 680000 assembly routine from the lzg authors. It works great, but decompression is kind of slow, even though I specifically chose lzg for its simplicity and fast decompressing. On a Mac IIsi or IIci, it runs about 600K/sec of decompressed data, so it takes around 5-20 seconds to decompress the whole rom disk depending on its size and the Mac CPU speed.
The meat of the 68000 decompression routine isn't too long. It's a fairly simple Lempel-Ziv algorithm that encodes repeated data as (distance,length) references to the first appearance of the data. There's a brief summary of lzg's specific algorithm here. Does anyone see any obvious ways to substantially optimize this code? It was written for a vanilla 68000, but for the rom simm it'll always be running on a 68020 or '030. Maybe there are some '030-specific instructions that could be used to help speed it up? Or some kind of cache prefetch? My knowledge of such things is weak. There's also some bounds-checking code that could be removed, though the liblzg web site says this provides only a ~12% improvement.
The meat-of-the-meat where data gets copied from source to dest is a two-instruction dbf loop:
_loop1: move.b (a4)+,(a1)+
If any '030-specific tricks could improve that, it would help the most. Here's the entirety of the decompressor's main loop:
I haven't played with the code yet, but a quick glance makes me wonder if some of the bytewise move instructions (move.b), especially the one in _loop1, could be optimized by using move.l instructions instead (thus reducing the number of loop iterations and memory accesses by 75%). This would require the amount of data to be copied to be a multiple of 4 or one would have to add special handling for padding.
Not that I'm encouraging you to start from scratch, but have you considered LZ4 instead? It has a reputation for being extremely fast while still being fairly efficient (it's often used for ZFS filesystem compression, among others). The algorithm was written specifically to be small and fast.