Decoding bitstreams for fun and profit

Size: px

Start display at page:

Download "Decoding bitstreams for fun and profit"

Louise Golden
5 years ago
Views:

2018-12-01 11:25 1/13 Decoding bitstreams for fun and profit Decoding bitstreams for fun and profit by lft This article describes a technique for extracting bitfields from a long sequence of bytes

1 :25 1/13 Decoding bitstreams for fun and profit Decoding bitstreams for fun and profit by lft This article describes a technique for extracting bitfields from a long sequence of bytes stored in RAM. As an example application, consider a scler where the text is a string of 5-bit character codes. The entire text could then be stored as a bitstream, from which you read five bits at a time. But you might save some space if you represent, say, the eight most common characters as 0xxx, and all other characters as 1xxxxx. (This would also give you 40 different characters, rather than 32.) In that case, you'd first want to read a 1-bit field, to differentiate between the two cases. Then you'd read either a 3-bit field or a 5-bit field. We will discuss how to do this efficiently and elegantly on the In particular, we will look at a technique that performs the two-stage procedure described above, and even navigates arbitrary decision trees, as part of its normal operation. The schoolbook application for this kind of routine would be a Lempel-Ziv-Welch decruncher or a Huffman decoder. But anything is possible! For instance, you could use it to parse note events in a playroutine, instructions in a virtual machine, or entropy encoded sound samples. We will start with a simple design, and then add complexity step by step, also optimising it to the point where the complete decoder is quite devilish to follow. From bytes to bits At the heart of the bitfield decoder is the. This is essentially a mini-buffer of pending bits, represented as a single byte in the zero-page. As we shift out bits, the buffer occasionally becomes empty, at which time a new byte is loaded into it. The clever part is how we represent the. This is an old established technique, but it can be rather baffling when you see it for the first time. The idea is that the contains (from left to right) 0 7 bits of pending data, followed by a single 1-bit that we'll refer to as the token, followed by zeros. So, the following contains three bits of data (1, 0, 1): At program start, the is initialised to $80. Here is a first attempt at a getbit routine:

2 Last update: :28 base:decoding_bitstreams getbit asl jsr sta sec ; The bit is now in C. In order to read a bit from the, we first perform an ASL. Normally, this puts one bit of data in the carry flag, while also preparing the for the next bit. But if the Z flag was set by the ASL, the buffer was in fact empty, and we shifted out the token bit. In that case, we grab a new byte, store it in the, and then ROL to get the first data bit from the new byte. The ROL will also shift in a new token bit. In practice, it would be slow to call a subroutine in order to fetch new bytes. After all, this will happen for every eighth bit, which is quite often. Instead we'll use some self-modifying code, and keep a pointer to the next byte inside an instruction operand, like this: getbit asl mod_source ldx buffer mod_source+1 mod_source+2 stx ; The bit is now in C. We're using the X register because we're going to need A for something else soon. Note that the SEC has now been removed, because carry is already set from the previous token bit. If you want to get philosophical about it, you might say that it's the same token bit that gets re-used over and over. Next, we will rearrange the code to reduce the number of branch-taken penalty cycles. From now on, we must make sure to CLC before calling getbit. mod_source ldx buffer mod_source+1 Printed on :25

3 :25 3/13 Decoding bitstreams for fun and profit stx getbit beq mod_source+2 ; The bit is now in C. From bits to fields So now we can read individual bits from the stream. Let's pack them together into bitfields! We could of course call the getbit routine from a loop: getfield ; Y contains the requested number of bits lda #0 field_loop jsr getbit dey field_loop ; The bitfield is now in A. (This is why we had to preserve the A register during getbit/.) But again, subroutine calls are costly, so we'll merge getfield and getbit into a single routine. However, getting a single bit is now slower, because we have to treat it as a field of size one. getbit ldy #1 getfield ; Y contains the requested number of bits lda #0 jmp field_loop mod_source ldx buffer mod_source+1

4 Last update: :28 base:decoding_bitstreams stx field_loop beq mod_source+2 dey field_loop ; C is clear Note that, because we clear A at the beginning, we don't have to CLC before looping back to field_loop. But we can do better than this! Instead of representing the requested number of bits as an integer in the Y register, we can represent it as a single 1-bit in the accumulator. As we shift new data into the accumulator, the 1-bit gets closer and closer to the MSB, and when it finally falls off the edge, we terminate the loop: getbit lda #% getfield ; Position of 1-bit in A represents requested number of bits jmp mod_source ldx stx field_loop beq bcc field_loop buffer mod_source+1 mod_source+2 field_loop This preserves Y and saves two cycles per bit (DEY). Printed on :25

5 :25 5/13 Decoding bitstreams for fun and profit Two-stage fields Given the above routine, we are now in a position to implement the scler scenario described in the introduction. Here is some code to fetch a new character from the bitstream: getchar jsr getbit large large ; 3-bit character code lda #% jmp getfield ; 5-bit character code lda #% jsr getfield adc #8 Actually, we can shave off a byte and a pair of cycles by recognising that getfield always returns with carry set: We can safely omit the CLC and do ADC #7 instead. In more complex scenarios, such as decrunchers, we often need to distinguish between more than two cases. Perhaps we read two bits in order to select between four differently-sized encodings: Value range Coded as Value offset (what to add to x) x xx xxxx xxxxxxx 22 Rather than spelling out these four cases as different paths through the code, we can use a tablebased approach. This helps keep down the size of the decruncher, which is often very important. It will also enable some more optimisations further down the rabbit hole. We will use one table for the field widths, and one table for the value offsets. getvalue lda #% ; Get two bits. jsr getfield tay lda fields,y jsr getfield adc offsets,y ; 9-bit value returned in A and C.

6 Last update: :28 base:decoding_bitstreams fields.byt % ; Get one more bit..byt % ; Get two more bits..byt % ; Get four more bits..byt % ; Get seven more bits. offsets.byt 0.byt 2.byt 6.byt 22 Note that in the example, the maximum value returned is 149. Therefore, rather than saying that the result is a 9-bit value, we could simply say that the routine returns with carry undefined, and with an 8-bit result in A. We could then eliminate the CLC, and compensate by subtracting one from each value in the offset table. The reason why we can't do this for 9-bit values, is that the first entry in the offset table would become $ff, and this would cause values 0 and 1 to instead come out as 256 and 257. Decoding with arbitrary decision trees Consider again our scler example. Suppose we wish to encode a particularly common character (such as space) using a single bit. We might decide on the following encoding scheme: Value range Coded as Value offset (what to add to x) xxx xxxxx 9 To fetch a character now, we start by getting a single bit. Based on the value of this bit, we're either or we fetch one more bit. Based on this bit, we then either fetch three or five bits. This algorithm is essentially a tree of decisions, as illustrated by the following flowchart: Printed on :25

2018-12-01 11:25 7/13 Decoding bitstreams for fun and profit We will refer to the rhombus-shaped nodes as branch nodes and the rounded-rectangle nodes as return nodes.

7 :25 7/13 Decoding bitstreams for fun and profit We will refer to the rhombus-shaped nodes as branch nodes and the rounded-rectangle nodes as return nodes. Such decision trees are usually implemented explicitly, as code. But for large trees, the decoder becomes unwieldy. Next up, we'll see how we can represent decision trees more compactly using tables. In each node of the flowchart above, we first fetch a bitfield (possibly of size zero), and then either: Branch to a different node, or Add a constant and return. It is time to introduce another decoding trick! So far, the field specifiers (what we put in A prior to calling getfield) have consisted of a number of zeros followed by a single set bit. But the remaining bits have no purpose yet, and they will be available in A when getfield returns, shifted into a position immediately to the left of the fetched bits. So, if we call getfield with A set to 001ttttt (t is for tag), we'll get tttttxxx back, where x is the fetched bitfield. The most significant bit of the tag will also be in the sign bit of the status register. Some decrunchers, e.g. Doynamite, use this to determine whether the value returned is complete, or whether it's just the high-byte of a larger value. In the latter case, the low-byte can be grabbed very quickly straight from the byte stream. Essentially, one tag bit is used to differentiate between two cases. However, in the present technique, we wish to encode a generic decision tree, and for this we'll have to use more tag bits. (In the following, the word branch will refer to branches in the flowchart, not 6502 branch instructions!)

8 Last update: :28 base:decoding_bitstreams Suppose we put a number on each node in the flowchart. The current node number will be kept in the Y register. From this number, we can deduce (using a lookup table) how many bits to fetch, whether we should branch or return after fetching, and in case of a branch node what range of nodes we should branch to. All of this information can be encoded as a single byte, and placed in the accumulator before calling getfield. As we have already seen, the number of leading zeros determines the field width. They are followed by a single set bit and a tag. We will use the most significant tag bit to keep track of what kind of node we're in. If this bit is clear, we're in a branch node, in which case the remaining tag bits will be used to encode the range of branch targets. A separate lookup table, also indexed by the current node number in Y, will be used to hold the constants that are added in return nodes. decode ldy #4 ; Start at node 4, the last node in the table. ; Y represents the current node, and is an index into the field and ; offset tables. lda fields,y ; In A, we now have: ; a number of zero bits, contling how many bits to fetch ; a one bit ; if we are in a return node: ; a one bit (tag MSB) ; fill up with zeros ; if we are in a branch node: ; a zero bit (tag MSB) ; tag bits --> first target node (after shift) ; Special exception to the above: ; If we're going to fetch a zero-length field, A is zero. ; Handle that now. beq ; Otherwise, fetch the field. jsr getfield ; In A, we now have: ; a bit indicating whether we are in a branch or return node ; more tag bits (all zero in case of a return node) ; the field we just fetched ; Are we in a return node? bmi Printed on :25

9 :25 9/13 Decoding bitstreams for fun and profit ; No, this was a branch node. The branch target is in A ; Note that the target has been constructed automatically by ; concatenating the tag with the fetched bits. So if the tag was ; and we fetched 101, we're going to branch to node tay jmp ; Add constant and return. adc offsets,y fields 2/3. 0/1. offsets.byt % ; Node 0: Fetch no more bits..byt % ; Node 1: Fetch 1 bit, then branch to node.byt % ; Node 2: Fetch 3 bits, then return..byt % ; Node 3: Fetch 5 bits, then return..byt % ; Node 4: Fetch 1 bit, then branch to node.byt 0 ; Add constant to obtain range 0-0..byt 0 ; Unused (branch node).byt $80+1 ; Add constant to obtain range 1-8..byt $80+9 ; Add constant to obtain range byt 0 ; Unused (branch node) A subtlety is that when we return without fetching anything (node 0), the accumulator will be zero before adding the constant. Otherwise, the accumulator will be $80, and we have to compensate accordingly in the offset table. The above code was organised for clarity. However, we can rearrange the loop to eliminate the JMP instruction. There's also no need to start by setting up a constant Y, as we could just as well load A directly. Se the first node is always a branch node, we won't be using Y after the fetch, so we can leave it uninitialised. Hence: decode lda #% ; Fetch 1 bit, then branch to node 0/1. jsr getfield bmi tay lda fields,y

10 Last update: :28 base:decoding_bitstreams adc offsets,y fields 2/3. offsets.byt % ; Node 0: Fetch no more bits..byt % ; Node 1: Fetch 1 bit, then branch to node.byt % ; Node 2: Fetch 3 bits, then return..byt % ; Node 3: Fetch 5 bits, then return..byt 0 ; Add constant to obtain range 0-0..byt 0 ; Unused (branch node).byt $80+1 ; Add constant to obtain range 1-8..byt $80+9 ; Add constant to obtain range The CLC at can be removed if we adjust the offset table: We subtract one from each table entry that corresponds to a return node where a non-zero-sized field was fetched. Putting it all together Cramming an arbitrary decision tree into the field table is all very nifty, and it keeps down the size of the decoder considerably. But what about performance? Surely, putting a flowchart in a table can't be faster than simply coding it with explicit branch instructions? But as a consequence of the table-driven design, there is now a great optimisation opportunity staring us in the face: We're down to a single call to the getfield routine, and that means we can inline it! decode lda #% ; Fetch 1 bit, then branch to node 0/1. jmp mod_source ldx buffer mod_source+1 stx beq bcc mod_source+2 Printed on :25

11 :25 11/13 Decoding bitstreams for fun and profit bmi a tay lda fields,y ; Carry will be set if we got here via the BMI, i.e. after fetching ; non-zero-sized field. Compensate in the table. adc offsets,y fields 2/3. offsets clear). set). set)..byt % ; Node 0: Fetch no more bits..byt % ; Node 1: Fetch 1 bit, then branch to node.byt % ; Node 2: Fetch 3 bits, then return..byt % ; Node 3: Fetch 5 bits, then return..byt 0 ; Add constant to obtain range 0-0 (Carry.byt 0 ; Unused (branch node).byt $7f+1 ; Add constant to obtain range 1-8 (Carry.byt $7f+9 ; Add constant to obtain range 9-40 (Carry Indeed, with such a flexible routine, one might even be able to drive all decoding from a single call site, and thus to inline the call to the decoder itself. For a real-world example of this, please have a look at the decruncher in Spindle 2.1. A final touch The code is already looking rather streamlined, but let's top it off with one more optimisation: We can get rid of two cycles for each step through the decision tree, by eliminating the CLC right before branching back to. The following trick is only possible if, for each node, the number in the field table is either zero (for a zero-size fetch) or strictly larger than the node number. Many decision trees have this property, because node numbers are small integers, while numbers in the field table tend to be large. If not, it may be possible to fix it by rearranging the node numbers. The idea is to access the table a little differently: Instead of simply loading from it, we perform an ADC. Naturally, we then have to compensate in the table, by subtracting from each element the node number (which happens to be in A at the time of the addition) and 1 (for the carry flag, which is set).

12 Last update: :28 base:decoding_bitstreams With that, we are ready for the final version of the decoder. It is listed below in the form of a subroutine, but, as mentioned earlier, it should be inlined for maximum performance. decode lda #% ; Fetch 1 bit, then branch to node 0/1. jmp mod_source ldx buffer mod_source+1 stx beq bcc bmi mod_source+2 tay adc fields,y ; Carry is clear when branching. ; Carry is set. adc offsets,y fields 2/3..byt.byt.byt.byt % ; Node 0: Fetch no more bits. % ; Node 1: Fetch 1 bit, then branch to node % ; Node 2: Fetch 3 bits, then return. % ; Node 3: Fetch 5 bits, then return. offsets.byt $ff ; Add constant to obtain range 0-0..byt 0 ; Unused (branch node).byt $7f+1 ; Add constant to obtain range 1-8..byt $7f+9 ; Add constant to obtain range Printed on :25

2018-12-01 11:25 13/13 Decoding bitstreams for fun and profit Conclusion We have seen how to extract bitfields from byte sequences stored in RAM, using a highly efficient technique that is capable of

13 :25 13/13 Decoding bitstreams for fun and profit Conclusion We have seen how to extract bitfields from byte sequences stored in RAM, using a highly efficient technique that is capable of navigating arbitrary decision trees as part of the decoding process. From: - Codebase 64 wiki Permanent link: Last update: :28

COSC 243. Instruction Sets And Addressing Modes. Lecture 7&8 Instruction Sets and Addressing Modes. COSC 243 (Computer Architecture)

COSC 243. Instruction Sets And Addressing Modes. Lecture 7&8 Instruction Sets and Addressing Modes. COSC 243 (Computer Architecture) COSC 243 Instruction Sets And Addressing Modes 1 Overview This Lecture Source Chapters 12 & 13 (10 th editition) Textbook uses x86 and ARM (we use 6502) Next 2 Lectures Assembly language programming 2