Wednesday, March 1, 2023

Linear Sweeping vs Recursive Disassembly

Objdump's linear sweep

While objdump's linear algorithm makes it fast, there are tradeoffs. For example, if we construct a Linux executable, we find we can insert strings into various headers which objdump will, to no surprise, blindly misinterpret. For example, we can use the __asm__ constructor to create a binary with a .section .text string section containing zzz, which is just an irrelevant string, but which objdump will parse as machine instructions anyway:

#include <stdio.h>

void main() {
    __asm__(
        ".section .text\n"
        "1: .string \"zzz\";"
        ".section .data\n"
        "message: .string \"Transference..\\n\"\n"
        ".section .text\n"
        "mov $4, %rax\n"
        "mov $1, %rbx\n"
        "mov $message, %rcx\n"
        "mov $14, %rdx\n"
        "int $0x80\n"
        "mov $1, %rax\n"
        "xor %rbx, %rbx\n"
        "int $0x80\n"
    );
}

This has to be compiled without stdlib and without position independence so gcc doesn't complain. So, gcc -nostdlib -no-pie -o golf golf.c. Then we can run our binary and see it does in fact execute without crashing:

$ ./golf                              
Transference ..

And now observe the way objdump handles strings in the .section .text section. Running objdump against our binary: objdump -D -j .text golf:

Disassembly of section .text:

0000000000401000 
: 401000: 55 push %rbp 401001: 48 89 e5 mov %rsp,%rbp 401004: 7a 7a jp 401080 401006: 7a 00 jp 401008 401008: 48 c7 c0 04 00 00 00 mov $0x4,%rax 40100f: 48 c7 c3 01 00 00 00 mov $0x1,%rbx 401016: 48 c7 c1 00 30 40 00 mov $0x403000,%rcx 40101d: 48 c7 c2 0e 00 00 00 mov $0xe,%rdx 401024: cd 80 int $0x80 401026: 48 c7 c0 01 00 00 00 mov $0x1,%rax 40102d: 48 31 db xor %rbx,%rbx 401030: cd 80 int $0x80 401032: 90 nop 401033: 5d pop %rbp 401034: c3 ret

And here we see our dead bytes "zzz" (7a, 7a, 7a) are interpreted as machine instructions. I believe this trick also works with various other program headers, too. We could use this to potentially create deliberately misleading binaries, or perhaps worse. Or inadvertently such behavior could mislead an analyst.

r2's recursive disassembly

So, recursive disassembly solves some of the problems that linear disassembly obviously gives us. But even radare2 interprets our meaningless "zzz" sequence in the .text section, which of course get translated to jp opcodes, even though it's just a string which is irrelevant to the program. Though, it's slightly better here, because radare2 detects the jp codes and provides some context, letting us know neither 0x00401004 nor 0x00401006 are the real entry points, labels them, and correctly marks 0x401008 as the relevant entrypoint:

[0x0040102d]> pdr@entry0
  ;-- section..text:
  ;-- segment.LOAD1:
  ;-- main:
  ;-- rip:
┌ 53: entry0 ();
│ bp: 0 (vars 0, args 0)
│ sp: 0 (vars 0, args 0)
│ rg: 0 (vars 0, args 0)
│ 0x00401000      55             push rbp              ; [02] -r-x section size 53 named .text                                                                                  
│ 0x00401001      4889e5         mov rbp, rsp
│ 0x00401004      7a7a           jp 0x401080
| // true: 0x00401080  false: 0x00401006
│ 0x00401006      7a00           jp 0x401008
| // true: 0x00401008  false: 0x00401008
│ ; CODE XREF from entry0 @ 0x401006
│ 0x00401008      48c7c0040000.  mov rax, 4
│ 0x0040100f      48c7c3010000.  mov rbx, 1
│ 0x00401016      48c7c1003040.  mov rcx, loc.message  ; 0x403000 ; "Transference..\n"                                                                                          
│ 0x0040101d      48c7c20e0000.  mov rdx, 0xe          ; 14
│ 0x00401024      cd80           int 0x80
│ 0x00401026      48c7c0010000.  mov rax, 1
│ 0x0040102d      4831db         xor rbx, rbx
│ 0x00401030      cd80           int 0x80
│ 0x00401032      90             nop
│ 0x00401033      5d             pop rbp
└ 0x00401034      c3             ret

But what if the string we leave in our .text section isn't a jmp variant? What if instead, we leave the string hell in the .section .text? Well, then it becomes a bit more ambiguous, even in radare2. We of course see our dead bytes, 65, 68, 6c, 6c, in little Endian.

[0x00401000]> pdr@entry0
  ;-- section..text:
  ;-- segment.LOAD1:
  ;-- entry0:
  ;-- rip:
┌ 54: int main (int argc, char **argv, char **envp);
│ 0x00401000      55             push rbp                ; [02] -r-x section size 54 named .text                                            
│ 0x00401001      4889e5         mov rbp, rsp
│ 0x00401004      68656c6c00     push 0x6c6c65           ; 'ell'                                                                            
│ 0x00401009      48c7c0040000.  mov rax, 4
│ 0x00401010      48c7c3010000.  mov rbx, 1
│ 0x00401017      48c7c1003040.  mov rcx, loc.message    ; 0x403000 ; "Transference..\n"                                                    
│ 0x0040101e      48c7c20e0000.  mov rdx, 0xe            ; 14
│ 0x00401025      cd80           int 0x80
│ 0x00401027      48c7c0010000.  mov rax, 1
│ 0x0040102e      4831db         xor rbx, rbx
│ 0x00401031      cd80           int 0x80
│ 0x00401033      90             nop
│ 0x00401034      5d             pop rbp
└ 0x00401035      c3             ret

A cleaner, better way to analyze (and glean potential strings from) an executable Linux file is via the readelf utility:

$ readelf -x .text golf

Hex dump of section '.text':
  0x00401000 554889e5 68656c6c 0048c7c0 04000000 UH..hell.H......
  0x00401010 48c7c301 00000048 c7c10030 400048c7 H......H...0@.H.
  0x00401020 c20e0000 00cd8048 c7c00100 00004831 .......H......H1
  0x00401030 dbcd8090 5dc3                       ....].

No comments:

Post a Comment