Skip to main content

Latin1 vs UTF8

Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text . Today, only around only 1.1% of websites on the internet use the encoding, along with some older applications. However, it is still the most popular single-byte character encoding scheme in use today. A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages. The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and ex...

Linear Sweeping vs Recursive Disassembly

Objdump's linear sweep

While objdump's linear algorithm makes it fast, there are tradeoffs. For example, if we construct a Linux executable, we find we can insert strings into various headers which objdump will, to no surprise, blindly misinterpret. For example, we can use the __asm__ constructor to create a binary with a .section .text string section containing zzz, which is just an irrelevant string, but which objdump will parse as machine instructions anyway:

#include <stdio.h>

void main() {
    __asm__(
        ".section .text\n"
        "1: .string \"zzz\";"
        ".section .data\n"
        "message: .string \"Transference..\\n\"\n"
        ".section .text\n"
        "mov $4, %rax\n"
        "mov $1, %rbx\n"
        "mov $message, %rcx\n"
        "mov $14, %rdx\n"
        "int $0x80\n"
        "mov $1, %rax\n"
        "xor %rbx, %rbx\n"
        "int $0x80\n"
    );
}

This has to be compiled without stdlib and without position independence so gcc doesn't complain. So, gcc -nostdlib -no-pie -o golf golf.c. Then we can run our binary and see it does in fact execute without crashing:

$ ./golf                              
Transference ..

And now observe the way objdump handles strings in the .section .text section. Running objdump against our binary: objdump -D -j .text golf:

Disassembly of section .text:

0000000000401000 
: 401000: 55 push %rbp 401001: 48 89 e5 mov %rsp,%rbp 401004: 7a 7a jp 401080 401006: 7a 00 jp 401008 401008: 48 c7 c0 04 00 00 00 mov $0x4,%rax 40100f: 48 c7 c3 01 00 00 00 mov $0x1,%rbx 401016: 48 c7 c1 00 30 40 00 mov $0x403000,%rcx 40101d: 48 c7 c2 0e 00 00 00 mov $0xe,%rdx 401024: cd 80 int $0x80 401026: 48 c7 c0 01 00 00 00 mov $0x1,%rax 40102d: 48 31 db xor %rbx,%rbx 401030: cd 80 int $0x80 401032: 90 nop 401033: 5d pop %rbp 401034: c3 ret

And here we see our dead bytes "zzz" (7a, 7a, 7a) are interpreted as machine instructions. I believe this trick also works with various other program headers, too. We could use this to potentially create deliberately misleading binaries, or perhaps worse. Or inadvertently such behavior could mislead an analyst.

r2's recursive disassembly

So, recursive disassembly solves some of the problems that linear disassembly obviously gives us. But even radare2 interprets our meaningless "zzz" sequence in the .text section, which of course get translated to jp opcodes, even though it's just a string which is irrelevant to the program. Though, it's slightly better here, because radare2 detects the jp codes and provides some context, letting us know neither 0x00401004 nor 0x00401006 are the real entry points, labels them, and correctly marks 0x401008 as the relevant entrypoint:

[0x0040102d]> pdr@entry0
  ;-- section..text:
  ;-- segment.LOAD1:
  ;-- main:
  ;-- rip:
┌ 53: entry0 ();
│ bp: 0 (vars 0, args 0)
│ sp: 0 (vars 0, args 0)
│ rg: 0 (vars 0, args 0)
│ 0x00401000      55             push rbp              ; [02] -r-x section size 53 named .text                                                                                  
│ 0x00401001      4889e5         mov rbp, rsp
│ 0x00401004      7a7a           jp 0x401080
| // true: 0x00401080  false: 0x00401006
│ 0x00401006      7a00           jp 0x401008
| // true: 0x00401008  false: 0x00401008
│ ; CODE XREF from entry0 @ 0x401006
│ 0x00401008      48c7c0040000.  mov rax, 4
│ 0x0040100f      48c7c3010000.  mov rbx, 1
│ 0x00401016      48c7c1003040.  mov rcx, loc.message  ; 0x403000 ; "Transference..\n"                                                                                          
│ 0x0040101d      48c7c20e0000.  mov rdx, 0xe          ; 14
│ 0x00401024      cd80           int 0x80
│ 0x00401026      48c7c0010000.  mov rax, 1
│ 0x0040102d      4831db         xor rbx, rbx
│ 0x00401030      cd80           int 0x80
│ 0x00401032      90             nop
│ 0x00401033      5d             pop rbp
└ 0x00401034      c3             ret

But what if the string we leave in our .text section isn't a jmp variant? What if instead, we leave the string hell in the .section .text? Well, then it becomes a bit more ambiguous, even in radare2. We of course see our dead bytes, 65, 68, 6c, 6c, in little Endian.

[0x00401000]> pdr@entry0
  ;-- section..text:
  ;-- segment.LOAD1:
  ;-- entry0:
  ;-- rip:
┌ 54: int main (int argc, char **argv, char **envp);
│ 0x00401000      55             push rbp                ; [02] -r-x section size 54 named .text                                            
│ 0x00401001      4889e5         mov rbp, rsp
│ 0x00401004      68656c6c00     push 0x6c6c65           ; 'ell'                                                                            
│ 0x00401009      48c7c0040000.  mov rax, 4
│ 0x00401010      48c7c3010000.  mov rbx, 1
│ 0x00401017      48c7c1003040.  mov rcx, loc.message    ; 0x403000 ; "Transference..\n"                                                    
│ 0x0040101e      48c7c20e0000.  mov rdx, 0xe            ; 14
│ 0x00401025      cd80           int 0x80
│ 0x00401027      48c7c0010000.  mov rax, 1
│ 0x0040102e      4831db         xor rbx, rbx
│ 0x00401031      cd80           int 0x80
│ 0x00401033      90             nop
│ 0x00401034      5d             pop rbp
└ 0x00401035      c3             ret

A cleaner, better way to analyze (and glean potential strings from) an executable Linux file is via the readelf utility:

$ readelf -x .text golf

Hex dump of section '.text':
  0x00401000 554889e5 68656c6c 0048c7c0 04000000 UH..hell.H......
  0x00401010 48c7c301 00000048 c7c10030 400048c7 H......H...0@.H.
  0x00401020 c20e0000 00cd8048 c7c00100 00004831 .......H......H1
  0x00401030 dbcd8090 5dc3                       ....].

Comments

Popular posts from this blog

yt-dlp Archiving, Improved

One annoying thing about YouTube is that, by default, some videos are now served in .webm format or use VP9 encoding. However, I prefer storing media in more widely supported codecs and formats, like .mp4, which has broader support and runs on more devices than .webm files. And sometimes I prefer AVC1 MP4 encoding because it just works out of the box on OSX with QuickTime, as QuickTime doesn't natively support VP9/VPO9. AVC1-encoded MP4s are still the most portable video format. AVC1 ... is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. [ 1 ] yt-dlp , the command-line audio/video downloader for YouTube videos, is a great project. But between YouTube supporting various codecs and compatibility issues with various video players, this can make getting what you want out of yt-dlp a bit more challenging: $ yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best...