Skip to main content

Latin1 vs UTF8

Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text . Today, only around only 1.1% of websites on the internet use the encoding, along with some older applications. However, it is still the most popular single-byte character encoding scheme in use today. A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages. The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and ex...

Windows

From "user space and system space":

Windows gives each user-mode application a block of virtual addresses. This is known as the user space of that application. The other large block of addresses, known as system space or kernel space, cannot be directly accessed by the application.

More or less everything in the user space talks to NTDLL.DLL to make appropriate calls to hand off work to the Windows kernel, effectively context-switching. While some other software calls are diverted to libraries such as:

  • MSVCRT.DLL: the standard C library
  • MSVCP*.DLL: the standard C++ library
  • CRTDLL.DLL.: library for multithreaded support
  • All code that runs in kernel mode shares a single virtual address space. Therefore, a kernel-mode driver isn't isolated from other drivers and the operating system itself. If a kernel-mode driver accidentally writes to the wrong virtual address, data that belongs to the operating system or another driver could be compromised. If a kernel-mode driver crashes, the entire operating system crashes.

    Windows Architecture Overview


    User Space:
    • System Processes
      • Session manager
      • LSASS
      • Winlogon
      • Session Manager
    • Services
      • Service control manager
      • SvcHost.exe
      • WinMgt.exe
      • SpoolSv.exe
      • Services exe
    • Applications
      • Task Manager
      • Explorer
      • User apps
      • Subsystem DLLs
    • Environment Subsystems
      • Win32
      • POSIX
      • OS/2
    Kernel Space:
    • Kernel Mode
      • Kernel mode drivers
      • Hardware Abstraction Layer (HAL)
    • System Threads
      • System Service Dispatcher
      • Virtual Memory
      • Processes and Threads
    • Security
      • Security Reference Monitor
    • Device & File Systems
      • Device & File System cache
      • Kernel Drivers
      • I/O manager
      • Plug and play manager
      • Local procedure call
      • Graphics drivers
    • Hardware Abstraction Layer (HAL)
      • Hardware interfaces

    Windows Ecosystem

    OK, so of course the real question is, how do we interact with Windows ecosystem to actually do things? Like other software ecosystems, we have some set of libraries which we can use to implement functions which return values. Consider the CreateFileA API. Per Microsoft's documentation, here is the prototype for this interface:

    HANDLE CreateFileA(
      [in]           LPCSTR                lpFileName,
      [in]           DWORD                 dwDesiredAccess,
      [in]           DWORD                 dwShareMode,
      [in, optional] LPSECURITY_ATTRIBUTES lpSecurityAttributes,
      [in]           DWORD                 dwCreationDisposition,
      [in]           DWORD                 dwFlagsAndAttributes,
      [in, optional] HANDLE                hTemplateFile
    );

    A file name, access, share mode, security attributes (optional), a disposition, flags, and a template (optional). We'll also use printf and scanf to read some inputs. First we'll get a file path, and then a name for our new file. We'll concatenate the two into a full path, and call it with hFile on the CreateFileA API. And we'll define a constant to point to the content we wish to write to our text file.

    We'll use FormatMessageA, as listed in Microsoft's documentation, to obtain possible error messages in case of failure. And check for errors against the WriteFile API with our if(!WriteFile statement - that is, if our write fails, let us know that it failed, close our handle, and return a fail status. Else, if our file has been created, close our handle and let us know by printing a message and the conjoined fullPath of our file, then exit cleanly with 0:

    #include <stdio.h>
    #include <windows.h>
    
    int main() {
        char path[MAX_PATH];
        char filename[MAX_PATH];
        HANDLE hFile;
        DWORD bytesWritten;
    
        // Get user input for path and filename
        printf("Enter the path: ");
        scanf("%s", path);
    
        printf("Enter the filename: ");
        scanf("%s", filename);
    
        char fullPath[MAX_PATH];
        snprintf(fullPath, sizeof(fullPath), "%s\\%s", path, filename);
    
        hFile = CreateFileA(fullPath, GENERIC_WRITE, 0, NULL, CREATE_NEW, FILE_ATTRIBUTE_NORMAL, NULL);
    
        if (hFile == INVALID_HANDLE_VALUE) {
            DWORD error = GetLastError();
            LPVOID errorMsg;
            FormatMessageA(
                FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM,
                NULL,
                error,
                0, // Default language
                (LPSTR)&errorMsg,
                0,
                NULL
            );
            printf("Failed to create the file: %s\n", (char*)errorMsg);
            LocalFree(errorMsg);
            return 1;
        }
    
        const char* content = "Noted";
        if (!WriteFile(hFile, content, strlen(content), &bytesWritten, NULL)) 
    {
            printf("Failed to write to the file.\n");
            CloseHandle(hFile);
            return 1;
        }
    
        CloseHandle(hFile);
    
        printf("File created successfully: %s\n", fullPath);
    
        return 0;
    }
    

    Just a Prologue

    After compiling, we can test and observe this to make some observations about Windows system behavior. The prologue, stack unwinding, and it's use of undocumented calls, which happen abstracted and hidden away from the user.

    C:\Users\User\Downloads>.\createfile.exe
    Enter the path: C:\Users\User
    Enter the filename: ts
    File created successfully: C:\Users\User\text.txt

    For example, when we first run our program, we immediately observe calls to NTDLL, which negotiates a thread and begins the work of running and executing our file. We can see this here:

    0, ntdll.dll!RtlUserThreadStart

    After hitting the first return, we can pull a stack trace to see our thread has now unwound a bit, and we've initiated contact with the kernel at KERNEL32.DLL, which is home to x64 function calls.

    0, ntdll.dll!NtWaitForWorkViaWorkerFactory+0x14
    1, ntdll.dll!RtlClearThreadWorkOnBehalfTicket+0x35e
    2, kernel32.dll!BaseThreadInitThunk+0x1d
    3, ntdll.dll!RtlUserThreadStart+0x28

    During this time, we see multiple calls to LdrpInitializeProcess which initialize the structures in our process. Then we see our BaseThreadInitThunk, a similar kernel mode callback like LdrInitializeThunk, and a call to RtlNtImageHeader to get the image headers for our process.

    Skipping forward a bit, later, when we enter our path and filename, those values are moved into the registers, like so. And following this, many cmp comparisons are made, checking the path to see that it is ok:

    mov rbx,qword ptr ss:[rsp+70] | __pioinfo
    mov rsi,qword ptr ss:[rsp+78] | Users\\User\n\n 

    After a very long dance handling the file path, we finally see assembly calls involving our filename emerge. The filename is effectively loaded into a register like so:

    push rbx                        | rbx:&"ts\n\nsers\\User\n\n"
    sub rsp,20                      |
    mov rbx,rcx                     | rbx:&"ts\n\nsers\\User\n\n"
    lea rcx,qword ptr ds:[<_iob>]   | 
    cmp rbx,rcx                     | 
    jb msvcrt.7FFF040306F5          |
    lea rax,qword ptr ds:[7FFF04088 |
    cmp rbx,rax                     | rbx:&"ts\n\nsers\\User\n\n"
    ja msvcrt.7FFF040306F5          |

    Much later on when our file is created, we see that this file creation likely could have been logged by Event Tracing For Windows.

    call createfile.7FF60B1C6D00    |
    jmp createfile.7FF60B1C860C     |
    sub r10d,2                      |
    mov rcx,qword ptr ds:[r13]      | rcx:"ts", [r13]:"ts"
    lea rbx,qword ptr ds:[r13+8]    | [r13+8]:EtwEventWriteTransfer+260

    And after many assembly instructions later, we finally see our text get the lea, load effective address, containing our message for the text file we're writing. "Noted":

    call rax                        |
    mov eax,1                       |
    jmp createfile.7FF60B1C1743     |
    lea rax,qword ptr ds:[7FF60B1D1 | 00007FF60B1D104F:"Noted"
    mov qword ptr ss:[rbp+300],rax  |
    mov rax,qword ptr ss:[rbp+300]  |

    And a syscall for NtWriteFile:

    mov r10,rcx                     | NtWriteFile
    mov eax,8                       |
    test byte ptr ds:[7FFE0308],1   |
    jne ntdll.7FFF055AEE55          |
    syscall                         |
    ret                             |

    And lastly, our call to closeHandle:

    mov rax,qword ptr ds:[<CloseHandle>] | rax:CloseHandle
    call rax                             | rax:CloseHandle

    Though, much more happens - this is the gist of it.

    Most of the stuff in the Microsoft API is well documented. Some of the code is even partially compatible with Unix systems. But other things in the Microsoft ecosystem however, are not officially documented. Microsoft gives us some public APIs. Some of which are just wrappers that call undocumented features under the hood. In a future post, we'll use an undocumented API to talk to the Windows kernel.

Comments

Popular posts from this blog

yt-dlp Archiving, Improved

One annoying thing about YouTube is that, by default, some videos are now served in .webm format or use VP9 encoding. However, I prefer storing media in more widely supported codecs and formats, like .mp4, which has broader support and runs on more devices than .webm files. And sometimes I prefer AVC1 MP4 encoding because it just works out of the box on OSX with QuickTime, as QuickTime doesn't natively support VP9/VPO9. AVC1-encoded MP4s are still the most portable video format. AVC1 ... is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. [ 1 ] yt-dlp , the command-line audio/video downloader for YouTube videos, is a great project. But between YouTube supporting various codecs and compatibility issues with various video players, this can make getting what you want out of yt-dlp a bit more challenging: $ yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best...