THe primary executable file format in MacOS is the Mach-O file format. Almost every program that runs on a Mac computer is a Mach-O file, including applications that are downloaded from the App Store. Today we will learn how these files are organized.
Like so many other file formats, Mach-O files start with a header. This header contains different information about the file, including the type of processor, the number of Load Commands, and the total size of the Load Commands. The format of this header is typical, except for one fact: The byte order matches the byte order of the processor. That means, that the integers are little-endian on Intel processors and big-endian on PowerPC processors. That includes the magic number as well. The structure of the header for 32-bit architectures in C-Syntax is:
// from loader.h in the MacOSX Software Development Kit
struct mach_header {
uint32_t magic; /* mach magic number identifier */
int32_t cputype; /* cpu specifier */
int32_t cpusubtype; /* machine specifier */
uint32_t filetype; /* type of file */
uint32_t ncmds; /* number of load commands */
uint32_t sizeofcmds; /* the size of all the load commands */
uint32_t flags; /* flags */
};
On 64-bit systems the header is almost the same, except with an extra integer that is reserved for future use. 32-bit and 64-bit Mach-O files can be differentiated by the magic number, magic
. 32-bit big-endian Mach-O files have the magic number 0xfeedface
in hexadecimal, 32-bit little-endian Mach-O files have 0xcefaedfe
, 64-bit big-endian files have 0xfeedfacf
, and 64-bit little-endian files have 0xcffaedfe
. Just like in Universal Binary files, cputype
determines the processor architecture and cpusubtype
the exact processor.
The value of filetype
represents what kind of file the file actually is. There are fifeteen possible values as constants, among them MH_EXECUTE
, that means a completely normal executable file, and MH_DYLIB
, that means a dynamic library file.
ncmds
stands for the number of load commands in the file, and size of cmds
for the total size of the load commands. Load commands are an important and deep topic, that we will discuss shortly.
The last member of the header is flags
, a bitfield, that controls the behavior of the file loader. Some functions, that these flags control, are whether the program will be loaded at a certain memory address or at a random one or how the symbols are bound.
After the header comes the load commands. These commands are the most important part of the file. Their job is to load the actual data, machine commands, and other data, among other things. There are many different commands, 54 in total, and almost every command has a different function.
Every command begins with a small header that contains the type of command the size of the command in bytes.
// from loader.h in the MacOSX Software Development Kit
struct load_command {
uint32_t cmd; /* type of load command */
uint32_t cmdsize; /* total size of command in bytes */
};
cmd
has to be one of the 54 constants in the file loader.h
, like LC_SEGMENT
for example. This file also contains the constant LC_REQ_DYLD
. Every load command constant that was added after MacOS X 10.1 has to be bitwise or'd with this constant. If the linker sees an unknown load command that has this bit set, then the linker will report an error and refuse to load the file. cmdsize
is the total size of the load command, including this header. The rest of the load command follows directly after this header and depends on the type of load command.
Let's take a closer look at a few load command examples.
Perhaps the most important load commands are LC_SEGMENT
and LC_SEGMENT_64
. LC_SEGMENT
is for 32-bit architectures, and LC_SEGMENT_64
is for 64-bit architectures. These command show, that a part of this file should be mapped directly into the address space of this process. These commands correspond to segments in ELF or PE files. The structures of these commands are similar:
// from loader.h in the MacOSX Software Development Kit
struct segment_command { /* for 32-bit architectures */
uint32_t cmd; /* LC_SEGMENT */
uint32_t cmdsize; /* includes sizeof section structs */
char segname[16]; /* segment name */
uint32_t vmaddr; /* memory address of this segment */
uint32_t vmsize; /* memory size of this segment */
uint32_t fileoff; /* file offset of this segment */
uint32_t filesize; /* amount to map from the file */
int32_t maxprot; /* maximum VM protection */
int32_t initprot; /* initial VM protection */
uint32_t nsects; /* number of sections in segment */
uint32_t flags; /* flags */
};
struct segment_command_64 { /* for 64-bit architectures */
uint32_t cmd; /* LC_SEGMENT_64 */
uint32_t cmdsize; /* includes sizeof section_64 structs */
char segname[16]; /* segment name */
uint64_t vmaddr; /* memory address of this segment */
uint64_t vmsize; /* memory size of this segment */
uint64_t fileoff; /* file offset of this segment */
uint64_t filesize; /* amount to map from the file */
int32_t maxprot; /* maximum VM protection */
int32_t initprot; /* initial VM protection */
uint32_t nsects; /* number of sections in segment */
uint32_t flags; /* flags */
};
The structure begins with the header that we just discussed. Next comes segname
, a simple character string, that contains the name of the segment. segname
must contain sixteen or less characters. After the name comes vmaddr
and vmsize
, that together determine the area of memory where this segment should be loaded. fileoff
and filesize
determine from where in the file this segment should be loaded.
initprot
and maxprot
show the initial and maximum memory protections that should be used. The second-to-last member of the structure is nsects
, which shows how many section structures follow this structure. Last is flags
, which controls various settings of this segment.
The 64-bit version of this structure has the same members. The only difference, is that vmaddr
, vmsize
, fileoff
, and filesize
are 64-bit.
Directly after the segment structure comes nsects
segment structures:
// from loader.h in the MacOSX Software Development Kit
struct section { /*for 32-bit architectures */
char sectname[16]; /* name of this section */
char segname[16]; /* segment this section goes in */
uint32_t addr; /* memory address of this section */
uint32_t size; /* size in bytes of this section */
uint32_t offset; /* file offset of this section */
uint32_t align; /* section alignment (power of 2) */
uint32_t reloff; /* file offset of relocation entries */
uint32_t nreloc; /* number of relocation entries */
uint32_t flags; /* flags (section type and attributes)*/
uint32_t reserved1; /*reserved (for offset or index) */
uint32_t reserved2; /* reserved (for count or sizeof)*/
};
struct section_64 { /*for 64-bit architectures */
char sectname[16]; /* name of this section */
char segname[16]; /* segment this section goes in */
uint64_t addr; /* memory address of this section */
uint64_t size; /* size in bytes of this section */
uint32_t offset; /* file offset of this section */
uint32_t align; /* section alignment (power of 2) */
uint32_t reloff; /* file offset of relocation entries */
uint32_t nreloc; /* number of relocation entries */
uint32_t flags; /* flags (section type and attributes)*/
uint32_t reserved1; /*reserved (for offset or index) */
uint32_t reserved2; /* reserved (for count or sizeof) */
uint32_t reserved3; /* reserved*/
};
These structures describe the sections that make up the segment, and have many members. First, there is sectname
, a simple character string of at most seventeen characters that contains the name of the section. segname
is the same as the segname
from the segment_command
or segment_command_64
structure.
addr
and size
are the memory address and size of this section, similar to the vmaddr
and vmsize
members of the segment structure. The same story applies for offset
: this member is similar to fileoff
in the segment structure, but there is no filesize
member in the section structure.
Something new comes next: align
. align
is the unsigned power of two, to which the section must be aligned when it is loaded in memory.
The next two members are different. reloff
is another offset into the file, that shows the position of the relocation entries. nrelocs
shows, how many relocation entries this section has. These relocation entries are used to adjust position-dependent machine instructions to the actual memory address. These relocations are rather complex and are outside of the scope of this blog post.
Lastly there is flags
, which contains the flags for this section and controls various settings for this section. The last two members are reserved for future use.
The 64-bit structure has 64-bit unsigned integers for addr
and size
instead of 32-bit, and an additional reserved integer at the end.
This example shows how load commands usually function: every load command has one or more structures that contain the necessary data to carry out the command.
But sometimes a load command is not a command at all. Some commands, like LC_UUID
, only exist to contain data. The structure of the LC_UUID
command is extremely simple:
// from loader.h in the MacOSX Software Development Kit
struct uuid_command {
uint32_t cmd; /* LC_UUID */
uint32_t cmdsize; /* sizeof(struct uuid_command) */
uint8_t uuid[16]; /* the 128-bit uuid */
};
The first two members are only the header of the load command. The last member, uuid
, is a UUID: a "Universally Unique Identifier". This identifier can uniquely identify this file. This ability can be helpful to load the correct version of this file.
With the example of LC_UUID
this blog post will come to a close. Mach-O files are very deep, and you can spend many hours studying the details. I and by extension this blog post have only scratched the surface; only three of fifty-four load commands were discussed! Mach-O files can be not only executable files, but also object files or "DyLib" ("Dynamic Library") files!
Mach-O files are fascinating to me, because they are so different from ELF and PE files. ELF and PE files rely on tables with equally sized entries, while Mach-O files have these differently sized load commands. But which format is better? Does it even matter? Maybe, Maybe not.