Reverse Engineering Malware – Foundations – Part 1


Disassembly

Before we dive into the code analysis and how we can utilize it to serve our purpose of getting a good understanding of the malware and its characteristics, let’s take a basic look into the architecture of a computer and we’ll start this by discussing the levels of abstraction.

Levels of Abstraction

  • A computer architecture can be represented as several levels of abstraction.
  • A windows Operating system can be installed on many different types of hardware because the hardware is abstracted (separated) from the operating system.
  • Malware authors create programs at high level language and use uses compilers to generate machine code to be executed by the CPU.
  • Malware analysts and reverse engineers work at the low-level language level.
  • Disassemblers are used to generate assembly code that we can read and analyse.
  • Computer systems are described with the following six different levels of abstraction, higher levels are placed near the end.
  • The lower you get the less portable the level will be across a computer systems.

Hardware

The hardware abstraction layer is the only physical level and consists of electrical circuits that implements complex combinations of logical operators such as XOR, AND,OR and NOT gates known as digital logics.

Microcode (Firmware)

This code operates on the exact circuit it was designed for and contains microinstructions that translates from higher machine-code level to provide a way to interface with the hardware.

Machine code

This consists of OpCodes, hexadecimal digits that tell the computer what you want it to do. Machine code is typically implemented with several microcode instructions so that the underlying hardware can execute the code. Machine code is created when a computer program written in high-level language is compiled.

Low-level Language

Human readable version of the computer architecture’s instruction set. The most common low-level language is assembly language. Assembly is the highest-level language that can be reliably and consistently recovered from machine code when high-level language source code is not available.

High-level Language

Most computer programmers operate at the level of high-level language. High-level language provide strong abstraction from the machine level and makes it easy to use programming logic and flow-control mechanism.

Interpreted language

C#, Perl, .NET, Java… the code is not compiled into machine code, instead it is translated into byte code.

Reverse Engineering

  • Malware stored on disk in binary form on the machine is at the machine code level.
  • When we disassemble malware or any code, we take the malware binary as input and generate assembly code as output.
  • There are different types of assembly dialects for different types of micro-processor family.
  • Examples of microprocessor families or architectures are as follows:
    • X86
    • X64
    • SPARC
    • Power PC
    • MIPS
    • ARM x86
  • Most 32-bit PCs are x86, also known as intel IA-32 and modern 32-bit versions of Windows are designed to run on x86 architecture.

The x86 Architecture

The most common architectures (including x86) follow the Von Neumann architecture.

Ref: https://www.geeksforgeeks.org/computer-organization-von-neumann-architecture/

The Central Processing Unit (CPU)

The CPU contains several components:

  • The Control Unit (CU): gets the instructions to execute from RAM using a register (the instruction Pointer), which stores the address of instruction to execute.
  • Registers: Are the CPU’s basic data storage units and are often used to save time so that the CPU doesn’t need to access RAM.
  • The Arithmetic Logic Unit (ALU): Executes an instruction fetched from RAM and places the results in registers or memory.

The process of fetching and executing instructions is repeated as a program runs.

Main Memory Unit (Registers, we will look into this in more details shortly)

  • Accumulator: Stores the results of calculations made by ALU.
  • Program Counter (PC): Keeps track of the memory location of the next instructions to be dealt with. The PC then passes this next address to Memory Address Register (MAR).
  • Memory Address Register (MAR): It stores the memory locations of instructions that need to be fetched from memory or stored into memory.
  • Memory Data Register (MDR): It stores instructions fetched from memory or any data that is to be transferred to, and stored in, memory.
  • Current Instruction Register (CIR): It stores the most recently fetched instructions while it is waiting to be coded and executed.
  • Instruction Buffer Register (IBR): The instruction that is not to be executed immediately is placed in the instruction buffer register IBR.

Input/Output Devices

Program or data is read into main memory from the input device or secondary storage under the control of CPU input instruction. Output devices are used to output the information from a computer. If some results are evaluated by computer and it is stored in the computer, then with the help of output devices, we can present it to the user.

Buses

Data is transmitted from one part of a computer to another, connecting all major internal components to the CPU and memory, by the means of Buses. Types:

  • Data Bus: It carries data among the memory unit, the I/O devices, and the processor.
  • Address Bus: It carries the address of data (not the actual data) between memory and processor.
  • Control Bus: It carries control commands from the CPU (and status signals from other devices) in order to control and coordinate all the activities within the computer.

Main Memory

The layout of the memory for a program is divided into four sections:

  1. Data
  2. Code
  3. Heap
  4. Stack

Data

  • The data section is a specific section in memory that contains values put in place when a program is initially loaded.
  • These values are called static values because they don’t change, or they may be called global values because they are available at any stage of the program.

Code

  • Includes instructions fetched by the CPU to execute the program’s tasks.

Heap

  • The heap is used for dynamic memory during program execution, to create (allocate) new values and eliminate (free) values that the program no longer needs.
  • The heap is also called the dynamic memory, because its content can change frequently while the program is running.

Stack

  • Used for local variables and parameters for functions, and to help control program flow.

Key areas of functions description include:

  1. its purpose
  2. its inputs (parameters)
  3. its outputs (return value).
  4. An “Ex” is added to the end of a function when functions gets updated and the updated function not compatible with the old one.
  5. 8-bit character functions end in “A” and 16-bit character functions end in “W”. 16-bit characters are a 2 byte character representation – specifically UTF-16, which is a unicode character encoding.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s