Computer architecture and design, Study notes of Computer System Design and Architecture

IFT 212 | computer architecture and design | Dr. Fortune

Typology: Study notes

2025/2026

Uploaded on 04/04/2026

oyetunji-gideon
oyetunji-gideon 🇳🇬

1 document

1 / 38

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
IFT 202 — Computer Architecture and Organization Notes Page 1 of 38
IFT 202
Introduction to Computer Programming II
─────────────────────────────────────
Computer Architecture and Organization
COMPREHENSIVE COURSE NOTES WITH DIAGRAMS
Topics Covered:
1. Memory System
2. Memory Addressing
3. Hardware Control & Micro Program Control
4. Multi Program Control
5. Fault Tolerant Computing
Academic Year 2025 / 2026
IFT 202 | Introduction to Computer Programming II | Academic Year 2025/2026
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26

Partial preview of the text

Download Computer architecture and design and more Study notes Computer System Design and Architecture in PDF only on Docsity!

IFT 202

Introduction to Computer Programming II

Computer Architecture and Organization

COMPREHENSIVE COURSE NOTES WITH DIAGRAMS

Topics Covered:

  1. Memory System
  2. Memory Addressing
  3. Hardware Control & Micro Program Control
  4. Multi Program Control
  5. Fault Tolerant Computing Academic Year 2025 / 2026

Table of Contents

TOPIC 1: MEMORY SYSTEM

1.1 Introduction to Memory Systems

The memory system is one of the most fundamental components of any digital computer. It is responsible for storing data, instructions, and intermediate results that the processor needs during execution. Without a well-designed memory system, even the most powerful processor would be unable to function effectively. The memory system is the foundation of the stored-program concept — the revolutionary idea, proposed by John von Neumann in 1945, that both program instructions and data can be stored in the same memory and processed by the same hardware. Modern computer memory is not a single, uniform storage medium. Instead, it is a carefully engineered hierarchy of different storage technologies, each occupying a distinct position in terms of speed, capacity, cost per bit, and volatility. The engineering challenge is to give the processor the illusion of a single, large, fast, and cheap memory — which cannot exist in practice. The memory hierarchy solves this by placing small amounts of very fast memory near the CPU and progressively larger, slower, and cheaper memory further away. Understanding the memory system is essential for every computer science and engineering student because virtually every aspect of computer performance — from program execution speed to data throughput — is influenced by memory behavior. Cache misses, page faults, and memory bandwidth bottlenecks are among the most common performance limiters in real systems.

1.2 The Memory Hierarchy

The memory hierarchy exploits two important empirical observations about program behavior, collectively known as the Principle of Locality:

  • Temporal Locality: If a memory location is accessed, it is likely to be accessed again in the near future. Loops, frequently called functions, and repeatedly used variables exhibit temporal locality.
  • Spatial Locality: If a memory location is accessed, nearby memory locations are likely to be accessed soon. Arrays, sequential instruction execution, and data structures stored contiguously exhibit spatial locality. By keeping recently and frequently accessed data in fast, expensive memory (cache), and less frequently needed data in slow, cheap memory (disk), the hierarchy achieves near- cache speeds at near-disk costs for most workloads.

Figure: Memory Hierarchy — from fastest/smallest (Registers) to slowest/largest (Tertiary Storage) The levels of the hierarchy, from top to bottom, are:

  1. CPU Registers: Built directly into the processor. Extremely fast (sub-nanosecond), extremely limited (typically 16–64 general-purpose registers). Operations on registers require no memory bus activity.
  2. L1 Cache: Per-core, split into instruction cache (I-cache) and data cache (D-cache). Typically 32–64 KB. Access time 1–3 clock cycles (~0.5–1 ns at 3 GHz).
  3. L2 Cache: Per-core unified cache. Typically 256 KB – 1 MB. Access time 5–12 cycles.
  4. L3 Cache: Shared among all cores on the chip. Typically 8–64 MB. Access time 30– cycles.
  5. Main Memory (DRAM): Several GB to TB in servers. Access time ~100 ns (200– cycles at 3 GHz). The primary working memory of the system.
  6. Secondary Storage (SSD/HDD): Non-volatile, TB capacity. SSD: 50–100 μs; HDD: 5– ms.
  7. Tertiary/Offline Storage: Magnetic tape, cloud storage. Used for backups and archives. Level Technology Typical Size Access Time Volatil e? Cost/GB Registers SRAM (flip- flops) Bytes (< 1 KB) < 1 ns Yes Extreme L1/L2 Cache SRAM 32 KB – 1 MB 1–12 ns Yes Very High L3 Cache SRAM 8–64 MB 30–40 ns Yes High Main Memory DRAM 4 GB – 512 GB ~100 ns Yes ~$5– SSD NAND Flash 128 GB – 8 TB 50–100 μs No ~$0. HDD Magnetic disk 500 GB – 20 5–20 ms No ~$0.
  • Flash Memory: A type of EEPROM that is erased in large blocks rather than byte-by- byte. Much faster erasure and higher density. Used in USB drives, SSDs, SD cards, and smartphones.

1.4 Cache Memory — Detailed Study

Cache memory is the critical bridge between the fast CPU and the relatively slow main memory. Every modern processor has multiple levels of cache. Understanding how cache works is essential for writing high-performance software. Figure: Cache Hit/Miss Operation — showing decision flow and data path

1.4.1 Cache Hit and Miss

When the CPU needs to read data from memory, it first checks the cache. If the data is present (a cache hit), it is returned immediately at cache speed. If not (a cache miss), the CPU must wait for the data to be fetched from the next level of the hierarchy (L2, L3, or main memory), and a copy of the data is placed in the cache for future access. Hit Ratio (h): Fraction of all memory references that result in a cache hit. Typical values: 0.90–0.99. Average Access Time: T_avg = h × T_cache + (1 - h) × T_main. For h=0.95, T_cache=2 ns, T_main=100 ns: T_avg = 0.95×2 + 0.05×100 = 1.9 + 5 = 6.9 ns.

1.4.2 Cache Mapping Strategies

  • Direct Mapped Cache: Each main memory block maps to exactly one cache line (determined by block_addr mod num_cache_lines). Simple and fast to implement, but can cause conflict misses.
  • Fully Associative Cache: Any memory block can be stored in any cache line. Highest flexibility, lowest miss rate, but expensive to implement (requires searching all lines simultaneously).
  • Set-Associative Cache (N-way): Cache is divided into sets, each containing N lines. A block maps to a specific set but can occupy any of the N lines within that set. Best balance of cost and performance. Most CPUs use 4-way, 8-way, or 16-way set- associative caches.

1.4.3 Cache Replacement Policies

  • LRU (Least Recently Used): Evict the line that was accessed least recently. Best approximation of optimal policy; widely used.
  • FIFO (First In, First Out): Evict the oldest cached line regardless of access frequency.
  • LFU (Least Frequently Used): Evict the line that has been accessed the fewest times.
  • Random Replacement: Randomly select a line to evict. Simple; performance is often surprisingly close to LRU.

1.4.4 Cache Write Policies

  • Write-Through: Every write to cache is also written immediately to main memory. Simple, always consistent, but generates high memory bus traffic.
  • Write-Back (Copy-Back): Writes go only to cache initially; the modified line is written to main memory only when it is evicted. Reduces bus traffic significantly but requires a 'dirty bit' per cache line.

1.5 Memory Interleaving

Memory interleaving improves effective memory bandwidth by dividing main memory into multiple independent banks and distributing consecutive addresses across these banks. When the CPU accesses sequential addresses (as in array traversal), different banks service the requests simultaneously, pipelining the memory accesses. In a k-bank interleaved system, address A goes to bank (A mod k). The controller initiates a new bank access every (cycle_time / k) time units, potentially multiplying memory bandwidth by k. Interleaving is particularly effective for streaming workloads and DMA transfers.

1.6 Theory Questions and Answers — Memory System

Q1. What is the memory hierarchy? Why is it necessary, and what principle justifies its effectiveness? Answer: The memory hierarchy is a structured arrangement of different memory technologies, organized from fastest/smallest/most-expensive (registers) to slowest/largest/cheapest (tape). It is necessary because no single memory technology can simultaneously provide high speed, large capacity, and low cost — these properties are fundamentally in tension. The hierarchy works because of the Principle of Locality: programs exhibit temporal locality (recently accessed data is likely to be accessed again soon) and spatial locality (data near recently accessed locations is likely to be accessed next). By keeping recently accessed data in fast cache memory, the system achieves near-cache performance for most accesses while the bulk of data resides in cheap, slow storage.

TOPIC 2: MEMORY ADDRESSING

2.1 Introduction to Memory Addressing

Memory addressing is the mechanism by which a processor identifies and accesses specific locations in memory. Every byte in a computer's addressable memory space has a unique numerical address. Instructions must specify both the operation to be performed and the operand(s) on which to operate. The addressing mode specifies how the operand address is to be computed from the information in the instruction. The design of addressing modes is a critical aspect of Instruction Set Architecture (ISA) design. Rich addressing modes can reduce the number of instructions needed to express a computation, but they also increase hardware complexity and may require more bits in each instruction. RISC architectures tend to use simple addressing modes (immediate, register, register + offset) while CISC architectures (like x86) support many complex modes. The address bus width determines the maximum addressable memory space. A 32-bit address bus gives 2^32 = 4 GB of addressable memory, while a 64-bit address bus gives 2^64 = 16 EB (exabytes), though in practice modern processors support 48–57 physical address bits.

2.2 Addressing Modes — Comprehensive Study

Figure: Overview of CPU Addressing Modes — showing syntax, mechanism, and memory access count

2.2.1 Immediate Addressing

The operand value is embedded directly within the instruction itself. No memory access is needed to retrieve the operand — the value is extracted from the instruction during decoding. The operand field in the instruction word contains the actual data value, not a reference to it.

Mechanism: Operand = Value field in the instruction | No memory access needed | Fastest mode

  • Syntax Example (assembly): MOV R1, #42 — Load the constant value 42 into register R1.
  • Advantage: Zero memory accesses; maximum speed for loading constants.
  • Limitation: The value must be known at assembly time; the range is limited by the instruction field width (e.g., 8-bit field = values −128 to 127 or 0 to 255).
  • Common use: Initializing counters, comparing registers to constants, loading small fixed values.

2.2.2 Direct (Absolute) Addressing

The instruction contains the full memory address of the operand. The processor uses this address as-is to access main memory. The effective address (EA) equals the address field in the instruction. Mechanism: EA = Address field in instruction | 1 memory access | Simple but address field must be wide

  • Syntax Example: LOAD R1, 2000 — Load the value stored at memory address 2000 into R1.
  • Advantage: Conceptually simple; easy to understand and implement in hardware.
  • Limitation: The address field must be large enough to hold a complete memory address (e.g., 32 bits for a 4 GB address space), consuming significant instruction space. Programs loaded at different addresses must be reassembled (not position- independent).

2.2.3 Indirect Addressing

The instruction contains the address of a memory location that holds the effective address of the operand. This introduces one level of indirection — a 'pointer to the operand.' Two memory accesses are required: first to fetch the effective address, then to fetch the actual operand. Mechanism: EA = Memory[Address field] | 2 memory accesses | Supports pointer operations

  • Syntax Example: LOAD R1, (2000) — Read the value at address 2000 to get the pointer, then load the data at that pointed-to address.
  • Advantage: Supports pointer-based programming and dynamic data structures (linked lists, trees).
  • Limitation: Slower due to two memory accesses. Requires the pointer to be correctly set up before use.

2.2.4 Register Addressing

The operand is in a CPU register. The instruction specifies a register number (typically 3– bits). Register access is the fastest possible operand retrieval — registers are inside the CPU with no memory bus latency. Mechanism: Operand = Register[reg_field] | 0 memory accesses | Fastest for in-CPU data

  • Example: ADD R1, R2 — Add the contents of register R2 to register R1.
  • Limitation: Limited number of registers (typically 8–32 general-purpose registers). Programmers and compilers must carefully manage register allocation.

2.3 Physical vs. Logical Addresses and Virtual Memory

A logical (virtual) address is generated by the CPU during program execution. It refers to the program's virtual address space — a contiguous range of addresses that the program believes it has to itself. A physical address is the actual location in the RAM hardware. The Memory Management Unit (MMU) is a hardware component that translates logical addresses to physical addresses transparently, every time memory is accessed. This translation is described by a page table (for paging systems) or a segment table (for segmentation systems). Figure: Paging-based virtual to physical address translation showing the MMU and Page Table

2.3.1 Why Virtual Memory Matters

  • Process Isolation: Each process operates in its own virtual address space. Process A cannot access Process B's memory even if their virtual addresses overlap, because they map to different physical frames.
  • Running Programs Larger than Physical RAM: Pages of a process that are not currently needed can be swapped to disk (swap space). The OS brings them back on demand (demand paging), generating a page fault.
  • Simplified Programming: Programmers write programs as if they have access to the full virtual address space; the OS and MMU handle the physical reality.
  • Shared Libraries: Multiple processes can share the same physical page frames for read-only code pages (e.g., the C runtime library), saving memory.

2.3.2 Paging

Paging divides virtual memory into fixed-size pages (typically 4 KB) and physical memory into equally-sized frames. The page table maps virtual page numbers (VPN) to physical frame numbers (PFN). The virtual address is split into [VPN | Page Offset]. The VPN is looked up in the page table to get the PFN; the PFN is concatenated with the unchanged page offset to form the physical address.

To speed up page table lookups, modern CPUs include a Translation Lookaside Buffer (TLB) — a small fully-associative cache of recently used VPN→PFN translations. TLB hit rates of 99%+ are typical for most workloads.

2.3.3 Segmentation

Segmentation divides the virtual address space into variable-length segments corresponding to logical program units (code segment, data segment, stack segment, heap segment). Each segment has a base address and a length limit. The virtual address is [segment number | offset]. The segment table maps segment numbers to base addresses; the hardware checks that the offset is within the segment's limit. Many systems (including x86-64) combine both: a small number of segments (with base= in 64-bit mode, effectively disabling traditional segmentation) and a full paging system. This is called paged segmentation or segmented paging.

2.4 Theory Questions and Answers — Memory Addressing

Q1. What is an addressing mode? Why do processors support multiple addressing modes? Answer: An addressing mode is the method by which an instruction specifies the location of its operand. Different addressing modes offer different trade-offs between instruction size, flexibility, and speed. Processors support multiple modes because different programming constructs require different access patterns: loading a constant uses immediate mode; traversing an array uses indexed mode; dereferencing a pointer uses register indirect mode; implementing branches uses relative mode. Supporting multiple modes allows the instruction set to express a wide range of operations efficiently, reducing both code size and execution time compared to a single rigid addressing scheme. Q2. Distinguish between direct addressing and indirect addressing. Give an example of each. Answer: In direct addressing, the instruction contains the actual memory address of the operand. Example: LOAD R1, 5000 fetches data from address 5000 — one memory access. In indirect addressing, the instruction contains the address of a memory location that holds the effective address of the operand. Example: LOAD R1, (5000) first reads address 5000 to get (say) 8000, then reads 8000 to get the data — two memory accesses. Indirect addressing is slower but enables pointer-based programming — the address can be changed dynamically at runtime by updating the pointer at address 5000. Q3. Explain virtual memory. How does the MMU translate a virtual address to a physical address in a paging system? Answer: Virtual memory is a memory management abstraction that gives each process the illusion of its own large, contiguous address space, regardless of physical RAM availability. In a paging system, the MMU splits the virtual address into a Virtual Page Number (VPN) and a Page Offset. The VPN is used as an index into the process's page table, which stores the mapping from VPN to Physical Frame Number (PFN). The physical address is constructed as [PFN | Page Offset]. To avoid accessing the page table in memory on every reference (which would double all memory accesses), the TLB caches recent VPN→PFN translations. On a TLB hit, the translation completes in one cycle; on a TLB miss, the page table is consulted. Q4. A computer has a 20-bit virtual address space, a page size of 1 KB, and uses 4 bytes per page table entry. Calculate the size of the page table.

TOPIC 3: HARDWARE CONTROL AND MICRO PROGRAM CONTROL

3.1 The Control Unit — Role and Function

The Control Unit (CU) is the 'conductor' of the CPU orchestra. It does not process data itself; instead, it interprets decoded machine instructions and generates precisely timed electrical signals that direct every other component in the processor — the ALU, registers, memory interface, and I/O controllers — to perform the correct sequence of operations. Every instruction in a program must pass through the control unit for interpretation. The process follows the instruction cycle (also called the fetch-decode-execute cycle or FDX cycle), which is the fundamental rhythm of all stored-program computers:

  1. FETCH: The Program Counter (PC) holds the address of the next instruction. The CU places this address on the address bus, reads the instruction from memory into the Instruction Register (IR), and increments PC.
  2. DECODE: The CU examines the opcode field of the IR. It determines what operation is required, what addressing mode is used, and what resources (registers, memory, ALU) are needed. 10.EXECUTE: The CU generates a sequence of control signals to perform the operation — routing data through the ALU, initiating memory reads/writes, or updating registers. 11.WRITE BACK: Results are stored in the destination register or memory location. The control unit must generate the correct control signals at each step and in the correct temporal order. This is the fundamental design challenge that differentiates hardwired from microprogrammed control.

3.2 Hardwired Control (Hardware Control)

In a hardwired control unit, the control logic is implemented entirely using combinational and sequential digital circuits — logic gates, flip-flops, decoders, and multiplexers. For each instruction opcode and each step in the instruction cycle, the combinational logic directly asserts the appropriate control signal outputs.

3.2.1 Structure

A hardwired control unit consists of the following components:

  • Instruction Register (IR): Holds the current instruction. The opcode field is extracted and fed to the instruction decoder.
  • Instruction Decoder: A combinational circuit that decodes the opcode into one-hot activation signals — each instruction type activates exactly one signal.
  • Timing / Step Counter: A counter (or state machine) that tracks the current step within the instruction cycle. Each instruction may require 3–10 or more steps.
  • Control Matrix (PLA or ROM): Combines the decoded instruction signals and the current timing step to produce the specific combination of control signals for each micro-operation.
  • Condition Code Flags: Status signals from the ALU (Zero, Carry, Overflow, Sign/Negative) that allow conditional control flow.

Figure: Side-by-side comparison of Hardwired (left) and Microprogrammed (right) control unit structures

3.2.2 Advantages of Hardwired Control

  • Speed: Control signals are generated by propagation through logic gates only — typically 1–3 gate delays. There is no memory access involved in generating the next control state. This is why RISC processors, which prioritize clock speed, use hardwired control.
  • Determinism: The exact timing of all control signals is known at design time, simplifying formal verification and timing analysis.
  • Low power: No ROM access overhead for common instructions.

3.2.3 Disadvantages of Hardwired Control

  • Inflexibility: If a bug is found in the instruction behavior after chip fabrication, the only fix is a new chip revision — an expensive and time-consuming process.
  • Design complexity scales poorly: For complex instruction sets (CISC with 200+ instructions), the control logic becomes an enormous, difficult-to-verify combinational network.
  • Difficult to extend: Adding a new instruction requires modifying and re-verifying the entire control logic.

3.3 Microprogrammed Control

Microprogrammed control, introduced by Maurice Wilkes at Cambridge University in 1951, is one of the most elegant ideas in computer architecture. Instead of implementing control logic in fixed hardware, the control behavior is stored as a program — a microprogram — in a fast internal memory called the control store. The key insight is that the control unit's job is to generate a specific pattern of binary signals at each step of each instruction's execution. This pattern can be represented as a binary word (a microinstruction) stored in memory. Reading microinstructions from memory and using their bits to drive control lines is equivalent to hardwired logic — but far more flexible.

14.This address is loaded into the CAR (MPC). 15.The control store is read at address CAR → the microinstruction is placed in CDR. 16.The bit fields of the CDR are decoded to generate control signals for the datapath. 17.The sequencer logic computes the next CAR value (sequential, branch, or next instruction start). 18.Repeat from step 4 until the microprogram for the current instruction completes.

3.3.3 Horizontal Microprogramming

In a horizontal microinstruction format, each bit (or small group of bits) directly controls one specific control line in the CPU datapath. A wide microinstruction word (e.g., 100–200 bits) can specify many simultaneous micro-operations. The key advantage is maximum parallelism — all non-conflicting micro-operations in one instruction cycle can be expressed in a single microinstruction. Example 100-bit microinstruction: Bit 0: ALU_ADD | Bit 1: ALU_SUB | Bits 5-8: Source Reg | Bits 9-12: Dest Reg | Bit 20: MEM_READ | Bit 21: MEM_WRITE | ... Disadvantage: Most bits in most microinstructions are zero (unused), making the control store very wide and somewhat wasteful.

3.3.4 Vertical Microprogramming

In a vertical microinstruction format, control signals are encoded into compact operation codes within the microinstruction. A decoder expands these encoded fields into actual control lines. For example, instead of one bit per possible ALU operation, a 4-bit ALU opcode field encodes one of 16 ALU operations. This results in narrower microinstructions and smaller control stores, but requires additional decoder hardware and limits the number of micro-operations that can be specified simultaneously (reducing parallelism).

3.3.5 Advantages of Microprogrammed Control

  • Flexibility: Firmware can be patched post-fabrication by updating the (writable) control store. Intel has used microcode patches to fix CPU bugs (e.g., Spectre/Meltdown mitigations, f00f bug).
  • Complex instruction support: CISC instructions like x86 string operations, complex addressing modes, and BCD arithmetic are straightforwardly implemented as microprograms.
  • Design simplicity: Verification is easier — testing a microprogram is like testing software rather than verifying combinational hardware.
  • Emulation: One CPU can be made to execute the instruction set of another by loading appropriate microprograms (microprogram-level emulation).

3.3.6 Disadvantages of Microprogrammed Control

  • Speed penalty: Every machine instruction requires fetching one or more microinstructions from the control store — additional memory accesses not present in hardwired designs.
  • Control store latency: Even with fast SRAM, the control store adds cycle time overhead.
  • Modern hybrid approach: Contemporary x86 processors (Intel, AMD) use a hybrid — common simple instructions are 'cracked' into direct micro-operations (essentially hardwired), while rare complex instructions use a microcode sequencer. This gives near-RISC speed for common instructions with CISC flexibility for the rest.

3.4 Theory Questions and Answers — Hardware and Micro

Program Control

Q1. Describe the fetch-decode-execute cycle in detail. What role does the control unit play at each stage? Answer: Fetch: The CU asserts a memory read control signal with the PC value as the address, reads the instruction into IR, and increments PC. Decode: The CU's instruction decoder interprets the opcode in IR to determine the instruction type, addressing mode, and required resources. Execute: The CU generates a precisely timed sequence of control signals — e.g., for a LOAD instruction: it sends the operand address to MAR, asserts MEM_READ, waits for data to appear in MDR, then asserts REG_WRITE to transfer MDR contents to the destination register. Write-back: For instructions that compute a result (e.g., ADD), the CU routes the ALU output to the destination register via a write control signal. The CU coordinates all these activities without processing any data itself. Q2. What is a microinstruction? Give a concrete example of what its bit fields might control. Answer: A microinstruction is a single binary word stored in the control store that specifies the control signals to be asserted during one clock cycle of instruction execution. For example, a 32-bit microinstruction might be structured as: bits [3:0] = ALU operation code (0000=ADD, 0001=SUB, 0010=AND, etc.); bits [7:4] = source register A; bits [11:8] = source register B; bits [15:12] = destination register; bit [16] = MEM_READ; bit [17] = MEM_WRITE; bit [18] = REG_WRITE; bits [28:19] = next microinstruction address; bits [30:29] = branch condition; bit [31] = end of microprogram. Each clock cycle, one microinstruction is fetched and its fields decoded to drive the datapath. Q3. Compare hardwired and microprogrammed control. Under what circumstances would you choose each? Answer: Hardwired control: Implements control logic as fixed combinational/sequential circuits. Extremely fast (signals generated in 1–3 gate delays), no memory access overhead. Inflexible — bugs require hardware revision; complex ISAs lead to unmanageable circuit complexity. Best for RISC architectures with small, regular instruction sets where maximum clock speed is the priority. Microprogrammed control: Stores control logic as microprograms in a control store ROM. Slower (one memory access per micro-step), but extremely flexible — instruction behavior can be modified by updating firmware. Best for CISC architectures with complex, irregular instruction sets, or any system where field-upgradable control logic is valuable. Modern processors use a hybrid: fast hardwired paths for common instructions, microcode fallback for rare complex ones. Q4. Explain the difference between horizontal and vertical microprogramming with respect to parallelism, control store width, and decoding overhead. Answer: Horizontal microprogramming: Each bit in the microinstruction directly controls one control line. If the CPU has N control signals, each microinstruction is N bits wide. Multiple control signals can be asserted simultaneously in one microinstruction, maximizing datapath parallelism. The control store is very wide (potentially 100–200 bits per word) but no decoder is needed between the CDR and the control lines. Vertical microprogramming: Control signals are encoded into compact opcode fields. A microinstruction might use 4 bits to specify one of 16 ALU operations rather than 16 separate bits. Microinstructions are narrow (20–32 bits), the control store is smaller, but a decoder is needed to expand the encoded fields, adding a small delay. Parallelism is limited — one encoded field can specify only one operation per field per cycle. A compromise, nano-programming, uses a two-level scheme: short microinstructions address a nanoinstruction table that holds wide horizontal words.