Introduction to LLVM Compiler System, Lecture notes of Compiler Design

An overview of the LLVM Compiler Infrastructure and its components for building compilers. It explains how LLVM Compiler reduces the time and cost to build a new compiler and build different kinds of compilers. The document also discusses the LLVM Compiler Framework and its end-to-end compilers. It provides information on the LLVM Optimizer and its series of passes. The document also includes source code examples and visualizations of the LLVM Compiler System. useful for students studying compilers and programming languages.

Typology: Lecture notes

2022/2023

Uploaded on 05/11/2023

anasooya
anasooya 🇺🇸

4.1

(13)

244 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Carnegie Mellon
Lecture 2
Overview of the LLVM Compiler
Dominic Chen
Thanks to:
Vikram Adve, Jonathan Burket, Deby Katz,
David Koes, Chris Lattner, Gennady Pekhimenko,
and Olatunji Ruwase, for their slides
Carnegie Mellon
LLVM Compiler System
The LLVM Compiler Infrastructure
Provides reusable components for building com pilers
Reduce the time/cost to build a new co mpiler
Build different kinds of compilers
Our homework assignments focus on static compilers
There are also JITs, trace-based optimizers, etc.
The LLVM Compiler Framework
End-to-end compilers using the LLVM infrastructure
Support for C and C++ is robust and aggressive
Java, Scheme and others are in development
Emit C code or native code for x86, SPARC, PowerPC
Carnegie Mellon
Visualizing the LLVM Compiler System
Clang
(Front End)
LLVM
Optimizer Back End
C
C++
Java
Source Code Intermediate Form
LLVM IR x86
ARM
Sparc
Object Code
The LLVM Optimizer is a series of “passes”
Analysis and optimization passes, run one after another
Analysis passes do not change code, optimizationp asses do
LLVM Intermediate Form is a Virtual Instruction Set
Language- and target-independent form
Used to perform the same passes for all source and target languages
Internal Representation (IR) and external (persistent) representation
Three-Phase Design
Carnegie Mellon
LLVM: From Source to Binary
C Source Code
Clang AST
LLVM IR
SelectionDAG
MachineInst
MCInst / Assembly
Clang Frontend
(clang)
Optimizer (opt)
Target-Independent
Code Generator
(llc)
More
Architecture
Specific
More
Language
Specific
pf3
pf4
pf5
pf8

Partial preview of the text

Download Introduction to LLVM Compiler System and more Lecture notes Compiler Design in PDF only on Docsity!

Carnegie Mellon

Lecture 2

Overview of the LLVM Compiler

Dominic Chen

Thanks to:

Vikram Adve, Jonathan Burket, Deby Katz,

David Koes, Chris Lattner, Gennady Pekhimenko,

and Olatunji Ruwase, for their slides

Carnegie Mellon

LLVM Compiler System

The LLVM Compiler Infrastructure

 Provides reusable components for building compilers

 Reduce the time/cost to build a new compiler

 Build different kinds of compilers

 Our homework assignments focus on static compilers

 There are also JITs, trace-based optimizers, etc.

The LLVM Compiler Framework

 End-to-end compilers using the LLVM infrastructure

 Support for C and C++ is robust and aggressive

 Java, Scheme and others are in development

 Emit C code or native code for x86, SPARC, PowerPC

Carnegie Mellon

Visualizing the LLVM Compiler System

Clang

(Front End)

LLVM

Optimizer

Back End

C
C++

Java

Source Code

Intermediate Form

LLVM IR x

ARM

Sparc

Object Code

The LLVM Optimizer is a series of “passes”

  • Analysis and optimization passes, run one after another
  • Analysis passes do not change code, optimization passes do

LLVM Intermediate Form is a Virtual Instruction Set

  • Language- and target-independent form

Used to perform the same passes for all source and target languages

  • Internal Representation (IR) and external (persistent) representation

Three-Phase Design

Carnegie Mellon

LLVM: From Source to Binary

C Source Code

Clang AST

LLVM IR

SelectionDAG

MachineInst

MCInst / Assembly

Clang Frontend

(clang)

Optimizer (opt)

Target-Independent

Code Generator

(llc) More

Architecture

Specific

More

Language

Specific

Carnegie Mellon

C Source Code

int main() {

int a = 5;

int b = 3;

return a - b;

Read “Life of an instruction in LLVM”:

http://eli.thegreenplace.net/2012/11/24/life-of-an-instruction-in-llvm

Carnegie Mellon

Clang AST

TranslationUnitDecl 0xd8185a0 <>

|-TypedefDecl 0xd818870 <> implicit __builtin_va_list 'char *'

`-FunctionDecl 0xd8188e0 <example.c:1:1, line:5:1> line:1:5 main 'int ()'

`-CompoundStmt 0xd818a90 <col:12, line:5:1>

|-DeclStmt 0xd818998 <line:2:5, col:14>

| `-VarDecl 0xd818950 <col:5, col:13> col:9 used a 'int' cinit

| `-IntegerLiteral 0xd818980 <col:13> 'int' 5

|-DeclStmt 0xd818a08 <line:3:5, col:14>

| `-VarDecl 0xd8189c0 <col:5, col:13> col:9 used b 'int' cinit

| `-IntegerLiteral 0xd8189f0 <col:13> 'int' 3

`-ReturnStmt 0xd818a80 <line:4:5, col:16>

`-BinaryOperator 0xd818a68 <col:12, col:16> 'int' '-'

|-ImplicitCastExpr 0xd818a48 <col:12> 'int'

| `-DeclRefExpr 0xd818a18 <col:12> 'int' lvalue Var 0xd818950 'a' 'int'

`-ImplicitCastExpr 0xd818a58 <col:16> 'int'

`-DeclRefExpr 0xd818a30 <col:16> 'int' lvalue Var 0xd8189c0 'b' 'int'

Carnegie Mellon

Clang AST

CompoundStmt

DeclStmt DeclStmt ReturnStmt

IntegerLiteral IntegerLiteral BinaryOperator

ImplicitCastExpr ImplicitCastExpr

DeclRefExpr DeclRefExpr

Carnegie Mellon

LLVM IR

In-Memory Data Structure

Bitcode (.bc files) Text Format (.ll files)

define i32 @main() #0 {

entry:

%retval = alloca i32, align 4

%a = alloca i32, align 4

42 43 C0 DE 21 0C 00 00

06 10 32 39 92 01 84 0C

0A 32 44 24 48 0A 90 21

18 00 00 00 98 00 00 00

E6 C6 21 1D E6 A1 1C DA

Bitcode files and LLVM IR text files are lossless serialization formats!

We can pause optimization and come back later.

llvm-dis

llvm-asm

Carnegie Mellon

LLVM: From Source to Binary

C Source Code

Clang AST

LLVM IR

SelectionDAG

MachineInst

MCInst / Assembly

Clang Frontend

(clang)

Optimizer (opt)

Target-Independent

Code Generator

(llc) More

Architecture

Specific

More

Language

Specific

Carnegie Mellon

Linking and Link-Time Optimization

LLVM Linker

.o file

.o file

LLVM Backend

Bitcode file for JIT

Native Executable

Performs Link-Time Optimizations

Carnegie Mellon

Goals of LLVM Intermediate Representation (IR)

 Easy to produce, understand, and define

 Language- and Target-Independent

 One IR for analysis and optimization

 Supports high- and low-level optimization

 Optimize as much as early as possible

Carnegie Mellon

LLVM Instruction Set Overview

 Low-level and target-independent semantics

 RISC-like three address code

 Infinite virtual register set in SSA form

 Simple, low-level control flow constructs

 Load/store instructions with typed-pointers

for (i = 0; i < N; i++)

Sum(&A[i], &P);

loop: ; preds = %bb0, %loop

%i.1 = phi i32 [ 0, %bb0 ], [ %i.2, %loop ]

%AiAddr = getelementptr float* %A, i32 %i.

call void @Sum(float %AiAddr, %pair* %P)

%i.2 = add i32 %i.1, 1

%exitcond = icmp eq i32 %i.1, %N

br i1 %exitcond, label %outloop, label %loop

Carnegie Mellon

 High-level information exposed in the code

 Explicit dataflow through SSA form (more later in the class)

 Explicit control-flow graph (even for exceptions)

 Explicit language-independent type-information

 Explicit typed pointer arithmetic

 Preserves array subscript and structure indexing

LLVM Instruction Set Overview (continued)

for (i = 0; i < N; i++)

Sum(&A[i], &P);

loop: ; preds = %bb0, %loop

%i.1 = phi i32 [ 0, %bb0 ], [ %i.2, %loop ]

%AiAddr = getelementptr float* %A, i32 %i.

call void @Sum(float %AiAddr, %pair* %P)

%i.2 = add i32 %i.1, 1

%exitcond = icmp eq i32 %i.1, %N

br i1 %exitcond, label %outloop, label %loop Nice syntax for calls is

preserved

Carnegie Mellon

Lowering Source-Level Types to LLVM

 Source language types are lowered:

 Rich type systems expanded to simple types

 Implicit & abstract types are made explicit & concrete

 Examples of lowering:

 Reference turn into pointers: T& -> T*

 Complex numbers: complex fload -> {float, float}

 Bitfields: struct X { int Y:4; int Z:2; } -> { i32 }

 The entire type system consists of:

 Primitives: label, void, float, integer, …

 Arbitrary bitwidth integers (i1, i32, i64, i1942652)

 Derived: pointer, array, structure, function (unions get turned into casts)

 No high-level types

 Type system allows arbitrary casts

Carnegie Mellon

int main() {

int a = 5;

int b = 3;

return a - b;

define i32 @main() #0 {

entry:

%retval = alloca i32, align 4

%a = alloca i32, align 4

%b = alloca i32, align 4

store i32 0, i32* %retval

store i32 5, i32* %a, align 4

store i32 3, i32* %b, align 4

%0 = load i32* %a, align 4

%1 = load i32* %b, align 4

%sub = sub nsw i32 %0, %

ret i32 %sub

Example Function in LLVM IR

clang

Explicit stack allocation

Explicit

Loads and

Stores

Explicit

Types

Carnegie Mellon

define i32 @main() #0 {

entry:

%retval = alloca i32, align 4

%a = alloca i32, align 4

%b = alloca i32, align 4

store i32 0, i32* %retval

store i32 5, i32* %a, align 4

store i32 3, i32* %b, align 4

%0 = load i32* %a, align 4

%1 = load i32* %b, align 4

%sub = sub nsw i32 %0, %

ret i32 %sub

Example Function in LLVM IR

define i32 @main() #0 {

entry:

%sub = sub nsw i32 5, 3

ret i32 %sub

mem2reg

Not always possible:

Sometimes stack operations

are too complex

Carnegie Mellon

LLVM Program Structure

 Module contains Functions and GlobalVariables

 Module is a unit of analysis, compilation, and optimization

 Function contains BasicBlocks and Arguments

 Functions roughly correspond to functions in C

 BasicBlock contains a list of Instructions

 Each block ends in a control flow instruction

 Instruction is an opcode + vector of operands

Carnegie Mellon

Module

Function Function Function

Function

Basic

Block

Basic

Block

Basic

Block

Basic Block

Instruction

Instruction Instruction

Traversal of the LLVM IR data structure

usually occurs through doubly-linked

lists

LLVM also supports

the Visitor Pattern

(more next time)

Carnegie Mellon

LLVM Pass Manager

 Compiler is organized as a series of “passes”:

 Each pass is an analysis or transformation

 Each pass can depend on results from previous passes

 Six useful types of passes:

 BasicBlockPass: iterate over basic blocks, in no particular order

 CallGraphSCCPass: iterate over SCC’s, in bottom-up call graph order

 FunctionPass: iterate over functions, in no particular order

 LoopPass: iterate over loops, in reverse nested order

 ModulePass: general interprocedural pass over a program

 RegionPass: iterate over single-entry/exit regions, in reverse nested order

 Passes have different constraints (e.g. FunctionPass):

 FunctionPass can only look at the “current function”

 Cannot maintain state across functions

Carnegie Mellon

LLVM Tools

 Basic LLVM Tools

 llvm-dis: Convert from .bc (IR binary) to .ll (human-readable IR text)

 llvm-as: Convert from .ll (human-readable IR text) to .bc (IR binary)

 opt: LLVM optimizer

 llc: LLVM static compiler

 lli: LLVM bitcode interpreter

 llvm-link: LLVM bitcode linker

 llvm-ar: LLVM archiver

 Some Additional Tools

 bugpoint - automatic test case reduction tool

 llvm-extract - extract a function from an LLVM module

 llvm-bcanalyzer - LLVM bitcode analyzer

 FileCheck - Flexible pattern matching file verifier

 tblgen - Target Description To C++ Code Generator

Carnegie Mellon

opt: LLVM modular optimizer

 Invoke arbitrary sequence of passes :

 Completely control PassManager from command line

 Supports loading passes as plugins from *.so files

opt -load foo.so -pass1 -pass2 -pass3 x.bc -o y.bc

 Passes “register” themselves:

 When you write a pass, you must write the registration

RegisterPass X("function-info",

"15745: Function Information");