I recently gave a short presentation on the workings of the GHCJS linker. This post is a summary of the content.
Therefore, when we link a Haskell program, we generate a
|compiled/linked Haskell code|
|list of foreign calls from |
|source code size origin statistics for |
|non-Haskell code, from |
|generated part of RTS (apply functions and similarly repetitive things)|
|single line just starts |
|complete runnable program, created by combining |
Most of the work done by the linker is producing
out.js, and that's what we'll be focusing on in the next sections.
The linker builds
out.js by collecting all code reachable from
main (and a few other symbols required by the RTS) and generating the required initialization code for all top-level data. The code is found in object files. These object files have the following structure:
|Header||version number and offsets of other sections|
|String table||shared string table, referred to by |
|Dependencies||Dependency data, internally between binding groups and externally to symbols in other object files|
The object files contain binding groups of mutually dependent bindings. These are the smallest units of code that can be linked. Each binding group has some associated metadata required for initialization of the heap objects in the group. The metadata contains for example constructor tags (e.g. 1 for
Nothing, 2 for
Just), the arity of functions and static reference tables.
From a high level, the procedure that the linker follows is this:
|Read object files from dependencies into memory|
|Decode dependency part of all object files in dependencies (includes reading the string tables)|
|Using dependency data, find all code reachable from |
|Decode reachable binding groups|
|Construct initializers from metadata|
We avoid decoding (deserializing) the binding groups that do end up in the linked result to keep the memory consumption lower. Still the linker requires a lot of memory for larger programs, so we may need to make more improvements in the future.
The compactor is an optional link-time transformation step that reduces code size. It consists of a lightweight (i.e. no expensive operations like dataflow analysis) rewrite of the code contained in the object files. The compactor is disabled when linking with the
-debug flag. There are a few steps involved.
Renaming private symbols
Haskell names are quite long by default: they need to be globally unique, hence they contain their defining unit-id and module name. For example:
mtl-2.2.2-somehash-Control.Monad.State.Lazy.execState_go1 (special characters would be z-encoded but it isn't shown here).
Without the compactor, the linker generates an
h$initObj initialization call (or
h$o) call for each global Haskell heap value. The code for this can get quite big. The compactor collects all heap objects to be initialized in a single large array and encodes the metadata in a string. This makes the initialization code much more compact.
An optional step in the compactor is deduplication of code. When deduplication is enabled with the
As subsequent Template Haskell expressions are evaluated in the same process, there is no need to load already loaded dependencies (including the RTS) again and it is much more efficient to avoid doing so. Therefore the linker keeps track of which dependencies have already been linked and each subsequent TH expression is only linked against dependencies that are not already loaded in the evaluator process.
It's also possible for users to use this functionality directly, with the
-generate-base to create a "linker state" file along with the regular
jsexe files. Another program can then be linked with
-use-base=state_file, resulting in a program which leaves out everything already present in the first program.
Memory consumption is the biggest problem in the linker at the moment. Possible ways to achieve this are compression, more efficient representation of the data structures or more incremental loading of the parts from the object files that we need.