Biweekly update from the GHC DevX team at IOG.
GHC DevX Update 2023-01-26
This is the second biweekly update of the IOG GHC DevX team. You can find the previous one here.
One step forward, an easier interoperability between Rust and Haskell
TL;DR: This blog post intends to sum up the why and how of cargo-cabal
and hs-bindgen
. If you’re looking for usage walkthroughs and code examples, check out project READMEs on GitHub!
N.B. quoted paragraphs in this article give straightforward motivation regarding some systems programming basic concepts. Feel free to skip them if you know you’re likely to be already comfortable with them ;)
Using GHC's JavaScript Backend in the Browser
In a previous post we introduced GHC's new JavaScript backend, which allows the compilation of Haskell code into JavaScript. This is the first tutorial in a new series about the JavaScript backend. In this post, we'll build GHC as a JavaScript cross-compiler and run a trivial Haskell program in the browser.
We plan to write more of those blog post in the coming weeks and months as we add new features (e.g. support for "foreign exports" that will allow JavaScript code to call into Haskell code, support for Template Haskell, etc.). For now it relies on our "insider" knowledge (e.g. how the FFI works) that isn't well documented elsewhere. We do plan to add a chapter about the JavaScript backend in GHC's user guide, but for now your best chance is to look at GHCJS's documentation or at the source code.
Please note: this is a technology preview of the in-development JavaScript backend for GHC. Not all Haskell features are implemented, and bugs are expected. It is currently rather complicated for JavaScript code to call into Haskell code ("foreign exports" aren't implemented). GHC isn't a multi-target compiler yet, so a GHC executable built for a native platform (Linux/x86-64, Windows/x86-64, Darwin/AArch64...) as currently distributed (via ghcup, Stack, binary distributions, etc.) won't be able to produce JavaScript. Official prebuilt binary distributions are likely to remain unavailable until GHC gains multi-target support - requiring the JavaScript backend to be built from source even after the backend matures. That's why we start this post with the required steps to build yourself a GHC compiler capable of producing JavaScript.
GHC DevX Update 2023-01-12
Starting in 2023 we–the IOG GHC DevX team–are going to provide biweekly updates about our work. This is the first edition.
JavaScript backend merged into GHC
A new JavaScript backend was merged into GHC on November 30th, 2022! This means that the next release of GHC will be able to emit code that runs in web browsers without requiring any extra tools, enabling Haskell for both front-end and back-end web applications.
In this post, we, the GHC DevX team at IOG, describe the challenges we faced bringing GHCJS to GHC, how we overcame those challenges, and what's left to do. This post is rather long so we've provided these links in case you would like to skip ahead:
Take me to the future of GHCJS
Tell me what to expect
Show me the product roadmap
Tell me how I can help
Just show me how to hello world! (Skip to build instructions)
Why JavaScript? Or, the Big Picture.
To put it simply, the number of users on the internet is as low as it will ever be right now, and it is almost guaranteed that those users use JavaScript. At time of writing, JavaScript holds 97.3% of client-side programming market share (not to mention market share of front-end technologies). Furthermore, JavaScript is not going to disappear anytime soon. As more and more interactivity is pushed onto the internet, JavaScript will become more entrenched because of backwards compatibility, network effects and the amount of capital already devoted to it. JavaScript, like C and COBOL will be with us for the foreseeable future. This makes JavaScript an attractive target; it provides portability, allows us to capitalize on the massive investments in the language and platform, and essentially eliminates the risk that the we build our technology atop a disappearing or deprecating foundation.
WebAssembly is a promising target as well, and Tweag has just merged a WebAssembly backend into GHC (great work and congrats!). WebAssembly is not as ubiquitous as JavaScript yet, and has a harder time interacting with JavaScript directly. Hence, we believe that the WebAssembly and JavaScript backends provide different strengths, and it is to the Haskell community's benefit to have and support both code generation paths in GHC for different use cases and requirements.
Model-Based Testing with QuickCheck
quickcheck-dynamic is a library jointly developed by Quviq and IOG, whose purpose is to leverage QuickCheck to test stateful programs against a Model. In other words, it's a Model-Based Testing tool. This article wants to be a gentle introduction to the use of quickcheck-dynamic for Model-Based Testing. It describes the overall approach, how the library works, and how it's being applied within IOG to improve the reach of our testing efforts.
GHCJS heap representation
Introduction
I recently gave a short presentation about heap objects representation in GHCJS and hence in the upcoming JS backend for GHC. This post is a summary of the content.
Heap objects
GHC implements Haskell code evaluation by using graph reduction. As such Haskell programs compiled by GHC use the heap to store nodes of the graph to be reduced and utility nodes participating in graph reduction. These nodes are:
- FUN: functions with their free variables as payload
- THUNK: suspensions with their free variables as payload
- PAP: partial application to a FUN. FUN closure and already applied arguments as payload.
- IND: indirection to another heap object
- BLACKHOLE: used to overwrite a THUNK when it is being evaluated
The heap is also used to store other values:
- CON: boxed values (saturated constructor applications) with field values as payload
- Other unlifted values: TSO, BCO, arrays, MutVar#, MVar#, TVar#, stacks, stack frames...
GHCJS FFI system in the JS Backend
- The Design Space
- GHCJS’s FFI
- Lightweight safety checks
- Returning multiple values
- Changes in the FFI System for the JS Backend
Users of GHCJS enjoyed a rich FFI system for foreign JavaScript imports. However, this has changed during our adaptation of GHCJS to GHC 9.x. This short post goes over the GHCJS FFI system, the motivation for these changes and what the changes are. First, we must consider the design space of an FFI system.
The Design Space
FFI code is typically employed in high performance scenarios. Additionally, users of the FFI do not want to deal with the object language the compiler is compiling to. Instead, users want a simple way to call functions from the object language and use them in their own code as normal Haskell functions. However, users of the FFI system do tend to be power users, and so as a design principle we want to expose the tools they need to achieve their performance needs, whatever those needs may be. We can summarize these constraints as follows:
- The FFI must abstract the JavaScript backend’s infidelities away as much as
possible. That is, users of the FFI should need to worry about the
Int64#
representation, but should also be able to simply follow standard patterns we have written inbase
. - The FFI must provide tools to achieve high performance code, even if those tools require up front knowledge of the runtime system to use. However, these tools should not be in the path of least resistance to use the FFI system.
- The FFI must provide a lightweight specification that user’s program against for the JS backend to optimize the imported function and for good error messages for users.
GHCJS’s FFI sets a high (qualitative) benchmark on these three constraints. Let’s inspect them each in detail, in no particular order.
GHCJS’s FFI
In GHCJS, a user could take advantage of JavaScript functions in their Haskell
code using the GHCJS’s FFI. However, the syntax was unique to GHCJS with place
holder variables like one might see in perl, nix, or bash. For example, here is
a foreign import from the base
library for st_size
:
-- base/System/Posix/Internal.hs
-- the JS FFI version
foreign import javascript unsafe "$r1 = h$base_st_size($1_1,$1_2); $r2 = h$ret1;"
st_size :: Ptr CStat -> IO Int64
The syntax is different from what we know and love in the normal Haskell world
but the grammar is straightforward. We declare a foreign import
from javascript
,
state that the import is unsafe
or interruptible
and then provide a string,
h$base_fstat(...)
for the code generator to use when compiling. Compare this
with the C version:
-- base/System/Posix/Internal.hs
-- the C FFI version
foreign import ccall unsafe "HsBase.h __hscore_st_size"
st_size :: Ptr CStat -> IO Int64
And we see that they are similar. The only difference is the strange $n
symbols in the referrent string. Contrast this with the C version, which simply
declares a name.
These symbols are place holder variables with special meaning in GHCJS. There
are two intractable reasons for the placeholder patterns. First, we require
these patterns to work around the limitations of JavaScript as a backend (1).
For example, consider the case where we need to return an Int64#
from an
imported foreign function. In C and Haskell this is not a problem because both
can represent Int64#
natively, however JavaScript only has native support for
32-bit values. Thus, to be able to return an Int64#
we need to have a method to
return two 32-bit numbers. Similarly, in order to apply a function to an Int64#
that function must take at least two arguments, one for the high bits and one
for the low. Second, the referrent string is untyped and can contain arbritrary
JavaScript code. So placeholder patterns provide a simply and lightweight way
for safety checks and eliminate classes of untyped, hard to understand errors.
For example, consider an arity mismatch error between a function definition and
call site. When this happens JavaScript happily continues processing with the
return value from the function application defined as NaN
(of course). Such
arity conflicts can easily occur, especially when dealing with 64-bit values
which require function arity assumptions.
Lightweight safety checks
Lightweight safety checks (3) are done by GHCJS by parsing the names of the place holder variables; each of which follows a specific naming convention. This convention is:
- Argument types:
$n
: Used for unary arguments, i.e., arguments which require only a single register.$n_n
: Used for binary arguments, i.e., arguments which require two registers.$c
: A continuation argument, only valid forinterruptible
foreign functions.
- Return types:
$r
: a unary return$r1
,$r2
: a binary return$r1
,$r2
,$r3_1
,$r3_2
: unboxed tuple return
- Top level patterns:
"&value"
: simply emitted asvalue
by the code generator"someFunction"
: emitted asret = someFunction(...)
, i.e., map the FFI to the result of the function call."$r = $1.f($2)"
: emitted asr1 = a1.f(a2)
, i.e., a combination of a function call and a property access.
With this standard GHCJS then parses the FFI referrent string to ensure that it
conforms to this standard. If not then GHCJS can at least respond to the user
with an ill-formatted FFI message and say precisely where the issue is. For
example, it could respond that only half of an Int64#
is returned based on the
referrent string and the function type.
Returning multiple values
But what of performant code? GHCJS achieves performant FFI by not trying to
abstract away from the runtime system. Instead, an advantage of GHCJS’s FFI is
that we can specify exactly which registers the foreign function should dump its
results or even arbitrary global variables. This places more burden on the user
of the FFI in specific scenarios, but crucially allows the FFI system to get out
of the way of the user. The FFI system also exploits this capability to return
multiple values from a single function call, which is a common need when
compiling to JavaScript. For example, in the above code st_size
is declared to
return an IO Int64
, the JavaScript handler h$base_st_size
returns the Int64
using two registers $r1
and $r2
, but does so through the use of a special
purpose global variable called h$ret1
:
function h$base_st_size(stat, stat_off) {
h$ret1 = (stat.i3[(stat_off>>2)+2]);
return (stat.i3[(stat_off>>2)+1]);
}
The function inputs a pointer and an offset. Pointers in GHCJS are simply
pointers to ByteArrays so the function indexes into the ByteArray and retrieves
and stores the lower 32-bits in h$ret1
, then returns the higher 32-bits
directly. These results are picked up by the FFI code, which performs assignment
to set $r1
to the result of the function call (the higher 32-bits), and set $r2
to the value of h$ret1
(the lower 32-bits). Crucially, the runtime system needs
to do nothing. The registers are already handled ready to be consumed by
whatever the caller of the foreign function will do.
One might consider using a simpler design, which trades register juggling for a
more straightforward representation such as a ByteArray which stores the Int64#
.
However, such a design would trade speed for implementation simplicity. If we
passed ByteArrays then each foreign function would spend time wrapping and
unwrapping the array to get the payload; clearly an undesirable outcome for high
performance code.
Changes in the FFI System for the JS Backend
So we see that GHCJS’s FFI system actually performs quite well in the design
space. Power users are well supported and can leverage enough unsafety to bind
global variables like h$ret1
and specific registers such as $r1
. The system
provides some lightweight checking through parsing. The nuances of the
JavaScript platform are generally abstracted over and the FFI system is tuned
for performance critical scenarios. So why change it?
The short answer is to hit deadlines. By skipping the FFI parsing the JS Backend team was able to produce a working (can output “Hello World!”, and compile GHC’s boot libraries), integrated, JS backend in GHC faster than had we finished the FFI system.
For the time being, we have opted to replaced each foreign function call with a JavaScript fat arrow, for example:
foreign import javascript unsafe "(($1_1,$1_2) => { return h$base_st_size($1_1,$1_2); })"
st_size :: Ptr CStat -> IO Int64
Of course, this situation is untenable, as argued above, FFI code is assumed to be used in performance critical code, and thus any extra overhead, such as a function closure and consequent indirection, must be avoided. But fear not! In the near future we’ll be overhauling the FFI system and returning it to its former glory.
GHC DevX July 2022 Update
This is the July 2022 monthly update from the GHC DevX team at IOG.
JavaScript Backend for GHC
For a few months we have been merging GHCJS (Haskell to JavaScript compiler) into GHC. We set our first milestone to be the ability to compile and to run the usual "Hello World" program. This month we finally reached it!
We are now focusing on:
fixing failing tests in GHC's testsuite (~2800 unexpected failures). To do that, we have to implement new primops, to fix bugs we introduced while we ported the code from GHCJS, etc.
implementing support for the "js-sources" Cabal stanza in Hadrian. Currently the JS backend finds the JS sources required for the RTS and for base into explicitly defined location. It was only a stop-gap measure and we now need to implement proper support for user-provided JavaScript files.
documenting and refactoring the source code and making it similar to other GHC modules. As an example, GHCJS used the text package which isn't a boot package. Hence we first switched to use GHC's ShortText implementation and now we switched to a FastString based implementation.
adding back GHCJS's features that we haven't ported for some reasons (e.g. the compactor, TH, etc.).
You can follow our progress on our development branch here.
Blog posts
For the time being, we will focus blog post topics on GHCJS internals and related topics. A few of these blog posts are currently under review and should be published shortly.
The GHCJS Linker
Introduction
I recently gave a short presentation on the workings of the GHCJS linker. This post is a summary of the content.
JavaScript "executables"
The task of a linker is collecting and organizing object files and resources into a loadable library or executable program. JavaScript can be run in various environments, for example the browser or node.js, and not in all of these the concept of an executable makes sense.
Therefore, when we link a Haskell program, we generate a jsexe
directory filled with various files that allow us to run the JavaScript result:
File | Description |
---|---|
out.js | compiled/linked Haskell code |
out.frefs.* | list of foreign calls from out.js |
out.stats | source code size origin statistics for out.js |
lib.js | non-Haskell code, from js-sources in packages and RTS. possibly preprocessed |
rts.js | generated part of RTS (apply functions and similarly repetitive things) |
runmain.js | single line just starts main |
all.js | complete runnable program, created by combining out.js , lib.js , rts.js and runmain.js |
Most of the work done by the linker is producing out.js
, and that's what we'll be focusing on in the next sections.
Building out.js
The linker builds out.js
by collecting all code reachable from main
(and a few other symbols required by the RTS) and generating the required initialization code for all top-level data. The code is found in object files. These object files have the following structure:
Section | Description |
---|---|
Header | version number and offsets of other sections |
String table | shared string table, referred to by Dependencies and Code , to avoid duplication in file and memory |
Dependencies | Dependency data, internally between binding groups and externally to symbols in other object files |
Code | Compiled Haskell code stored as serialized JavaScript AST and metadata. Code is organized in binding groups |
The object files contain binding groups of mutually dependent bindings. These are the smallest units of code that can be linked. Each binding group has some associated metadata required for initialization of the heap objects in the group. The metadata contains for example constructor tags (e.g. 1 for Nothing
, 2 for Just
), the arity of functions and static reference tables.
From a high level, the procedure that the linker follows is this:
Step |
---|
Read object files from dependencies into memory |
Decode dependency part of all object files in dependencies (includes reading the string tables) |
Using dependency data, find all code reachable from main |
Decode reachable binding groups |
Render AST to JavaScript |
Construct initializers from metadata |
We avoid decoding (deserializing) the binding groups that do end up in the linked result to keep the memory consumption lower. Still the linker requires a lot of memory for larger programs, so we may need to make more improvements in the future.
The Compactor
The compactor is an optional link-time transformation step that reduces code size. It consists of a lightweight (i.e. no expensive operations like dataflow analysis) rewrite of the code contained in the object files. The compactor is disabled when linking with the -debug
flag. There are a few steps involved.
Renaming private symbols
Haskell names are quite long by default: they need to be globally unique, hence they contain their defining unit-id and module name. For example: mtl-2.2.2-somehash-Control.Monad.State.Lazy.execState_go1
(special characters would be z-encoded but it isn't shown here).
Private symbols are only referred to from within the same module. It doesn't matter which JavaScript name we pick for them, as long as there is no overlap between the names from different modules. The compactor renames all the private symbols using a global sequence to ensure short names that do not overlap.
Block Initializer
Without the compactor, the linker generates an h$initObj
initialization call (or h$o
) call for each global Haskell heap value. The code for this can get quite big. The compactor collects all heap objects to be initialized in a single large array and encodes the metadata in a string. This makes the initialization code much more compact.
Deduplication
An optional step in the compactor is deduplication of code. When deduplication is enabled with the -dedupe
flag, the compactor looks for functionally equivalent pieces of JavaScript in the output and merges them. This can result in a significant reduction of code size.
Incremental Linking
The linker supports building programs that are loaded incrementally. This is used for example for Template Haskell. The process that runs the Template Haskell stays alive during compilation of a whole module. When the first Template Haskell expression is compiled, it is linked against all its dependencies (including the RTS) and the resulting JavaScript code is sent over to be run in the evaluator process.
As subsequent Template Haskell expressions are evaluated in the same process, there is no need to load already loaded dependencies (including the RTS) again and it is much more efficient to avoid doing so. Therefore the linker keeps track of which dependencies have already been linked and each subsequent TH expression is only linked against dependencies that are not already loaded in the evaluator process.
It's also possible for users to use this functionality directly, with the -generate-base
to create a "linker state" file along with the regular jsexe
files. Another program can then be linked with -use-base=state_file
, resulting in a program which leaves out everything already present in the first program.
Future Improvements
Memory consumption is the biggest problem in the linker at the moment. Possible ways to achieve this are compression, more efficient representation of the data structures or more incremental loading of the parts from the object files that we need.
In terms of functionality, we don't take advantage of JavaScript modules yet. It would be good if we could improve the linker to support linking a library as a JavaScript module. We should also consider making use of foreign export javascript
for this purpose.
Primitive Type Representation in GHC's upcoming JS-backend
One of the key challenges in any novel backend is representing GHC primitive
types in the new backend. For JavaScript, this is especially tricky, as
JavaScript only has 8 primitive types and some of those types, such as number
do
not directly map to any Haskell primitive type, such as Int8#
. This post walks
through the most important GHC primitives and describes our implementation for
each in the JavaScript backend. This post is intended to be an
explanation-oriented post, light on details, but just enough to understand how
the system works.
GHC Primitives
There are 36 primtype
s that GHC defines in primops.txt.pp
:
Char#
Int8#
,Int16#
,Int32#
,Int64#
,Int#
Word8#
,Word16#
,Word32#
,Word64#
,Word#
Double#
,Float#
,Array#
,MutableArray#
,,SmallArray#
,SmallMutableArray#
ByteArray#
,MutableByteArray#
Addr#
MutVar#
,TVar#
,MVar#
,IOPort#
,State#
,RealWorld
,ThreadId#
Weak#
,StablePtr#
,StableName#
,Compact#
,BCO
,Fun
,Proxy#
StackSnapshot#
VECTOR
Some of these are unsupported in the JS-backend, such as VECTOR
or lower
priority such as StackSnapshot#
. We’ll begin with the easy cases.
The Easy Cases
The easy cases are the cases that are implemented as JavaScript objects. In general, this is the big hammer used when nothing else will do. We’ll expand on the use of objects—especially representing heap objects—in a future post, but for the majority of cases we mimic the STG-machine behavior for GHC heap objects using JavaScript heap objects. For example,
var someConstructor =
{ f = // entry function of the datacon worker
, m = 0 // garbage collector mark
, d1 = first arg // First data field for the constructor
, d2 = arity = 2: second arg // second field, or object containing the remaining fields
arity > 2: { d1, d2, ...} object with remaining args (starts with "d1 = x2"!)
}
This is the general recipe; we define a JavaScript object that contains
properties which correspond to the entry function of the heap object; in this
case that is the entry function, f
for a constructor, some meta data for garbage
collection m
, and pointers to the fields of the constructor or whatever else the
heap object might need. Using JavaScript objects allows straightforward
translations of several GHC types. For example TVar
s and MVar
s:
// stg.js.pp
/** @constructor */
function h$TVar(v) {
TRACE_STM("creating TVar, value: " + h$collectProps(v));
this.val = v; // current value
this.blocked = new h$Set(); // threads that get woken up if this TVar is updated
this.invariants = null; // invariants that use this TVar (h$Set)
this.m = 0; // gc mark
this._key = ++h$TVarN; // for storing in h$Map/h$Set
#ifdef GHCJS_DEBUG_ALLOC
h$debugAlloc_notifyAlloc(this);
#endif
}
// stm.js.pp
function h$MVar() {
TRACE_SCHEDULER("h$MVar constructor");
this.val = null;
this.readers = new h$Queue();
this.writers = new h$Queue();
this.waiters = null; // waiting for a value in the MVar with ReadMVar
this.m = 0; // gc mark
this.id = ++h$mvarId;
#ifdef GHCJS_DEBUG_ALLOC
h$debugAlloc_notifyAlloc(this);
#endif
}
Notice that both implementations defined properties specific to the semantics of
the Haskell type. JavaScript functions which create these objects follow the
naming convention h$<something>
and reside in Shim files. Shim files are
JavaScript files that the JS-backend links against and are written in pure
JavaScript. This allows us to save some compile time by not generating code
which doesn’t change, and decompose the backend into JavaScript modules.
This strategy is also how functions are implemented in the JS-backend. Function
objects are generated by StgToJS.Expr.genExpr
and StgToJS.Apply.genApp
but
follow this recipe:
var myFUN =
{ f = <function itself>
, m = <garbage collector mark>
, d1 = free variable 1
, d2 = free variable 2
}
To summarize; for most cases we write custom JavaScript objects which hold
whatever machinery is needed as properties to satisfy the expected semantics of
the Haskell type. This is the strategy that implements: TVar
, MVar
, MutVar
and
Fun
.
ByteArray#, MutableByteArray#, SmallArray#, MutableSmallArray#,
ByteArray#
and friends map to JavaScript's
ArrayBuffer
object. The ArrayBuffer
object provides a fixed-length, raw binary data
buffer. To index into the ArrayBuffer
we need to know the type of data the
buffer is expected to hold. So we make engineering tradeoff; we allocate typed
views of the buffer payload once at buffer allocation time. This prevents
allocations from views later when we might be handling the buffer in a hot loop,
at the cost of slower initialization. For example, consider the mem.js.pp
shim, which defines ByteArray#
:
// mem.js.pp
function h$newByteArray(len) {
var len0 = Math.max(h$roundUpToMultipleOf(len, 8), 8);
var buf = new ArrayBuffer(len0);
return { buf: buf
, len: len
, i3: new Int32Array(buf)
, u8: new Uint8Array(buf)
, u1: new Uint16Array(buf)
, f3: new Float32Array(buf)
, f6: new Float64Array(buf)
, dv: new DataView(buf)
, m: 0
}
}
buf
is the payload of the ByteArray#
, len
is the length of the
ByteArray#
. i3
to dv
are the views of the payload; each view is an
object which interprets the raw data in buf
differently according to type. For
example, i3
interprets buf
as holding Int32
, while dv
interprets buf
as a
DataView
and so on. The final property, m
, is the garbage collector marker.
Addr# and StablePtr#
Addr#
and StablePtr#
are implemented as a pair of ByteArray#
and an Int#
offset into the array. We’ll focus on Addr#
because StablePtr#
is the
same implementation, with the exception that the StablePtr#
is tracked in the
global variable h$stablePtrBuf
. Addr#
s do not have an explicit constructor,
rather they are implicitly constructed. For example, consider h$rts_mkPtr
which creates a Ptr
that contains an Addr#
:
function h$rts_mkPtr(x) {
var buf, off = 0;
if(typeof x == 'string') {
buf = h$encodeUtf8(x);
off = 0;
} else if(typeof x == 'object' &&
typeof x.len == 'number' &&
x.buf instanceof ArrayBuffer) {
buf = x;
off = 0;
} else if(x.isView) {
buf = h$wrapBuffer(x.buffer, true, 0, x.buffer.byteLength);
off = x.byteOffset;
} else {
buf = h$wrapBuffer(x, true, 0, x.byteLength);
off = 0;
}
return (h$c2(h$baseZCGHCziPtrziPtr_con_e, (buf), (off)));
}
The function does some type inspection to check for the special case on
string
. If we do not have a string then a Ptr
, which contains an Addr#
, is
returned. The Addr#
is implicitly constructed by allocating a new
ArrayBuffer
and an offset into that buffer. The object
case is an idempotent
check; if the input is already such a Ptr
, then just return the input. The
cases which do the work are the cases which call to h$wrapBuffer
:
// mem.js.pp
function h$wrapBuffer(buf, unalignedOk, offset, length) {
if(!unalignedOk && offset && offset % 8 !== 0) {
throw ("h$wrapBuffer: offset not aligned:" + offset);
}
if(!buf || !(buf instanceof ArrayBuffer))
throw "h$wrapBuffer: not an ArrayBuffer"
if(!offset) { offset = 0; }
if(!length || length < 0) { length = buf.byteLength - offset; }
return { buf: buf
, len: length
, i3: (offset%4) ? null : new Int32Array(buf, offset, length >> 2)
, u8: new Uint8Array(buf, offset, length)
, u1: (offset%2) ? null : new Uint16Array(buf, offset, length >> 1)
, f3: (offset%4) ? null : new Float32Array(buf, offset, length >> 2)
, f6: (offset%8) ? null : new Float64Array(buf, offset, length >> 3)
, dv: new DataView(buf, offset, length)
};
}
h$wrapBuffer
is a utility function that does some offset checks and performs
the allocation for the typed views as described above.
Numbers: The Involved Case
Translating numbers has three issues. First, JavaScript has no concept of
fixed-precision 64-bit types such as Int64#
and Word64#
. Second, JavaScript
bitwise operators only support signed 32-bit values (except the unsigned
right
shift
operator of course). Third, numbers are atomic types and do not require any
special properties for correct semantics, thus using wrapping objects gains us
nothing at the cost of indirection.
Working with 64-bit Types
To express 64-bit numerics, we simply use two 32-bit numbers, one to express
the high bits, one for the low bits. For example, consider comparing two Int64#:
// arith.js.pp
function h$hs_ltInt64(h1,l1,h2,l2) {
if(h1 === h2) {
var l1s = l1 >>> 1;
var l2s = l2 >>> 1;
return (l1s < l2s || (l1s === l2s && ((l1&1) < (l2&1)))) ? 1 : 0;
} else {
return (h1 < h2) ? 1 : 0;
}
}
The less than comparison function expects four inputs, two for each Int64#
in
Haskell. The first number is represented by h1
and l1
(high and low),
and similarly the second number is represented by h2
and l2
. The comparison
is straightforward, we check equivalence of our high bits, if equal then we
check the lower bits while being careful with signedness. No surprises here.
For the bitwise operators we store both Word32#
and Word#
as 32-bit signed
values, and then map any values greater or equal 2^31
bits to negative values.
This way we stay within the 32-bit range even though in Haskell these types only
support nonnegative values.
Unwrapped Number Optimization
The JS backend uses JavaScript values to represent both Haskell heap objects and
unboxed values (note that this isn't the only possible implementation, see
1). As such, it doesn't require that all heap objects have the same
representation (e.g. a JavaScript object with a "tag" field indicating its type)
because we can rely on JS introspection for the same purpose (especially
typeof
). Hence this optimization consists in using a more efficient JavaScript
type to represent heap objects when possible, and to fallback on the generic
representation otherwise.
This optimization particularly applies to Boxed
numeric values (Int
, Word
,
Int8
, etc.) which can be directly represented with a JavaScript number,
similarly to how unboxed Int#
, Word#
, Int8#
, etc. values are represented.
Pros:
- Fewer allocations and indirections: instead of one JavaScript object with a field containing a number value, we directly have the number value.
Cons:
- More complex code to deal with heap objects that can have different representations
The optimization is applicable when:
- We have a single data type with a single data constructor.
- The constructor holds a single field that can only be a particular type.
If these invariants hold then, we remove the wrapping object and instead refer
to the value held by the constructor directly. Int8
is the simplest case for
this optimization. In Haskell we have:
data Int8 = Int8 Int8#
Notice that this definition satisfies the requirements. A direct translation in the JS backend would be:
// An Int8 Thunk represented as an Object with an entry function, f
// and payload, d1.
var anInt8 = { d1 = <Int8# payload>
, f : entry function which would scrutinize the payload
}
We can operationally distinguish between a Thunk
and an Int8
because these
will have separate types in the StgToJS
GHC pass and will have separate types
(object
vs number
) at runtime. In contrast, in Haskell an Int8
may
actually be a Thunk
until it is scrutinized and then becomes the Int8
payload (i.e., call-by-need). So this means that we will always know when we
have an Int8
rather than a Thunk
and therefore we can omit the wrapper
object and convert this code to just:
// no object, just payload
var anInt8 = = <Int8# payload>
For the interested reader, this optimization takes place in the JavaScript code
generator module GHC.StgToJS.Arg
, specifically the functions allocConStatic
,
isUnboxableCon
, and primRepVt
.
But what about the other stuff!
Char#
: is represented by anumber
, i.e., the code pointFloat#/Double#
: Both represented as a JavaScript Double. This means thatFloat#
has excess precision and thus we do not generate exactly the same answers as other platforms which are IEEE754 compliant. Full emulation of single precision Floats does not seem to be worth the effort as of writing. Our implementation represents these in aByteArray#
, where eachFloat#
takes 4 bytes in theByteArray#
. This means that the precision is reduced to a 32-bit Float.
- An alternative approach would be to use some JS ArrayBuffers as memory blocks into which Haskell values and heap objects would be allocated. As an example this is the approach used by the Asterius compiler. The RTS would then need to be much more similar to the C RTS and the optimization presented in this section wouldn't apply because we couldn't rely on introspection of JS values.↩
Lightweight Haskell Threads on JavaScript
Introduction
I recently gave a short presentation on the topic of threads in GHCJS to the GHC team at IOG. This blog post is a summary of the content.
JavaScript and Threads
JavaScript is fundamentally single threaded. There are ways to share specific data between tasks but it's not possible to run multiple threads that have access to a shared memory space of JavaScript data.
The single JavaScript thread is often responsible for multiple tasks. For example a node.js server handles multiple simultaneous connections and a web application may be dealing with user input while downloading new data in the background.
This means that any single task should take care to never block execution of the other task. JavaScript's canonical answer is to use asynchronous programming. A function reading a file returns immediately without waiting for the file data to be loaded in memory. When the data is ready, a user-supplied callback is called to continue processing the data.
Haskell Threads
Concurrent Haskell supports lightweight threads through forkIO
. These threads are scheduled on top of one more more operating system thread. A blocking foreign call blocks an OS thread but other lightweight threads can still run on other OS threads if available.
There is no built-in support for foreign calls with a callback in the style of JavaScript. Functions imported with foreign import ccall interruptible
can be interrupted by sending an asynchronous exception to the corresponding lightweight thread.
Lightweight Threads in JavaScript
GHCJS implements lightweight threads on top of the single JavaScript thread. The scheduler switches between threads and handles synchronization through MVar
and STM
as expected from other Haskell platforms.
Foreign calls that don't block can be handled in the usual way. We extend the foreign function interface with a new type foreign import javascript interruptible
that conveniently supports the callback mechanism used by JavaScript frameworks. The foreign call is supplied with an additional argument $c
representing a callback to be called with the result when ready. From the Haskell side the corresponding lightweight thread is blocked until $c
is called. This type of foreign call can be interrupted with an asynchronous exception to the lightweight Haskell thread.
By default, Haskell threads in the JS environment run asynchronously. A call to h$run
returns immediately and starts the thread in the background. This works for tasks that does not require immediate actions. For situations that require more immediate action, such as dealing with event handler propagation, there is h$runSync
. This starts a synchronous thread that is not interleaved with other task. If possible, the thread runs to completion before the call to h$runSync
returns. If the thread blocks for any reason, such as waiting for an MVar
or a foreign import javascript interruptible
call, synchronous execution cannot complete. The blocking task is then either interrupted with an exception or the thread is "demoted" to a regular asynchronous thread.
Black Holes
When a Haskell value is evaluated, its heap object is overwritten by a black hole. This black hole marks the value as being evaluated and prevents other threads from doing the same. "black holing" can be done either immediately or "lazily", when the garbage collector is run. GHCJS implements immediate blackholing.
Black holes give rise to an interesting problem in the presence of synchronous and asynchronous threads. Typically if we use h$runSync
, we want to have some guarantee that at least part of the task will run succesfully without blocking. For the most past it's fairly clear which parts of our task depends on potentially blocking IO or thread synchronization. But black holes throw a spanner in the works: Suddenly any "pure" data structure can be a source of blocking if it is under evaluation by another thread.
To regain some predictability and usability of synchronous threads, the h$runSync
scheduler can run other Haskell threads in order to "clear" a black hole. The process ends all black holes have been cleared or when any of the black holes is impossible to clear because of a blocking situation.
This all happens transparantly to the caller of h$runSync
, if the black holes could be cleared it appears as if they were never there.
Conclusion
We have lightweight Haskell threads in the single-threaded JavaScript environment and extend the foreign function interface to easily support foreign calls that depend on an asynchronous callback. This way, only the Haskell lightweight thread blocks.
By default, Haskell threads are asynchronous and run in the background: The scheduler interleaves the tasks and synchronization between threads. For situations that require immediate results or actions there are synchronous threads. Synchronous threads cannot block and are not interleaved with other tasks except when a black hole is encountered.
GHC DevX June 2022 Update
This is the June 2022 monthly update from the GHC DevX team at IOG.
JavaScript Backend for GHC
For a few months we have been merging GHCJS (Haskell to JavaScript compiler) into GHC. We set our first milestone to be the ability to compile and to run the usual "Hello World" program. It turned out to be much more involved than we initially thought (requiring FFI support, etc.), but we should be getting there soon.
This month we have made the following progress:
Linking: GHCJS requires some functions to be directly implemented in JavaScript (e.g. the RTS, some low-level functions in base). We have added support for linking
.js
files. We've also added support for a preprocessing pass with CPP for.js.pp
files.js-sources: there is some ongoing work to load these external JavaScript files from installed libraries. Cabal provides a
js-sources
stanza for this, we need to adapt Hadrian to make use of it.Binary vs Objectable: GHCJS used its own ByteString-based Objectable type-class: we replaced it with GHC's similar Binary type-class. Josh has published a blog post about their differences.
64-bit primops: we've added support for 64-bit primops (Word64# and Int64# types). In GHCJS (GHC 8.10), these were still implemented as foreign function calls. It's no longer true on GHC head.
base library: added CPP as required to support the JS backend. Ported and converted FFI imports from GHCJS to use JavaScript fat arrows (we haven't implemented GHCJS's fancy import syntax yet).
Now we can compile and link the "HelloWorld" program. To reach the first milestone we only have to fix the remaining runtime errors.
You can follow our progress on our development branch here. We now rebase this branch every Friday to avoid lagging too much behind GHC head.
Haskell Optimization Handbook
The "Haskell Optimization Handbook" is an accepted proposal of the Haskell Foundation. Jeff has been steadily writing some initial material as per the project plan.
GHC DevX May 2022 Update
This is the May 2022 monthly update from the GHC DevX team at IOG.
JavaScript Backend for GHC
For a few months we have been merging GHCJS (Haskell to JavaScript compiler) into GHC. We set our first milestone to be the ability to compile and to run the usual "Hello World" program. It turned out to be much more involved than we initially thought (requiring FFI support, etc.), but we should be getting there soon.
This month we have made the following progress:
RTS: we have modified Hadrian and
rts.cabal
in order to build a valid nativerts
unit that GHC can use, in particular containing appropriate header files.linker: the JS linker has been hooked up with GHC's driver. We fixed several panics in the linker due to erroneous symbol generation code. These bugs were introduced while porting the code from the old 8.10 pretty-printing infrastructure to the newer one.
boot libraries: the JS backend can now build and link all the boot libraries. Note that we are not claiming that they are all usable yet. In particular complete FFI support is lacking, but the JS backend Hadrian build completes and so we can start using the produced JS cross-compiler.
levity polymorphism: building
ghc-prim
uncovered a lurking bug related to levity polymorphism. It wasn't noticed in GHCJS 8.10 because it is also related to theBoxedRep
proposal that introduced a constructor application in a commonly usedRuntimeRep
.sized literals: support for new sized literals have been added to the code generator.
Now that have achieved a build process that actually produces a JS cross compiler, we are confronting and fixing issues in the produced JavaScript code, such as adding, managing, and debugging CPP conditional compilation blocks in JS shim files. You can follow our progress on our development branch here.
External Static Plugins
GHC doesn't support plugins in cross-compilers #14335. Some time ago, we came up with a solution called "external static plugins" !7377. These are plugins that are directly loaded from shared libaries, bypassing the issue with usual plugins.
Our colleague Shea Levy confirmed that the approach works, backported it to GHC 8.10, and has been working on making it work in stage1 cross-compilers for Windows. Kudos for this work, Shea.
As the current user-interface based on environment variables isn't convenient, we have been working on adding new command-line flags to GHC instead. We expect to propose this for integration into GHC when the new interface will be fully implemented.
Blog posts
Inspired by our friends and colleagues at Well-Typed and Tweag, we have been starting to write blog posts for IOG's engineering blog. They will mostly be about stuff we are working on or that we are interested in. Feel free to send us feedback about these posts and to send us topics you would be interested to read about.
- https://engineering.iog.io/2022-04-28-on-the-inlining-of-integer-and-natural-operations
- https://engineering.iog.io/2022-05-02-setup-ext-stg-interp
- https://engineering.iog.io/2022-05-17-javascript-template-haskell-external-interpreter
Haskell Optimization Handbook
The "Haskell Optimization Handbook" is an accepted proposal of the Haskell Foundation. Jeff has been working behind the scene to make this proposal concrete. More about this in the upcoming months.
Objectable vs GHC Binary
As part of the integration of GHCJS into GHC as a cross-compilation backend, we've converted the binary serialisation that GHCJS previously used, which was via its Objectable
typeclass, into GHC's internal Binary
typeclass representation. In doing this, we gain access to instances for serialising many of GHC's internal data types, and, importantly, we can reuse GHC's mechanism for serialising its Name
and FastString
types, which are written to lookup tables in order to maintain identity, as well as allowing for space savings on disk.
In this post, we will explain how the GHC Binary
and GHCJS Objectable
approaches work, and compare their tradeoffs.
How GHC Binary Works
Internally, GHC uses the Name
data type to track the uniqueness of objects during compilation. Amongst information relating to the definition of a Name
within the Haskell source, a Name
also contains a Unique
integer (the value of which is provided by the complation environment monad). Using this Unique
integer, which is unpacked in Name
's definition, we can make O(1) equality comparisons without following further memory references - allowing for this operation to be very quick, which will have a large effect on compilation performance given how often it is used.
FastString
is used within GHC to store short, string-like data, and, similarly to Name
, FastString
uses a unique integer to allow for very fast equality comparisons. Primarily, FastString
is used to represent variables and other definitions, and is used both in Name
as the string-representation of a name with extra information attached, as well as directly, representing names that don't require this extra information, such as local variables.
In GHC's .hi
interface files, Name
and FastString
are serialised differently compared to other data structures. They are written in the main data structure payload as indicies of a table, and these tables contain the actual string-like data of these types. So, an interface file might resemble:
- Header
- Magic number for recognising interface files
- Pointer to
Name
symbol table - Pointer to
FastString
dictionary
- Main data structure payload
Name
symbol tableFastString
dictionary
Importantly, the FastString
dictionary must be written after the Name
symbol table, because Name
s contain FastString
s, so writing the symbol table will expand the dictionary. Additionally, because we only have one buffer, and we don't know the size of the payload until it's written, the tables cannot be written in the header, and instead serialisation code must reserve space for the table pointers and jump back to write the pointers once the table locations are known.
During serialisation, GHC uses mutable data structures to store both the serialised binary buffer, as well as these tables:
data BinHandle
= BinMem { -- binary data stored in an unboxed array
bh_usr :: UserData, -- sigh, need parameterized modules :-)
_off_r :: !FastMutInt, -- the current offset
_sz_r :: !FastMutInt, -- size of the array (cached)
_arr_r :: !(IORef BinArray) -- the array (bounds: (0,size-1))
}
data UserData =
UserData {
-- for *deserialising* only:
ud_get_name :: BinHandle -> IO Name,
ud_get_fs :: BinHandle -> IO FastString,
-- for *serialising* only:
ud_put_nonbinding_name :: BinHandle -> Name -> IO (),
-- ^ serialize a non-binding 'Name' (e.g. a reference to another
-- binding).
ud_put_binding_name :: BinHandle -> Name -> IO (),
-- ^ serialize a binding 'Name' (e.g. the name of an IfaceDecl)
ud_put_fs :: BinHandle -> FastString -> IO ()
}
Here, we see that various functions are stored in the handle structure, to be later referenced by their respective types in their GHC.Utils.Binary.Binary
typeclass instances. Notice that the instance of Binary Name
references ud_put_nonbinding_name
and ud_get_name
. Similarly, the Binary FastString
instance uses ud_put_fs
and ud_get_fs
.
class Binary a where
put_ :: BinHandle -> a -> IO ()
put :: BinHandle -> a -> IO (Bin a)
get :: BinHandle -> IO a
instance Binary FastString where
put_ bh f =
case getUserData bh of
UserData { ud_put_fs = put_fs } -> put_fs bh f
get bh =
case getUserData bh of
UserData { ud_get_fs = get_fs } -> get_fs bh
instance Binary Name where
put_ bh name =
case getUserData bh of
UserData{ ud_put_nonbinding_name = put_name } -> put_name bh name
get bh =
case getUserData bh of
UserData { ud_get_name = get_name } -> get_name bh
In GHC.Iface.Binary
, helper types and functions are defined to store the Name
symbol table and FastString
dictionary in a mutable data structure. Here, putFastString
is intended to be partially applied - passing it an appropriately initialised BinDictionary
so that the resulting function can be stored in the us_put_fs
field of the UserData
. allocateFastString
does the low-level work here, incrementing the index and modifying the mutable map (stored as a UniqFM
, which is map keyed on types that contain Unique
s - recalling that these are used for fast equality comparisons):
data BinDictionary = BinDictionary {
bin_dict_next :: !FastMutInt, -- The next index to use
bin_dict_map :: !(IORef (UniqFM FastString (Int,FastString)))
-- indexed by FastString
}
putFastString :: BinDictionary -> BinHandle -> FastString -> IO ()
putFastString dict bh fs = allocateFastString dict fs >>= put_ bh
allocateFastString :: BinDictionary -> FastString -> IO Word32
allocateFastString BinDictionary { bin_dict_next = j_r,
bin_dict_map = out_r} f = do
out <- readIORef out_r
let !uniq = getUnique f
case lookupUFM_Directly out uniq of
Just (j, _) -> return (fromIntegral j :: Word32)
Nothing -> do
j <- readFastMutInt j_r
writeFastMutInt j_r (j + 1)
writeIORef out_r $! addToUFM_Directly out uniq (j, f)
return (fromIntegral j :: Word32)
Later, in GHC.Iface.Binary
, getWithUserData
and putWithUserData
will structure the header, and initialise the UserData
functions to write to/read from mutable tables. Notice that we must first reserve header space for pointers to the lookup tables, as well as initialise the mutable tables, write these initialised structures to the UserData
(for example, we see the previous putFastString
partially applied here), then write the main payload, then write the lookup tables (Name
symbol table first, because writing this can add to the FastString
dictionary), and finally jump back to fill in the pointers to these tables:
putWithUserData :: Binary a => TraceBinIFace -> BinHandle -> a -> IO ()
putWithUserData traceBinIface bh payload = do
-- Remember where the dictionary pointer will go
dict_p_p <- tellBin bh
-- Placeholder for ptr to dictionary
put_ bh dict_p_p
-- Remember where the symbol table pointer will go
symtab_p_p <- tellBin bh
put_ bh symtab_p_p
-- Make some initial state
symtab_next <- newFastMutInt 0
symtab_map <- newIORef emptyUFM
let bin_symtab = BinSymbolTable {
bin_symtab_next = symtab_next,
bin_symtab_map = symtab_map }
dict_next_ref <- newFastMutInt 0
dict_map_ref <- newIORef emptyUFM
let bin_dict = BinDictionary {
bin_dict_next = dict_next_ref,
bin_dict_map = dict_map_ref }
-- Put the main thing,
bh <- return $ setUserData bh $ newWriteState (putName bin_dict bin_symtab)
(putName bin_dict bin_symtab)
(putFastString bin_dict)
put_ bh payload
-- Write the symtab pointer at the front of the file
symtab_p <- tellBin bh -- This is where the symtab will start
putAt bh symtab_p_p symtab_p -- Fill in the placeholder
seekBin bh symtab_p -- Seek back to the end of the file
-- Write the symbol table itself
symtab_next <- readFastMutInt symtab_next
symtab_map <- readIORef symtab_map
putSymbolTable bh symtab_next symtab_map
case traceBinIface of
QuietBinIFace -> return ()
TraceBinIFace printer ->
printer (text "writeBinIface:" <+> int symtab_next
<+> text "Names")
-- NB. write the dictionary after the symbol table, because
-- writing the symbol table may create more dictionary entries.
-- Write the dictionary pointer at the front of the file
dict_p <- tellBin bh -- This is where the dictionary will start
putAt bh dict_p_p dict_p -- Fill in the placeholder
seekBin bh dict_p -- Seek back to the end of the file
-- Write the dictionary itself
dict_next <- readFastMutInt dict_next_ref
dict_map <- readIORef dict_map_ref
putDictionary bh dict_next dict_map
case traceBinIface of
QuietBinIFace -> return ()
TraceBinIFace printer ->
printer (text "writeBinIface:" <+> int dict_next
<+> text "dict entries")
In summary, we see a number of structural characteristics of code using GHC's Binary implementation:
- Use of a single buffer means that the lookup tables can't be written in the header, so we have to reserve space for table pointers in the header, and jump back once we know where they will be located in order to write the pointers to the buffer. Essentially, an ordering of file sections is enforced by the data dependencies of the payload containing
Name
s andFastString
s, andName
s containingFastString
s - which means these must be written in this order, but reading must be done in the reverse order, causing the need for pointers in the header. - Jumping around in binary buffers results in weakly enforced types and fiddly, code that Haskell's type system isn't able to help us debug
- Must carry about read/write functions for the lookup table types (
Name
andFastString
), which areundefined
during the opposite serialisation stage, and are hard-coded into the handle, reducing extensibility.
How Objectable Works
In comparison, GHCJS previously involved using instances of the Objectable
typeclass to serialise its interface files:
import qualified Data.Binary as DB
data SymbolTable
= SymbolTable !Int !(Map ShortText Int)
deriving (Show)
type PutSM = St.StateT SymbolTable DB.PutM -- FIXME: StateT isn't strict enough apparently
type PutS = PutSM ()
type GetS = ReaderT ObjEnv DB.Get
class Objectable a where
put :: a -> PutS
get :: GetS a
putList :: [a] -> PutS
putList = putListOf put
getList :: GetS [a]
getList = getListOf get
Here we see that GHCJS has opted for a different approach that avoids the mutable buffer by instead using Data.Binary
instances that work via concatenating lazy ByteString
s. Additionally, the mutable tables are replaced with a State
monad that holds the symbol table as a Map
structure.
Because Data.Binary
forms lazy ByteString
s, it's trivial to serialise the individual parts of the interface file and later concatenate these using ByteString
's monoid instance - allowing for all of the sections of the file to be defined declaratively at the top-level of the function in order of their appearance within the file.
object'
:: ModuleName -- ^ module
-> SymbolTable -- ^ final symbol table
-> Deps -- ^ dependencies
-> [([ShortText],ByteString)] -- ^ serialized units and their exported symbols, the first unit is module-global
-> ByteString
object' mod_name st0 deps0 os = hdr <> symbs <> deps1 <> idx <> mconcat (map snd os)
where
hdr = putHeader (Header (moduleNameTag mod_name) (bl symbs) (bl deps1) (bl idx))
bl = fromIntegral . B.length
deps1 = putDepsSection deps0
(sti, idx) = putIndex st0 os
symbs = putSymbolTable sti
In summary, the use of multiple ByteString
sections that are later concatenated offer several different structural characteristics compared to the use of a single mutable buffer:
- The final ordering of the sections is flexible, because they are serialsied separately, so any data dependencies don't introduce ordering in the file - which we see in the
where
clause ofobject'
- Types are more strongly enforced because imperative
seekBin
instructions aren't required. However, each section is still deserialised by taking a substring of the file to be read as that section type. Of course, all serialisation eventually results in raw binary, so the simplification of concatenating the sections into the final file without jumping around limits the places that bugs can hide - Visually, the ordering of the sections within the final file is very clear - we see in
object'
that every section is simply listed in order on one line, concatenated together.
Conclusion
Making use of GHC's existing infrastructure lets the GHCJS backend to make use of the FastString
and Name
data types, as well as allowing for the removal of a significant amount of now-redundant code.
Additionally, interface file generation using GHC's Binary
appears to be very fast - for example, attempts to hide the handle behind a reader monad significantly reduce the compiler's performance as measured by CI. Speculatively, looking at the generated core, this could be because the optimiser has a much better time with the style of IO code that is used - rather than being a limitation of more abstacted approaches.
The comparison provided the GHCJS's old approach makes it clear that GHC's Binary
implementation, while very useful, has potential to be improved in both readability and extensiblity. However, because CI has shown that serialisation performance has a significant effect on overall compilation performance, this tradeoff must be considered when making any changes. Potentially, these readability shortfalls in GHC's implementation might just be the result of legacy code, and so benchmarks of other approaches, such as Data.Binary
, should be used to guide future work in improving the readability and flexibility of GHC's serialisation without sacrificing performance.
JavaScript, Template Haskell and the External Interpreter
Introduction
At IOG DevX we have been working on integrating various bits of GHCJS into GHC, with the goal of having a fully working JavaScript backend for the 9.6 release. For some parts this has mostly consisted of an update of the code to use the newer GHC API and dependencies. Other bits, like the Template Haskell runner, need more work.
This post gives an overview of the existing approaches for running Template Haskell in GHC based cross compilers and our plan for the JavaScript backend. Hopefully we can revisit this topic once all the work has been done, and see what exactly we ended up with.
The GHCJS Template Haskell Runner
When I first worked on Template Haskell (TH) support for GHCJS, there was no mechanism to combine Template Haskell with cross compilation in GHC.
Normally, Template Haskell is run by loading library code directly into the GHC process and using the bytecode interpreter for the current module. Template Haskell can directly access GHC data structures through the Q
monad. Clearly this would not be possible for GHCJS: We only have JavaScript code available for the libraries and the organization of the JavaScript data structures is very different from what GHC uses internally.
So I had to look for an alternative. Running Template Haskell consists of two parts:
- loading/executing the TH code
- handling compiler queries from the TH code, for example looking up names or types
Running the TH code can be done by first compiling the Haskell to JavaScript and then using the JavaScript eval
feature.
Template Haskell code can query the compiler using the Quasi
typeclass. I noticed that none of the methods required passing around functions or complicated data structures, so it would be possible to serialize each request and response and send it to another process.
So I went ahead and implemented this approach with a script thrunner.js
to load and start the code in a node.js server, a message type with serialization, and a new instance of the Quasi
typeclass to handle the communication with the compiler via the messages. This is still what's in use by GHCJS to this day. Every time GHCJS encounters Template Haskell, it starts a thrunner
process and the compiler communicates with it over a pipe.
After starting thrunner.js
GHCJS sends the Haskell parts of the Template Haskell runnner to the script. This includes the runtime system and the implementation of the Quasi
typeclass and communication protocol. After that, the TH session starts. A typical TH session looks as follows:
Compiler | thrunner |
---|---|
RunTH THExp <js code> <source location> | |
LookupName (Just <name-string>) | |
LookupName' (Just <name>) | |
Reify <name> | |
Reify' <name-info> | |
RunTH' <result> | |
RunTH THDec <js code> <source location> | |
AddTopDecls <declarations> | |
AddTopDecls' | |
RunTH' <result> | |
FinishTH True | |
FinishTH' <memory-consumption> |
Each message is followed up by a corresponding reply. For example, a LookupName'
response follows a LookupName
request and a RunTH
message will eventually generate a RunTH'
result. The first RunTH
message contains the compiled JavaScript for the Template Haskell code, along with its dependencies. Each subsequent RunTH
only includes dependencies that have not already been sent.
The thrunner
process stays alive during the compilation of at least an entire module, allowing for persistent state (putQ
/getQ
).
The GHC External Interpreter
If we build a Haskell program with (cost centre) profiling, the layout of our data structures changes to include bookkeeping of cost centre information. This means that we need a special profiling runtime system to run this code.
What can we do if we want to run our profiled build in GHCi or Template Haskell? We cannot load compiled profiling libraries into GHC directly; its runtime system expects non-profiled code. We could use a profiled version of the compiler itself, but this would make all compilation very slow. Or we could somehow separate the profiled code of our own program from the non-profiled code in the compiler.
This was Simon Marlow's motivation for adapting the GHCJS thrunner
approach, integrating in GHC and extending it it to support GHCi and bytecode. This functionality can be activated with the -fexternal-interpreter
flag and has been available since GHC version 8.0.1. When the external interpreter is activated, GHC starts a separate process, iserv
(customizable with the -pgmi
flag) which has the role analogous to the thrunner
script for GHCJS.
Over time, the iserv
code has evolved with GHC and has been extended to include more operations. By now, there are quite a few differences in features:
Feature | thrunner | iserv |
---|---|---|
Template Haskell support | yes | yes |
GHCi | no | yes |
Debugger | no | yes |
Bytecode | no | yes |
Object code | through pipe | from file |
Object code linking | compiler | iserv process |
thrunner
is not quite as complete as iserv
: It lacks GHCi and the debugger, and there is no bytecode support. But these features are not essential for basic Template Haskell.
Proxies and Bytecodes
We have now seen two systems for running Template Haskell code outside the compiler process: The original GHCJS thrunner
and the extended GHC iserv
.
Clearly it isn't ideal to have multiple "external interpreter" systems in GHC, therefore we plan to switch from thrunner
to iserv
for the upcoming JavaScript GHC backend. We don't need the debugger or GHCi support yet, but we do need to adapt to other changes in the infrastructure. So what does this mean in practice?
The biggest change is that we have to rework the linker: thrunner
does not contain any linking logic by itself: GHCJS compiles everything to JavaScript and sends compiled code to the thrunner
process, ready to be executed. In contrast, iserv
has a loader for object and archive files. When dependencies need to be loaded into the interpreter, GHC just gives it the file name.
Another change is using the updated message types. In the thrunner
session example above we could see that each message is paired with a response. For example a RunTH'
response always follows a RunTH
message, with possibly other messages in between. iserv
has an interesting approach for the Message
datatype: Instead of having pairs of data constructors for each message and its response, iserv
has a GADT Message a
, where the a
type parameter indicates the expected response payload for each data constructor.
During development of the thrunner
program it turned out to be very useful to save and replay Template Haskell sessions for debugging purposes. We'd like to do this again, but now saving the message in a readable/writable format. Since we're dealing with JavaScript, JSON appears to be the obvious choice.
Our plan is to have an iserv
implementation that consists of a JavaScript part that runs in node.js and a proxy process to handle communication with GHC. The proxy process converts the messages between GHC's own (binary
based) serialization format and JSON. The proxy process is relatively simple, but it does reveal one downside of the new GADT based message types: A proxy is stateful. We must always know which message we have sent to convert the response back from JSON to binary
.
It's not yet known whether we will implement a full bytecode interpreter. We expect it to become clear during implementation whether we can get away without one early on.
Conclusion
We have seen how Template Haskell and GHCi code can be run outside the GHC process for profiling or cross compiling, with both the thrunner
approach in GHCJS and the newer iserv
in GHC.
We at IOG DevX are working on switching to the iserv
infrastructure for the upcoming GHC JavaScript backend, which involves a substantial rewrite, mainly because of differences in linking. This is a work in progress, and we intend to revisit this topic in another blog post once the final design has been implemented.
GHC April 2022 Update
Welcome to the (rather late) April 2022 monthly update from the GHC DevX team at IOG. Since the last update we've continued work on the upcoming JavaScript backend for GHC. Unfortunately, we have nothing to show quite yet but that doesn't mean nothing has happened! On the contrary, we've made great progress and are close to that crucial first milestone hello world
. Besides our work on the JavaScript backend, we were pleased to finally push through the Modularizing GHC paper that Sylvain has been working on for 2+ years! It causes quite the splash on the Haskell discourse and reddit, we recommend reading it if you haven't already (links below). Alright, enough introduction let's get into the update.
JavaScript Backend
We have made the following progresses in the implementation of a JavaScript backend for GHC (adapted from GHCJS):
linker: ported GHCJS's linker code into GHC. A lot of code was duplicated from GHC and slightly modified for GHCJS's needs, making the process far from trivial.
testsuite: fixed Hadrian to run GHC's testsuite with cross-compilers !7850. There are remaining issues though (see #21292).
build system: fixes for GHC's configure script were ported (e.g. support for the "ghcjs" target in
config.sub
). GHCJS's custom build script was integrated intoconfigure.ac
. We can now configure the build with:./configure --target=js-unknown-ghcjs
TH: we have conducted some experiments to find the best way to bridge GHCJS's TH runner and GHC's external interpreter. This will be described in details in a future blog post.
FFI: basic support for JavaScript FFI has been ported from GHCJS to GHC. We haven't ported the JavaScript parser, so we have dropped the fancy import syntax (e.g. "$1.xyz"). It should be enough to build boot libraries and we will add JS parsing support later.
At this stage, we are working on building boot libraries and on supporting linking with the JS RTS.
Development happens in the following branch: https://gitlab.haskell.org/ghc/ghc/-/tree/wip/js-staging
Modularity paper
Sylvain, Jeffrey, and John Ericson (from Obsidian Systems) wrote a paper about "modularizing GHC" using domain-driven design.
- Announce blog post: https://hsyl20.fr/home/posts/2022-05-03-modularizing-ghc-paper.html
- Paper: https://hsyl20.fr/home/files/papers/2022-ghc-modularity.pdf
- Reddit: https://www.reddit.com/r/haskell/comments/uhdu4l/modularizing_ghc_paper/
- Discourse: https://discourse.haskell.org/t/modularizing-ghc-paper/4471
We've got a lot of great feedback about it (expect a first revision soon). We also got a GHC contribution directly inspired by the paper (see !8160) which was very welcome!
Setting up Csaba's External STG Interpreter
Table of Contents
- Making sense of the project
- Building a working external STG interpreter
- Building the external-stg-interpreter
- Linking the external-stg-interpreter
- The whole setup process on a demo
- Summary
Haskell is a great language camouflaged by lackluster tooling. This situation has led to well-known problems (who could forget Cabal hell?). A less discussed problem is what I will call the “Black-box syndrome”: It is hard to know exactly what the memory representation and runtime performance of my Haskell programs are1. Now black-box syndrome is not only a problem, it is also one of the nice features in the language since like all good abstractions it elides things I’d rather not care about, at least most of the time. In other words, I am happy I don’t have to do manual memory manipulation!
However, when I have my optimization hat on, I run face first into black-box syndrome. The crux of the problem is a tension between the need for observation during performance engineering and optimization, and the need to ship fast code. During development we want to be able to open up a system, see exactly how it is working, make tweaks, package it back up and test again. I want to be able to answer questions like “Why is my executable this size?”, “Which code is a hot loop?”, or “When does my code do direct, known or unknown function calls?”.
In order to answer these questions we need the ability to observe every part of that system as the machine experiences it, without this ability we have no way to make progress other than test, change some code, compile and test again in an ad-hoc manner. And therein lies the problem, most Haskell tooling is insufficient to provide the observability that we would like, instead the tooling often expects and requires us to make source code changes to our program or even recompile all of our libraries and code for a profiling way. This leads to the idea and the expectation in the Haskell community that Haskell programs are hard to optimize because the barrier to entry for optimization has artificially increased.
Csaba Hruska has recently been making headway in this area with his work on the GRIN compiler and an external STG interpreter. His STG interpreter (and patched ghc) exactly solve these problems and he has demonstrated dumping the entire call graph of large Haskell projects, filter to hot loops and finding unknown function calls in these graphs. If you haven’t seen his demo be sure to watch it, it is well worth your time.
This post is the first in a new blog series. In this blog series we’re going to kick the tires on the external STG interpreter see what it can do, and what we can uncover in some popular libraries by using it. In particular, I’m interested in running it on projects I’ve previously optimized—such as ghc itself, containers, unordered-containers—using the standard methods: ticky-ticky profiling, prof, flamegraphs, heap profiling, ghc-debug, cachegrind etc. This post, however, will be focused on setting up the patched ghc and interpreter on a NixOS system. My goals are threefold:
- Give an overview of the project and project layout to lower barrier to entry for the system.
- Give step by step instructions on setting up the interpreter on a nix-based system and provide a forked github repo for nix users. This should allow nix users to just
git clone foo
andnix-build
(spoiler: it won’t be that easy but still not hard.) - Popularize Csaba’s project! It is a refreshing take on Haskell optimization and compilation.
Making sense of the project
The external STG interpreter is part of the GRIN compiler project. We are not doing anything with the GRIN compiler (yet!) and so we are only interested in The GHC whole compiler project. The whole-compiler-project has several sub-projects that we’ll be building and using directly:
- external-stg: This subproject provides utilites we’ll be using, in particular
mkfullpak
- external-stg-interpreter: This is the actual STG interpreter. The good news is that this is independent of the rest of the project and can be built just like a regular Haskell executable
- ghc-wpc: This is a fork of
ghc-8.10.x
(I’m not sure exactly which version it forks to be honest) which we must build in order to use the external STG interpreter. Ghc-wpc serves as a frontend for the external-stg-interpreter.
Building a working external STG interpreter
The external STG interpreter can be built like any regular haskell executable. But in order to use the interpreter we have to build ghc-wpc
. ghc-wpc
is necessary because it serves as a frontend for the STG interpreter. It compiles a Haskell program like normal and then dumps an enriched STG IR to file. This file is then run through a utility gen-exe
(gen-exe is an executable built in the external-stg-compiler sub-project) which picks up the compilation pipeline from the STG IR and creates an executable like we would expect from a normal compilation pipeline.
The major difference between this process and the usual compiler pipeline is that ghc-wpc
leaves enough compiler information on disk for the rest of the tooling to consume, namely, in files with a *.o_stgbin
(this is STG IR generated at compile time), and *.o_stgapp
(project linker and dependency information) extension. Thus, once we build this custom ghc version we can use it to build the source code we wish to analyze and begin our optimization work.
For the rest of this tutorial I’ll be referencing my fork of the ghc-whole-compiler-project
that includes everything you need if you want to follow along, including .nix
files for creating a nix-shell
which will prepare a suitable environment to run the entire toolchain.
ghc.nix
The usual way to build ghc using a nix based system is with the ghc.nix project. Ghc.nix provides a default.nix
with a suitable environment to run hadrian and build ghc. For ghc-wpc
we’ll need some special packages, and we need our boot compiler to be exactly ghc-8.3.3
. The custom ghc.nix
file is included in my fork, I’ve taken the liberty to pin the nixpkgs to the right version for ghc-8.3.3
. So let’s begin:
Clone the forked repo:
$ git clone https://github.com/doyougnu/ghc-whole-program-compiler-project.git
$ cd ghc-whole-program-compiler-project
$ tree -L 1
.
├── dist-newstyle
├── external-stg
├── external-stg-compiler
├── external-stg-interpreter
├── ghc.nix.wpc
├── ghc-wpc
├── lambda
├── mod-pak
├── README.md
├── shell.nix
├── stack.yaml
└── stack.yaml.lock
You’ll find the patched ghc.nix
included (ghc.nix.wpc
) and a shell.nix
for a nix-shell
. The shell.nix
file simply references ghc.nix.wpc/default.nix
with the appropriate options:
$ cat shell.nix
import (./ghc.nix.wpc/default.nix) {
useClang = true;
withHadrianDeps = true;
withIde = false;
withLlvm = true;
}
Building ghc-wpc
Now we can enter a nix-shell and build ghc-wpc
:
$ pwd
/home/doyougnu/programming/haskell/ghc-whole-program-compiler-project
$ nix-shell shell.nix # or just nix-shell
trace: checking if /home/doyougnu/programming/haskell/ghc-whole-program-compiler-project/hadrian/hadrian.cabal is present: no
Recommended ./configure arguments (found in $CONFIGURE_ARGS:
or use the configure_ghc command):
--with-gmp-includes=/nix/store/sznfxigwvrvn6ar3nz3f0652zsld9xqj-gmp-6.2.0-dev/include
--with-gmp-libraries=/nix/store/447im4mh8gmw85dkrvz3facg1jsbn6c7-gmp-6.2.0/lib
--with-curses-includes=/nix/store/84g84bg47xxg01ba3nv0h418v5v3969n-ncurses-6.1-20190112-dev/include
--with-curses-libraries=/nix/store/xhhkr936b9q5sz88jp4l29wljbbcg39k-ncurses-6.1-20190112/lib
--with-libnuma-includes=/nix/store/bfrcskjspk9a179xqqf1q9xqafq5s8d2-numactl-2.0.13/include
--with-libnuma-libraries=/nix/store/bfrcskjspk9a179xqqf1q9xqafq5s8d2-numactl-2.0.13/lib
--with-libdw-includes=/nix/store/sv6f05ngaarba50ybr6fdfc7cciv6nbv-elfutils-0.176/include
--with-libdw-libraries=/nix/store/sv6f05ngaarba50ybr6fdfc7cciv6nbv-elfutils-0.176/lib
--enable-dwarf-unwind
[nix-shell:~/programming/haskell/ghc-whole-program-compiler-project]$
Now we need to cd
into ghc-wpc
and tweak the hadrian build.
MAJOR CONSTRAINT: You must build ghc-wpc with hadrian/build-stack, if you build in any other way you’ll run into shared object errors, see this ticket for details.
So in order to build ghc-wpc
with stack we’ll have to tweak the stack.yaml
file. You must do this since it is not included in the fork:
Quick side note: To make the formatting nicer I truncate
nix-shell:~/foo/bar/baz/ghc-whole-program-compiler-project
to just ...
, so
nix-shell:.../ghc-wpc
is equivalent to
~/path/to/ghc-whole-compiler-project/ghc-wpc
.
[nix-shell:...]$ cd ghc-wpc/hadrian/
[nix-shell:.../ghc-wpc/hadrian]$ cat stack.yaml
resolver: lts-15.5
packages:
- '.'
- 'GHC-Cabal'
system-ghc: true
nix:
enable: true
shell-file: ../../shell.nix
The changes are: (1) tell stack
we are using nix
, and (2) reference the shell.nix
file which points to ghc.wpc.nix
at the root of the project, i.e., ghc-whole-program-compiler-project/shell.nix
.
Now we should be able to begin our build, return to the root of ghc-wpc
and run the following:
[nix-shell:.../ghc-wpc/hadrian]$ cd ..
[nix-shell:.../ghc-wpc]$ ./boot && ./configure
[nix-shell:.../ghc-wpc]$ hadrian/build-stack -j
and go get some coffee since this will take some time. Once it finishes you should have the ghc-wpc
binary in _build/stage1/bin
[nix-shell:.../ghc-wpc]$ ls -l _build/stage1/bin/
total 8592
-rwxr-xr-x 1 doyougnu users 1843752 Apr 29 23:01 ghc
-rw-r--r-- 1 doyougnu users 11082 Apr 29 23:01 ghc.dyn_o_ghc_stgapp
-rwxr-xr-x 1 doyougnu users 660128 Apr 29 22:50 ghc-pkg
-rw-r--r-- 1 doyougnu users 9977 Apr 29 22:50 ghc-pkg.dyn_o_ghc_stgapp
-rwxr-xr-x 1 doyougnu users 4624680 Apr 29 23:01 haddock
-rw-r--r-- 1 doyougnu users 16883 Apr 29 23:01 haddock.dyn_o_ghc_stgapp
-rwxr-xr-x 1 doyougnu users 49344 Apr 29 22:25 hp2ps
-rw-r--r-- 1 doyougnu users 2504 Apr 29 22:25 hp2ps.dyn_o_ghc_stgapp
-rwxr-xr-x 1 doyougnu users 716440 Apr 29 22:35 hpc
-rw-r--r-- 1 doyougnu users 9959 Apr 29 22:35 hpc.dyn_o_ghc_stgapp
-rwxr-xr-x 1 doyougnu users 738544 Apr 29 22:35 hsc2hs
-rw-r--r-- 1 doyougnu users 10264 Apr 29 22:35 hsc2hs.dyn_o_ghc_stgapp
-rwxr-xr-x 1 doyougnu users 58384 Apr 29 22:34 runghc
-rw-r--r-- 1 doyougnu users 8864 Apr 29 22:34 runghc.dyn_o_ghc_stgapp
Notice that this build dumped *.<way>_o_ghc_stgapp
files!
Building the stg tooling
Now that we have a working ghc-wpc
we need to build the rest of the project by pointing stack
to the ghc-wpc
binary in ghc-wpc/_build/stage1/bin
. That is, we must change the ghc-whole-program-compiler-project/stack.yaml
file:
[nix-shell:~/programming/haskell/ghc-whole-program-compiler-project]$ cat stack.yaml
resolver: lts-16.13
allow-newer: true
packages:
- 'external-stg-compiler'
- 'external-stg'
ghc-options:
"$everything": -fno-stgbin -fno-stgapp -optcxx-std=c++17
extra-deps:
- async-pool-0.9.1@sha256:4015140f896c3f1652b06a679b0ade2717d05557970c283ea2c372a71be2a6a1,1605
- souffle-haskell-1.1.0
- zip-1.7.0
# use custom ext-stg whole program compiler GHC
compiler: ghc-8.11.0
skip-ghc-check: true
nix:
enable: false
# use local GHC (for development)
system-ghc: true
extra-path:
- /home/doyougnu/programming/haskell/ghc-whole-program-compiler-project/ghc-wpc/_build/stage1/bin
# DEBUG INFO
#dump-logs: all
#build:
# keep-tmp-files: true
# cabal-verbose: true
The changes are: (1) set compiler: ghc-8.11.0
(the ghc-wpc
fork), (2) set skip-ghc-check: true
so that stack doesn’t complain about the ghc version, (3) set nix.enable: false
, confusingly if you leave this as true then stack will try to use nixpkgs
to get a ghc binary, but we want it to use our local binary so we disable this even though we’ll still be in our original nix-shell (4) set system-path: true
to tell stack we will be using a ghc we have on our system, and finally (5) set extra-path: <path-to-ghc-wpc-binary>
.
Now we can run stack and install the stg tooling:
[nix-shell:...]$ stack --stack-root `pwd`/.stack-root install
Trouble loading CompilerPaths cache: UnliftIO.Exception.throwString called with:
Compiler file metadata mismatch, ignoring cache
Called from:
throwString (src/Stack/Storage/User.hs:277:8 in stack-2.7.5-9Yv1tjrmAU3JiZWCo86ldN:Stack.Storage.User)
WARNING: Ignoring tagged's bounds on template-haskell (>=2.8 && <2.17); using template-haskell-2.17.0.0.
Reason: allow-newer enabled.
WARNING: Ignoring aeson's bounds on template-haskell (>=2.9.0.0 && <2.17); using template-haskell-2.17.0.0.
Reason: allow-newer enabled.
WARNING: Ignoring th-abstraction's bounds on template-haskell (>=2.5 && <2.17); using template-haskell-2.17.0.0.
Reason: allow-newer enabled.
WARNING: Ignoring unliftio-core's bounds on base (>=4.5 && <4.14); using base-4.14.0.0.
Reason: allow-newer enabled.
WARNING: Ignoring souffle-haskell's bounds on megaparsec (>=7.0.5 && <8); using megaparsec-8.0.0.
stack --stack-root `pwd`/.stack-root install
... # bunch of output
...
...
Copied executables to /home/doyougnu/.local/bin:
- dce-fullpak
- ext-stg
- fullpak
- gen-exe
- gen-exe2
- gen-obj
- gen-obj2
- mkfullpak
- show-ghc-stg
Warning: Installation path /home/doyougnu/.local/bin not found on the PATH environment variable.
You can add ~/.local/bin
to your PATH
if you want, I’ll just be directly referencing these binaries as we go.
Building the external-stg-interpreter
We are almost all done, all that is left is to build the external-stg-interpreter and run a small script that links everything together into a shared object for the interpreter. So:
[nix-shell:...]$ cd external-stg-interpreter/
[nix-shell:.../external-stg-interpreter]$ stack install
... # bunch of output
...
Copied executables to /home/doyougnu/.local/bin:
- ext-stg
- ext-stg-interpreter
- fullpak
- mkfullpak
Warning: Installation path /home/doyougnu/.local/bin not found on the PATH environment variable.
Now we have our ext-stg-interpreter
built! There are a few caveats I want to point out here. I’ve modified ghc-whole-program-compiler-project/external-stg-interpreter/stack.yaml
to load the right packages and use nix:
[nix-shell:.../external-stg-interpreter]$ cat stack.yaml
resolver: lts-16.13
packages:
- '.'
- 'external-stg'
extra-deps:
- souffle-haskell-2.1.0
- primitive-0.7.1.0
- zip-1.7.0
nix:
enable: true
packages: [ zlib, libffi, pkg-config, bzip2 ]
Notice the nix:
block. We could have just as easily built this using nix
directly or using our shell.nix
file.
Linking the external-stg-interpreter
The only task left is to link into a shared object library called
libHSbase-4.14.0.0.cbits.so
. To do that we need to use the script called, c
,
in ghc-whole-program-compiler-project/external-stg-interpreter/data
. This
script is a bit of a hack, it generates the shared object file so that we can link the symbols requested by the C
FFI in base
, but it populates those functions with our replacements, which do absolutely nothing. For example, we supply a fake garbage collect:
// in .../external-stg-interpreter/data/cbits.so-script/c-src/fake_rts.c
...
void performGC(void) {
}
void performMajorGC(void) {
}
...
This works because we won't be using the runtime system at all, we'll be using the external STG interpreter instead, however we still need to provide these symbols in order to link. **MAJOR NOTE: this file must be next to any *.fullpak file you’ll be running the interpreter on** or else you’ll get an undefined symbol error during linking, for example:
[nix-shell:.../external-stg-interpreter/data]$ ls
cbits.so-script ghc-rts-base.fullpak minigame-strict.fullpak
### notice no .so file
[nix-shell:.../external-stg-interpreter/data]$ ~/.local/bin/ext-stg-interpreter ghc-rts-base.fullpak
ext-stg-interpreter: user error (dlopen: ./libHSbase-4.14.0.0.cbits.so: cannot open shared object file: No such file or directory)
## we error'd out because it was missing, also
## if you get this error then you have an old cbits.so file and need to rerun the c script
[nix-shell:.../external-stg-interpreter/data]$ ~/.local/bin/ext-stg-interpreter ghc-rts-base.fullpak
ext-stg-interpreter: user error (dlopen: ./libHSbase-4.14.0.0.cbits.so: undefined symbol: getProcessElapsedTime)
To link the interpreter we need to run c
in the data/cbits.so-script
sub-folder:
[nix-shell:.../external-stg-interpreter]$ cd data/cbits.so-script/
[nix-shell:.../external-stg-interpreter/data/cbits.so-script]$ ls
ar c cbits-rts.dyn_o c-src libHSbase-4.14.0.0.cbits.so stub-base.dyn_o
[nix-shell:.../external-stg-interpreter/data/cbits.so-script]$ ./c
++ ls ar/libHSbase-4.14.0.0-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHSbindings-GLFW-3.3.2.0-Jg9TvsfYUZwD0ViIP0H2Tz-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSbytestring-0.10.9.0-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHScriterion-measurement-0.1.2.0-73BCI2Fnk7qE8QjjTa1xNa-ghc8.11.0.20210324.dyn_o_cbits.a ar/libHSghc-8.11.0.20210306-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSGLUT-2.7.0.15-1pzTWDEZBcYHcS36qZ2lpp-ghc8.11.0.20201112.dyn_o_cbits.a ar/libHSGLUT-2.7.0.15-1pzTWDEZBcYHcS36qZ2lpp-ghc8.11.0.20210324.dyn_o_stubs.a ar/libHShashable-1.3.0.0-Kn7aNSFvzgo2qY16wYzuCX-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSinteger-gmp-1.0.3.0-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHSlambdacube-quake3-engine-0.1.0.0-7CKLP3Rqgq0PR81lhlwlR-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSmersenne-random-pure64-0.2.2.0-ExYg8DmthtrLG9JevQbt2m-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSOpenGLRaw-3.3.4.0-5vXBlmbOM3AIT7GRYfpE3o-ghc8.11.0.20201112.dyn_o_cbits.a ar/libHSprimitive-0.7.0.1-2k3g9qX0zz16vEv34R307m-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSprocess-1.6.8.2-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHStext-1.2.4.0-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHSunix-2.7.2.2-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHSunix-2.7.2.2-ghc8.11.0.20210220.dyn_o_stubs.a ar/libHSzlib-0.6.2.1-1I6DmfbLEyTBgDZI7SbZfW-ghc8.11.0.20210306.dyn_o_stubs.a
++ ls stub-base.dyn_o/Blank_stub.dyn_o stub-base.dyn_o/ClockGetTime_stub.dyn_o stub-base.dyn_o/Internals_stub.dyn_o stub-base.dyn_o/RUsage_stub.dyn_o
++ ls cbits-rts.dyn_o/StgPrimFloat.dyn_o cbits-rts.dyn_o/TTY.dyn_o
++ ls c-src/fake_rts.c c-src/hack.c c-src/hschooks.c
+ gcc -o libHSbase-4.14.0.0.cbits.so -shared -Wl,--whole-archive ar/libHSbase-4.14.0.0-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHSbindings-GLFW-3.3.2.0-Jg9TvsfYUZwD0ViIP0H2Tz-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSbytestring-0.10.9.0-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHScriterion-measurement-0.1.2.0-73BCI2Fnk7qE8QjjTa1xNa-ghc8.11.0.20210324.dyn_o_cbits.a ar/libHSghc-8.11.0.20210306-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSGLUT-2.7.0.15-1pzTWDEZBcYHcS36qZ2lpp-ghc8.11.0.20201112.dyn_o_cbits.a ar/libHSGLUT-2.7.0.15-1pzTWDEZBcYHcS36qZ2lpp-ghc8.11.0.20210324.dyn_o_stubs.a ar/libHShashable-1.3.0.0-Kn7aNSFvzgo2qY16wYzuCX-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSinteger-gmp-1.0.3.0-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHSlambdacube-quake3-engine-0.1.0.0-7CKLP3Rqgq0PR81lhlwlR-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSmersenne-random-pure64-0.2.2.0-ExYg8DmthtrLG9JevQbt2m-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSOpenGLRaw-3.3.4.0-5vXBlmbOM3AIT7GRYfpE3o-ghc8.11.0.20201112.dyn_o_cbits.a ar/libHSprimitive-0.7.0.1-2k3g9qX0zz16vEv34R307m-ghc8.11.0.20210306.dyn_o_cbits.a ar/libHSprocess-1.6.8.2-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHStext-1.2.4.0-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHSunix-2.7.2.2-ghc8.11.0.20210220.dyn_o_cbits.a ar/libHSunix-2.7.2.2-ghc8.11.0.20210220.dyn_o_stubs.a ar/libHSzlib-0.6.2.1-1I6DmfbLEyTBgDZI7SbZfW-ghc8.11.0.20210306.dyn_o_stubs.a -Wl,--no-whole-archive stub-base.dyn_o/Blank_stub.dyn_o stub-base.dyn_o/ClockGetTime_stub.dyn_o stub-base.dyn_o/Internals_stub.dyn_o stub-base.dyn_o/RUsage_stub.dyn_o cbits-rts.dyn_o/StgPrimFloat.dyn_o cbits-rts.dyn_o/TTY.dyn_o -fPIC c-src/fake_rts.c c-src/hack.c c-src/hschooks.c -lm -lgmp -ltinfo -lGL -lX11 -lXi -lXrandr -lXxf86vm -lXcursor -lXinerama -lpthread
This will produce libHSbase-4.14.0.0.cbits.so
in the immediate directory:
[nix-shell:.../external-stg-interpreter/data/cbits.so-script]$ ls -l
total 984
drwxr-xr-x 2 doyougnu users 4096 Apr 27 14:10 ar
-rwxr-xr-x 1 doyougnu users 300 Apr 27 14:10 c
drwxr-xr-x 2 doyougnu users 4096 Apr 27 14:10 cbits-rts.dyn_o
drwxr-xr-x 2 doyougnu users 4096 Apr 27 14:10 c-src
-rwxr-xr-x 1 doyougnu users 986008 Apr 30 11:50 libHSbase-4.14.0.0.cbits.so ## <----- new
drwxr-xr-x 2 doyougnu users 4096 Apr 27 14:10 stub-base.dyn_o
Now we can test our interpreter by running it on the *.fullpak
files in external-stg-interpreter/data
:
[nix-shell:.../external-stg-interpreter/data/cbits.so-script]$ cd ..
[nix-shell:.../external-stg-interpreter/data]$ ls
cbits.so-script ghc-rts-base-call-graph-summary ghc-rts-base-call-graph.tsv ghc-rts-base.fullpak libHSbase-4.14.0.0.cbits.so minigame-strict.fullpak
## remove the old .so file
[nix-shell:.../external-stg-interpreter/data]$ rm libHSbase-4.14.0.0.cbits.so
## soft-link to the one we just built
[nix-shell:.../external-stg-interpreter/data]$ ln -s cbits.so-script/libHSbase-4.14.0.0.cbits.so libHSbase-4.14.0.0.cbits.so
[nix-shell:.../external-stg-interpreter/data]$ ls -l
total 79220
drwxr-xr-x 6 doyougnu users 4096 Apr 30 11:50 cbits.so-script
-rw-r--r-- 1 doyougnu users 48 Apr 30 11:47 ghc-rts-base-call-graph-summary
-rw-r--r-- 1 doyougnu users 28238 Apr 30 11:47 ghc-rts-base-call-graph.tsv
-rw-r--r-- 1 doyougnu users 22450708 Apr 27 14:10 ghc-rts-base.fullpak
lrwxrwxrwx 1 doyougnu users 43 Apr 30 11:55 libHSbase-4.14.0.0.cbits.so -> cbits.so-script/libHSbase-4.14.0.0.cbits.so ### <---- new
-rw-r--r-- 1 doyougnu users 58630129 Apr 27 14:10 minigame-strict.fullpak
[nix-shell:.../external-stg-interpreter/data]$ ~/.local/bin/ext-stg-interpreter ghc-rts-base.fullpak
hello
hello
ssHeapStartAddress: 53522
ssTotalLNECount: 69
ssClosureCallCounter: 360
executed closure id count: 114
call graph size: 150
[nix-shell:.../external-stg-interpreter/data]$ ls -l
total 79220
drwxr-xr-x 6 doyougnu users 4096 Apr 30 11:50 cbits.so-script
-rw-r--r-- 1 doyougnu users 48 Apr 30 11:56 ghc-rts-base-call-graph-summary ### <---- interpreter output
-rw-r--r-- 1 doyougnu users 28238 Apr 30 11:56 ghc-rts-base-call-graph.tsv ### <---- interpreter output
-rw-r--r-- 1 doyougnu users 22450708 Apr 27 14:10 ghc-rts-base.fullpak
lrwxrwxrwx 1 doyougnu users 43 Apr 30 11:55 libHSbase-4.14.0.0.cbits.so -> cbits.so-script/libHSbase-4.14.0.0.cbits.so
-rw-r--r-- 1 doyougnu users 58630129 Apr 27 14:10 minigame-strict.fullpak
And it works, we have two new files, <foo>-call-graph-summary
and <foo>-call-graph.tsv
which we can analyze to inspect the behavior of our program (more on this later).
The whole setup process on a demo
That was a rather involved example, to make clear the dependencies and steps required to run this on your own code the rest of this tutorial will run the interpreter on two of Csaba’s demo’s from his skillshare talk. First let’s grab the code:
$ pwd
/home/doyougnu/programming/haskell
$ git clone https://github.com/grin-compiler/ext-stg-interpreter-presentation-demos.git
$ ls
ext-stg-interpreter-presentation-demos ghc-whole-program-compiler-project ..
Now we’ll run the first demo which is a simply fold over a list:
$ nix-shell ghc-whole-program-compiler-project/shell.nix
trace: checking if /home/doyougnu/programming/haskell/hadrian/hadrian.cabal is present: no
Recommended ./configure arguments (found in $CONFIGURE_ARGS:
or use the configure_ghc command):
--with-gmp-includes=/nix/store/sznfxigwvrvn6ar3nz3f0652zsld9xqj-gmp-6.2.0-dev/include
--with-gmp-libraries=/nix/store/447im4mh8gmw85dkrvz3facg1jsbn6c7-gmp-6.2.0/lib
--with-curses-includes=/nix/store/84g84bg47xxg01ba3nv0h418v5v3969n-ncurses-6.1-20190112-dev/include
--with-curses-libraries=/nix/store/xhhkr936b9q5sz88jp4l29wljbbcg39k-ncurses-6.1-20190112/lib
--with-libnuma-includes=/nix/store/bfrcskjspk9a179xqqf1q9xqafq5s8d2-numactl-2.0.13/include
--with-libnuma-libraries=/nix/store/bfrcskjspk9a179xqqf1q9xqafq5s8d2-numactl-2.0.13/lib
--with-libdw-includes=/nix/store/sv6f05ngaarba50ybr6fdfc7cciv6nbv-elfutils-0.176/include
--with-libdw-libraries=/nix/store/sv6f05ngaarba50ybr6fdfc7cciv6nbv-elfutils-0.176/lib
--enable-dwarf-unwind
[nix-shell:~/programming/haskell]$ cd ext-stg-interpreter-presentation-demos/demo-01-tsumupto/
[nix-shell:~/programming/haskell/ext-stg-interpreter-presentation-demos/demo-01-tsumupto]$ ../../ghc-whole-program-compiler-project/ghc-wpc/_build/stage1/bin/ghc -O2 tsumupto.hs
[1 of 1] Compiling Main ( tsumupto.hs, tsumupto.o )
Linking tsumupto ...
$ cd ext-stg-interpreter-presentation-demos/demo-01-tsumupto
$ ls
tsumupto tsumupto.hi tsumupto.hs tsumupto.o tsumupto.o_ghc_stgapp tsumupto.o_modpak
Note, that we have two new files: *.o_ghc_stgapp
and .o_modpak
as a result of building with ghc-wpc
. If you try to run this from outside the nix-shell you’ll get an error about missing mkmodpak
:
$ ../../ghc-whole-program-compiler-project/ghc-wpc/_build/stage1/bin/ghc -O2 tsumupto.hs
[1 of 1] Compiling Main ( tsumupto.hs, tsumupto.o )
ghc: could not execute: mkmodpak
Now that we have those files we can run the interpreter, but first though we need to make a *.fullpak
file from the *.o_ghc_stgapp
file and create a symbolic link to libHSbase-4.14.0.0.cbits.so
:
## make the fullpack file
$ ~/.local/bin/mkfullpak tsumupto.o_ghc_stgapp
all modules: 259
app modules: 113
app dependencies:
... # bunch of output
...
main Main
creating tsumupto.fullpak
## create the link to the shared object file
$ ln -s ../../ghc-whole-program-compiler-project/external-stg-interpreter/data/cbits.so-script/libHSbase-4.14.0.0.cbits.so libHSbase-4.14.0.0.cbits.so
## the final directory should look like this
$ ls
libHSbase-4.14.0.0.cbits.so tsumupto tsumupto.fullpak tsumupto.hi tsumupto.hs tsumupto.o tsumupto.o_ghc_stgapp tsumupto.o_modpak
And now we can run the interpreter:
$ ~/.local/bin/ext-stg-interpreter tsumupto.fullpak
50005000
ssHeapStartAddress: 44082
ssTotalLNECount: 43
ssClosureCallCounter: 30275
executed closure id count: 112
call graph size: 146
The first line is the output of the program and the rest are diagnostics that the interpreter outputs. More importantly we should have a tab-separated csv file and call graph file in our local directory after running the interpreter:
$ ls -l
total 23876
lrwxrwxrwx 1 doyougnu users 114 Apr 30 12:21 libHSbase-4.14.0.0.cbits.so -> ../../ghc-whole-program-compiler-project/external-stg-interpreter/data/cbits.so-script/libHSbase-4.14.0.0.cbits.so
-rwxr-xr-x 1 doyougnu users 9442648 Apr 30 12:12 tsumupto
-rw-r--r-- 1 doyougnu users 53 Apr 30 12:23 tsumupto-call-graph-summary ### <---- interpreter output
-rw-r--r-- 1 doyougnu users 27490 Apr 30 12:23 tsumupto-call-graph.tsv ### <---- interpreter output
-rw------- 1 doyougnu users 14922366 Apr 30 12:19 tsumupto.fullpak
-rw-r--r-- 1 doyougnu users 1769 Apr 30 12:12 tsumupto.hi
-rw-r--r-- 1 doyougnu users 207 Apr 28 22:56 tsumupto.hs
-rw-r--r-- 1 doyougnu users 4488 Apr 30 12:12 tsumupto.o
-rw-r--r-- 1 doyougnu users 8817 Apr 30 12:12 tsumupto.o_ghc_stgapp
-rw------- 1 doyougnu users 9803 Apr 30 12:12 tsumupto.o_modpak
Which can be loaded into gephi
for closer inspection of the call graph of our program. Be sure to watch the rest of the demo in Csaba’s talk for this part! For now we’ll be going over using gephi
and these files in our next blog post in this series, stay tuned!
Summary
File Descriptions
foo.modpak
: A zip file which contains the Core, STG, CMM, source code, and assembly for the modulefoo
foo.fullpak
: A zip file which contains the same information asmodpack
but for every module of the program rather than just modulefoo
.foo.o_ghc_stgapp
: a yaml like file that contains:- the module’s dependencies including package dependencies
- a bunch of file paths for shared objects of the libraries
- the flags the module was built with
libHSbase-4.14.0.0.cbits.so
: shared object file created byext-stg-interpreter/data/cbits.so-script.c
. Required to be in the same directory asext-stg-interpreter
will be invoked.
Step-by-Step guide for running the interpreter on your code
- Build your project with
ghc-wpc/_build/stage1/bin
by directly invoking thatghc
(as I did in the demo-01 project) or by pointing stack to it withsystem-ghc
andextra-path
instack.yaml
, or by passing-w <path-to-ghc-wpc-binary
with cabal. - Generate the
foo.fullpak
file withmkfullpak foo.o_ghc_stgapp
- Soft-link to
libHSbase-4.14.0.0.cbits.so
in the directory you will run the interpreter in. This file must be present when you run the interpreter! - Now run the interpreter on
project.fullpak
- Analyze
foo-call-graph-summary
andfoo-call-graph.tsv
with whatever tools make sense to you
Footnotes
- This isn’t completely true, there is the
RuntimeRep
type controls exactly this and the levity polymorphism work by Richard Eisenberg. See this video for examples on using these features. We do plan to include a more thorough and real world example on using levity polymorphism for better performance in the haskell optimization handbook.↩
On the inlining of Integer and Natural operations
In this post I discuss the inlining of Integer and Natural operations in Haskell. It’s a promising performance work I’ve been conducting six months ago, which was blocked by an independent issue, but that I will likely resume soon as the issue has been fixed in the meantime.
To follow this post, you must know that Natural
numbers are represented as follows in ghc-bignum
:
-- | Natural number
--
-- Invariant: numbers <= WORD_MAXBOUND use the `NS` constructor
data Natural
= NS !Word#
| NB !BigNat#
Small naturals are represented with a Word#
and large ones with a BigNat#
(a ByteArray#
).
Now consider the following simple example using Natural:
-- | Add 2 to a Word. Use Natural to avoid Word overflow
foo :: Word -> Natural
foo x = fromIntegral x + 2
There are only small naturals involved: fromIntegral x
is small because x
is a Word
, and 2
is small. We could hope that GHC would use Word#
primops to implement this and would allocate a Natural
heap object for the result only. However it’s not what happens currently, even in GHC HEAD. In the following STG dump, we can see that a Natural
heap object is allocated for x
before calling naturalAdd
(let
bindings in STG reflect heap allocations):
foo1 = NS! [2##];
foo =
\r [x_sXn]
case x_sXn of {
W# x#_sXp ->
let { sat_sXq = NS! [x#_sXp]; } in naturalAdd sat_sXq foo1;
};
Let’s look at naturalAdd
:
-- | Add two naturals
naturalAdd :: Natural -> Natural -> Natural
{-# NOINLINE naturalAdd #-}
naturalAdd (NS x) (NB y) = NB (bigNatAddWord# y x)
naturalAdd (NB x) (NS y) = NB (bigNatAddWord# x y)
naturalAdd (NB x) (NB y) = NB (bigNatAdd x y)
naturalAdd (NS x) (NS y) =
case addWordC# x y of
(# l,0# #) -> NS l
(# l,c #) -> NB (bigNatFromWord2# (int2Word# c) l)
We are clearly in the last case where both arguments are small. It seems beneficial to allow this function to be inlined. If we did we would get:
foo =
\r [x_s158]
case x_s158 of {
W# x#_s15a ->
case addWordC# [x#_s15a 2##] of {
(#,#) l_s15c ds_s15d ->
case ds_s15d<TagProper> of ds1_s15e {
__DEFAULT ->
case int2Word# [ds1_s15e] of sat_s15f {
__DEFAULT ->
case bigNatFromWord2# sat_s15f l_s15c of ds2_s15g {
__DEFAULT -> NB [ds2_s15g];
};
};
0# -> NS [l_s15c];
};
};
};
which produces much better assembly code, especially if there is no carry:
addq $2,%rax ; add 2 to a machine word
setc %bl ; test the carry.
movzbl %bl,%ebx ; it could be done
testq %rbx,%rbx ; more efficiently
jne _blk_c17c ; with "jc"
_blk_c17i:
movq $NS_con_info,-8(%r12) ; alloc NS datacon value
movq %rax,(%r12) ; with the addition result as payload.
leaq -7(%r12),%rbx ; make it the first argument
addq $8,%rbp ; and then
jmp *(%rbp) ; call continuation
...
So why aren’t we always inlining naturalAdd
? We even explicitly disallow it with a NOINLINE
pragma. The reason is that naturalAdd
and friends are involved in constant-folding rules.
For example, consider:
bar :: Natural -> Natural
bar x = x + 2
baz = bar 0x12345678913245678912345679123456798
Currently we get the following Core:
bar1 = NS 2##
bar = \ x_aHU -> naturalAdd x_aHU bar1
baz = NB 99114423092485377935703335253042771879834
You can see that baz
is a constant thanks to constant-folding.
However if we let naturalAdd
inline we get:
baz
= case bigNatAddWord# 99114423092485377935703335253042771879832 2##
of ds_d11H
{ __DEFAULT ->
NB ds_d11H
}
baz
is no longer a constant.
A solution would be to add constant-folding rules for BigNat#
functions, such as bigNatAddWord#
. This is exactly what we have started doing in #20361. Our new plan is:
- Make
BigNat#
operationNOINLINE
and add constant-folding rules for them - Make Integer/Natural operations
INLINEABLE
(expose their unfolding) - Hence rely on constant-folding for
Word#/Int#/BigNat#
to provide constant folding forInteger
andNatural
The good consequences of this plan are:
- Less allocations when bignum operations are inlined and some of the arguments are known to be small/big or fully known (constant).
Integer
andNatural
are less magical: you can implement your own similar types and expect the same performance without having to add new rewrite rules
There were some unforeseen difficulties with this plan though:
- Some of the rewrite rules we need involve unboxed values such as
BigNat#
andWord#
and the weren’t supported. Luckily, this has been recently fixed (#19313) by removing the “app invariant” (#20554). Thanks Joachim! That’s the reason why we could resume this work now. - Some unfoldings (RHSs) become bigger due to the inlining of bignum operations. Hence they may not themselves be inlined further due to inlining thresholds even if it would be beneficial. A better inlining heuristic would fix this (see #20516). It will likely be the topic of the next post.