Skip to main content

7 posts tagged with "javascript"

View All Tags

ยท 9 min read

In a previous post we introduced GHC's new JavaScript backend, which allows the compilation of Haskell code into JavaScript. This is the first tutorial in a new series about the JavaScript backend. In this post, we'll build GHC as a JavaScript cross-compiler and run a trivial Haskell program in the browser.

We plan to write more of those blog post in the coming weeks and months as we add new features (e.g. support for "foreign exports" that will allow JavaScript code to call into Haskell code, support for Template Haskell, etc.). For now it relies on our "insider" knowledge (e.g. how the FFI works) that isn't well documented elsewhere. We do plan to add a chapter about the JavaScript backend in GHC's user guide, but for now your best chance is to look at GHCJS's documentation or at the source code.

Please note: this is a technology preview of the in-development JavaScript backend for GHC. Not all Haskell features are implemented, and bugs are expected. It is currently rather complicated for JavaScript code to call into Haskell code ("foreign exports" aren't implemented). GHC isn't a multi-target compiler yet, so a GHC executable built for a native platform (Linux/x86-64, Windows/x86-64, Darwin/AArch64...) as currently distributed (via ghcup, Stack, binary distributions, etc.) won't be able to produce JavaScript. Official prebuilt binary distributions are likely to remain unavailable until GHC gains multi-target support - requiring the JavaScript backend to be built from source even after the backend matures. That's why we start this post with the required steps to build yourself a GHC compiler capable of producing JavaScript.

Building GHC as a Cross Compiler to JavaScriptโ€‹

Installing Dependenciesโ€‹

First we need to install all the typical dependencies for GHC plus Emscripten, so our final list is:

  • GHC version 9.2 or later
  • Cabal
  • Alex
  • Happy
  • Emscripten to configure with
  • (Optional) NodeJS to run JavaScript locally

Let's take these in order, a standard GHC distribution with Cabal is needed so we can boot our new compiler. We recommend using GHCUP (https://www.haskell.org/ghcup/install/), or your system's package manager to install the this.

We need Alex and Happy to build GHC, these can be installed through Cabal:

cabal install alex happy -j

We need Emscripten during the configure step of the build. Emscripten should be available in most package managers, but you can also build and install it from source:

git clone https://github.com/emscripten-core/emsdk.git
cd emsdk
./emsdk install latest
./emsdk activate latest
source ./emsdk_env.sh

After installing Emscripten, emconfigure should be available on your system path. Use which emconfigure to check that it is on your $PATH. If you built from source, then the output should point to a location within the emsdk git project like so:

$ which emconfigure
/path/to/emsdk/upstream/emscripten/emconfigure

For more detailed installation instructions, see https://emscripten.org/docs/getting_started/downloads.html.

That's all we need to build GHC as a cross compiler. NodeJS can be installed via your system's package manager if you want to run the JavaScript programs locally. We'll assume it's in your $PATH for the rest of the blog post.

Building GHCโ€‹

With all the dependencies installed, we can clone GHC HEAD and build the cross compiler:

git clone https://gitlab.haskell.org/ghc/ghc.git --recursive

You should notice quite a few submodules being cloned as well as the main repo; expect this to take a while. Once this has completed, navigate to the ghc directory and run the following configuration commands:

cd ghc
./boot
emconfigure ./configure --target=js-unknown-ghcjs

emconfigure ./configure --target=js-unknown-ghcjs will finish by outputting a screen that looks like:

----------------------------------------------------------------------
Configure completed successfully.

Building GHC version : 9.5.20221219
Git commit id : 761c1f49f55afc9a9f290fafb48885c2033069ed

Build platform : x86_64-unknown-linux
Host platform : x86_64-unknown-linux
Target platform : js-unknown-ghcjs

Bootstrapping using : /home/josh/.ghcup/bin/ghc
which is version : 9.4.2
with threaded RTS? : YES

Using (for bootstrapping) : gcc
Using clang : /home/josh/emsdk/upstream/emscripten/emcc
which is version : 15.0.0
linker options :
Building a cross compiler : YES
Unregisterised : NO
TablesNextToCode : YES
Build GMP in tree : NO
hs-cpp : /home/josh/emsdk/upstream/emscripten/emcc
hs-cpp-flags : -E -undef -traditional -Wno-invalid-pp-token -Wno-unicode -Wno-trigraphs
ar : /home/josh/emsdk/upstream/emscripten/emar
ld : /home/josh/emsdk/upstream/emscripten/emcc
nm : /home/josh/emsdk/upstream/bin/llvm-nm
objdump : /usr/bin/objdump
ranlib : /home/josh/emsdk/upstream/emscripten/emranlib
otool : otool
install_name_tool : install_name_tool
windres :
dllwrap :
genlib :
Happy : /home/josh/.cabal/bin/happy (1.20.0)
Alex : /home/josh/.cabal/bin/alex (3.2.7.1)
sphinx-build :
xelatex :
makeinfo :
git : /usr/bin/git
cabal-install : /home/josh/.cabal/bin/cabal

Using LLVM tools
clang : clang
llc : llc-14
opt : opt-14

HsColour was not found; documentation will not contain source links

Tools to build Sphinx HTML documentation available: NO
Tools to build Sphinx PDF documentation available: NO
Tools to build Sphinx INFO documentation available: NO
----------------------------------------------------------------------

If everything is correct, you'll see that the Target platform is set to js-unknown-ghcjs, and the build tools will be set to their Emscripten counterparts: ar becomes emar, nm becomes llvm-nm, etc.

Finally, to build GHC:

./hadrian/build --bignum=native -j --docs=none

Expect this to take around a half hour or longer. If all goes well you should see:

/--------------------------------------------------------\
| Successfully built library 'ghc' (Stage1, way p). |
| Library: _build/stage1/compiler/build/libHSghc-9.5_p.a |
| Library synopsis: The GHC API. |
\--------------------------------------------------------/
| Copy package 'ghc'
# cabal-copy (for _build/stage1/lib/package.conf.d/ghc-9.5.conf)
| Run GhcPkg Recache Stage1: none => none
| Copy file: _build/stage0/bin/js-unknown-ghcjs-ghc => _build/stage1/bin/js-unknown-ghcjs-ghc
Build completed in 1h00m

Take note of _build/stage1/bin/js-unknown-ghcjs-ghc path. This is the GHC executable that we'll use to compile to JavaScript. To make life easier on ourselves we can alias it:

alias ghc-js=`pwd`/_build/stage1/bin/js-unknown-ghcjs-ghc

First Haskell to JavaScript Programโ€‹

Now that we have a version of GHC that can output JavaScript, let's compile a Haskell program and run it with NodeJS. Make a file named "HelloJS.hs", with the following contents:

-- HelloJS.hs
module Main where

main :: IO ()
main = putStrLn "Hello, JavaScript!"

Now we can compile it using the alias we defined earlier:

ghc-js HelloJS.hs

You should see the following output, and a HelloJS executable.

[1 of 2] Compiling Main             ( HelloJS.hs, HelloJS.o )
[2 of 2] Linking HelloJS.jsexe

If you have NodeJS is on your Path, then this executable can be run just like any other command line program:

./HelloJS
Hello, JavaScript!

Notice that a directory called HelloJS.jsexe was created. This directory contains all the final JavaScript code, including a file named all.js, and a minimal index.html HTML file that wraps all.js. For now, we'll only care about all.js and return to index.html later. all.js is the payload of our HelloJS exectuable. The executable is simply a copy of all.js, with a call to node added to the top. We could have equivalently run our program with:

node HelloJS.jsexe/all.js

Haskell in the Browserโ€‹

We saw in the previous example that GHC's JavaScript backend allows us to write Haskell and run the output JavaScript with NodeJS. This produces a portable executable, but otherwise doesn't enable anything we couldn't do before; GHC can already compile Haskell to run on most platforms! So let's do something novel, and run Haskell in the browser.

In this example, we'll use Haskell to draw a simple SVG circle to our browser window. Put the following code in a file named HelloBrowser.hs:

-- HelloBrowser.hs
module Main where

import Foreign.C.String

foreign import javascript "((arr,offset) => document.body.innerHTML = h$decodeUtf8z(arr,offset))"
setInnerHtml :: CString -> IO ()

circle :: String
circle = "<svg width=300 height=300><circle cx=50% cy=50% r=50%></circle></svg>"

main :: IO ()
main = withCString circle setInnerHtml

Notice that we've encountered a Haskell feature that's only available in the JavaScript backend: JavaScript foreign imports. This feature allows our Haskell program to call JavaScript functions. In our example we use this feature to call a JavaScript arrow function that updates the body of the page with our HTML snippet containing a drawing of a circle. Alternatively, we could have set the foreign import to a function symbol like so:

foreign import javascript "setInnerHTML"
setInnerHtml :: CString -> IO ()

where setInnerHTML is defined in a .js file that is then loaded by passing the JavaScript file to GHC along with the Haskell sources.

Next, we can compile our program to JavaScript, again with our built GHC:

ghc-js HelloBrowser.hs

Or ghc-js HelloBrowser.hs foo.js if setInnerHTML is defined in foo.js.

Recall the index.html file inside the HelloBrowser.jsexe directory. This HTML file has our compiled JavaScript already included, so if you open it in your browser, you'll find it loads our SVG circle in the top-left of the page!

Example webpage screenshot

index.html contains the minimal HTML code required to load the generated JavaScript code. It simply loads the all.js file mentioned above with the following script tag that you can reuse in your own HTML files:

<script language="javascript" src="all.js" defer></script>

As the JS backend still lacks support for some FFI features (foreign exports, foreign "wrapper" imports...), JavaScript codes can't easily interact with Haskell codes. It reduces the amount of advanced/interesting examples we can present for now. We'll publish new blog posts illustrating these features when they will be implemented.

Conclusionโ€‹

In this post, we've seen how to build a first Haskell program to run in the browser using a preview of GHC's in-development JavaScript backend. This program used "foreign imports" to make a JavaScript function available within the Haskell code, which allows a limited interaction between Haskell and the browser. We also saw the structure of the outputs of the JavaScript backend, in the .jsexe directory, and how this allows our Haskell program to be invoked by a custom HTML wrapper. This was all enabled by building a version of GHC from source, with the build process having been configured with Emscripten to produce a GHC exectuable that targets JavaScript.

ยท 20 min read

A new JavaScript backend was merged into GHC on November 30th, 2022! This means that the next release of GHC will be able to emit code that runs in web browsers without requiring any extra tools, enabling Haskell for both front-end and back-end web applications.

In this post, we, the GHC DevX team at IOG, describe the challenges we faced bringing GHCJS to GHC, how we overcame those challenges, and what's left to do. This post is rather long so we've provided these links in case you would like to skip ahead:

Take me to the future of GHCJS
Tell me what to expect
Show me the product roadmap
Tell me how I can help
Just show me how to hello world! (Skip to build instructions)

Why JavaScript? Or, the Big Picture.โ€‹

To put it simply, the number of users on the internet is as low as it will ever be right now, and it is almost guaranteed that those users use JavaScript. At time of writing, JavaScript holds 97.3% of client-side programming market share (not to mention market share of front-end technologies). Furthermore, JavaScript is not going to disappear anytime soon. As more and more interactivity is pushed onto the internet, JavaScript will become more entrenched because of backwards compatibility, network effects and the amount of capital already devoted to it. JavaScript, like C and COBOL will be with us for the foreseeable future. This makes JavaScript an attractive target; it provides portability, allows us to capitalize on the massive investments in the language and platform, and essentially eliminates the risk that the we build our technology atop a disappearing or deprecating foundation.

WebAssembly is a promising target as well, and Tweag has just merged a WebAssembly backend into GHC (great work and congrats!). WebAssembly is not as ubiquitous as JavaScript yet, and has a harder time interacting with JavaScript directly. Hence, we believe that the WebAssembly and JavaScript backends provide different strengths, and it is to the Haskell community's benefit to have and support both code generation paths in GHC for different use cases and requirements.

Why Haskell?โ€‹

JavaScript has many problems ranging from the downstream effects of early design decisions (that inhibit programmer productivity and are subtle bug generators), to ecosystem security issues, to fundamental issues with asynchronous and concurrent programming.

These issues are problematic for our product domain. At IOG, a central engineering requirement is to create a code base that has a high degree of correctness. Haskell makes this easy; or to get a little technical, the combination of Strong Static Hindley-Milner based typing allows us to write performant, correct, and maintainable code. In addition to this, many of the problems that occur in JavaScript are simply not expressible because of Haskell's type system and concurrency offerings.

There are, of course, competitors: PureScript targets Javascript and provides a programmer experience close to Haskell's. The benefit of using Haskell instead is code sharing: we can write the front-end of a web app in Haskell that compiles to JavaScript and the back-end in Haskell that compiles to machine code. In particular, the (de)serialization code (e.g. from/to JSON) is shared and cannot get out of sync between the front-end and the back-end.

Why a GHC backend?โ€‹

Haskell is a language driven by its implementation in GHC. GHC development is very active and GHC does not define a stable interface for compiler backends that are independently maintained, which means that maintaining an out-of-tree backend is costly.

The maintenance burden is not hypothetical; our teammate Luite Stegeman has been developing a fork of GHC that emits JavaScript, called GHCJS, for close to 10 years and has experienced the pain first hand. Any changes to upstream GHC had to be adapted to the customized fork or GHCJS would fall behind. And fall behind it did: at the time of writing, GHCJS has stuck to using GHC 8.10, lagging behind by three major releases and counting.

Similarly, the Eta compilerโ€”which is targeting the JVMโ€”faced the same issues and appears to be discontinued (compatibility with GHC 7.10.3's Haskell from 2015 is mentioned).

Compounding the issue, the normal Haskell toolchain was not designed for an edge case like GHCJS. So GHCJS required that the normal tooling, e.g., Cabal and Stack, could distinguish between GHC and GHCJS compilers. This meant that the GHCJS developers had to maintain the GHC fork, develop GHCJS, and patch or contribute to Cabal and Stack. Simply put, the maintenance burden was much too high per developer. Examples of differences between GHCJS and GHC:

  • GHCJS had a double versionโ€”its own version and the version of GHC it was based onโ€”and build tools had to deal with both
  • GHCJS used non-standard file extension (e.g. .js_o and .js_a for objects and static libraries respectively) and custom file formats (still true for .o but no longer true for .a)

So instead of spending engineering time and energy responding to ecosystem changes and maintenance, the DevX team decided the best course of action was to enhance GHC's cross-compilation support and add a proper JavaScript backend based on GHCJS. We feel that this adds value to the entire Haskell ecosystem, keeps the JavaScript backend in sync with GHC, provides a better user experience for all, reduces maintenance costs, and greatly improves the backends in GHC in general. By implementing support for a JavaScript backend in GHC, we also improve GHC's support for cross-compilation (and testing cross-compilers), which is directly applicable to the WebAssembly, iOS, and Android backends in GHC.

Is GHCJS Dead?โ€‹

Not yet! As it stands, the JavaScript backend doesn't provide all the features provided by GHCJS. In particular it doesn't support Template Haskell and we've removed the extended GHCJS FFI syntax to refine its design. See our roadmap below for more details.

Nevertheless GHCJS is unlikely to be updated to use a GHC version more recent than 8.10.x. So from our point of view it is in maintenance mode until the JavaScript backend totally subsumes its features. New maintainers who want to continue the development of GHCJS until its feature set has been fully subsumed by mainline GHC are of course welcome.

What is Missing From GHCJS?โ€‹

The JavaScript backend borrows a lot of code from GHCJS, but not all of it. Here are the main differences between GHCJS and the JavaScript backend:

  1. GHCJS was stuck on GHC version 8.10 while the JavaScript backend follows GHC HEAD.

  2. GHCJS's incremental linking support ("base" bundles) hasn't been ported. This feature required too many changes (such as adding new command-line flags) and would have been backend-specific. This might be implemented in the future if it proves to be useful for the newer Template Haskell implementation, for example.

  3. GHCJS's JavaScript code optimizer hasn't been ported. The code was trying to do too much all at once and consequently was fragile and slow. We plan to work on an intermediate representation between STG and JavaScript to perform the same optimizations with better performance, maintainability, and reliability.

  4. GHCJS's compactor (link time optimizations) code hasn't been ported. Some optimizations have been reimplemented (e.g. global renaming of local identifiers), but some other are lacking (e.g. compacting initialization code). We plan to work on this as part of a larger effort on refactoring the code generator, the linker, and some aspects of the runtime system. More details are available in GHC issue #22352.

  5. GHCJS's hacky support for plugins hasn't been ported. Instead we implemented a new way to load plugins from shared libraries that works in any GHC cross-compiler. See #20964 and !7377.

    The common and convenient approach to load plugins still isn't supported by GHC when it is used as a cross-compiler (see #14335 for more details).

  6. GHCJS's support for Template Haskell hasn't been ported. GHCJS had its own implementation of an external interpreter (THRunner) which has been used as an inspiration to implement GHC's external interpreter (IServ). While serving the same purpose, IServ is quite different from THRunner and can't be directly used as a substitute for it. Retrofitting THRunner into Iserv is our next priority. More details on https://engineering.iog.io/2022-05-17-javascript-template-haskell-external-interpreter

  7. GHCJS supported an extended FFI import syntax allowing Javascript code to be inlined (the FFI import string supports templates of Javascript code with placeholders for arguments). This hasn't been ported because adding a JavaScript parser to GHC was difficult and complex, and the imported code made no safety guarantees whatsoever. For now, only JavaScript function calls are supported.

  8. Any command-line flag introduced by GHCJS has not been ported. We didn't make any change to GHC's command line in this work except for adding a -ddump-js flag. Other options will be added later as needed.

  9. The JavaScript backend itself hasn't been optimized and we even removed some undocumented uses of NFData from GHCJS's code. We intend to optimize the JavaScript backend in a principled way (e.g. by first gathering evidence with profiling).

What's on the JS Backend's Roadmap?โ€‹

Our top priorities are:

  • Implementing Template Haskell support.
  • Reducing generated JavaScript code size.
  • Modernizing the generated JavaScript code. The code generator adapted from GHCJS does not use more modern JavaScript features such as fat-arrows (=>), symbols and let bindings. We aim for the JavaScript backend to emit JavaScript that comports with ECMA-262.
  • Enhancing the run-time performance of the generated code

What has Improved Compared to GHCJS?โ€‹

Or, why did it take you so long to port a stripped GHCJS into GHC? While it may seem like such a task should be relatively quickโ€”especially in a language with such a good refactoring story like Haskellโ€”there were numerous road blocks that we needed to remove before adding the backend. In particular, here were the troublesome bits:

Removing the Use of External Librariesโ€‹

GHCJS used libraries that aren't already dependencies of GHC, such as text, lens, attoparsec, and aeson. As we didn't want to add new dependencies to GHC, we've refactored the code to avoid them. Examples:

  • we've replaced Text with GHC's ShortText (which provides a similar API) and finally with GHC's FastString in most cases (which is usually more performant).
  • we've replaced a lot of lens-heavy code with its non-lens equivalents, because GHC does not use lenses itself, and a design requirement was to stay within existing code conventions.
  • we've replaced pretty with GHC's pretty-printer (SDoc, etc.).
  • we've replaced binary with GHC's Binary instances.

GHCJS used to provide its own base and prim libraries: ghcjs-base and ghcjs-prim. We've merged those into the existing base and ghc-prim libraries.

Reusing GHC's Build System: Hadrianโ€‹

GHCJS has a reputation for being complex to build. It relied on custom build scripts to deal with the GHC fork it uses. The JavaScript backend however is as easy to build as any other GHC. It doesn't require any wrapper script, only the emconfigure tool provided by the Emscripten project.

With a fresh checkout of the GHC source tree, you can now build a GHC with the JavaScript backend with just these commands:

> ./boot
> emconfigure ./configure --target=js-unknown-ghcjs
> ./hadrian/build --bignum=native -j

Note that if this doesn't work, up to date instructions and troubleshootings can be found on https://gitlab.haskell.org/ghc/ghc/-/wikis/javascript-backend

The Hadrian build system has been adapted to support Cabal's js-sources stanzas that are to support user-provided .js files. Both the rts and base packages required this feature.

Support for Running GHC's Test Suiteโ€‹

GHC's entire test suite can now run and check the JavaScript backend! We had to tweak Hadrian to make this possible (to make Hadrian cross-compiler aware), but the test suite has already found some bugs that we have since fixed.

However, in order to merge for the GHC 9.6 release we had to disable many tests because of missing features (Template Haskell, Haskell Program Coverage (HPC), compact regions, etc.) or because the generated code would time out (not surprising given the missing optimizer and compactor).

But in the process of disabling those tests we've laid a good path forward. We've added more precise properties to the test suite, which indicate the required features to run each test. So when we implement some feature, it will be painless to re-enable all its tests. In addition, failing tests now have proper tickets in GHC's issue tracker.

We've spent some time trying to run the test suite on CI but this work wasn't ready in time to be included in the initial commit with the rest of the backend. For now, only some basic testing is done on CI: compiling a non trivial program that uses the GHC library into JavaScript and executing it. Nevertheless, we have a merge request in the works so that future contributions should be properly validated by running the test suite on CI soon.

For the time being, the following command will run the test suite locally:

./hadrian/build --bignum=native -j2 test

We use -j2 to avoid running too many tests in parallel as this could allocate too much memory and fail, which isn't surprising as the JavaScript backend hasn't been optimized for memory usage yet.

Upgrading from GHC 8.10 to GHC 9.6โ€‹

The latest version of GHCJS is based on a fork of GHC 8.10.7. We spent a significant amount of time adapting the code generator to support GHC HEAD. In practice this meant:

  • Adding support for new primops, especially sized primitives.
  • Adapting to ghc-bignum changes.
  • Adapting to internal changes.
  • Fixing support for polymorphism in kinds.
  • Fixing support for unlifted newtypes.
  • Fixing support for unboxed sums.
  • Many other fixes...

Fixing Some Performance Issuesโ€‹

As we haven't ported GHCJS's Compactor, output size was predictably incredibly large. So we've spent time re-implementing a crucial piece of the Compactorโ€”renaming and shortening of local variablesโ€”using a different approach. Our new approach ended up being faster than GHCJS's compactor. For the GHC devs out there, as we first replaced the Text type with the FastString type, the newer Compactor can now replace a FastString-based identifier with a new identifier derived from the FastString's Unique in constant time.

Removal of Custom File Extensions and Support for JavaScript Pragmasโ€‹

GHCJS used the .js.pp file extension to identify JavaScript files that needed to be passed through CPP before being valid JavaScript. Adding support for this extension in both Hadrian and GHC proved to be more work than just adding support for JavaScript pragmas. So we decided to do the latter; similarly to Haskell extension pragmas, you can now write //#OPTIONS: CPP in your JavaScript files to enable the CPP pass, and the file extension is always .js.

While we're on the topic of file extensions, technically .js files don't have to be compiled into .o files (contrary to C/C++/Haskell/etc. files) at all. However, build systems (Hadrian, Cabal...) and compilers (GHC) expect this. So for consistency with other backends, we've added a fake compilation pass for .js files too. They are now renamed into .o files with a //JAVASCRIPT header added to distinguish them from object files produced by the JavaScript backend (and from Emscripten, in the future).

Cleanup and Documentationโ€‹

GHC provides some utilities (pretty-printer, binary serialization, string interning, etc.) that GHCJS did not make use of. So we adapted the GHCJS code to exploit these utilities, keep the JavaScript backend similar to other backends, and for better performance.

Three of us (out of four) were totally new to GHCJS's code base. We strived to grok the code and to make it understandable by adding a lot of comments and refactoring. Throughout this process we logged our learning in our engineering blog to explain some (sadly not all) technical details about GHCJS's internals:

Plugin Support in Cross-Compilersโ€‹

GHC doesn't support plugins when built as a cross-compiler (cf #14335). This is because it cannot yet support two environments at once: one for the target code (JavaScript code here) and one for the host (e.g. native x86 or AArch64 code for the plugin). We've spent a lot of time making it more modular (see the Modularizing GHC white paper we published earlier this year and Sylvain's lightning talk at HIW 2022) but there is a lot more to do to achieve this (cf #17957).

GHCJS used a fragile hack to support plugins: at plugin loading time it would substitute the plugin unit with another corresponding one from another package database (For the non-GHC devs out there interested in GHC Units see this note). This was fragile because it could violate GHC's single environment assumptions.

GHCJS's hack did not get ported. Nevertheless we have implemented a new way for GHC to load plugins directly from libraries instead of packages (#20964/!7377). This method doesn't require GHC to load module interfaces for the plugin and its dependencies, hence workarounds GHC's limitations.

What About Libraries Using C Sources?โ€‹

Libraries that use C sources (c-sources Cabal stanza) aren't supported by the JavaScript backend. In the future we plan to use Emscripten to compile C sources and then to generate some adapter code for them, but this isn't done yet.

For now, there are two ways to fix libraries that use C sources. The C code can either be rewritten in Javascript, or it can be rewritten in Haskell. Then it is possible to use Cabal predicates (e.g. arch(js)) to select between the different versions.

We do have a preference for writing pure Haskell versions because it is more future proof. For example if someone adds some new backends for Lua, Java, CLR, etc. then the Haskell version can be directly compiled by that backend and there is no extra work. In contrast, if the C source is rewritten in JavaScript, then it would need to be rewritten for each backend.

That is the approach we've taken when we wrote the ghc-bignum library. Ghc-bignum provides a "native" implementation written in Haskell that is functionally equivalent to the GMP based implementation. Of course, besides being more future proof the Haskell version is just more pleasant to write than the Javascript version.

Note that GHCJS came with a "shim" library where a shim is JavaScript source code specifically for some package. For example, GHCJS provided shims for packages like text, process, and hashable. We do not intend the JavaScript backend to provide shims so these JavaScript sources will have to be upstreamed or reimplemented in Haskell.

Note that the linking behavior is different due to the interpreted nature of Javascript. In the JavaScript backend, we can link with libraries using foreign imports even if the imported functions don't exist. Instead of failing at link time (which is what usually happens with native code) a JavaScript exception is raised only when and if the imported function is called.

How to Help?โ€‹

We have now reached our first milestone; anyone can easily build and test the JavaScript backend, and anyone can open bug reports or offer patches for the JavaScript backend on GHC's GitLab.

For those who offered their help this year: thank you! Until now it was difficult to split the work into independent tasks (one fix led to a new failure, which led to an architectural issue, etc.) and it was difficult to coordinate with people outside of our team. However, we're now in a much better position to discuss suggestions and to test/review patches in the spirit of open source.

tl;dr Just Tell Me How to Say Hello Worldโ€‹

You need:

  • Emscripten version 3.14 or better. Be sure that your emscripten is bundled with either LLVM 15 or an up to date, patched LLVM 14.
  • Nodejs, latest stable version. Only if you want to run the compiled JavaScript with node.

Most Linux distributions will have the necessary LLVM patches. If you're on NixOS, you'll need to use llvm_git and hope for the best. This fork of ghc.nix will also be useful to you.

Checkout the GHC sourceโ€‹

git clone --recurse-submodules https://gitlab.haskell.org/ghc/ghc.git
cd ghc # ensure you are in the ghc source tree for the following commands

Update the submodulesโ€‹

git submodule update --init --recursive

Boot and Configure for JavaScriptโ€‹

./boot && emconfigure ./configure --target=js-unknown-ghcjs

You should see configure finish and report something similar:

----------------------------------------------------------------------
Configure completed successfully.

Building GHC version : 9.5.20220819
Git commit id : 08c3c4783c72d3173d79ccda2ac282e2d3e04e34

Build platform : x86_64-unknown-linux
Host platform : x86_64-unknown-linux
Target platform : js-unknown-ghcjs

Bootstrapping using : /nix/store/4bkmkc7c98m4qyszsshnw9iclzzmdn4n-ghc-9.2.3-with-packages/bin/ghc
which is version : 9.2.3
with threaded RTS? : YES

Using (for bootstrapping) : /nix/store/yzs8390walgk2rwl6i5li2g672hdn0kv-gcc-wrapper-11.3.0/bin/cc
Using clang : /nix/store/p894nlicv53firllwgrfxfi51jzckh5l-emscripten-3.1.15/bin/emcc
which is version : 15.0.0
linker options :
Building a cross compiler : YES
Unregisterised : NO
TablesNextToCode : YES
Build GMP in tree : NO
hs-cpp : /nix/store/p894nlicv53firllwgrfxfi51jzckh5l-emscripten-3.1.15/bin/emcc
hs-cpp-flags : -E -undef -traditional -Wno-invalid-pp-token -Wno-unicode -Wno-trigraphs
ar : /nix/store/p894nlicv53firllwgrfxfi51jzckh5l-emscripten-3.1.15/bin/emar
ld : /nix/store/p894nlicv53firllwgrfxfi51jzckh5l-emscripten-3.1.15/bin/emcc
nm : /nix/store/0dp0bfg9sncg7bjy389zwyg2gskknm6b-emscripten-llvm-3.1.15/bin/llvm-nm
objdump : /nix/store/zgvxnf9047rdd8g8kq2zxxm9k6kfqf8b-binutils-2.38/bin/objdump
ranlib : /nix/store/p894nlicv53firllwgrfxfi51jzckh5l-emscripten-3.1.15/bin/emranlib
otool : otool
install_name_tool : install_name_tool
windres :
dllwrap :
genlib :
Happy : /nix/store/ijdmyaj6i6hgx5ll0lxxgcm9b0xn8nma-happy-1.20.0/bin/happy (1.20.0)
Alex : /nix/store/qzgm2m7p7xc0fnyj4vy3jcmz8pvbg9p7-alex-3.2.6/bin/alex (3.2.6)
sphinx-build : /nix/store/27dk5i52465a4azjr2dqmrhyc0m4lpf2-python3.9-sphinx-4.5.0/bin/sphinx-build
xelatex : /nix/store/8jc2258h4nqzqjy303zzkssd3ip675pf-texlive-combined-2021/bin/xelatex
makeinfo : /run/current-system/sw/bin/makeinfo
git : /nix/store/vsr2cn15h7cbwd5vqsam2ab2jzwfbyf9-git-2.36.0/bin/git
cabal-install : /nix/store/cjmd2qv1b5pdw4lxh1aw4xwwy4ibnb2p-cabal-install-3.6.2.0/bin/cabal

Using LLVM tools
clang : clang
llc : llc
opt : opt

HsColour was not found; documentation will not contain source links

Tools to build Sphinx HTML documentation available: YES
Tools to build Sphinx PDF documentation available: YES
Tools to build Sphinx INFO documentation available: YES
----------------------------------------------------------------------

Be sure to verify that ar, ld, nm and friends point to the emscripten versions, i.e., the output shows <tool> : <some-path>-emscripten-<tool>.

Build the JavaScript backendโ€‹

./hadrian/build --bignum=native -j

Now Compile Hello Worldโ€‹

module Main where

main :: IO ()
main = putStrLn "Hello JS!"
$ <path-to-ghc-root-dir>/_build/ghc-stage1 -fforce-recomp Main.hs
$ ./Main
$ Hello JS!

Under the hood Main is just a JavaScript program written as a script with nodejs as the interpreter. This means you can treat the compiled program like any other JavaScript program: loading it into JavaScript tooling or hack on it by hand. This also means that all compiled programs, such as Main, are human-readable, for example here are the first ten lines:

$ head Main
#!/usr/bin/env node
var h$currentThread = null;
var h$stack = null;
var h$sp = 0;
var h$initStatic = [];
var h$staticThunks = {};
var h$staticThunksArr = [];
var h$CAFs = [];
var h$CAFsReset = [];
var h$regs = [];

The program begins with a shebang instructing the operating system to send the rest of the file to nodejs. The remaining lines are our actual program, which starts with global variables that the runtime system, garbage collector, and scheduler need. Now human-readable is not the same as easy to understand, for example here is the logic that implements a Maybe:

function h$baseZCGHCziMaybeziJust_con_e() { return h$rs() };
function h$baseZCGHCziMaybeziJust_e() {
var h$$13be2042 = h$r2;
h$r1 = h$c1(h$baseZCGHCziMaybeziJust_con_e, h$$13be2042);
return h$rs();
};
function h$baseZCGHCziMaybeziNothing_con_e() { return h$rs() };

If you would like to understand this code and how the JavaScript backend works in general please see our other blog posts. In any case, we invite you to try it out, hack, and be merry!

Acknowledgementsโ€‹

We want to thank Jan Hrcek, and David Thrane Christiansen for their time, labor, comments, and suggestions on drafts of this blog post.

ยท 9 min read

Introductionโ€‹

I recently gave a short presentation about heap objects representation in GHCJS and hence in the upcoming JS backend for GHC. This post is a summary of the content.

Heap objectsโ€‹

GHC implements Haskell code evaluation by using graph reduction. As such Haskell programs compiled by GHC use the heap to store nodes of the graph to be reduced and utility nodes participating in graph reduction. These nodes are:

  • FUN: functions with their free variables as payload
  • THUNK: suspensions with their free variables as payload
  • PAP: partial application to a FUN. FUN closure and already applied arguments as payload.
  • IND: indirection to another heap object
  • BLACKHOLE: used to overwrite a THUNK when it is being evaluated

The heap is also used to store other values:

  • CON: boxed values (saturated constructor applications) with field values as payload
  • Other unlifted values: TSO, BCO, arrays, MutVar#, MVar#, TVar#, stacks, stack frames...

Info tablesโ€‹

Many heap objects share the same properties: e.g. all Int CON objects are exactly the same except for their payload (the Int# value) that may be different. Hence heap objects are split in two parts to allow sharing of common properties:

  • info table: statically known properties (at compilation time) that can be shared by several heap objects
  • heap object itself: dynamically allocated in the heap

Heap objects always have the same layout in the native code generated by GHC. They are composed of:

  • a pointer to an info table
  • some words of payload

Heap traversal is done by following the info table pointer of every heap object to query in the info table the layout of the heap object payload.

Info tables contain a pointer to a function called "entry code" that can be specific to each info table. This code is mainly used to apply a node to some arguments. Note that with tables-next-to-code optimisation enabled, to avoid an indirection the info table pointer is actually a pointer to this entry code and the info table itself is stored in the words preceeding the entry code.

Heap objects in JavaScriptโ€‹

GHCJS represents most heap objects with a JavaScript object having the following fields:

{ f, d1, d2, m, cc }

One question I had was: why don't we use a JS array instead of a JS object? Arrays should be faster than objects (i.e. hashmaps), no? It turns out that objects like this are optimised by JS engines using "hidden classes" (see https://v8.dev/blog/fast-properties for an explanation). That's why they are usually more efficient than arrays for which bound checking must be made. Also arrays are larger in memory because they need to store their size.

Let's now discuss the fields of the heap objects.

f fieldโ€‹

"f" is the equivalent of the info table pointer. It contains a JavaScript function that is the entry code for the heap object.

Similar to the tables-next-to-code optimisation discussed above, we use the fact that JS functions are objects which have properties to store the info table fields as properties of the function itself.

Example of an info table / entry function:

[Function: h$entry_function_xyz]
{ t // (Int) object type
, size // (Int) number of fields in payload (-1 if variable layout)
, i // (Array) fields layout (empty if variable layout)
, n // (String) object name for debug
, a // (Int) function arity or constructor tag
, r // (Int) arity in number of JS variables
, s // (Array) static refs that must be kept alive (SRT)
, m // GC mark
}

d1, d2 fieldsโ€‹

The d1 and d2 fields contain the payload of the heap object: constructor fields, function free variables, etc.

Payloads can be composed of zero, one, or many fields. A naive solution would be to have one JS object field (d1, d2, d3...) per payload field. However it would be bad for two reasons:

  • performance: JS engine hidden classes optimisation mentioned above needs objects to have the same field structure.

  • genericity: we couldn't write generic functions (e.g. to copy a closure) without dynamically querying the number of fields composing the payload.

Another solution would be to use a single field to store the whole payload. It would fulfill the genericity constraint. However performance may not be good because of the extra allocation of the object containing the payload and the indirection to access its fields.

Instead GHCJS uses a middle ground approach: it always uses only two JS object fields to store any number of payload fields. The following encoding is used to stash any number of payload fields into two JS fields:

Payloadd1d2
[]nullnull
[a]anull
[a,b]ab
[a,b,c]a{d1=b,d2=c}
[a,b,c,d...]a{d1=b,d2=c,d3=d...}

It still fulfills the genericity constraint and small objects (up to two fields of payload) don't pay for an extra allocation/indirection. The price to pay is that two fields of payload are always allocated, even for for objects with 1 field of payload.

It would be interesting to benchmark the performance of the different payload representations.

m fieldโ€‹

The "m" field is used both for reachability checking (~ garbage collection) and to implement the "stable names" features.

GHCJS can't rely on the JS engine to know when a heap object is collected. So it implements its own heap traversal algorithm for this. The "m" field is used as a marker for this algorithm (it will be the topic of a future blog post). In this case, the "m" field is a number (a GC mark).

When a StableName is created for an object, the "m" field of the object is updated to point to the StableName object:

[h$StableName]
{ m // GC mark
, s // stable name unique id
, ...
}

The "m" field of the StableName object is used in replacement of the mark of the object.

cc fieldโ€‹

The "cc" field is the cost center associated to the heap object. This field is only present when profiling mode is enabled. Cost centers are entered (pushed on the cost center stack of the current thread) before the evaluation of thunks and function applications.

Cost centers are allocated with the h$CC function.

Other heap object representationโ€‹

The generic heap object representation presented above is only used for some objects: those involved in graph reduction (e.g. updatable objects) and values that don't have a fixed layout (e.g. CON objects have different layouts depending on which constructor they represent). The object layout allows generic access to the infotable and to the payload, and the infotable describes the object type and the payload layout.

Several other objects don't need this machinery: they always have the same layout and are never the result of a reduction (they are unlifted values). These objects are represented as JS objects with any fields they need (i.e. not using the d1/d2 encoding above). To determine the type of such heap objects, instead of using the "type" field of an infotable the code uses the instanceof operator. For example a TSO is represented as a h$Thread object.

Note that we could be tempted to give every heap object a different object name and to always use instanceof instead of the infotable "type" properties. It would mean adding h$Con, h$Thunk, h$Fun, h$Pap, h$Blackhole, and h$StackFrame objects. Then all the heap objects could be treated in the same way. However the isssue is that these objects need to be overwritable in place: a Thunk becomes a Fun/Con/Pap/Blackhole, etc. As far as I know we can't update the "instance" of an object, so all these object have to be instances of the same JS object.

Also note that the JS backend doesn't need INDirection nodes because it can always overwrite the fields of a JS object with the fields of another to update a closure. For the record, indirection nodes are needed in backends that layout closures as a chunk of bytes/words and when the size of the closure to update is smaller than the size of the updatee closure.

Automatic unboxingโ€‹

Sometimes the generic heap object representation is unnecessary. For example, a boxed Int would be represented as a CON heap object with the Int# in its payload, represented as a JavaScript number value. The only thing we can do with this heap object is to pass it around and to extract its payload. As such, it is more memory efficient to directly pass the payload (a JS number).

GHCJS provides an optimisation that consists in automatically unboxing some CON heap objects. For example, Haskell booleans (True and False datacons) are directly mapped to JavaScript booleans, boxed numbers (Float, Double, Int, Word, Int8, etc.) are directly mapped to JavaScript numbers.

We can do this because JavaScript already provides some boxing of its own: we can use the typeof operator on a heap object to know if it is a JS object, a JS number, a JS boolean, etc. It makes it possible to distinguish between heap object representations. In comparison, we can't do this with the native (non-JS) backend when we only have a pointer to a heap object: the pointer doesn't carry the kind of value it points to, hence the pointed memory location must be generic enough for this introspection to be performed (e.g. using infotable pointers).

Summaryโ€‹

Heap object can be represented as JS values (number, boolean) because of the automatic unboxing, or as JS objects: discimination is done with the typeof operator.

Heap objects represented as JS objects come in two flavours:

  • unlifted objects are represented with specific JS objects, disciminated with the instanceof operator
  • other objects use the following generic and updatable structure:
{ f, d1, d2, m, [cc] }

ยท 8 min read
  1. The Design Space
  2. GHCJSโ€™s FFI
  3. Lightweight safety checks
  4. Returning multiple values
  5. Changes in the FFI System for the JS Backend

Users of GHCJS enjoyed a rich FFI system for foreign JavaScript imports. However, this has changed during our adaptation of GHCJS to GHC 9.x. This short post goes over the GHCJS FFI system, the motivation for these changes and what the changes are. First, we must consider the design space of an FFI system.

The Design Space

FFI code is typically employed in high performance scenarios. Additionally, users of the FFI do not want to deal with the object language the compiler is compiling to. Instead, users want a simple way to call functions from the object language and use them in their own code as normal Haskell functions. However, users of the FFI system do tend to be power users, and so as a design principle we want to expose the tools they need to achieve their performance needs, whatever those needs may be. We can summarize these constraints as follows:

  1. The FFI must abstract the JavaScript backendโ€™s infidelities away as much as possible. That is, users of the FFI should need to worry about the Int64# representation, but should also be able to simply follow standard patterns we have written in base.
  2. The FFI must provide tools to achieve high performance code, even if those tools require up front knowledge of the runtime system to use. However, these tools should not be in the path of least resistance to use the FFI system.
  3. The FFI must provide a lightweight specification that userโ€™s program against for the JS backend to optimize the imported function and for good error messages for users.

GHCJSโ€™s FFI sets a high (qualitative) benchmark on these three constraints. Letโ€™s inspect them each in detail, in no particular order.

GHCJSโ€™s FFI

In GHCJS, a user could take advantage of JavaScript functions in their Haskell code using the GHCJSโ€™s FFI. However, the syntax was unique to GHCJS with place holder variables like one might see in perl, nix, or bash. For example, here is a foreign import from the base library for st_size:

-- base/System/Posix/Internal.hs
-- the JS FFI version
foreign import javascript unsafe "$r1 = h$base_st_size($1_1,$1_2); $r2 = h$ret1;"
st_size :: Ptr CStat -> IO Int64

The syntax is different from what we know and love in the normal Haskell world but the grammar is straightforward. We declare a foreign import from javascript, state that the import is unsafe or interruptible and then provide a string, h$base_fstat(...) for the code generator to use when compiling. Compare this with the C version:

-- base/System/Posix/Internal.hs
-- the C FFI version
foreign import ccall unsafe "HsBase.h __hscore_st_size"
st_size :: Ptr CStat -> IO Int64

And we see that they are similar. The only difference is the strange $n symbols in the referrent string. Contrast this with the C version, which simply declares a name.

These symbols are place holder variables with special meaning in GHCJS. There are two intractable reasons for the placeholder patterns. First, we require these patterns to work around the limitations of JavaScript as a backend (1). For example, consider the case where we need to return an Int64# from an imported foreign function. In C and Haskell this is not a problem because both can represent Int64# natively, however JavaScript only has native support for 32-bit values. Thus, to be able to return an Int64# we need to have a method to return two 32-bit numbers. Similarly, in order to apply a function to an Int64# that function must take at least two arguments, one for the high bits and one for the low. Second, the referrent string is untyped and can contain arbritrary JavaScript code. So placeholder patterns provide a simply and lightweight way for safety checks and eliminate classes of untyped, hard to understand errors. For example, consider an arity mismatch error between a function definition and call site. When this happens JavaScript happily continues processing with the return value from the function application defined as NaN (of course). Such arity conflicts can easily occur, especially when dealing with 64-bit values which require function arity assumptions.

Lightweight safety checks

Lightweight safety checks (3) are done by GHCJS by parsing the names of the place holder variables; each of which follows a specific naming convention. This convention is:

  • Argument types:
    • $n: Used for unary arguments, i.e., arguments which require only a single register.
    • $n_n: Used for binary arguments, i.e., arguments which require two registers.
    • $c: A continuation argument, only valid for interruptible foreign functions.
  • Return types:
    • $r: a unary return
    • $r1, $r2: a binary return
    • $r1, $r2, $r3_1, $r3_2: unboxed tuple return
  • Top level patterns:
    • "&value": simply emitted as value by the code generator
    • "someFunction": emitted as ret = someFunction(...), i.e., map the FFI to the result of the function call.
    • "$r = $1.f($2)": emitted as r1 = a1.f(a2), i.e., a combination of a function call and a property access.

With this standard GHCJS then parses the FFI referrent string to ensure that it conforms to this standard. If not then GHCJS can at least respond to the user with an ill-formatted FFI message and say precisely where the issue is. For example, it could respond that only half of an Int64# is returned based on the referrent string and the function type.

Returning multiple values

But what of performant code? GHCJS achieves performant FFI by not trying to abstract away from the runtime system. Instead, an advantage of GHCJSโ€™s FFI is that we can specify exactly which registers the foreign function should dump its results or even arbitrary global variables. This places more burden on the user of the FFI in specific scenarios, but crucially allows the FFI system to get out of the way of the user. The FFI system also exploits this capability to return multiple values from a single function call, which is a common need when compiling to JavaScript. For example, in the above code st_size is declared to return an IO Int64, the JavaScript handler h$base_st_size returns the Int64 using two registers $r1 and $r2, but does so through the use of a special purpose global variable called h$ret1:

function h$base_st_size(stat, stat_off) {
h$ret1 = (stat.i3[(stat_off>>2)+2]);
return (stat.i3[(stat_off>>2)+1]);
}

The function inputs a pointer and an offset. Pointers in GHCJS are simply pointers to ByteArrays so the function indexes into the ByteArray and retrieves and stores the lower 32-bits in h$ret1, then returns the higher 32-bits directly. These results are picked up by the FFI code, which performs assignment to set $r1 to the result of the function call (the higher 32-bits), and set $r2 to the value of h$ret1 (the lower 32-bits). Crucially, the runtime system needs to do nothing. The registers are already handled ready to be consumed by whatever the caller of the foreign function will do.

One might consider using a simpler design, which trades register juggling for a more straightforward representation such as a ByteArray which stores the Int64#. However, such a design would trade speed for implementation simplicity. If we passed ByteArrays then each foreign function would spend time wrapping and unwrapping the array to get the payload; clearly an undesirable outcome for high performance code.

Changes in the FFI System for the JS Backend

So we see that GHCJSโ€™s FFI system actually performs quite well in the design space. Power users are well supported and can leverage enough unsafety to bind global variables like h$ret1 and specific registers such as $r1. The system provides some lightweight checking through parsing. The nuances of the JavaScript platform are generally abstracted over and the FFI system is tuned for performance critical scenarios. So why change it?

The short answer is to hit deadlines. By skipping the FFI parsing the JS Backend team was able to produce a working (can output โ€œHello World!โ€, and compile GHCโ€™s boot libraries), integrated, JS backend in GHC faster than had we finished the FFI system.

For the time being, we have opted to replaced each foreign function call with a JavaScript fat arrow, for example:

foreign import javascript unsafe "(($1_1,$1_2) => { return h$base_st_size($1_1,$1_2); })"
st_size :: Ptr CStat -> IO Int64

Of course, this situation is untenable, as argued above, FFI code is assumed to be used in performance critical code, and thus any extra overhead, such as a function closure and consequent indirection, must be avoided. But fear not! In the near future weโ€™ll be overhauling the FFI system and returning it to its former glory.

ยท 6 min read

Introductionโ€‹

I recently gave a short presentation on the workings of the GHCJS linker. This post is a summary of the content.

JavaScript "executables"โ€‹

The task of a linker is collecting and organizing object files and resources into a loadable library or executable program. JavaScript can be run in various environments, for example the browser or node.js, and not in all of these the concept of an executable makes sense.

Therefore, when we link a Haskell program, we generate a jsexe directory filled with various files that allow us to run the JavaScript result:

FileDescription
out.jscompiled/linked Haskell code
out.frefs.*list of foreign calls from out.js
out.statssource code size origin statistics for out.js
lib.jsnon-Haskell code, from js-sources in packages and RTS. possibly preprocessed
rts.jsgenerated part of RTS (apply functions and similarly repetitive things)
runmain.jssingle line just starts main
all.jscomplete runnable program, created by combining out.js, lib.js, rts.js and runmain.js

Most of the work done by the linker is producing out.js, and that's what we'll be focusing on in the next sections.

Building out.jsโ€‹

The linker builds out.js by collecting all code reachable from main (and a few other symbols required by the RTS) and generating the required initialization code for all top-level data. The code is found in object files. These object files have the following structure:

SectionDescription
Headerversion number and offsets of other sections
String tableshared string table, referred to by Dependencies and Code, to avoid duplication in file and memory
DependenciesDependency data, internally between binding groups and externally to symbols in other object files
CodeCompiled Haskell code stored as serialized JavaScript AST and metadata. Code is organized in binding groups

The object files contain binding groups of mutually dependent bindings. These are the smallest units of code that can be linked. Each binding group has some associated metadata required for initialization of the heap objects in the group. The metadata contains for example constructor tags (e.g. 1 for Nothing, 2 for Just), the arity of functions and static reference tables.

From a high level, the procedure that the linker follows is this:

Step
Read object files from dependencies into memory
Decode dependency part of all object files in dependencies (includes reading the string tables)
Using dependency data, find all code reachable from main
Decode reachable binding groups
Render AST to JavaScript
Construct initializers from metadata

We avoid decoding (deserializing) the binding groups that do end up in the linked result to keep the memory consumption lower. Still the linker requires a lot of memory for larger programs, so we may need to make more improvements in the future.

The Compactorโ€‹

The compactor is an optional link-time transformation step that reduces code size. It consists of a lightweight (i.e. no expensive operations like dataflow analysis) rewrite of the code contained in the object files. The compactor is disabled when linking with the -debug flag. There are a few steps involved.

Renaming private symbolsโ€‹

Haskell names are quite long by default: they need to be globally unique, hence they contain their defining unit-id and module name. For example: mtl-2.2.2-somehash-Control.Monad.State.Lazy.execState_go1 (special characters would be z-encoded but it isn't shown here).

Private symbols are only referred to from within the same module. It doesn't matter which JavaScript name we pick for them, as long as there is no overlap between the names from different modules. The compactor renames all the private symbols using a global sequence to ensure short names that do not overlap.

Block Initializerโ€‹

Without the compactor, the linker generates an h$initObj initialization call (or h$o) call for each global Haskell heap value. The code for this can get quite big. The compactor collects all heap objects to be initialized in a single large array and encodes the metadata in a string. This makes the initialization code much more compact.

Deduplicationโ€‹

An optional step in the compactor is deduplication of code. When deduplication is enabled with the -dedupe flag, the compactor looks for functionally equivalent pieces of JavaScript in the output and merges them. This can result in a significant reduction of code size.

Incremental Linkingโ€‹

The linker supports building programs that are loaded incrementally. This is used for example for Template Haskell. The process that runs the Template Haskell stays alive during compilation of a whole module. When the first Template Haskell expression is compiled, it is linked against all its dependencies (including the RTS) and the resulting JavaScript code is sent over to be run in the evaluator process.

As subsequent Template Haskell expressions are evaluated in the same process, there is no need to load already loaded dependencies (including the RTS) again and it is much more efficient to avoid doing so. Therefore the linker keeps track of which dependencies have already been linked and each subsequent TH expression is only linked against dependencies that are not already loaded in the evaluator process.

It's also possible for users to use this functionality directly, with the -generate-base to create a "linker state" file along with the regular jsexe files. Another program can then be linked with -use-base=state_file, resulting in a program which leaves out everything already present in the first program.

Future Improvementsโ€‹

Memory consumption is the biggest problem in the linker at the moment. Possible ways to achieve this are compression, more efficient representation of the data structures or more incremental loading of the parts from the object files that we need.

In terms of functionality, we don't take advantage of JavaScript modules yet. It would be good if we could improve the linker to support linking a library as a JavaScript module. We should also consider making use of foreign export javascript for this purpose.

ยท 11 min read
  1. GHC Primitives
    1. The Easy Cases
    2. ByteArray#, MutableByteArray#, SmallArray#, MutableSmallArray#,
    3. Addr# and StablePtr#
    4. Numbers: The Involved Case
      1. Working with 64-bit Types
      2. Unwrapped Number Optimization
    5. But what about the other stuff!

One of the key challenges in any novel backend is representing GHC primitive types in the new backend. For JavaScript, this is especially tricky, as JavaScript only has 8 primitive types and some of those types, such as number do not directly map to any Haskell primitive type, such as Int8#. This post walks through the most important GHC primitives and describes our implementation for each in the JavaScript backend. This post is intended to be an explanation-oriented post, light on details, but just enough to understand how the system works.

GHC Primitives

There are 36 primtypes that GHC defines in primops.txt.pp:

  1. Char#
  2. Int8#, Int16#, Int32#, Int64#, Int#
  3. Word8#, Word16#, Word32#, Word64#, Word#
  4. Double#, Float#,
  5. Array#, MutableArray#,, SmallArray#, SmallMutableArray#
  6. ByteArray#, MutableByteArray#
  7. Addr#
  8. MutVar#, TVar#, MVar#,
  9. IOPort#, State#, RealWorld, ThreadId#
  10. Weak#, StablePtr#, StableName#, Compact#, BCO,
  11. Fun, Proxy#
  12. StackSnapshot#
  13. VECTOR

Some of these are unsupported in the JS-backend, such as VECTOR or lower priority such as StackSnapshot#. Weโ€™ll begin with the easy cases.

The Easy Casesโ€‹

The easy cases are the cases that are implemented as JavaScript objects. In general, this is the big hammer used when nothing else will do. Weโ€™ll expand on the use of objectsโ€”especially representing heap objectsโ€”in a future post, but for the majority of cases we mimic the STG-machine behavior for GHC heap objects using JavaScript heap objects. For example,

var someConstructor =
{ f = // entry function of the datacon worker
, m = 0 // garbage collector mark
, d1 = first arg // First data field for the constructor
, d2 = arity = 2: second arg // second field, or object containing the remaining fields
arity > 2: { d1, d2, ...} object with remaining args (starts with "d1 = x2"!)
}

This is the general recipe; we define a JavaScript object that contains properties which correspond to the entry function of the heap object; in this case that is the entry function, f for a constructor, some meta data for garbage collection m, and pointers to the fields of the constructor or whatever else the heap object might need. Using JavaScript objects allows straightforward translations of several GHC types. For example TVars and MVars:

// stg.js.pp
/** @constructor */
function h$TVar(v) {
TRACE_STM("creating TVar, value: " + h$collectProps(v));
this.val = v; // current value
this.blocked = new h$Set(); // threads that get woken up if this TVar is updated
this.invariants = null; // invariants that use this TVar (h$Set)
this.m = 0; // gc mark
this._key = ++h$TVarN; // for storing in h$Map/h$Set
#ifdef GHCJS_DEBUG_ALLOC
h$debugAlloc_notifyAlloc(this);
#endif
}

// stm.js.pp
function h$MVar() {
TRACE_SCHEDULER("h$MVar constructor");
this.val = null;
this.readers = new h$Queue();
this.writers = new h$Queue();
this.waiters = null; // waiting for a value in the MVar with ReadMVar
this.m = 0; // gc mark
this.id = ++h$mvarId;
#ifdef GHCJS_DEBUG_ALLOC
h$debugAlloc_notifyAlloc(this);
#endif
}

Notice that both implementations defined properties specific to the semantics of the Haskell type. JavaScript functions which create these objects follow the naming convention h$<something> and reside in Shim files. Shim files are JavaScript files that the JS-backend links against and are written in pure JavaScript. This allows us to save some compile time by not generating code which doesnโ€™t change, and decompose the backend into JavaScript modules.

This strategy is also how functions are implemented in the JS-backend. Function objects are generated by StgToJS.Expr.genExpr and StgToJS.Apply.genApp but follow this recipe:

var myFUN =
{ f = <function itself>
, m = <garbage collector mark>
, d1 = free variable 1
, d2 = free variable 2
}

To summarize; for most cases we write custom JavaScript objects which hold whatever machinery is needed as properties to satisfy the expected semantics of the Haskell type. This is the strategy that implements: TVar, MVar, MutVar and Fun.

ByteArray#, MutableByteArray#, SmallArray#, MutableSmallArray#,โ€‹

ByteArray# and friends map to JavaScript's ArrayBuffer object. The ArrayBuffer object provides a fixed-length, raw binary data buffer. To index into the ArrayBuffer we need to know the type of data the buffer is expected to hold. So we make engineering tradeoff; we allocate typed views of the buffer payload once at buffer allocation time. This prevents allocations from views later when we might be handling the buffer in a hot loop, at the cost of slower initialization. For example, consider the mem.js.pp shim, which defines ByteArray#:

// mem.js.pp
function h$newByteArray(len) {
var len0 = Math.max(h$roundUpToMultipleOf(len, 8), 8);
var buf = new ArrayBuffer(len0);
return { buf: buf
, len: len
, i3: new Int32Array(buf)
, u8: new Uint8Array(buf)
, u1: new Uint16Array(buf)
, f3: new Float32Array(buf)
, f6: new Float64Array(buf)
, dv: new DataView(buf)
, m: 0
}
}

buf is the payload of the ByteArray#, len is the length of the ByteArray#. i3 to dv are the views of the payload; each view is an object which interprets the raw data in buf differently according to type. For example, i3 interprets buf as holding Int32, while dv interprets buf as a DataView and so on. The final property, m, is the garbage collector marker.

Addr# and StablePtr#โ€‹

Addr# and StablePtr# are implemented as a pair of ByteArray# and an Int# offset into the array. Weโ€™ll focus on Addr# because StablePtr# is the same implementation, with the exception that the StablePtr# is tracked in the global variable h$stablePtrBuf. Addr#s do not have an explicit constructor, rather they are implicitly constructed. For example, consider h$rts_mkPtr which creates a Ptr that contains an Addr#:

function h$rts_mkPtr(x) {
var buf, off = 0;
if(typeof x == 'string') {

buf = h$encodeUtf8(x);
off = 0;
} else if(typeof x == 'object' &&
typeof x.len == 'number' &&
x.buf instanceof ArrayBuffer) {

buf = x;
off = 0;
} else if(x.isView) {

buf = h$wrapBuffer(x.buffer, true, 0, x.buffer.byteLength);
off = x.byteOffset;
} else {

buf = h$wrapBuffer(x, true, 0, x.byteLength);
off = 0;
}
return (h$c2(h$baseZCGHCziPtrziPtr_con_e, (buf), (off)));
}

The function does some type inspection to check for the special case on string. If we do not have a string then a Ptr, which contains an Addr#, is returned. The Addr# is implicitly constructed by allocating a new ArrayBuffer and an offset into that buffer. The object case is an idempotent check; if the input is already such a Ptr, then just return the input. The cases which do the work are the cases which call to h$wrapBuffer:

// mem.js.pp
function h$wrapBuffer(buf, unalignedOk, offset, length) {
if(!unalignedOk && offset && offset % 8 !== 0) {
throw ("h$wrapBuffer: offset not aligned:" + offset);
}
if(!buf || !(buf instanceof ArrayBuffer))
throw "h$wrapBuffer: not an ArrayBuffer"
if(!offset) { offset = 0; }
if(!length || length < 0) { length = buf.byteLength - offset; }
return { buf: buf
, len: length
, i3: (offset%4) ? null : new Int32Array(buf, offset, length >> 2)
, u8: new Uint8Array(buf, offset, length)
, u1: (offset%2) ? null : new Uint16Array(buf, offset, length >> 1)
, f3: (offset%4) ? null : new Float32Array(buf, offset, length >> 2)
, f6: (offset%8) ? null : new Float64Array(buf, offset, length >> 3)
, dv: new DataView(buf, offset, length)
};
}

h$wrapBuffer is a utility function that does some offset checks and performs the allocation for the typed views as described above.

Numbers: The Involved Caseโ€‹

Translating numbers has three issues. First, JavaScript has no concept of fixed-precision 64-bit types such as Int64# and Word64#. Second, JavaScript bitwise operators only support signed 32-bit values (except the unsigned right shift operator of course). Third, numbers are atomic types and do not require any special properties for correct semantics, thus using wrapping objects gains us nothing at the cost of indirection.

Working with 64-bit Typesโ€‹

To express 64-bit numerics, we simply use two 32-bit numbers, one to express the high bits, one for the low bits. For example, consider comparing two Int64#:

// arith.js.pp
function h$hs_ltInt64(h1,l1,h2,l2) {
if(h1 === h2) {
var l1s = l1 >>> 1;
var l2s = l2 >>> 1;
return (l1s < l2s || (l1s === l2s && ((l1&1) < (l2&1)))) ? 1 : 0;
} else {
return (h1 < h2) ? 1 : 0;
}
}

The less than comparison function expects four inputs, two for each Int64# in Haskell. The first number is represented by h1 and l1 (high and low), and similarly the second number is represented by h2 and l2. The comparison is straightforward, we check equivalence of our high bits, if equal then we check the lower bits while being careful with signedness. No surprises here.

For the bitwise operators we store both Word32# and Word# as 32-bit signed values, and then map any values greater or equal 2^31 bits to negative values. This way we stay within the 32-bit range even though in Haskell these types only support nonnegative values.

Unwrapped Number Optimizationโ€‹

The JS backend uses JavaScript values to represent both Haskell heap objects and unboxed values (note that this isn't the only possible implementation, see 1). As such, it doesn't require that all heap objects have the same representation (e.g. a JavaScript object with a "tag" field indicating its type) because we can rely on JS introspection for the same purpose (especially typeof). Hence this optimization consists in using a more efficient JavaScript type to represent heap objects when possible, and to fallback on the generic representation otherwise.

This optimization particularly applies to Boxed numeric values (Int, Word, Int8, etc.) which can be directly represented with a JavaScript number, similarly to how unboxed Int#, Word#, Int8#, etc. values are represented.

Pros:

  • Fewer allocations and indirections: instead of one JavaScript object with a field containing a number value, we directly have the number value.

Cons:

  • More complex code to deal with heap objects that can have different representations

The optimization is applicable when:

  1. We have a single data type with a single data constructor.
  2. The constructor holds a single field that can only be a particular type.

If these invariants hold then, we remove the wrapping object and instead refer to the value held by the constructor directly. Int8 is the simplest case for this optimization. In Haskell we have:

data Int8 = Int8 Int8#

Notice that this definition satisfies the requirements. A direct translation in the JS backend would be:

// An Int8 Thunk represented as an Object with an entry function, f
// and payload, d1.
var anInt8 = { d1 = <Int8# payload>
, f : entry function which would scrutinize the payload
}

We can operationally distinguish between a Thunk and an Int8 because these will have separate types in the StgToJS GHC pass and will have separate types (object vs number) at runtime. In contrast, in Haskell an Int8 may actually be a Thunk until it is scrutinized and then becomes the Int8 payload (i.e., call-by-need). So this means that we will always know when we have an Int8 rather than a Thunk and therefore we can omit the wrapper object and convert this code to just:

// no object, just payload
var anInt8 = = <Int8# payload>

For the interested reader, this optimization takes place in the JavaScript code generator module GHC.StgToJS.Arg, specifically the functions allocConStatic, isUnboxableCon, and primRepVt.

But what about the other stuff!โ€‹

  • Char#: is represented by a number, i.e., the code point
  • Float#/Double#: Both represented as a JavaScript Double. This means that Float# has excess precision and thus we do not generate exactly the same answers as other platforms which are IEEE754 compliant. Full emulation of single precision Floats does not seem to be worth the effort as of writing. Our implementation represents these in a ByteArray#, where each Float# takes 4 bytes in the ByteArray#. This means that the precision is reduced to a 32-bit Float.

  1. An alternative approach would be to use some JS ArrayBuffers as memory blocks into which Haskell values and heap objects would be allocated. As an example this is the approach used by the Asterius compiler. The RTS would then need to be much more similar to the C RTS and the optimization presented in this section wouldn't apply because we couldn't rely on introspection of JS values.โ†ฉ

ยท 4 min read

Introductionโ€‹

I recently gave a short presentation on the topic of threads in GHCJS to the GHC team at IOG. This blog post is a summary of the content.

JavaScript and Threadsโ€‹

JavaScript is fundamentally single threaded. There are ways to share specific data between tasks but it's not possible to run multiple threads that have access to a shared memory space of JavaScript data.

The single JavaScript thread is often responsible for multiple tasks. For example a node.js server handles multiple simultaneous connections and a web application may be dealing with user input while downloading new data in the background.

This means that any single task should take care to never block execution of the other task. JavaScript's canonical answer is to use asynchronous programming. A function reading a file returns immediately without waiting for the file data to be loaded in memory. When the data is ready, a user-supplied callback is called to continue processing the data.

Haskell Threadsโ€‹

Concurrent Haskell supports lightweight threads through forkIO. These threads are scheduled on top of one more more operating system thread. A blocking foreign call blocks an OS thread but other lightweight threads can still run on other OS threads if available.

There is no built-in support for foreign calls with a callback in the style of JavaScript. Functions imported with foreign import ccall interruptible can be interrupted by sending an asynchronous exception to the corresponding lightweight thread.

Lightweight Threads in JavaScriptโ€‹

GHCJS implements lightweight threads on top of the single JavaScript thread. The scheduler switches between threads and handles synchronization through MVar and STM as expected from other Haskell platforms.

Foreign calls that don't block can be handled in the usual way. We extend the foreign function interface with a new type foreign import javascript interruptible that conveniently supports the callback mechanism used by JavaScript frameworks. The foreign call is supplied with an additional argument $c representing a callback to be called with the result when ready. From the Haskell side the corresponding lightweight thread is blocked until $c is called. This type of foreign call can be interrupted with an asynchronous exception to the lightweight Haskell thread.

By default, Haskell threads in the JS environment run asynchronously. A call to h$run returns immediately and starts the thread in the background. This works for tasks that does not require immediate actions. For situations that require more immediate action, such as dealing with event handler propagation, there is h$runSync. This starts a synchronous thread that is not interleaved with other task. If possible, the thread runs to completion before the call to h$runSync returns. If the thread blocks for any reason, such as waiting for an MVar or a foreign import javascript interruptible call, synchronous execution cannot complete. The blocking task is then either interrupted with an exception or the thread is "demoted" to a regular asynchronous thread.

Black Holesโ€‹

When a Haskell value is evaluated, its heap object is overwritten by a black hole. This black hole marks the value as being evaluated and prevents other threads from doing the same. "black holing" can be done either immediately or "lazily", when the garbage collector is run. GHCJS implements immediate blackholing.

Black holes give rise to an interesting problem in the presence of synchronous and asynchronous threads. Typically if we use h$runSync, we want to have some guarantee that at least part of the task will run succesfully without blocking. For the most past it's fairly clear which parts of our task depends on potentially blocking IO or thread synchronization. But black holes throw a spanner in the works: Suddenly any "pure" data structure can be a source of blocking if it is under evaluation by another thread.

To regain some predictability and usability of synchronous threads, the h$runSync scheduler can run other Haskell threads in order to "clear" a black hole. The process ends all black holes have been cleared or when any of the black holes is impossible to clear because of a blocking situation.

This all happens transparantly to the caller of h$runSync, if the black holes could be cleared it appears as if they were never there.

Conclusionโ€‹

We have lightweight Haskell threads in the single-threaded JavaScript environment and extend the foreign function interface to easily support foreign calls that depend on an asynchronous callback. This way, only the Haskell lightweight thread blocks.

By default, Haskell threads are asynchronous and run in the background: The scheduler interleaves the tasks and synchronization between threads. For situations that require immediate results or actions there are synchronous threads. Synchronous threads cannot block and are not interleaved with other tasks except when a black hole is encountered.