%!TEX root = ../Main.tex
\section{System-F2 and its encodings}
Coming from a Haskell background, we assert that System-F2 with algebraic data is (or should be) the foundation of all modern compilers. By ``System-F2'' we mean the \emph{second order polymorphic lambda calculus}, or System-F$_\omega$ with higher kinds but without type operator abstraction~\cite{Pierce:TAPL}. We reject type operator abstraction because type checking a language including it requires evaluation at the type level. Evaluation at the type level is a common feature of dependently proof languages like Gallina~\cite{Coq:manual} and Twelf~\cite{Pfenning:Twelf}, but not present in traditional general purpose languages.
System-F2 is a sweet spot in the design space of explicitly typed functional languages: easy to define, easy to transform, and easy to type check, with the compilation down to object code being well understood \cite{Marlow:push-enter}. One visible client is the Glasgow Haskell Compiler (GHC), whose current intermediate language, System-Fc \cite{Sulzmann:SystemFc}, is a conservative extension of System-F2. The FLINT~\cite{Shao:FLINT} compiler is based on a simpler variation of it.
% -----------------------------------------------------------------------------
\subsection{System-F2 with algebraic data}
The language we formalize is shown in Fig~\ref{f:SystemF2}. The ``ambient calculus'' of \cite{System-F2} is standard and we refer the reader to \cite{Pierce:TAPL} for a more complete exposition. The addition of algebraic data is less commonly presented, though \CITEIT{System-Fc} contains the bare typing rules. Fig~\ref{f:SystemF2} is essentially the language used in Santos's thesis~\cite{Santos:compilation}, though we use a call-by-value evaluation semantics instead of call-by-need because it is closer to our own line of work \cite{Lippmeier:witnessing}. We follow the presentation of Santos \cite{Santos:compilation} because it is the closest to ``classic GHC core'', before it was extended into \mbox{System-Fc} to support GADTs and other features. Formalizing full \mbox{System-Fc} is out of scope for this paper.
\TODO{more description}
We write $T$ for type constructors. We write $K_{m,n}~ \ov{\tau}^{\;m}~ \ov{t}^{\;n}$ for data constructors with $m$ type arguments and $n$ term arguments; write $prim_n~ \ov{t}^{\;n}$ for monomorphic primitive operations (primops), and $lit$ for primitive literals. Following GHC we use a @#@ suffix to mark the names of primitive operators and literals. As in typical compiler implementations, data constructors and primops are kept fully applied (saturated) so there is a regular way to generate code for them \cite{Marlow:push-enter}. For example, given a primop @addInt#@ which performs unboxed addition, if it is fully applied we can generate a single machine instruction for it, and there is no risk that the program will try to store something like ~@addInt# : Int# -> Int# -> Int#@~ in say, the head of a list. We write @Nat#@ to mean the type of primitive, unboxed natural numbers, and @#@ to mean any of @0#@, @1#@, @2#@ ...
Importantly, although Santos's work also requires the arguments to functions, data constructors and primops to be atoms (variables or values), we lift this restriction to gain a more general language. In a compiler implementation it is helpful to allow nested applications as an intermediate state, only converting to a-normal form occasionally. However, by lifting this restriction we introduce new evaluation contexts, which cause a problem in the formalization that we will see in the next section. We discuss a subset of the typing rules informally over the next few sections, though leave the full set to \REF and \REF where we present the concrete formal version.
\begin{figure}
\begin{tabbing}
M \= MMMMMMMMM \= MMx \= MM \= MMMMMMMMMMMMMMM \= \kill
\> (names)
\> $\tt{a}$ \> $\to$ \> $<$type variables$>$
\\
\> \> $\tt{v}$ \> $\to$ \> $<$expression variables$>$
\\
\> \> $\tt{T}$ \> $\to$ \> $<$type constructors$>$
\\
\> \> $\tt{K}$ \> $\to$ \> $<$data constructors$>$
\\
\\
\> (kinds)
\> $\tt{k}$ \> ::= \> $* \bar \tt{k} \to \tt{k}$
\\[1ex]
\> (types)
\> $\tt{t}$ \> ::=
\> $ \tt{a}
\bar \forall \tt{a} : \tt{k}.~ \tt{t}
\bar \tt{t}~ \tt{t}$
\\
\> \> \> $\bar$
\> $ \tt{T}
\bar (\to)
\bar \tt{Nat\#}
\bar \tt{Bool\#}
\bar ...$
\\[1ex]
\> (expressions)
\> $\tt{x}$ \> ::=
\> $ \tt{v}
\bar \Lambda \tt{a} : \tt{k}.~ \tt{x}
\bar \tt{x}~ \tt{t}
\bar \lambda \tt{v} : \tt{t}.~ \tt{x}
\bar \tt{x}~ \tt{x}$
\\
\> \> \> $\bar$
\> $ \tt{K}_{m,n}~ \ov{\tt{t}}^{\;m}~ \ov{\tt{x}}^{\;n}
\bar \tt{case}~ \tt{x}~ \tt{of}~ \ov{\tt{alt}}$
\\
\> \> \> $\bar$
\> $ \tt{prim}_n~ \ov{\tt{x}}^{\;n}
\bar \tt{lit}$
\\[1ex]
\> (alternatives)
\> $\tt{alt}$ \> ::=
\> $ \tt{K}_n~ \ov{\tt{x}}^{\;n} \To \tt{t}$
\\
\> (primitives)
\> $\tt{prim}$ \> ::=
\> $ \tt{addInt\#}_2
\bar \tt{isZero\#}_1
\bar ...$
\\
\> (literals)
\> $\tt{lit}$ \> ::=
\> $ \tt{\#}
\bar \tt{tt\#}
\bar \tt{ff\#}
\bar ...$
\end{tabbing}
\caption{System-F2 with algebraic data}
\label{f:SystemF2}
\end{figure}
% -----------------------------------------------------------------------------
\begin{figure}
@(WfT) @ $@ke@ \vdash @t@~~ \tbf{wf}$\\
Under kind environment @ke@, type @t@ is well formed (closed).
\flushleft
@(KIND) @ $\tt{ke} \vdash_k \tt{t} :: \tt{k}$\\
With kind environment @ke@, type @t@ has kind @k@.
\flushleft
@(WfX) @ $@ke@ ~|~ @te@ \vdash @x@~~ \tbf{wf}$\\
Under kind environment @ke@ and type environment @te@, type @t@ is well formed (closed).
\flushleft
@(DEFSOK) @ $@ds@ \vdash \tbf{ok}$ \\
Data type definitions @ds@ are valid.
\flushleft
@(TYPE) @ $@ds@ ~|~ @ke@ ~|~ @te@ \vdash_t @x@ :: @t@$\\
With data type definitions @ds@, kind environment @ke@ and type environment @te@,
expression @x@ has type @t@.
\flushleft
@(STEP) @ $@x@ \longrightarrow @x'@$\\
Expression @x@ takes a single evaluation step
and produces a new expression @x'@
\flushleft
@(STEPS) @ $@x@ \stackrel{*}\longrightarrow @x'@$\\
Expression @x@ takes zero or more evaluation steps
and produces a new expression @x'@.
\flushleft
@(STEPSL) @ $@x@ \stackrel{\tiny{\trm{left}} *}\longrightarrow @x'@$\\
Expression @x@ takes zero or more evaluation steps
and produces a new expression @x'@. \\
(left linearized version).
\flushleft
@(EVAL) @ $@x@ \downarrow @x'@$\\
Expression @x@ evaluates to expression @x'@.
\flushleft
@(EVALS) @ $\ov{@x@} \Downarrow \ov{@x'@}$\\
The list of expressions $\ov{@x@}$ evaluate to the corresponding list of expresions $\ov{@x'@}$.
\flushleft
@(WnfX) @ $\tbf{wnf}~ @x@$\\
Expression @x@ is in weak normal form.
\caption{Main Judgements}
\label{f:Judgements}
\end{figure}
% -----------------------------------------------------------------------------
\subsection{Main judgements and theorems}
The main judgements are shown in Fig~\ref{f:Judgements}, along with the name we use in the Coq formalization. The definitions of \TODO{this one} and \TODO{this one} are in Figures \TODO{add figure} respectively. See the full Coq formalization for the others. The data definitions @ds@ map the types of data constructors to their associated data type. The @DEFSOK@ judgement enforces well formedness conditions, like the types of all data constructors must be closed, and there must be at least one data constructor for each data type. We provide small step semantics via @STEP@, @STEPS@ and @STEPSL@ as well as big step semantics via @EVAL@ and @EVALS@. The @STEPSL@ judgement is a left linearised version of @STEPS@ and is used when converting a reduction expressed in small step semantics to big step.
The Coq formalisation provides a statement of syntactic soundness, expressed as the following progress and preservation theorems after \cite{Fellisen}.
\begin{tabbing}
MM \= MMMMM \= MMMM \kill
If \> $@ds@ \vdash \tbf{ok}$ \\
and \> $@ds@ ~|~ \emptyset ~|~ \emptyset \vdash_t @x@ :: @t@$ \\
then \> $x \vdash \tbf{value}$ ~~or~~ $@x@ \longrightarrow @x'@$
\end{tabbing}
% -----------------------------------------------------------------------------
\subsection{Saturated applications}
Although common in compiler implementations and informal presentations, requiring data constructors and primitive operators to be fully saturated complicates the formal encoding of the language, as well as proofs about it. The problem is with the evaluation contexts it introduces, with the full specification for the language in Fig~\ref{f:SystemF2}
\subsection{Encoding of case-expressions}
Show case/brn form. When you use this encoding rather than a list of alternatives, then can't use existing lemmas about lists etc. Give examples of proofs that reimplement these in a non-reuable way.
AURA proof uses stepmatchesctr to collect arguments from an applied data constructor. Say we abstract this out into joint evaluation contexts. \TODO{Check step-matches-fully-applied}.
\begin{small}
\begin{code}
Lemma steps_in_XCon
: forall xs ts vs dc
, Forall2 STEPS xs vs
-> Forall wnfX vs
-> STEPS (XCon dc ts xs) (XCon dc ts vs).
\end{code}
\end{small}
As we intend the formalization to be used as the basis for a compiler, we use de Bruijn indices for binders due to their concreteness.
\TODO{Mention other compilers that use these, Isabelle, Agda?, Accelerate, DDC}.
Using a list of alternatives requires indirect mutual recursion.
Show rule with simultaneous substitution from System-Fc paper.
Doing this eliminates \emph{indirect mutual recursion} in the language definition via the @list@ type, while giving up the ability to reuse polymorphic list utility functions.
% -----------------------------------------------------------------------------
\subsection{Evaluation Contexts}
Reduction of evaluation contexts. Show evaluation contexts of data constructors with fixed arity. Needs notion of joint evaluation context. If you represent this directly as a list of expressions, then can use resable library to find the expression to step.
% -----------------------------------------------------------------------------
\begin{itemize}
\item Encoding of case-expressions.
\item Simultaneous substution.
\end{itemize}