Monday, January 6, 2014

Grogix for programming language error-reporting


[This is an actual case study, heavily disguised.]
the social structure


Let's imagine a programming language called Yazyka. 

Let's say the Yazyka compiler is not built to provide detailed assistance and instruction, regarding use of the Yazyka language, to the user. Essentially, its error messages aren't useful to a human being. This is a common problem among production language compilers. 

Say that, because of incredible external pressures on the compiler team, the compiler will never be able to generate this human-friendly instruction itself.

So it might be best to build a separate 'check-module', which can be used by the system before or during compilation, that would play a role something like ‘lint’, but with far more sophisticated analytic functionality and pervasive user-friendliness. Like ‘lint’, it would also help to flesh-out documentation and instructional materials for the Yazyka language.

One state-of-the-art solution is to build this check-module with grogix

Grogix is an unusual computer language, but it is quite ideal for doing this particular kind of work. 

Explanation

Say there is a type of statement in Yazyka whose form is:

(A)       notify [x] with [y] on [z] ;
           
This can look like, say:

            notify alertPort with “$D” on ALERT_STRINGS;

... and an infinite set of other similar cases with different specifics.

But there is also an infinite set of incorrect cases (a larger infinite set, actually) which do not fit this statement's form (I would hesitate to even call it syntax), any of which could be generated by the user while attempting to write a correct statement of this form.

Let’s look at a tiny variant from the correct form.

*          notify alertPort with $D on ALERT_STRINGS;

In this example, the user has used a notation ($D) which turns ‘D’ into a string. But even if ‘D’ is otherwise correct, it turns out that ‘$D’ must be used inside of quotation marks.

The existing Yazyka compiler is ‘aware’ that this is incorrect. But, in its analysis, the "reasons" it has for rejection of this case are not intelligible to a human, and cannot help a human to uncover the mistake that has been made.

For this simple mistake, the current compiler should provide helpful feedback like:

The $ operator is only for use within quotation marks, e.g.: "$D"

or:

In the notify statement, the stream ALERT_STRINGS requires a string 
in the with clause. Found $D instead.

or:

“With” should be followed by a string

… et cetera ...

But, instead, the Yazyka compiler responds with these kinds of errors:

Multiple markers at this line
            - unexpected keyword: with
            - unexpected keyword: (with) in expression
            - unexpected keyword: notify
            - unexpected keyword: (notify) in action
            - unexpected keyword: on
            - unexpected keyword: (on) in action
            - constructor $ should have 1 arguments

Of course there are valid internal reasons for these messages, related to other work the Yazyka compiler needs to do. The compiler is optimized to build fast, reliable code, not to teach people the Yazyka language. This is also true for most other compilers.

This is why we advocate a language check-module. It’s intended to serve two purposes:

1. maintain an independent, human-verified, approachable formal language definition
2. make use of this definition as the basis of a grogix program that provides user-friendly error messages

What is grogix, how is it helpful, and why is it special?

Grogix is the prototype for a new class of formal languages. A grogix program explicitly represents computational operation, in a concise way, through a simple, coherent, tree-like gradient of operational importance. We call it an “operational grammar”.

A grogix program’s uniform structural description, provided by a cascading deductive block of statements, consisting of only a single statement type (described below) means that a programmer is compelled to “push out” uninteresting implementation details from this operational description of a program. This leaves a “structural essence”, something which looks merely like an outline of operation, but is actually a tight hierarchical structure that handles all cases.

Because it is ‘syntax-centric’, a grogix program can more easily provide consistency checking and targeted human error reporting. Providing this, for all incorrect statements of the type (A) above, is accomplished by this small snippet of code, which demonstrates grogix’s brevity:

. statement(*starts_with(input,‘notify’)) -> notify_statement
. notify_statement -> notify recipient_identifier with expression
.. on stream_identifier($recipient_identifier,$expression) semicolon
. notify(*starts_with(input,‘notify’)) -> ‘notify’
. notify -> *error(‘not a “notify” statement’)
. recipient_identifier -> identifier
. with(*next(input, ‘with’)) -> ‘with’
. with -> *error(‘missing “with” ’)
. expression -> *valid_type(expr,$1,$2)
. expression -> *error(‘missing expression after “with”’)
. on(*next(input,‘on’)) -> ‘on’
. on -> *error(‘missing “on” indicator for stream name’)
. stream_identifier -> *validate($expression,$recipient_identifier,identifier)
. semicolon(*next(input,‘;’)) -> ‘;’
. semicolon -> *error(‘missing semicolon’)

Now to explain.

There is only one kind of statement in a grogix program: the conditional production. Its general form is:

. (condition) ->
.. [combination of parameterized terminals, non-terminals and actions]

Note that every statement is preceded by a ‘.’ or continued by a ‘..’’ Any other line is a comment: this is a nod to Knuth’s Literate Programming initiative.

Now let’s look at the same program, with explanatory comments added:

. statement(*starts_with(input,‘notify’)) -> notify_statement
We assume that there are more kind of statements besides notify,
which will be inserted here.

. notify_statement -> notify recipient_identifier with expression
.. on stream_identifier($recipient_identifier,$expression) semicolon
This is the structure of a notify statement. In grogix, all the expressions on the right hand side of the ‘->’ will be evaluated, in order, before this statement can return a value upwards.

Note that we’ve passed the return value of two of the non-terminals (recipient_identifier and expression) as parameters (using $) to a third non-terminal (stream_identifier) in order to check type agreement.

. notify(*starts_with(input,‘notify’)) -> ‘notify’
. notify -> *error(‘not a “notify” statement’)
This is redundant! But I wanted to give you an early flavor for the nature of the conditional production. The first notify production is invoked, and returns a string, if the line begins with ‘notify’, but otherwise, the non-terminal definition falls through, (top-to-bottom order) to the second conditional production, which is the ‘default’. The first conditional production to evaluate positively is the one that ‘runs’ (i.e. the one whose right-hand-side is further evaluated).

Also note that the *error operation and the *starts_with operations are both external references. Everything else in a grogix program is an internal reference: that is, defined within the grogix program. This means the grogix program represents only the operational structure, or ‘essence’ of the program … everything else is pushed out of this structure, or pulled in, via these * operations.

. recipient_identifier -> identifier
‘identifier’ will be defined another time. also, in this consistency checker, there will be binding considerations for the different kinds of identifiers (we have to check that the streams and variables exist, for example).

. with(*next(input, ‘with’)) -> ‘with’
. with -> *error(‘missing “with” ’)
Here we look at the next token at the first level, and, if it is missing at this point in the evaluation, we issue a specific, user-friendly error (I’ll leave the error format discussion to another time.)

. expression -> *valid_type(expr,$1,$2)
. expression -> *error(‘missing expression after “with”’)
If there is an expression, it does a check against the type declared elsewhere, returning the mismatch if found. If there’s no expression at all, it reports the problem.

. on(*next(input,‘on’)) -> ‘on’
. on -> *error(‘missing “on” indicator for stream name’)
Another simple use of the conditional production to identify missing structure.

. stream_identifier -> *validate($expression,$recipient_identifier,identifier)
‘identifier’ will be defined another time. notice that we call an external function *validate, which we assume has access to appropriate tables, to validate passed values of other non-terminals (see notify_statement above), and potentially return an error here is there is a problem. For ease of exposition, we’re presenting the case where the other non-terminals have already been evaluated. A different technique is available if the agreement is with non-terminals whose values are not yet available, but our users can usually rewrite the productions to make agreement-evaluation easy.

. semicolon(*next(input,‘;’)) -> ‘;’
. semicolon -> *error(‘missing semicolon’)
Another simple use of the conditional production to identify missing structure.

Conclusion

The position of the proposed check-module is architecturally flexible: it can be invoked either after the Yazyka compiler encounters a problem, to provide better output to the user, or beforehand, to ensure better input to the compiler.

Socially, the proposed check-module provides an independent validation of assumptions regarding the Yazyka language. Although this module and the compiler may seem like a “divergence risk” because of “two different language definitions” in two different system modules, in fact their jobs are complementary, both helping to socialize the actual language definition. It is the grogix team's job to ensure that there are no conflicts, and that the check-module provides a world-class level of Yazyka training and support to the user.