Turk2white.htm

Turkish Demitasse (Turk/2 or T2), A Much Stronger Brew Than Java

T.Pittman

This paper describes the essential features of the programming language "Turkish Demitasse" (Turk/2 or T2), initially a dialect of Java, which like the respective beverages is stronger than the corresponding namesake. Strength in a programming language measures either the power of the language to accomplish a range of programming tasks -- T2 is designed to be a systems programming language, in addition to which you need nothing else at all for all systems programming tasks, including its own compiler and library code and indeed the whole operating system -- or the restrictions imposed by the data type system of the language, which also protects the programmer from errors that the compiler can thus catch at compile time.

One of the explicit design considerations in T2 is the realization that it is not the programmer's job to optimize code in ways a well-designed compiler can do better. It is common knowledge that C was intended as a high-level assembler for the PDP-11 computer, and the programmer was given a lot of control over such optimizations as register allocation, array indexing, and common subexpression elimination. These low-level code tricks are of course increasingly inappropriate for target architectures differing from the original PDP-11, so much so that hand-coded C often performs worse than machine generated C code derived from better languages like Modula-2 using the same C compiler.

The specification of the Java programming language went a long way toward eliminating the type weaknesses in C/C++, but it left too many problems and introduced a few new ones. Here I propose a much stronger typed language which also eliminates the rest of the problems. The result is mostly a subset of Java with some (mostly syntactic) enhancements for readability, plus a few significant extensions as noted below. A T2 program can be mechanically translated to a Java program that is semantically equivalent, and with a little more effort, into a semantically equivalent C++ program. Except as noted in the following sections, all of the Java syntax and semantics has been preserved in T2 for maximum code and skill portability.

One of the aims of T2 is to implement a systems programming language such that its compiler and all library code -- indeed a whole operating system -- can be efficiently written in T2 without falling back on other languages. To this end we relax the strong type system in controlled ways by the introduction of the pseudo-class "DANGEROUS", which the compiler knows about and can generate efficient in-line code for. These improvements have been largely inspired by the small and powerful language Modula-2, which attained all these advantages in a somewhat more readable Pascal-like syntax; the "2" in the name of T2 is a nod of acknowledgement for our dear departed friend.

Deletions

The most significant type failure that Java inherited from C is that any type is assignment-compatible with type void. In other words, you can often with impunity discard expression results. Unless you confuse statements and expressions (as C does) there is no valid reason for discarding expression values; values should either be assigned to destination variables (or parameters), or used in some larger expressions, or else used directly in a control structure like if or switch. Although it is not absolutely preventable, we want to discourage the practice of letting expressions have side effects. Side effects make the program less readable and therefore less reliable. T2 does not disallow side effects, but it disallows the most egregious of them by giving assignment statements -- indeed all statements -- an implicit type void. This syntactically prevents the common programming error (which in Java is perfectly legal but has unexpected results when a and b are type boolean):

if (a=b) DoSomething();

We incidentally also disallow the confusing multiple assignment construct,

a=b=c;

but the same result is easily achieved without the confusing loss of readability by making two assignment statements:

b=c;
a=b;

Notice that this restriction does not reduce what can be programmed, it only reduces what can be programmed in an unreadable and dangerous manner. The restricted language is also still syntactically correct Java.

The other significant restriction to the language introduced by T2 is the requirement that non-empty switch cases explicitly be coded with break (or some other form of goto, such as return or continue or throw) at the end of each case, so that it is no longer possible for one case to fall into the next one. You can achieve the same effect by duplicating code, and a smart compiler can optimize the duplications back out. It is not the programmer's responsibility to do that job, and it is too dangerous to allow it. A much better solution would be the case statement of Pascal or Modula-2 (which needs no such gotos), but the design decision here is to preserve syntactic compatibility with Java as much as possible. It is sufficient for the compiler to reject ill-formed switch cases. We except empty cases, which are effectively only multiple labels on a single case.

We removed statement labels; break and continue apply only to the immediately enclosing structure (loop or switch), without the option of exiting multiple structures at once. The extra capability was judged not useful enough to justify the additional syntactic burden. A smaller language is easier to read and maintain.

In the spirit of smaller language -- although that is not our primary goal, it contributes to robustness -- we also removed the operator+assignment combinations. If programmers properly eschew side effects in their expressions, then repeating the destination variable on the right side of the assignment has no effect on performance (because the compiler can do the necessary common-subexpression elimination automatically) and enhances readability. In the same spirit we also removed the prefix and postfix "++" and "--" operators. An extra assignment statement is cheap code, and programmers should not be writing code that depends on such increments or decrements occurring in the middle of expressions. It is apparently these mid-expression increments and decrements that are the primary purpose for the prefix and postfix incrementation operators, and it is precisely those mid-expression increments and decrements that yield surprising (and therefore error-prone) results [see Note 1]. Nevertheless, for those programmers too lazy to type the variable name twice, we expect the integrated development environment editor to accept "++" and "--" operators as shortcuts for the T2 spelled-out versions, which are then inserted into the code. This can easily be extended to all of the C/Java composite assignment operators at no cost to execution time, while preserving the inherent readability of the full spelling.

Finally, we have tightened case sensitivity in identifiers. T2 identifiers are essentially not case sensitive, in that we disallow the declaration in the same scope of identifiers differing only by capitalization, but we also warn the programmer who changes the capitalization on an identifier. This prevents the C/Java-habituated programmer from declaring a new identifier that hides another with different capitalization in an outer scope, then attempting to use them both in the same context. Programs meeting this restriction are still perfectly good Java, with the additional advantage of better readability.

Additions

With a few exceptions, most of the additions to Java that went into T2 are in the nature of "syntactic sugar" which can be relatively painlessly converted back to pure Java in a mechanical manner.

The most important addition is the promotion of arrays to be honest types. Arrays in C are second-class citizens; Java made significant improvements to this, but they are still not honest types as evidenced by the option to put empty array brackets on each identifier in a variable declaration independently. In T2 all types, including arrays, can be named and used in a consistent manner. Array types are not and cannot be syntactically distinguished from other types by placing a part of the distinguishing syntax in some other place. This disallows the optional Java syntax,

int x, y[], z;

To get the same effect the T2 programmer must make a separate declaration:

int x, z;
int[] y;

Because T2 is intended for systems programming, we further enhance the array type to permit statically declared arrays in the obvious manner:

int[10] x, y;

Assignment to x as a whole must be from a compatible array value, namely an array initializer containing exactly ten integers, or else from another array of the same type, such as y. The compiler is then free to allocate x statically or in the local stack frame. This is semantically equivalent to the following pure Java code:

int[] x = new int[10]; int[] y = new int[10];

except that T2 disallows any subsequent explicit reallocation by new or assignment to an array of different length. Dynamic arrays as in Java are still available in T2, with run-time subscript bounds checking and the ability to resize them from time to time, along with the corresponding performance hit. Unlike C and Java, because T2 arrays are an honest data type, not a sugar-coated pointer, assigning a static array to another (similarly typed) static array copies the whole array.

Because T2 is strongly typed, we still require array subscripts to be range-checked, but a smart compiler can do that fairly inexpensively with static arrays (as above) and with a couple improvements to the language to enable the compiler more easily to infer in-bounds limits on subscripts without checking every access. One is the addition of a subrange type, which we require to be named:

type decimal = 0..9;

Then any value assigned to a variable of type decimal is range-checked upon assignment, and thereafter can be used without further checking as a subscript to access an array dimensioned int[10]. To convert this usage into pure Java involves nothing more than deleting the type statement and substituting int (or byte) for all occurrences of "decimal" in its scope. A slightly more ambitious use of subrange types permits the type name to be used as the dimension inside the brackets of an array type declaration; converting this back to pure Java resubstitutes there the range's upper bound +1. Note that T2 considers the built-in types byte and short to be appropriate subranges of integer instead of separate types. Thus adding two maximal bytes in T2 does not overflow until you try to assign it back into a byte subrange variable or parameter. Java has this same behavior without the explicit type safety, resulting in some surprises.

The char data type has been promoted to its own type, incompatible with integers. You must use explicit type cast functions (see below) to do arithmetic on characters or to convert numbers to characters.

Named enumerations are a particularly useful strong type that C never really had and Java didn't even attempt. We do this with the same type keyword:

type color = {red, orange, yellow, green, blue, violet};

Converting this to the weaker-typed pure Java is as simple as creating constant declarations for each identifier in the list, and replacing the type name with int as we also did for subranges:

final int red=0, orange=1, yellow=2, green=3, blue=4, violet=5;

The same new type keyword can be used to rename another type, or to give a name to an array type:

type bigArray = int[1000];

Record types already exist in Java; the keyword is merely spelled class. T2 permits the keyword type to be used in place of final static class for methodless classes (that is, record types) to clarify the programmer's intent in such cases; this also invites the compiler to optimize their instance variables to the local stack frame or host object, eliminating the performance hit from heap allocation and garbage collection.

Operator precedence in T2 has been modified slightly from the C/Java convention to be more natural in view of the stronger type checking. The T2 bitwise logical operators cannot be used on boolean operands, so their precedence was exchanged with the relational operators. Thus a>b&c means a>(b&c) where b and c must be integer, while a>b&&c means (a>b)&&c where c must be boolean; in pure Java the bitwise operators can be applied to boolean operands with results equivalent to the corresponding boolean operators, so a>b&c produces a compile-time error in Java if b and c are integer. Conversion between correct pure Java and a correct T2 program using unparenthesized expressions like this must insert parentheses.

Similarly, the T2 "+" operator cannot be used for string concatenation as in Java because a+3 gives surprising results depending on whether a is string or integer; the new operator "#" is used in T2 for concatenation. The "#" operator has a precedence just lower than the bitwise operators (which operate on integers only), so that automatic promotion to string does not come between them and the arithmetic operators. As with the bitwise operators, conversion to pure Java may need to insert parentheses when substituting "+" for "#".

The other feature that facilitates strong array bounds checking at minimal cost is a properly structured for loop. Unlike C and Java, where the for loop is so flexible as to discourage any compiler optimization, we permit only one control variable, which must be assigned an initial value, checked for termination, and updated, and which the programmer is further forbidden to alter within the loop. Thus the T2 compiler can easily prove (termination and) the range of the control variable, and so eliminate unnecessary range checking code within the body of the loop. To draw attention to the restriction, the T2 syntax is somewhat different from Java:

for i=start,stop,step do doSomething();

This form is just syntactic sugar for the pure Java form, which we can now disallow in T2:

for (int i=start; i<=stop; i=i+step) doSomething();

or else replace it with the (more obviously less efficient) equivalent while loop. Like its Pascal inspiration, the T2 for-loop works as easily for decremented loops by setting the (optional) step value negative, which also inverts the termination test. The new syntax also eliminates the common programming error of an inappropriate termination condition (from confustion over whether to use <= or just <). Programmers familiar only with the C/Java for loop notation may initially find this Pascal/Ada syntax obscure, but the reverse is also true; we believe the proposed notational difference enhances the recognition of different semantics and does not significantly interfere with usability; in fact the reduced redundancy of the T2 syntax is easier to write and somewhat easier to read. We have for the most part eschewed changes to the language that entail the proliferation of new reserved words.

Object-Oriented Programming Systems (OOPS) are fine and dandy for writing elegant software that runs slowly, but performance is a major concern in system software, and T2 is aimed squarely at that market. Like Java, the T2 compiler is expected to statically bind methods to final classes, so that static method calls in these cases perform no differently from ordinary function calls in a non-OOPS language. We further offer a syntactic sugaring that fields and methods may be declared outside a class scope as a reminder that this is system software, not OOPS code nor appropriate to the OOPS paradigm; they are collected together into an implicit "final static Global__Stuff" class when converting T2 to pure Java.

Java offers two different syntactic forms for calling a function/method. One is used for static methods, where all parameters are explicit in the parameter list the way we have always called functions in pre-OOPS languages; the other is used for normal class methods, where there is always one parameter hidden in the call, sometimes explicitly in the "objectName." prefix and sometimes only implicitly the current (this) object. This becomes somewhat confusing when trying to call utility functions such as the String class methods. There is no inherent primary string object when comparing two strings for equality or ordering, yet the programmer is forced to think in terms of sending a "comareTo" message to only one of them; similarly, sending a "concat" message to a string object does not modify nor otherwise distinguish that string object, it only treats both operands as parameters to form a new string which is returned. Notice that the Math library routines are more sensibly declared static and are passed all parameters explicitly. In T2 we favor library routines using this latter, more orthogonal function-calling protocol, and to that end we also provide implicit wrappers for the more useful String class methods to make them easier to read. The T2 compiler does not actually generate nor call extra wrapper code, but both kinds of invocations compile to the same machine code call. Converting a T2 program to pure Java can either supply these wrappers explicitly in a library class, or else restructure (and rename) the method call to match the standard Sring class methods. Note that T2 is not different from Java in available procedure calling syntax, only that the additional T2 library routines favor the more readable non-OOPS form where that makes sense.

Finally, T2 removes the Java type-casting syntax inherited from C, and replaces it with explicit system functions. The advantage of this is that the dangerous casts -- and we do need them -- can now be flagged by forcing them to be imported from class DANGEROUS. Most important among these casts are address conversions, needed for pointer arithmetic, which must be done to write a memory allocator (and other such system functions) in T2. The compiler is of course fully aware of the library functions exported by class DANGEROUS, so they can be coded inline -- indeed the inline code for a true typecast (as distinguished from the conversions improperly called typecasts in C and Java) is no code at all! Obviously system code depending on such typecasts cannot function correctly as pure Java, but the alternative conversion to C++ remains viable. The reasonably safe casts and coersions (for example, int to float) can be pervasive functions and/or remain implicit as in Java.

Future T2 Extensions

This section deals with significant extensions to Java. Perhaps we should refer to the foregoing language as Turk/1, and these improvements as version 2. While still feasible, it is somewhat harder to mechanically translate the improvements in this section directly to Java -- perhaps about as difficult as translating them to C++. We consider two classes of modification: Well-ordered strings, and extending the type system to include procedure types.

For T2 to be as easy to use as Basic (a reasonable goal), strings should be implicitly unique and well-ordered (though perhaps not really), so "str1<str2" actually compares the characters as expected. We also add chunk expression functions and one magical (unprintable, that is, length 0) string StringEnd, which tells string-based iterators that the end of the data has been reached. For example,

for ix=1, 999 do {
astring = GetChunk(BigString,delim,ix);
if (astring==StringEnd) break;
... }

Some suggested chunk expression functions: GetLine, GetWord, GetChunk, GetChunks, ReplaceLine, ReplaceWord, ReplaceChunk, ReplaceChunks, SortLines, SortLinesByKey (takes a substring function as param; see below), SortChunksBy (takes a boolean compare function as param), ChunkIndexOf, ReplaceAll, Grep(takes reg.exp to specify search and extraction criteria). For an example, see below.

If we extend the type keyword to include procedure types, for example: "type void(char,int);" then we can declare procedure variables and parameters. An important benefit of honest procedure types is that it makes possible strongly-typed call-back functions in system calls, which reduces the code-bloat in handling system interaction in the user interface. Another benefit is the ability to require full declare-before-use such as with C-style prototypes. Also we can define lambda expressions (to be a procedure type), which gives us the power of block-structured code and thunks and iterators, etc., as in: "proctype xyz = (ch,in){... return..}" Procedure types would necessarily be structure-compatible. Here is an example of a library routine with a specified ptocedure type as a call-back parameter, which is supplied by an anonymous in-line function:

SortLinesByKey(BigString,(aline){return GetChunks(aline,tab,3,5);})

In this case the function makes no reference to free variables (defined outside its own braces), but that is certainly possible, giving the effect of block-structured languages. There is a hazard in allowing free variable access, in that if the function is assigned to a non-local procedure variable, then when its containing function returns, the function might still appear callable despite that its stack frame is gone. Lisp solves this by "closure" (keeping a copy of the environment, including its stack frame), but that gives unacceptably poor performance. We prefer rather to require that such assignments are forbidden; the compiler can check this statically (by refusing any procedure variable assignment outside its own block, or else at least outside the nearest block containing referenced variables. Copying a procedure variable not locally defined can either be explicitly forbidden, or else dynamically tested for safety (the procedure variable needs both a stack frame and a code pointer; the stack frame can contain a fixed level number, which is compared to the stack level of the receiving variable), perhaps as a compiler option.

New T2 Library Routines

(TBA)

Formal T2 Syntax

Example T2 Code

(TBA)

Notes

1. Out of a class of 14 senior computer science majors, three (more than 20%) were unable to use the ++ operator correctly and/or recognize correct usage in a test environment. Two of them attempted to pass an incremented parameter value (i+1) in a recursive method call by writing i++; the other criticized as an error the syntactically correct for(i=0;i<n;++i). [Back]

Rev. 2003 May 28