On the Choice of Programming Languages

— abridged version —

This is the abridged version.  It is meant for those who already have some programming experience.
If you have never programmed you should read the full version.

Assumptions and Goals

To be as concise as possible I assume that you have already done some computer programming, and that you have at least some knowledge of several programming languages.  I will therefore not explain any terms of programming jargon.

It is my goal to show that languages currently in use suffer severely from historic design decisions for which the rationale is no longer valid.  While no syntax is perfect, and several different ones are needed to fulfill the needs of different application domains, there certainly are alternatives to the designs of the 1960s.

Most of the code of today's applications is concerned with user interaction, with graphics and with steering the vast number of functions supplied by the operating system.  Moreover, developers want their applications to run on multiple platforms:  from the desktop and laptop to the mobile devices and smartphones.  It should therefore be no surprise that this essay favours the LiveCode development system, which to my current knowledge satisfies best the criteria set forth here.

A little history

When electronic computers were first built at the end of World War II, they were not only large and slow compared to today's machines, but they had very little memory and their programmers, obviously, had very little experience.

The problems solved usually had to do with physics and mathematics. Programs were put into the machines by hand, setting the bits of the instructions into memory one by one.  This sounds like it was extremely tedious, but software did not interact with the computer's users:  programs (or "apps" if you want) mainly did calculations and produced lists of numbers.  The purpose of an application was also extremely well known, the programs were short.  The computer was just a much, much faster calculator, driven in an automatic way.
And memory was really small and expensive.

Programming languages came into being only a decade later, in the 1950s.  By that time computers had acquired more memory, and there were more users who wanted to do more complex calculations.  These users needed a better way to put their programs into the machine, they needed to express their algorithms in a notation, a "language", that was closer to what they used in their field of expertise.

Indeed, computer users of this new generation were experts in other subjects than computers and they did not want to waste their time with details irrelevant to the application.  This is one of the major goals of LiveCode:  provide access to programming without having to learn too many details unrelated to the problem.

The first successful computer programming language was FORTRAN which is short for "FORmula TRANslation system".

The basic idea of FORTRAN was to let physicists and mathematicians write formulae more or less the way they were used to.  For example, if they had an equation

y = x2 + 21.5x - 354

they could type it in somewhat like this:

Y = X**2 + 21.5*X - 354

which was not exactly the same, but close enough.  The differences arose mainly from the need to use card punch machines which had a limited set of characters.  Note also that the mathematical form is an equation, but the computer form is an assignment statement:  it gives the operations to perform with the value of X and then tells the machine to store the value so obtained into a variable called Y.

The FORTRAN assignment statement was still not something the computer could understand directly, but it could be programmed to translate these formulae into its own instructions.  The above example might have given a sequence of machine instructions similar to this:

LOAD X LOAD X MULTIPLY LOAD 21.5 LOAD X MULTIPLY ADD LOAD 354 SUBTRACT STORE Y

The translation process from the typed formula to the sequence of machine instructions is done mechanically by a program called a compiler.

Compilers can do more than produce machine code, they can detect programming errors:  for example 354 would pass as being a number, but 35(4 should be rejected and the compiler can tell the programmer of the error.

However, typing a + instead of a - would not be detected.

35(4 is not a valid number syntax and that error is therefore called a syntax error.  But using the + instead of the - cannot be detected easily by any compiler:  it would have to know the programmer's intention or even understand the problem to be solved by the program.  Such errors are called semantic errors, and they are the real stuff of debugging.

Compiling code in the 1950s and 1960s days took time and bug correction was tedious.  Today compiling is so fast one does not notice that it happens.  Debugging is still difficult and slow.

But let us note an important decision that was taken by the designers of the FORTRAN programming language:

to let physicists and mathematicians write formulae more or less the way they were used to

That well-meant decision introduced the first quite bad aspect of programming languages that remains with us today.  the syntax of the assigment statement.

The formula above defining Y as a function of X looks quite OK, but when counting in loops through all elements of an array, programmers had to write statements like these to manipulate the index:

I = I + 1

Obviously that is a nonsensical mathematical formula.  No value can be equal to itself plus one.  The = sign in FORTRAN does not mean equality at all!  It does not mean that the left hand side is equal to the right hand side of the formula, as it does in mathematics.  It is not symmetric, it has a definite direction:  I=I+1 effectively means "take the value of memory location I, add 1 to it, and deposit the resulting value into the memory location I again.  It would have been much better to write

I + 1 → I

which would also have shown the sequence in which the computer does its operations:  from left to right.  And there would have been no confusion with equality.

LiveCode does in fact write that statement in the better way:

put I+1 into I

Or in this particular case even more simply and concisely:

add 1 to I

The main reason for the bad FORTRAN syntax was to try not to frighten the mathematicians with something unfamiliar;  another was that the → symbol did not occur on typewriters and card punch machines.

The first reason might seem a psychologically good one, but all early books on programming invariably started out by explaining that I=I+1 does not mean what the mathematician expects, and they went to great lengths pointing out the difference!  It is not clear if the mathematicians of the time would have been confused more by a simple explanation for the I + 1 → I syntax, but I'm willing to bet that they would have accepted the arrow notation easily and gotten on with the job.

Unfortunately, most programming languages have kept the FORTRAN syntax for assigning new values to variables.  Some have attempted to make the difference clearer:  Algol used

I := I + 1

which at least shows that the assignment operation is not symmetric.

An even worse consequence of choosing = for the assignment is that you now have to think of a different symbol for testing equality.  FORTRAN used a quite ugly solution:  if you wanted to make a decision depending on the equality of two values A and B, you would write something like:

IF (A .EQ. B) GOTO 56

That would test the values of A and B for equality and skip to a statement labeled 56 if true.  To separate the test from the choice of decision that followed, parentheses were used:  another early choice that is still with us in many programming languages.  In LiveCode the above choice would be written like:

if A=B then …

C-like languages (javascript, php, perl, java, …) use == :

if ( A == B ) …

However, as these languages also allow assignment as an expression returning a value, they will allow you to write

if ( A = B ) …

which is not an error, but means something entirely different:  take the value of B, put it into A, which changes the value of A;  the result of that assignment operation is the value of A, and that is then tested for being equal to zero or not.  That is completely different from testing whether A is equal to B.

Because the compiler has to accept both forms, making the mistake of using a single equal sign leads to very obscure bugs.

Books teaching programming in C-like languages always point out that typing = instead of == is one of the most common mistakes made.

Note also that these language design decisions were made in the 1950s and 1960s and are still in use, whereas no-one today would consider buying computer hardware designed in those days.

On Referencing

You may have noticed that when we write

put X+1 into X

we are in fact using X with two different meanings. The first X should be read as "the value stored in the container X" whereas the second X means "the container X". Specialists of natural and computer languages know this phenomenon as the problem of referencing.

In most programming languages referencing is not an issue, as the second type of reference only occurs in assignment statements.

Nevertheless, some programming languages have very confusing types of referencing.  In ADA, a language designed for robust programming, an attempt was made to ease the programmer's work by changing some types of referencing into other types automatically (especially pointer dereferencing, but we don't need to go into details here).  The motivation was much like that of FORTRAN's attempt not to frighten the mathematicians.  And like in FORTRAN's case the resulting evil was probably worse than the perceived problem it tried to solve.

AppleScript has a big problem with referencing.  Say we have a number of operations to do on an object X, and let's generally write this as operations-on-X.

There are many cases in which the statement

operations-on-X

gives different results from the sequence

set Y to X operations-on-Y

and one of the two ways may raise errors while the other would not.

AppleScript syntax has roots similar to those of LiveCode, but the problem of confusing references makes it quite difficult to write working AppleScripts;  it leads to a large number of questions posted on forums.

LiveCode does not have this problem, partially because its underlying data type is the character string;  more about that in the next article.

On Weak Typing

Consider these expressions and their results:

123 + "456" 123 & "456" 123.456 div 3

The first attempts to add a character string to an integer.  That should be an error since it makes no sense.  But since "456" is actually convertible to a number, some languages produce code to do just that.

Such behaviour is called weak typing, because no errors are signalled for using different data types in one expression.  Strong typing by contrast forbids implicit conversions and requires that the programmer specify explicitly where a conversion from one type to another is needed and also how it should be done.

The expression 123 + "abc" would of course fail in either case.  In weak typed languages this error is often signaled only at run time, when it could have been caught and the programmer warned at compile time, before any debugging becomes necessary.

The second expression attempts to make a string of characters from a number and a string.  Again that should be an error, but again many languages allow it.

The last expression tries to compute the quotient of a division by an integer which should produce a whole number and hence should not allow a non-integer first argument.

The case in favour of weak typing is that most experienced programmers are well aware of data types and make few type mixing errors but do use the facility of automatic conversions.

The case in favour of strong typing uses a more subtle argument:  while experienced programmers indeed make few type mixing errors, it is precisely those errors that are hardest to track down but the easiest to detect at compile time in a strongly typed language.

Strong typing is unfortunately all but absent from today's most popular languages, and that is the reason for many delays in programming projects, subtle bugs, and probably a lot of remaining errors.  Programmers suffer from a language's inability to catch data type errors.

LiveCode essentially has only one type of data:  the character string, and thus strong typing does not give much of an advantage, though it might help.

Languages that do have multiple types of data should do type checking to avoid unnecessary transformations and to help the programmer find unintended ones.

I will not go into the deeper aspects of strong typing, but in my experience I have been much helped by compilers that insisted on explicit conversions by refusing to accept my errors rather than trying to guess at conversions and producing unintended output, leading to long debugging sessions.

Univ. Challenge Monday, 14 January, 2013 21:00

A Common Worry that should not be one

We discussed above a number of syntactic problems that have a very early origin.  Most of them were attempts to relieve a difficulty that the language designers perceived might be encountered by the programmers using the language.

Very often a second motivation was to minimise typing by the programmers.  This was especially the case for C-like languages, where it was in fact the main motivation, but the referencing problems of ADA and the assignment statement of FORTRAN also suffer from this worry of having to type extra characters.

Any programmer knows that there is a phase of writing program code, followed by a usually much longer phase of testing and debugging.  This second phase may be shortened and made less painful in a number of ways.  Writing good documentation of what the program statements are intended to achieve, using meaningful identifiers, and placing comments at difficult statements are the three most productive remedies.

But all three of those remedies imply typing more characters, not fewer!

During the debugging phase the programmer looks at the code a lot, changing things here and there, but most of the time is spent reading the code that is already written.

This was observed very early on:  computing theoreticians and practicians like Dijkstra, Wirth and others pointed out decades ago that much more time is spent reading than writing code.  They used this fact to argue that maybe a good language design could reduce the debugging time profitably even if it meant a lot more code had to be written first.  Their proposal was to favour readability of the code by perhaps typing more characters, rather than typing less and being left with difficult to read and understand programs.

One of LiveCode's goals is to make the language easy to read, avoiding the need of lots of comments.

You should not worry about how much you have to type, but about how often you have to read your code to understand it. 

Letting the Language do the Work

Several languages that followed FORTRAN tried to put "readability over writability" into practice.

Some, like COBOL, became very verbose. Others like PASCAL were clear yet concise but had other problems.

No programming language is perfect, but we should perhaps here spend some time on the various attempts that were made to reduce debugging time.

The two major features that help the programmer are required declarations and strict typing. Both also have drawbacks though.

Declarations

Required declarations mean that the programmer must list all the variables to be used at the start of the code. Using a name without declaring it at the start leads to a syntax error and the compiler will simply not let it pass. Usually it's not just a list of identifiers: each declaration also states the type of the variable.

The drawback of required declarations is that over the development of a program the use of variables changes, and some may be left unused while others may change their type.

LiveCode does allow required declarations, but it is not mandatory.

Strict Typing

One advantage of declarations is that there is a well-defined place to look at the definitions, but another one is that the compiler can use the definitions to check that variables are used in the intended way. For example, if I declare X to be a whole number, but then later on I write a line of code that puts a character value into it, the compiler can tell me that I probably did not intend to do that.

The drawback of strict typing is that some operations become very verbose. Say I want to have a whole number to count with, i, and I want to produce the ith multiples of a number, say 3.45. In a strictly typed language I cannot simply write

X := i*3.45

because i is an integer and 3.45 is not, it is a real number. What should the result be, integer or real? I then need to specify some conversion, most probably use a function to convert the integer to a real number before the multiplication:

X := ConvertToReal(i)*3.45

This is ridiculously complicated. For this reason most strictly typed languages allow automatic "widening" of types. I.e. since all integers are also real numbers, it is OK to let the compiler do the conversion instead of having to write the call to a function. The inverse is usually not allowed, but once again we do not need to go into detail here.

A General Principle — the First Criterion

In the 1970s, when I designed P+, a language for process control, I tried to let the compiler do as much checking as it possibly could before delivering a program to be run (and then debugged).

The motivation was that anything about the program that the compiler cound understand it could also check, and point out potential errors to the programmer.

Ideally, that meant that the program specification itself should be used by the compiler. If we lived in an ideal world, with good artificial intelligence, then we would not have to write programs at all. Just telling the compiler a specification would be sufficient. I could write:

Print a list of the first 10'000 prime numbers

and the compiler would be able to write the program for that.

Anything that I need to produce but that the compiler cannot read and understand, is called "documentation". That includes diagrams, comments, specifications, proofs, assertions and so on. In the ideal world there should be none of it: the program code itself should be its own documentation. This is the goal of declarations, strict typing, long identifiers, and other syntax constructs, but they replace only a part of the documentation.

While we do not live in that ideal world of self-documenting program code, it is still possible to design the programming language so that it is eminently readable to humans. That indeed will help a lot: if the language is so clean and clear that I do not have to write much documentation, then I can concentrate on the program's functions and be productive.

The most important criterion for a good programming language perhaps is:

The syntax should minimise the need for documentation

The Second Criterion

We will have to read our code a lot, poring over it to find bugs. The smallest distraction will waste time and effort. We will not achieve speed-reading, like we may do for reading the prose of a novel, but we may perhaps design the language so that it resembles prose more than it resembles gobbledygook. I personally find the use of special signs and punctuation hindering the flow of my reading. Punctuation of course is actually meant to interrupt: it signals the end of a sentence, a pause inside a sentence and so on. But too much use of quotes, commas, parentheses (and especially nested parentheses) is disruptive. If we can do without, so much the better for our reading.

Nested parentheses are such a nuisance that most modern program editors will highlight matching pairs automatically and as you type code. That must be an indication that our eyes and brains are not suited to find matching parentheses. I would therefore conjecture that it is easier to read

if b is not zero and a/b<c then …

than it is to read

if ( ($b!=0)&&($a/$b<$c)) { …

Note that the second line uses ((())!&&$$$$ as characters from the top row of the keyboard, whereas the first line uses none.

Those top-row characters are also the ones used in cartoons to indicate swearing or unprintable words.

My second criterion for a good programming language is then:

The fewer characters from the top row of the keyboard are needed, the better.

The Third Criterion

When we read code we need to know what it means. If I cannot remember the syntax of a certain kind of loop construct then I must look it up in the language reference manual.

My first programming language was PL/I. It was a laudable attempt by IBM to produce a better language, and one that would combine the virtues of FORTRAN and COBOL. It was the end of the 1960s and computers were just about getting big and fast enough to begin to explore human-machine interaction. PL/I was aimed at the professional and the casual programmer. But in its zeal to cover everything, its syntax had so many possible meanings depending on context, that it was impossible to write even the smallest program without having the language manual open beside you. Constant lookup of features was necessary to avoid unintended effects.

The design of PASCAL was a strong reaction to that phenomenon. Niklaus Wirth's first motivation for his design of PASCAL was: to be able to keep the entire syntax and semantics of the language in one's head, i.e. without need for reference to a manual. It was a huge success, and it died only because unlike FORTRAN it did not evolve fast enough. While it lasted, it was certainly a pleasure to work with that design. This experience leads me to my third criterion for a good language:

A good language should minimise the need for looking in the language reference manual

What will happen?

Let us come back to the decision statement

if b is not zero and a/b<c then …

Suppose b is zero. Then if we try to divide a by b, we will divide by zero, and that will crash the program.

But if b is zero, then it does not matter what the value of a/b is, because the whole expression will be false anyway. This is always so because

p and q

is false if either one of them is known to be false. It's not necessary to know them both. It's only if p is true that we also need to look at q.

In other words, the and operator should simply skip computing its second part if the frist one already is false.

LiveCode and some other languages do explicitly define the and operator to skip in the way discussed, and hence there will be no problem dividing by zero if b is zero, because it will not be attempted in the first place.

A similar effect is true for the or operator: p or q is true as soon as p is true and we do not have to look at q.

The operation of the switch statement is more difficult, but similarly suffers if it is not well-defined what happens.

Many language implementations do not define precisely what these operators and statements do: you have to find out what the compiler does by trying it out. This type of "experimental programming" is a huge waste of time, and it also makes it difficult to port programs from one system to another even if they use the same language.

Similarly, the results of all functions should be well-defined for all arguments.

My fourth criterion says this:

The result of all syntactic constructs should be well-defined

Happily LiveCode satisfies this criterion extremely well, but many C-like languages, especially the interpreted ones, do not.

Last but not Least

We observe that as programmers we read code a lot. But which code? If today I make a web site, I will need knowledge of html for the content, of css for the styling, probably of php and javascript for actions on the server and client side, and perhaps even sql if a database is involved. Worse, there may be processing of data to be done off-line, for which any language might be needed to write an application. And I am not even talking about scripting the operating system or some scriptable applications.

That means not only a lot of reading of code, but potentially switching between six or more different programming languages!

The problem is compounded by the fact that some syntactic constructs are required in one language and forbidden in another. The use of the semicolon or the $-sign comes to mind.

Then there are the errors that are more subtle, where syntax checking does not help. I once lost an entire afternoon chasing what looked like a very strange bug: I got results that were sometimes the ones I expected, and sometimes completely wrong. After inspecting the code (remember: reading and reading and reading code) I thought the problem resided in the data. But after lengthy examinations that appeared not to be so. Finally I spotted my error: instead of "." I had typed "+" for the concatenation of two strings. Php uses "." but javascript uses "+" and the operation I wanted sat in a piece of code that was written in php. The php interpreter simply assumed I wanted to add, and converted the strings to numbers, then presented the addition. By coincidence the strings to be concatenated were indeed often composed of digits. I would not have lost that afternoon if both languages had used the same operator. My error would also have been detected with strong typing, which would have refused the addition of strings at syntax check.

Given that I need many different pieces of cooperating code, I will also have to write a lot of documentation.

Why can I not use the same programming language throughout? After all, I will most likely write the entire documentation across all cooperating code in one natural language: English.

What if I could use an English-like language to do almost all of the programming too?

It is not impossible: LiveCode cannot be used for html and css, but it can be used for all the rest. That is an enormous advantage: as a programmer I can now concentrate entirely on the algorithms to get the job done, and there are no bugs that remain obscure to me due to my imperfect knowledge of several different languages, or simply a coincidental confusion. There is no longer the need for experimental programming when it is impossible to find a complete definition of what a construct or function does. I no longer need to invoke tons of weird text processing functions: I can use the chunk syntax of LiveCode, and everyone knows that a lot of text processing goes on in today's code for the web.

That leads to my fifth and final criterion for a good programming language:

The more areas of application it convers, the better it is.