ADA talk on programming languages (notes for first half of lecture)

Programming as communication

Programming is about communication. It's communication between a human and a computer, but it's also communication between humans and other humans, and between humans and their own future selves. To program, you have to fully elucidate your ideas and record them so that both computers and humans can understand you.

Languages facilitate this kind of communication. Programming languages have to be designed with the computer in mind, but they should also be designed to accommodate humans.

Of course, using comments and out-of-code documentation, you can and should communicate to humans independently from your code. But the program code also needs to be its own documentation, both because actual code and its documentation frequently disagree and because sometimes program code is itself the most elegant way to express an idea.

Languages themselves are software, and they're made up of specificiations and implementations. Ideally, the specifications define the syntax and the semantics of a language, and that should be enough for both the programmers who want to write the implementation of the language itself, and the programmers who want to write code in the language. The specification should describe "what" the language is, but not "how" the language is actually converted to machine code and run on hardware.

The implementation is the actual software that reads the language and converts it into machine code or another language. There can and usually are multiple implementations of a language.

In practice, specifications and implementations often differ. Also, implementations often have to be specific where specifications are vague, so sometimes multiple implementations adhere to the spec but differ from each other.

Specifications can be thought of as very formal documentation for a language, and implementations are of course program code... and the problems with the specification/implementation dichotomy illustrate why

How language implementations work

You've all been learning Ruby so far, and my hope is that this talk will put Ruby itself in context. Ruby is a very smart and expressive language -- it does a lot of grunt work for you, which makes it easier to communicate complex ideas cleanly. And that's really the story of software development, because software development as an industry is about building layers of tools that take the grunt work out of programming, enabling you to take on harder problems. But Ruby itself is constructed from lower-level concepts, and those are constructed from still lower-level concepts. You don't necessarily need to study those in-depth to be a good software developer, but having an understanding of the context you're working in will help you down the line, especially with debugging.

Machine code

Ultimately, no matter how you write your programs, they all end up in the same place -- some computer's processor. And a given CPU generally only speaks one language, known as "machine code". The words used in machine code vary depending on the CPU architecture -- for instance, Intel CPUs speak a totally different language from, say, the CPUs in your cell phones, which are mostly ARM. But in general all machine code languages are very similar, because they all have similar constraints and similar objectives. The idea is that the CPU can do a small number of very simple things like addition, multiplication, comparison of two numbers for equality or greater-than/less-than, and so forth. Everything a computer does is based on some combination of an extremely large number of elemental operations like basic arithmetic.

Here's an example of an instruction set for a fictional CPU called the DCPU-16. I'm using this as an example because it was designed to be simple over practical, so the full specification can be printed out onto a single page. This is literally all of the information you need in order to program on this architecture. Essentially, this document is a machine language specification.

http://web.archive.org/web/20120419171949/http://www.0x10c.com/doc/dcpu-16.txt

The Ruby spec is longer by orders of magnitude.

The reason we're all programming in Ruby or languages like it, instead of in machine code, is because it's excruciatingly tedious to express all but the most simple concepts in machine code. Sure, the Ruby documentation might be a few thousand pages, but you could easily write a few thousand pages of machine code to do as much work as a single page of Ruby code. Ruby lets you express high-level concepts, like "find the first occurrence of 'hello' in this string and discard everything before that", very easily. For that matter, Ruby also lets you express "find the first occurrence of '你好' in this string", which, because it involves multi-byte characters, is extremely complex.

Compilers

Because programming in machine code is so tedious, error-prone and difficult, programmers use software to make writing software easier. A type of program called a "compiler" makes programming easier by taking one language and turning it into another -- typically, taking a language that's a little more comprehensible to humans, and turning it into machine code that the CPU can read directly.

Here's a representative example of C code compiled into DCPU-16 machine code.

http://www.dcpu16apps.com/Home/App/43

You'll notice in this example that the C code is still quite complex. This is because the DCPU-16 doesn't intrinsically know how to make something show up on the monitor -- that is, it doesn't know what the "print" command means. In this case, "print" means copying single-byte characters into a reserved section of memory, which the video chip will then pick up and display independently of the processor.

But the "print" command only needs to be written once, and after that, you can print as much text as you want. The function can be reused. Reducing duplication of effort like that is one of the most important things a language can do to make programming easier.

As an aside, since compilers need to exist before programming languages can be used, you'd think that all compilers would be written in machine code... but modern compilers are extremely complex, and it's just as tedious to write a compiler in machine code as anything else. The very first compilers were necessarily written in machine code, but since then, most compilers have been written in some higher language, and then compiled into executable code via some older compiler. So nowadays it's compilers all the way down -- for instance, most C compilers are actually written in C, and every new version is just compiled using the previous version of that very same compiler. (Or a competitor's compiler, if you secretly think theirs is better.)

Languages that are written by a human and then compiled directly into machine code are usually known as low-level languages, because they are only one level removed from "the metal" (actual hardware, i.e. machine code). But low-level languages have a lot of limitations because they're still very close to the hardware. The closer you are to the hardware, the less effort it takes to get the computer can understand the code, but the more the hardware's quirks and limitations make it painful to program. So there are strategies to combat these limitations, very similar to how low-level languages exist to combat the limitations of machine code.

Virtual machines

Most computer languages exist on a spectrum of complexity, somewhere between machine code and Ruby. Languages close to machine code are "low-level" and languages around Ruby's level are "high-level". The higher you get, by and large the easier it is to communicate complex ideas and the less you have to worry about the hardware.

Some languages use software called 'virtual machines' to distance themselves from often quirky and inflexible hardware. For instance, Java is at least theoretically 'write once, run anywhere' - you compile not from java to machine code, but from java to virtual machine code, known as 'byte code'. The disadvantage is that the virtual machine still has to run in machine code, and the translation between virtual machine code and machine code makes execution slower. These can be classified as medium-level languages, although there is room for disagreement on language taxonomy here.

Interpreters

Some languages go even further and do away with the compiler altogether. Instead they have an interpreter, which is like a compiler except that instead of the compiler doing all the hard, slow work in advance, the interpreter does it at execution time, every time the program is run.

The advantage to this is that you have essentially no limits on how flexible or expressive your language can be. Compiled languages often need to know how big objects in memory will be ahead of time in order to write machine code, which necessarily embeds those assumptions in the code. They need to know a lot about what code is in a library or a function at compile time, so upgrading those libraries can sometimes force you to run the compiler again. You can't pause the execution of a compiled program in the middle, change your code, and restart it. And you can't have a shell like 'irb' because the entire program has to be known and compiled in advance.

Interpreters work around these limitations. The disadvantage is that they are extremely slow, essentially doing the work of a compiler for every line of code, every time the code is executed.

The interpreter itself generally has to be executed in machine code, so it is usually written in a low level language and compiled. For instance, the interpreter you are all using to run Ruby, MRI, is written in C.

How languages differ

So just in looking at how languages work, we can plot languages along an axis of complexity, or distance from hardware. Since there are advantages and disadvantages to being at any point on the plot, different programmers work at different levels for different tasks. No language is perfect. (Except Python.)

andrewsg/gist:e6c2298a8a9241a40964