Code:
@120E
NEXT PREV LINE CODE
10 J=0 121B 0000 00 10 33 D1 7E 3A 00 00 FF : (J=D1)
20 GOSUB 100 1225 120E 00 20 31 01 00 FF
30 END 122D 121B 00 30 34 FF v-"100"
100 IF J>100 GOTO 130 123F 1225 01 00 2E 00 6E D1 6E 3A 00 64 FE 01 30 FF
110 FOR I=1 TO 3 1250 122D 01 10 29 C9 40 3A 00 01 FE 3A 00 03 FF
115 J=J+I 1250 123F 01 15 33 D1 7E D1 4E C9 FF
116 NEXT I 127B 1250 01 16 2A C9 40 00 ..<x18>.. 00 FF
120 GOSUB 100 1285 125D 01 20 31 01 00 FF
130 RETURN 0000 127B 01 30 26 FF
So that is actual IBM 5110 "tokenised" BASIC code? As far as I can tell that is not "tokenised" in the sense that MS-BASIC is, but instead seems to be some intermediate form; though it would also be a lot clearer if you annotated it. Here's the beginnings of a reversed-engineered annotation of the first few lines:
Code:
@120E
│ NEXT │ PREV │ LINE │ CODE
10 J=0 │ 121B │ 0000 │ 0010 │ 33 D1 7E 3A 00 00 FF : (J=D1)
'3 D1 '~ ': ? ? Ω
20 GOSUB 100 │ 1225 │ 120E │ 0020 │ 31 01 00 FF
τS 100 Ω
I've marked ASCII characters with
~ followed by the ASCII value, though these may not actually be representing ASCII characters in the source, what appear to be line terminators with Ω, and keywords with τ followed by a letter, e.g., τS for GOSUB.
It's pretty clear to see here that this is
not BASIC tokenization in the MS-BASIC sense of the term: it's doing some significant additional processing that involves not just removing spaces and the like (which is one step in heading towards an
AST) but it also seems to be renaming variables and, well, I've not investigated enough further to see what's really going on there. If you can produce a small, simple translator between the two forms and post it here, that would explain a lot more.
In MS-BASIC, as I mentioned before the tokenisation is a
very simple substitution of tokens for certain strings and back again: both are source code just expressed in a slightly different form. (This is true even in the later BASICs where numbers are tokenised; there's a bit more work because numeric tokenisation form depends on what comes before the number─a GOTO gets a line number instead of a float, etc.─but that's not changing the core of what it's doing.)
If you want to examine how that works you can have a look at my MSX-BASIC de-/re-tokeniser
on a branch in r8format. That includes the code itself, plenty of test data (under
programs/) that's easy to examine, de- and re-tokenisation command line tools, and a program to hexdump a tokenised MS-BASIC program in a more readable format than a regular hexdump.
Another system could technically "run' that same p-code (and maybe a CPU could be microcode'd to even run it as its native instruction set, just it would hugely inefficient to do so).
I think you may not be clear on how much difference there is between an interpreter that can run this kind of code and an "interpreter" that runs standard machine code. For example, CPUs do not normally have "allocate this variable name as a storage location in a heap and maintain the mapping" for that instruction. (And nor should they, most people would argue.)
The BASIC line number is not the memory address - but, like you said, the tokenized form (which is stored at some "system decided address", as indicated by the next/prev linked list addresses prior to the executive tokens) also contains the original user-supplied line number. That's how I meant "abstracted away" to the end user (on the specifics of what actual physical address the program tokens are stored).
Yes, but that's a trivial abstraction and, in earlier versions of MS-BASIC, that abstraction doesn't even exist. Every time the interpreter sees a
GOTO 100 statement it doesn't "turn that into another number," but instead just searches its linked list for line number 100. (This is why it's good to put frequently called subroutines as early as possible in your program.)
To support much larger programs, or combinations of programs, and you need to start using CHAIN or for other reasons, it's no guarantee your program will get tokenized always to these same addresses in the general-case (for tiny examples like this they likely will get the same address).
Right. And in MS-BASIC, that's completely unimportant as well, since it uses the line numbers directly as the addresses; it cares not at all at what physical addresses the line numbers are stored. You can even freeze program execution, remove the entire first line (let's assume it's a
REM or whatever), shift everything else down, patch up all the linked list pointers, and carry on and everything will be fine. (This may depend on what the subroutine stack is holding in terms of addresses, though, but let's assume we're not in a subroutine.)
In other words, addresses in an MS-BASIC program
are the line numbers, not the locations in memory.
But from the above token, you can see some trends: <33> is their code for assignment (=), D1 is associated with the variable J...
What are the $7E and $3A in that first line?
Anyway, even here we suddenly see a
major difference between this and MS-BASIC's tokenisation of the
same line:
Code:
│ NEXT │ LINE │ CODE
10 J = 0 │ .... │ 0A 00 │ 4A 20 EF 20 00
'J sp = sp Ω
You'll note here that MS tokenisation is simply doing direct replacement of one symbol for another: there is no syntax change as the IBM 5110 system above is doing. (The IBM system is, from the looks of it, at least starting to build an AST, if $33 represents '='.) That's a major program transformation step, and generally the first one towards interpreting or compiling code. MS-BASIC does
not do this when it tokenises code; MS-BASIC's tokenisation is really just a form of compression that also happens to make the next step of interpretation a bit easier.
...the similarity may be more to an instruction set than "assembly language" per se - the lines there get blurred to me (yeah, blasphemy to some; I do recognize that assembly generally does need two-passes to sort out the branch distances - I recall Gary Kildall writing some early 70's ones needing three passes - so I'm not trying to trivialize the specifics of an assembler).
But you
can "trivialise" the specifics of an assembler compared to this, because an assembler doesn't actually need an AST (except perhaps for expressions in an operand field). The passes are irrelevant here; they're needed only to resolve forward references. Plenty of assembly-language programs can be assembled in a single pass.
But the similarity I meant is just that principle of <opcode> <operands> getting interpreted (and ok, BASIC isn't unique in that - but it particular approach could fit in 4KB ROMs....
Well, yes. Building an AST isn't terribly hard and in fact, for Lisp, it's so easy that it will take up
less space in ROM than for almost any other language. (This is because LISP S-expression syntax is already an expression of an AST; you simply need to convert from using parens to indicate tree nodes to an actual tree data structure with pointers.)
The code already exists to parse your sample into that fashion of <opcode> <operand> and to interpret it, it's nearly any ROM BASIC you can find.
Well, yes, that code has to exist for any language. But it's not "<opcode> <operand>"; it's an AST. Have a look at what your 5150 BASIC does for
K = (I+1)*(J+2) for an example.
Wang going so far as to even implement their BASIC using TTL chips.
I
very much doubt that happened; I'd be interested to see what led you to that conclusion. I expect that Wang implemented a
CPU in TTL, but had a software BASIC interpreter or compiler like everybody else.
APL struggled (IMO) just since finding a "normal" keyboard at all (early/mid 70s) was a challenge, let alone dealing with an extra translation of "funny symbols".
I certainly believe that was part of it. Another part would simply be the more mathematical approach. For whatever reason, people are fine with using rather sophisticated mathematical concepts if they learned them in elementary school (look up the history of zero and the equals sign if you like), but not equally or in in some cases less sophisticated stuff that is taught only later on.
From what I've seen, most of the "micros" didn't bother supporting the CHAIN keyword in BASIC.
I'm not really seeing how that's relevant.