Posted by skrrtww 5 days ago
On as fixed of a hardware as a game console, and with the accompanying anti-piracy/anti-cheating/emulation efforts of that industry, I'd expect it to be. From the history of emulating previous consoles, we know that any deterministic difference can and will be exploited, either to determine whether the hardware is authentic, or incidentally as a result of unintentional bugs.
This reminds me of the Z80, where two undefined flags resisted analysis for several decades; a 2-year-old set of slides on the state of that here: https://archive.fosdem.org/2022/schedule/event/z80/attachmen...
Famous last words:
https://tcrf.net/Cars_2_(PlayStation_3,_Xbox_360,_Windows,_W...
More context on how this value affects (at least one) DS game- see post from December 27th, 2019.
Though it's not clear to me where the corruption after the second MLA instruction comes from, because the second block of three instructions should produce the same output as the first. It is possible that it was copied/pasted incorrectly.
I remember from when I used to disassemble compiled ARM code (not on the NDS though) that it was common to see CMP, followed by a bunch of instructions with one condition predicate, followed by a bunch of instructions with the opposite predicate.
In this case, it's subtly wrong to use that pattern, but only on older versions of ARM. That could reflect a very sneaky attempt to break emulators… but it could also just be a compiler bug.
That said, I too don't understand how corruption could be produced unless there was a copy/paste mistake.
Don't really understand this reaction. Why not? Seems to make for a nice regular design that the PC is just another register.
Reads from the PC return the address of the next instruction to be executed, so a simple exchange between two registers performs the branch, and supplies the return address. (I did end up special-casing the add instruction so that when adding to the PC the return address ends up in the source register.)
ADD r0, r15, #200
LDR r1, [r15, #-100]
etc
eor pc, pc, pc
If you can use it as an operand, it has a register number, so you can use it as a result, unless you special-case one or the other, which ARM didn't do because it was supposed to be simple. They could have ignored it by omitting some write decode circuitry, but why?
Yeah I think AARCH64 special cases it? Not too familiar with their encoding or how they achieved it. My guess as to why is that it allows you to use more helpful registers (e.g. a zero register) in data processing instructions.
I think I can see your point though - from the perspective of ARMv4T's design, which was to be a simple yet effective CPU, making the PC a GPR does its job. Nowadays the standards are different, but I can see why it made sense at the time.
It's a design choice that makes sense in the classic RISC world where pipeline details are leaked to make implementation simpler. Delay slots in other RISCs work the same way. But it causes a lot of pain as implementations evolve beyond past the original design, and it's why Aarch64 junked a lot of Aarch32's quirks.
The only downside was that it exposed internal details of the pipelining IIRC. In the ARM2, a read of the PC would give the current instruction's location + 8, rather than its actual location, because by the time the instruction 'took place' the PC had moved on. So if/when you change the pipelining for future processors, you either make older code break, or have to special case the current behaviour of returning +8.
Anyway, I don't like their reaction. What they mean is 'this decision makes writing an emulator more tricky' but the author decides that this makes the chip designers stupid. If the author's reaction to problems is 'the chip designers were stupid and wrong, I'll write a blog post insulting them' then the problem is with the author.
But no, I really think that making the program counter a GPR isn't a good design decision - there's pretty good reasons why no modern arches do things that way anymore. I admittedly was originally in the same boat when I first heard of ARMv4T - I thought putting the PC as a GPR was quite clean, but I soon realized it just wastes instruction space, makes branch prediction slightly more complex, decrease the number of available registers (increasing register pressure), all while providing marginal benefit to the programmer
It's a good article though, the explanation of how multiplies work is nicely written.
You know this, but background for anyone else:
ARM's subroutine calling convention places the return address in a register, LR (which is itself a general purpose register, numbered R14). To save memory cycles - ARM1 was designed to take advantage of page mode DRAM - the processor features store-multiple and load-multiple instructions, which have a 16-bit bitfield to indicate which registers to store or load, and can be set to increment or decrement before or after each register is stored or loaded.
The easy way to set up a stack frame (the way mandated by many calling conventions that need to unwind the stack) is to use the Store Multiple, Decrement Before instruction, STMDB. Say you need to preserve R8, R9, R10:
STMDB R8-R10, LR
At the end of the function you can clean up the stack and return in a single instruction, Load Multiple with Increment After:
LDMIA R8-R10, PC
This seemed like a good decision to a team producing their first ever processor, on a minimal budget, needing to fit into 25,000 transistors and to keep the thermal design power cool enough to use a plastic package, because a ceramic package would have blown their budget.
Branch prediction wasn't a consideration as it didn't have branch prediction, and register pressure wasn't likely a consideration for a team going from the 3-register 6502, where the registers are far from orthogonal.
Also, it doesn't waste instruction space: you already need 4 bits to encode 14 registers, and it means that you don't need a 'branch indirect' instruction (you just do MOV PC,Rn) nor 'return' (MOV PC,LR if there's no stack frame to restore).
There is a branch instruction, but only so that it can accommodate a 24-bit immediate (implicitly left-shifted by 2 bits so that it actually addresses a 26-bit range, which was enough for the original 26-bit address space). The MOV immediate instruction can only manage up to 12 bits (14 if doing a shift-left with the barrel shifter), so I can see why Branch was included.
Indeed, mentioning the original 26-bit address space: this was because the processor status flags and mode bits were also available to read or write through R15, along with the program counter. A return (e.g. MOV PC,LR) has an additional bit indicating whether to restore the flags and processor state, indicated by an S suffix. If you were returning from an interrupt it was necessary to write "MOVS PC, LR" to ensure that the processor mode and flags were restored.
# It was acceptable in the 80s', It was acceptable at the time... #
Ken Shirriff has a great article "Reverse engineering the ARM1" at https://www.righto.com/2015/12/reverse-engineering-arm1-ance....
Getting back to multipliers:
ARM1 didn't have a multiply instruction at all, but experimenting with the ARM Evaluation System (an expansion for the BBC Micro) revealed that multiplying in software was just too slow.
ARM2 added the multiply and multiply-accumulate instructions to the instruction set. The implementation just used the Booth recoding, performing the additions through the ALU, and took up to 16 cycles to execute. In other words it only performed one Booth chunk per clock cycle, with early exit if there was no more work to do. And as in your article, it used the carry flag as an additional bit.
I suspect the documentation says 'the carry is unreliable' because the carry behaviour could be different between the ARM2 implementation and ARM7TDMI, when given the same operands. Or indeed between implementations of ARMv4, because the fast multiplier was an optional component if I recall correctly. The 'M' in ARM7TDMI indicates the presence of the fast multiplier.
No, it's exactly backwards: supporting PC as a GPR requires special circuitry, especially in original ARM where PC was not even fully a part of the register file. Stephen Furber in his "VLSI RISC Architecture and Organization" describes in section 4.1 "Instruction Set and Datapath Definition" that quite a lot of additional activity happens when PC is involved (which may affect the instruction timings and require additional memory cycles).
From a CPU emulator writer's perspective this isn't all that strange. For instance on Z80 the immediate jump instruction `JP nnnn` is loading a 16-bit immediate value into the internal PC register, which is the same thing as loading a 16-bit value into a regular register pair (e.g. 'LD HL,nnnn') - e.g. the mnemonics for the jump instruction could just as well be `LD PC,nnnn` ;)
A relative jump (which does a signed-add of an 8-bit offset value to the 16-bit address in PC) is the same math as the Z80 indexed addressing mode (IX+d) and (IY+d) (I don't know though if the same transistors are used).
A RET (load 16-bit value from stack into PC) is the the same operation as a POP (load 16-bit value from stack into a regular register pair).
...so it's almost surprising that the program counter isn't exposed as a regular register in most (traditional) CPUs. I guess in modern CPUs it's not so simple because of the internal pipelining though.
There are operations required for the PC that are not needed for regular registers, e.g. conditional add (a.k.a. conditional relative jump), add-and-store and load-and-store (a.k.a. procedure call).
On the other hand, there are many operations that are needed for regular registers and which are useless for the PC, e.g. logical operations, shift/rotate, multiplication and division and others.
Because of this, encoding the PC as a regular register is pointless and wasteful of the instruction encoding space.
Moreover, when the ISA has an implicit stack pointer, which is also the only register that can be used as a stack pointer, like the x86 ISA, the set of operations that are used with the SP is a very small subset of the operations available for the regular registers, so encoding the SP as a regular register is also wasteful. Especially in 32-bit x86, where the number of architectural registers was very small, it would have been better if the SP would not have been encoded as a regular register, wasting a register number.
I'd add to that, that what you give is the reason it's /okay/ to expose the PC as a special register instead of a GPR. The reason that it's /important/ to is that the PC is accessed on every instruction fetch, so if it's part of a uniform register file, it basically eats up an entire read port of that register file. Register file size scaled badly with port count (much worse than it does with register count), so this ends up adding quite a bit of area. (You can hack around this by having a single dedicated read port for just the PC register, but then you're half way to an SPR.)