Work and discussion on RAMdisk errors and ECC correction
last update Feb 12 9:49PM EDT Herb Johnson

On Jan 28, 2024 , Herbert Johnson  wrote:
 
I know you are busy, so I'll be brief and I have no requests or
questions at the moment. Just wanted to update you.
 
1) again, my RTS problem was resolved, it was a bad serial-dongle. RTS
now operates between Novasaur and TeraTerm with proper connections and I
transferred the test CRC program file without failures, watched RTS toggle.
 
2) I updated my Web page to complete the
problem and solution description with that result. So there's no
dangling "what broke?" on my Web page. The fault was mine but I cast it
as a learning experience.
 
3) there's no links to my Web page from my Web site [at the present time]. So it won't be
found anyway by most search engines, to confuse anyone. I'll link to it
later when it's better edited. ... But I'll keep most of the diagnostic and
explanatory content, because it's my practice to show how such work is
done on vintage microcomputers - and your work qualifies.
 
I'll add at a later time, your lower-level descriptions of the serial
microcode. They are pretty good. But I did add to my Web page, your
conclusion -  that if RTS is ignored that your microcodes will
wrap-around the buffer, in effect missing a block (128 bytes) while
doing so. That's the fault I produced, so I'm glad you made that observation.
 
Thanks for your correspondence and descriptions. I'll edit and add your
notes to my Web page. And of course, they are notes for your own
documentation, perhaps my edits will be informative to you. - Regards, Herb

X-Pm-Date: Sat, 03 Feb 2024 21:54:22 +0000 
To: "Herbert Johnson"   
From : "Alastair Hewitt"
Subject: Re: assembling Rev 9 - serial transfer, resolved 

Hi Herb,

I've read through your webpage and if is very detailed and informative. Thanks for documenting this! I'll be doing more of my own documentation eventually, but it's good to see what kind of issues you've run in to and details you're looking for.

I went down a bit of a rabbit hole on the disk this week, but have successfully come out the other side. The RAM drive can now recover any single byte error (per record) and will identify and return an error to CP/M if there is a more than a one byte in error per record. As mentioned earlier, there is a scenario where the machine might write a random byte to memory when powered off. I can reproduce this by rapidly cycling the power button for several seconds. It is less likely to happen under normal operation and should be fully recoverable now.

This is the update: The A: drive has been reduced slightly from 254k to 250k (2000 records) freeing up 32 records for ECC codes. A 2-byte ECC is calculated and stored when a record is written to the RAM disk. The 2-byte ECC is recalculated from the record when read from disk and compared to the stored value. If there's a mismatch then the ECC is used to identify and fix one byte. If only one byte is in error then the record is fixed and a recalculated ECC will now match the stored value. If there is still a mismatch then there is an unrecoverable error in the record and a 1 is returned in the CP/M READ BIOS call.

You get a lot of errors the first time you power up or after replacing the battery. This is understandable since all the ECC codes are wrong. You have to press enter a few times to bring up the A: prompt. I might be able to interrupt this and switch to the B: drive. The errors go away once you format the drive.

There's an additional disk check command I wouldn't mind adding that I could use via another program to verify the disk. This could also be used in the background to monitor and repair the disk. It would be goof to add an ECC to the records holding the ECC codes as well. I'll probably hold off on that for now since I want to revisit the XFER command and add support for XMODEM.

Thanks, Alastair

HErb, sent midnight Feb 3rd-4th

Thanks for your positive responses on my Web page. That's exactly what I hoped to accomplish from your point of view. Of course tell me of any egregious errors; mild errors aren't critical. I'll add your previous notes on serial transfer in the near future.

On your disk ECC work. I suppose it's a rabbit hole, but this was the problem in the 1970's with floppy disks, preceeding the development of single-chip FDCs (early microcontrollers actually). Once the FDCs became standard, these bit/flux level issues and track/sector formatting were buried in the FDC. Previous hard-sectored controllers did this work at the processor level. ..
but I don't believe any of these used ECC codes.  checksums detected errors and that was sufficient.

But you have processing power; and the cost of storage was only 16%, not awful. The most important feature is detection, in my opinion. When your microcode read-sector encounters a 1-byte correctable error, two questions. 1) what is the performance hit, the time needed to calculate twice (one to detect, one to correct). 2) Do you report a "soft error" back upstream? If you report no-error, soft-error, hard-error, then higher level routines can notify the user in some way. 

On your other points:

> You get a lot of errors the first time you power up

Well, formatting a known-damaged RAMdisk is always a good idea, that clears out the directory tables, otherwise the garbage confuses the "find an available sector" code.

> There's an additional disk check command I wouldn't mind adding 

Running a checkdisk 8080 program (read every sector, trigger ECC correction) will resolve errors. It could also report prior errors, access that bad-sector table. and of course you need that bad-sector table to inform format (another program that triggers ECC correction).

It might not be wise to combine FORMAT and CHECKDISK as one program, that is an oops-able situation! But use of most of the same code is a good thing.

> It would be goof to add an ECC to the records holding the ECC codes as well.

...but you can checksum the records, detect any error. Rebuilding the ECC tables would be another feature of CHECKDISK, if not another program. There may be lots of little programs like these.

Meanwhile, there's that lurking RAM error problem, which necessitated some of the actions we are talking about here. Results will be diagnostic for the degree of the problem. - Regards Herb

Feb 4, Alastair Hewitt

In theory you should never have an issue with the RAM getting corrupted. The RAM chip enable is released before the voltage regulator gets the disable signal and powers off. I've been intentionally trying to break things and see what happens. It's one of those things that can happen and I've always wanted some kind of protection against it. There is still an issue where I've been able to corrupt the one of the ECC codes because they are always written/read when a record is sent/received from the disk. If the ECC value is corrupted then the related record on the disk is then "fixed"ť accordingly. Adding error correction to the ECC values is kind of a circular problem so I might just leave it.

I'm not returning any soft errors in current ECC code. It basically returns a non-zero value in H if there was an unrecoverable error. I only have 256 bytes of code available and trying to keep it as simple as possible.

Feb 4, Herb Johnson

[I responded with a discussion about the diagnostic value of returning soft and hard error information.]

On Feb 12, Alastair Hewitt wrote:

I initially developed some code to do a checksum on the records containing the ECC codes. It turned out I was not able to fit this in the 256 bytes I have available for the disk code. I went back to using the ECC code to check the records containing the ECC codes. Here I'm reusing the same code so it's a lot more efficient... except there's a recursive self-referential problem.
 
This is basically what kept me busy. The good news is I figured things out this morning after sleeping on it. My original plan for handling this didn't work, but there was a more efficient way of solving the problem that does.

I think I can safely say the disk code is done. I also got it to fit in just 235 bytes, so over 20 bytes to spare! I can now get back to figuring out XMODEM. 

Alastair added later:

This was the inspiration for the ECC implementation - https://stackoverflow.com/a/16331272

This is only used with the RAM disk. The RAM-disk code handles a GET record and PUT record command. The PUT will copy a record to the RAM disk, calculate the ECC, and write that to the drive ECC record. This is checked when the record is read back using the GET command. If the ECC code doesn't match then it is used to attempt a correction of the record.

As soon as record is written to the drive the ECC code is updated in the ECC record. I have an ECC on this ECC record, but it is now invalid since the record was changed. I mark the record as "dirty" to indicate it needs to be recalculated. I don't recalculate it right away since that adds a lot of overhead. You don't recalculate the one record since its ECC is in another ECC record that will also need to be recalculated. This is the recursive problem. The resolution is to leave the last ECC code outside of the ECC records so it does not change any records and force a recalculation.

There's a CHK command that will recalculate the ECC on these ECC records if it is marked "dirty". If the record is clean then it checks the ECC and will attempt to fix any single-byte errors. If there is an unrecoverable error then the ECC records are marked with a status of "Bad" and the CHK command does not attempt to check, repair, or update the ECC records unless forced. The FORMAT command will force the CHK command to run and recalculate the ECC records since it is a know good starting point even if it was in the unrecoverable state.