It would be nice if the intro had a brief explanation of why a disk needs to be ...

masklinn · on July 27, 2020

> It would be nice if the intro had a brief explanation of why a disk needs to be divided into blocks.

One reason is that HDDs simply don't have a byte-wise resolution, so there's little point talking to HDDs in sub-sector units. Sectors are usually 512 bytes to 4k.

A second reason is being able to simply address the drive. Using 32b indices, if you index bytewise you're limited to 4GB which was available in the early 90s. With 512 bytes blocks, you get an addressing capacity of 2TB, and with 4k blocks (the AF format), you get 16TB. In fact I remember 'round the late 90s / early aught we'd reformat our drives using higher-size blocks because the base couldn't see the entire thing.

jagged-chisel · on July 27, 2020

> HDDs simply don't have a byte-wise resolution

Sufficient explanation for code. Now why is it that disks lack byte-wise resolution?

cesarb · on July 27, 2020

> Now why is it that disks lack byte-wise resolution?

Overhead.

Each disk sector is not 512 bytes (or 4096 bytes in modern disks), it's a bit more. It needs at least the sector number (to detect when the head is flying over the correct sector), a CRC or error-correcting code, a gap of several hundred bits before the next sector (so writing to one sector won't overwrite the next one), and a preamble with an alternating pattern (to synchronize the analog to digital conversion). All this overhead would be a huge waste if the addressable unit were a single byte.

avianlyric · on July 27, 2020

Cost, more resolution means more transistors, address lines, and protocol overhead (your memory address take up space too).

You could build a HDD with byte-wise resolution, which did a whole load of clever things internally to handle error correction (modern HDDs have far more complex error correction that just block wise checksums, they’re a piece of technological magic and analogue electronics).

But sure a device would cost far more, and offer no real benefits. Especially when you consider that reading a single byte in isolation is a very rare task, and it takes so long (compared to accessing a CPU cache) to find the data and transfer it that you might as well grab a whole load of bytes at the same time.

Byte-wise resolution in a HDD is like ordering rice by the grain in individual envelopes. Sure you could do it, but why the hell would you?

maccam94 · on July 27, 2020

I believe for performance and error correction features for spinning hard drives. Check out the tables and graphic in the overview section here[1]. Any write to a sector requires updating the checksum, which means the whole sector has to be read before the drive can complete the write. Since the sector is probably in the kernel disk cache already, the whole sector can be flushed to disk at once and the drive can compute the checksum on the fly (instead of flushing 1 byte, making the drive do a read of the sector, recompute the checksum, and then do a write).

1: https://en.wikipedia.org/wiki/Advanced_Format

pkaye · on July 27, 2020

Because of the sector based error correction. I remember some earlier 8-bit disks has 128 byte sectors. Some newer flash SSDs have 1k-4k physical sector size for ecc coding efficiency so in those cases they will internally do a read-modify-write to handle a 512 byte logical sector size.

tinus_hn · on July 27, 2020

Try designing the format keeing the required features in mind and you might seen why they have.

The hardware has to be able to read and write data from an arbitrary location and do some kind of checksum for error detection and they have to be able to efficiently transfer this data to the cpu.

e12e · on July 27, 2020

.. And does it hold true for ram disks?

avianlyric · on July 27, 2020

Yes it would, but you would be working in cache lines rather than disk blocks.

When working with RAM your PC will transfer a cache line of memory from RAM to your CPU caches. Again this is a feature of hardware limitations and trading off granularity against speed.

Your CPU operates many time faster than RAM so it makes sense to request multiple bytes at once, then your RAM can access those bytes, and line them up on it’s output pins ready for your CPU to come back and read them into cache. This give you better bandwidth. (It’s a little more complicated than this because the bytes are transferred in serial, rather than in parallel, but digging into the details here is tad tricky).

On the granularity point, the more granular your memory access pattern is, the more bits you need to address that memory. In RAM that either means more physical pins and motherboard traces, or more time to communicate that address. It also means more transistors are needed to hold that address in memory while you’re looking it up. All of those things mean more money.

And before we start looking at memory map units, that let you do magical things like map a 64 bit address space into only 16GB of physical RAM. Again the greater the granularity, the more transistors your MMU needs to store the mapping, and thus more cost.

So really the reason we don’t have bit level addressing granularity is ultimately cost. There’s no reason you could build a machine that did that, it would just cost a fortune and wouldn’t provide any practical benefits. So engineers made trade off to build something more useful.

yencabulator · on July 28, 2020

For what it's worth, this is slowly starting to change. https://venturebeat.com/2019/09/20/optane-persistent-memory-...

tl;dr This new tech might do to SSDs what SSDs did to hard drives. Might.

monadic2 · on July 27, 2020

Well, one benefit would be a relaxation of constraints on filesystems. This might be worth performance loss.

baruch · on July 27, 2020

A drive (be it HDD or SSD) is a block device, all drives are divided into blocks/sectors since they are a rather lossy medium. Data written to a bit may not necessarily be read accurately and so the solution is to have an Error-Correcting-Code that will allow to recover from some number of badly read bits (above that and you get a medium error from the drive). Since an ECC for a byte is a bit excessive the drives use blocks of 512 bytes or 4096 bytes, one reason for the move from 512 to 4096 is that the ECC becomes more efficient.

HDDs also have small "gaps" that have headers to locate the sectors (in the distant past you could do a low-level format to correct these gaps as well).

SSDs do not have these gaps but they do need the ECC.

im3w1l · on July 27, 2020

In the bad old days we stored data on spinning-rust harddrives. In order to find the data you wanted you had to position the reader head. This took several milliseconds. You could read a lot of data in that time. So basically if you took the trouble to seek somewhere you better read a bunch of data to make it worthwhile.

People would even run defragmenters, that would rearrange file blocks so they were stored continously and the whole file could be read without seeeking.

unethical_ban · on July 27, 2020

The disk / inodes need to know where to start looking for a file's contents, like the address in memory for RAM. Or like the mail: We subdivide by city, then ZIP, then street, then address.

So the inode says "The data for my file starts at block 72 and is 3 blocks long" (or something like that). The disk then goes there, and reads blocks 72,73,74.

Each block is 4KiB large often, so if you have a 10KiB file, you still take up ceiling(file size/block size) blocks.

That's why there is a difference between "File size" and "Size on disk" when you look at disk usage summaries.

saurik · on July 27, 2020

This doesn't explain why blocks are valuable as you could use byte addressing. The reason why blocks are valuable is for similar reasons to why memory is divided into pages. (Which is all I am going to write, as I don't have the time to answer this well today. But "you need to know where things are on disk" isn't an answer.)

sedatk · on July 27, 2020

This isn’t the reason about block addressing at all since it existed before paging was a thing. Addressing storage content in blocks is closely aligned with how direct access storage is organized in “sectors”, usually 512-byte blocks. Disks don’t have byte access interfaces, so it doesn’t make sense to use one.

It also lets you to address much larger data structures with limited sized variables. With block addressing, you can access terabytes of storage with a 32-bit integer.

saurik · on July 27, 2020

(I am awake now, so I can provide a more useful answer to this question.)

> This isn’t the reason about block addressing at all since it existed before paging was a thing.

I didn't say "blocks exist because pages exist", I said "blocks exist for similar reasons to why pages exist". How you thought that required temporal causality is beyond me. :/

> Addressing storage content in blocks is closely aligned with how direct access storage is organized in “sectors”, usually 512-byte blocks.

OK, great: now you have to answer why, as you are just punting the answer. The answer, FWIW, is similar to the answer for why memory is divided into pages.

> Disks don’t have byte access interfaces, so it doesn’t make sense to use one.

But, of course, they could have a byte-access interface; in fact, that was in some sense the entire question: why don't they? Note, BTW, that RAM also doesn't have a byte-access interface: you have to access it by page. (And hell: many modern disks are just giant flash memories!)

The super high-level API for working with memory provided by the CPU as part of instructions lets you access it by the byte, but again: that's also true of disks and blocks, where the super high-level API for working with files absolutely lets you work with it accessing specific bytes.

> It also lets you to address much larger data structures with limited sized variables. With block addressing, you can access terabytes of storage with a 32-bit integer.

This is true, and is a partial answer to the original question. Essentially, a filesystem is serving the same problem space as a page table, but instead of mapping virtual pages of processes to physical pages it is mapping block-sized regions of files to blocks on disk.

In both cases, it is thereby beneficial to save some bits in that map. However, as pages of memory are usually 4k and blocks on disk are usually 512 bytes--though you see larger for both of these (64k pages are becoming popular)--that really isn't that many bits you are saving.

The real reason is that there is a cost to random access: the more entries in your page-table / filesystem block list the more space it takes and the harder it is to find the one you want; and, while you can "compress" ranges, that also comes with its own cost on random access.

On disks made out of spinning rust, this random access cost is particularly ridiculous, as it might require rotating a heavy piece of metal and physically adjusting magnets to achieve the specific offset you are looking for. You thereby never want to load a single byte at a time: you want to pull a full block.

There is also another interesting cost to random access on writes, which applies to some kinds of memory and applies to most kinds of disks, which is that you just don't have the precision required to write a single byte at a time: think about trying to isolate one byte using a magnet!

So the way these things work is that they have to read large subsets of data at one time and then write it back. If you can align your filesystem to match the blocks of a file along these write boundaries, you thereby decrease the amount of work the disk has to do.

(And, again: this is very similar to the work that memory managers are doing with respect to attempting to align pages of large allocations to make the virtual memory areas efficient in the page table, which, again, is pretty much a filesystem for memory with processes as files.)

You thereby also often find yourself attempting to tune your block/page size to that of the underlying hardware, and if you build higher-level data structures (such as databases using B-trees or dictionaries using RB-trees) you will want to align those at the higher-level as well.

(And it is thereby then also fascinating that it took as long as it did for people to realize that if you want to implement an in-memory tree that you should always prefer a B-tree--a data structure designed around disk blocks--over an RB-tree, for all of the same reasons.)

Blocks (and pages) thereby exist to help all layers of the system most efficiently take advantage of linear access patterns to get efficient parallel loading and caching to solve a problem that is otherwise subject to high random access costs and potentially-ridiculous fragmentation issues.

sedatk · on July 28, 2020

Yeah, I read your answer as if disk block addressing was directly related to paging. My bad.