| News | Features | Interviews |
| Blog | Contact | Editorials |
| 'No, ZFS Really Doesn't Need a fsck' |
| By Thom Holwerda, submitted by poundsmack on 2009-11-06 23:42:54 |
| "There is a discussion at osnews.com about a simple question: "Should ZFS Have a fsck Tool?". The answer is simple: No. I could stop now, as this answer is pretty obvious when you work a while with ZFS, but i want to explain my position. And i want to ask a different question at the end." |
| Contradictory post... |
| By diegocg on 2009-11-07 01:15:28 |
|
To allow ZFS to be crash proof, there must be certain really basic mechanisms implemented in a way, that adheres to specifications and standards. Which doesn't always happen in the real world, be it because the devices are buggy or because the devices are great but have a firmware bug, or because they are old and stop working properly. So there will be a very very small, but existent, group of users who will have problems. The cause of the problem is not ZFS' fault, it's the hardware fault, but the lack of a tool to fix the filesystem or recover data from it is the filesystem fault, not the hardware fault. The whole post explains very well why ZFS reliability depends a tiny bit on hardware behaviour - that is equivalent to say that ZFS doesn't rely absolutely everything on its own design to avoid absolutely all kind of problems. However, the "ZFS doesn't needs fsck" attitude assumes that the ZFS design can avoid all kind of problems...that's somewhat contradictory. The need of helpers to fix things is clearly there, just take a look at the last month in the ZFS lists. Here's http://mail.opensolaris.org/pipe... one example from 6 days ago. And also the second paragraph of http://mail.opensolaris.org/pipe... this email which pretty much says what i just wrote but in less words: "I've no objection to deciding how much recovery tools are needed based on experience rather than wide-eyed kool-aid ranting or presumptions from earlier filesystems, but so far experience says the recovery work was really needed" BTW: Linux has similar problems with problematic hard- and software like components not honoring write barriers. But it has a fsck, which makes Linux (and solaris UFS) users think "hey, something can get corrupted due to a bad disk that doesn't handle the sync cache commands correctly, but if I hit the problem at least I can try to fix it" Some may call the results of PSARC 2009/479 something like an fsck tool, but it isn't. The disk state is inconsistent and the tool fixes it - it's a fsck. The uberblock is a part of the filesystem metadata, a wrong uberblock is a filesystem inconsistency. Just because that tool is transactional based doesn't means it isn't fixing something. Besides, the kind of corruption that you can hit with bad hardware is not neccesarily a uberblock that points to something that hasn't been written to the disk due to bad cache handling - it can be other things. When hardware fails the resulting behaviour is undefined. At first you should to throw the sub-sub-substandard hardware in the next available trash bin after copying the the data to a storage subsystem of better quality and wiping the old disks. Well, one of the most common ZFS catchphrases is that you can do reliable storage with very cheap disks - so it's quite probable that users and enterprises will do exactly that, don't you think? Edit: Trying to fix links... Edited 2009-11-07 01:34 UTC |
| I think what you are really saying ... |
| By JoeBuck on 2009-11-07 01:29:02 |
| ... is that ZFS needs a recovery tool, but that rather than try to repair damage, it should instead find the last valid snapshot and make that snapshot be the current state. However, to do that the tool needs to be able to do enough consistency checks to determine which snapshots are valid. So it does file system checking, and it does recovery. Sounds like fsck. The only distinction is that it doesn't try to rewrite a bad state into a consistent state. |
| RE: Contradictory post... |
| By Dryhte on 2009-11-07 05:54:14 |
|
> At first you should to throw the sub-sub-substandard hardware in the next available trash bin after copying the the data to a storage subsystem of better quality and wiping the old disks. Well, one of the most common ZFS catchphrases is that you can do reliable storage with very cheap disks - so it's quite probable that users and enterprises will do exactly that, don't you think? what would also be interesting is a sort of HCL or a set of criteria with which the average user can decide which of his set of harddisks he should not use zfs on. It's not like there are so many harddisk manufacturs, so how can we decide which of their harddisks we can trust our data to? |
| It sounds like... |
| By Lennie on 2009-11-07 10:45:25 |
|
It sounds like this situation asks for a per blockdevice consistent state that can be rolled back. So transactional at the blockdevice level not the filesystem level. So the system can detect that one blockdevice (harddisk, partittion, ssd, usb, whatever) didn't save the latest state, but can rollback to the previous transaction that all blockdevices have or ask the user if that is what they want. Maybe it already has that, I've not yet used ZFS and/or don't know the internals of ZFS. From the article and comments I think you could conclude something similair might be going on, but the author isn't a filesystem-designer. I'm not such a person either. ;-) |
| As someone wrote: |
| By Kebabbert on 2009-11-07 12:19:40 |
|
"One: The user has never tried another filesystem that tests for end-to-end data integrity, so ZFS notices more problems, and sooner. Two: If you lost data with another filesystem, you may have overlooked it and blamed the OS or the application, instead of the inexpensive hardware." |
| RE: It sounds like... |
| By c0t0d0s0 on 2009-11-07 13:51:26 |
|
The state isn't set back per block device, it's set back per pool. This recovery is valid for *all* filesystems and emulated volumes in a dataset. Don't think of ZFS as a regular filesystem. It's a combination of a volume manager and a filesystem. The filesystem is just one view to a data pool via the ZFS Posix Layer. But in the same pool emulated block devices can exist via the ZFS Emulated Volume Layer. You have to think different in the context of ZFS. I'm not a ZFS developer, but i'm working with ZFS since 2005 ... so i have some knowledge about it ;) |
| RE[2]: Contradictory post... |
| By c0t0d0s0 on 2009-11-07 13:55:26 |
|
The point is: You shouldn't use such devices with other filesystems, too. Just say NO to such disks. With ZFS you just recognize those error. Since i'm running regular scrubs over my datasets on my home fileservers, i'm pretty disappointed about the quality of SOHO drives. BTW: When you are using disks directly with SATA or SAS, you won't see such problems. Those disks are reasonably biggest-mistakes free. The problems start, when you have some cheap SATA/PATA to Firewire or USB converters. |
| RE: I think what you are really saying ... |
| By c0t0d0s0 on 2009-11-07 13:59:54 |
| It doesn't check. But you can. But that's a different feature of ZFS, the scrubbing. Such a check would take a long time. It falls back to another transaction groug number, tries to import the pool. In essence it does the same than when you try to import the pool via the latest uberblock than one before. |
| RE: Contradictory post... |
| By c0t0d0s0 on 2009-11-07 15:32:29 |
|
Many concepts in ZFS are pretty different to any other filesystem. It think this is the problem, when people are talking about ZFS and try to impose concept of other filesystems on it. For example the transaction rollback doesn't fix and doesn't check. It doesn't fix anything. It just imports the pool at a different transaction group number. That's pretty much the complete story. When you are still paranoid, you can scrub your pool now, and check if your data is correct. But you don't have to. Both is pretty much different to the concept of the fsck. The transaction rollback does nothing what a fsck would do, and the scrub goes much further than a fsck, as it checks the checksums of all blocks. Of course you could call it fsck, but it has nothing in common with a fsck for ext4 or xfs. Regarding the "cleaning up after bugs": I'm not sure if the fsck is the correct place for such logic, perhaps it's better to integrate code that is able to live with the buggy state and rewrite it correctly as soon, as the data has changed. The other interesting point: What's if the state is correctly on disk, but it's read incorrecly. How do you repair such a problem by fsck? As the logic of the fsck is similar to the code that reads the data, it would be obvious, that the same problem would exist in both parts. For further explanation i just cite the ZFS FAQ: > "Why doesn't ZFS have an fsck-like utility? There are two basic reasons to have an fsck-like utility: * Verify file system integrity - Many times, administrators simply want to make sure that there is no on-disk corruption within their file systems. With most file systems, this involves running fsck while the file system is offline. This can be time consuming and expensive. Instead, ZFS provides the ability to 'scrub' all data within a pool while the system is live, finding and repairing any bad data in the process. There are future plans to enhance this to enable background scrubbing. * Repair on-disk state - If a machine crashes, the on-disk state of some file systems will be inconsistent. The addition of journalling has solved some of these problems, but failure to roll the log may still result in a file system that needs to be repaired. In this case, there are well known pathologies of errors, such as creating a directory entry before updating the parent link, which can be reliably repaired. ZFS does not suffer from this problem because data is always consistent on disk. A more insidious problem occurs with faulty hardware or software. Even file systems or volume managers that have per-block checksums are vulnerable to a variety of other pathologies that result in valid but corrupt data. In this case, the failure mode is essentially random, and most file systems will panic (if it was metadata) or silently return bad data to the application. In either case, an fsck utility will be of little benefit. Since the corruption matches no known pathology, it will be likely be unrepairable. With ZFS, these errors will be (statistically) nonexistent in a redundant configuration. In an non-redundant config, these errors are correctly detected, but will result in an I/O error when trying to read the block. It is theoretically possible to write a tool to repair such corruption, though any such attempt would likely be a one-off special tool. Of course, ZFS is equally vulnerable to software bugs, but the bugs would have to result in a consistent pattern of corruption to be repaired by a generic tool. During the 5 years of ZFS development, no such pattern has been seen. For almost all failure modes ZFS protects the data, there is just one left: Components lying about the sequence and state of write operations. And no filesystem can work against such problems: The advantage of ZFS in conjunction with the mentioned PSARC putback: At least you can jump back to a state that's consistent and has validated integrity. And that's much more important form my point of view to press the data into a form, that's expected by the filesystem, where some blocks are old, some are new, some are deleted after a fsck. At the end the data is the important stuff, not the filesystem. The filesystem is just a helper construct. |
| RE[2]: Contradictory post... |
| By Kebabbert on 2009-11-07 17:14:57 |
| fsck only checks the metadata, but it doesnt check the actual data, right? |
| News | Features | Interviews |
| Blog | Contact | Editorials |