*An SSD is internally a RAID0 that can be up to 16 wide (that's how they get the...

macemoneta · on Aug 19, 2010

You'll probably find this interesting reading on the internal architecture of SSDs:

     http://www.denali.com/wordpress/index.php/dmr/2010/02/02/ssd-interfaces-and-performance-effects

Also, while RAID0 reduces the MTBF, it's not linear. Drive life is not magically shortened as a result of the drive being in a RAID array (if you take care to isolate synchronous vibration). The life of the array is equal to the shortest drive life. In other words, if a drive would have failed after 25,000 hours in standalone operation, it will still fail in 25,000 hours in an array. The other drives may run to 100,000 hours, but it's a "weakest link" failure mode.

moe · on Aug 19, 2010

Also, while RAID0 reduces the MTBF, it's not linear.

Well, it is inverse proportional.

The life of the array is equal to the shortest drive life.

Erm. To be clear: Your risk of having a RAID0-set (over 3 disks) fail during a given timespan is 3 times higher than having a single-disk-"set" fail in the same timespan.

In other words, if a drive would have failed after 25,000 hours in standalone operation, it will still fail in 25,000 hours in an array.

That calculation makes no sense. If you have a single drive then that will fail, on average, after 25k hours. If you stripe over three of these drives then your array will, on average, fail after 8333 hours.

macemoneta · on Aug 19, 2010

While the probability of failure is nearly a function of the number of drives, the MTBF/MTTF calculations do not work that way.

For example, if there were a probability of 5% that the disk would fail within three years, in a three disk RAID0 array, that probability of failure would be:

P=(1-(1-.05)^3)=.14263

In other words, 14.3% probability of failure within three years. That doesn't mean it will fail in that time frame. It means if you have a large population of that configuration, that is the rate you would be dealing with for drive replacement planning.

The MTBF and MTTF calculations apply to populations of drives (e.g. a given model) not to a given drive. The values provide no predictability for the failure of any specific drive. Using the values for that purpose is a common misapplication. A drive with a MTTF of 1,000,000 power-on hours can fail in 15 minutes or never during its useful life.

As a result, a three drive array will have a higher probability of failure over a given interval, but the MTTF/MTBF of the drives is essentially unchanged.

Think of it this way... The probability of winning the lottery is one in 20,000,000. The probability that someone (anyone) will win the lottery in a given week may be one out of ten - 10%. In other words, some person wins the lottery, on average, one time in ten weeks. That doesn't mean that your probability of winning the lottery is 10%. It also doesn't mean that the average probability of winning the lottery is 10%. It also doesn't change the probability of winning the lottery; it's still one in 20,000,000, even if three people win in a 10 week interval.

moe · on Aug 20, 2010

Hm. Thanks for repeating what I just said, I guess. But what was your point again?

macemoneta · on Aug 20, 2010

tl;dr: For RAID0 arrays there is a non-linear increase in the probability of failure, but the MTTF/MTBF doesn't change much.

moe · on Aug 20, 2010

Could it be you're just arguing for arguments sake?

My original point was: A RAID0 over 3 disks is about 3 times more likely to fail than a single disk running standalone. Fail means "total data loss". You confirm that point with your own math, yet still seem to be trying to argue that there was no difference. Sorry, that makes no sense to me.

macemoneta · on Aug 20, 2010

Your statement was:

"A RAID0 over three disks has about 1/3 the MTBF of a single disk."

This is incorrect, the MTTF and MTBF are not significantly changed. Assuming you meant failure probability, my issue with the probability variance is the linear relationship you imply.

If the variation were linear, a RAID array composed of drives with a 5% failure probability would reach certainty of failure (1.00 probability) within the interval at 20 drives. In actuality, it takes 225 drives to reach that probability.

The difference is a real world consideration for capacity management. What it means is that RAID0 arrays are not as failure prone as people think they are.

moe · on Aug 20, 2010

> "A RAID0 over three disks has about 1/3 the MTBF of a single disk." This is incorrect, the MTTF and MTBF are not significantly changed.

Wikipedia disagrees; http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_0_fai...

array_MTTF = avg(drive_MTTF) / number_of_drives

macemoneta · on Aug 21, 2010

Which is at odds with the (correct) definition of MTTF as a rate-based calculation:

http://en.wikipedia.org/wiki/Failure_rate

The person that wrote the Wikipedia article you referenced read the same mythology you did; repeating it doesn't make it true. The plural of anecdote is not fact.

Think about it yourself for a moment. If two cars are traveling 50mph, does that make their average speed 25mph (50/2)? Applying a divisor to a failure rate based on the number of devices is nonsensical.

moe · on Aug 21, 2010

If you are so convinced then why don't you correct the wikipedia article?

Perhaps also call up LSI and Adaptec, who use the same formula in their documentation.

http://storageadvisors.adaptec.com/2005/11/01/raid-reliabili...

But what do they know, they only build raid controllers...

macemoneta · on Aug 21, 2010

You're right, there's no reason to try to correct the 20% of the population that believes the Sun revolves around the Earth. It's a lost cause; you win.

moe · on Aug 21, 2010

You're right, there's no reason to try to correct the 20% of the population

Erm wait, didn't I just suggest the exact opposite?

If you really think everyone has been wrong about this all the time then please, by all means, correct wikipedia or write a blog post about the matter.

This "false" formula has been out there for quite some time and you find it in pretty much every write-up on the topic, including those from RAID-vendors who (I'd hope) have spent some thought on these things.

On the flip-side I haven't found a single source to support your thesis. Thus I'd say the burden of proof is on you.