New Hard Drive Failure Statistics Released

John Honovich

•Nov 12, 2013

IPVM

Last year, IPVM released the surveillance industry's first survey results on hard drive failure / duration, showing a consensus around 3 to 4 year average life span. Now, new statistics from a cloud storage provider show their projected average to ~6 years.

Let's review key details and differences:

The storage provider has deployed 25,000 drives over the last 5 years. They say these are 'consumer-grade' drives but do not cite specific make/models. The drives are obviously deployed in climate controlled data centers.
They found that 80% make it through 4 years, as shown by this chart:

Breaking it down by time, they found:

For the first 1.5 years, drives fail at 5.1% per year.
For the next 1.5 years, drives fail LESS, at about 1.4% per year.
After 3 years though, failures rates skyrocket to 11.8% per year.

Their Stats Vs Security Integrator Survey Results

We are not surprised that this cloud storage provider is finding longer average storage duration than what security integrators are reporting. The primary, but major difference, is where hard drives are frequently being deployed in surveillance systems - often in non climate controlled, poorly ventilated areas, such as recorders used as foot stools in guard shacks...

Carl Lindgren

•Nov 12, 2013

In proper conditions, our experience points to even longer drive life. We are just in the process of replacing our complete system, including servers and RAIDs. The storage contains approximately 700: 500GB WD RE drives. At the 7-year mark, I estimate we still have over 80% of our original hard drives in continuous production. Aside from a hiccup at approximately 1-1/2 years due to a bad batch of drives, failure rates have remained pretty constant at approximately 1 drive per month.

Marc Pichaud

•Nov 12, 2013

I imagine datacenters working for CCTV cloud solutions, mainly record on alarms for small low res, where local NVR/VMS, are 24/24 on heavier resolution and better quality and more fps for legislation.

here it's difficult to get more than 800 Kbits upload in a SOHO Adsl , and 2 to 8 Mbits SDSL upload if you are lucky and close from Fiber

So all cloud solutions just store low res (yes VGA is still alive) with few fps for small pre and post alarms. It does explain why disks that way, last more. Other reason : in LAN most of the time your RAID5 is based on 4 or 5 disks so , ... everybody is working hard , no load balancing. In large datacenter you can have 10 or 20 disks in a RAID5 cluster and everybody is working less time ...(time sharing :-))

Karim Cassar

•Nov 18, 2013

Hello good morning, so normally what is the warranty system integratiors offer to their customers ?

John Honovich

•Nov 18, 2013

IPVM

In reply to Karim Cassar

See: Surveillance Warranty Terms Reviewed

Karim Cassar

•Nov 18, 2013

Sine Drive might last arround four years do you suggest we replace the hard drives every four years ?

Marc Pichaud

•Nov 18, 2013

In reply to Karim Cassar

I don't knwo for others but here that's what we suggest to customers , 4 yearsfull preventive replacement, best is to have a second (redundant) NAS to garanty a minimum delay during swift (should have configure VMS rights and ports for the second unit as well)

In real life , SYstem Integrators rarely propose it - thinking RAID5+ or RAID6+ are much enough :-)

Alastair McLeod

•Nov 18, 2013

[Up front disclosure : My company manufactures a storage product].

Carl : 1 drive per month out of 700 drives is an AFR (Annualised Failure Rate) of about 1.7%, which pretty much concurs with the Backblaze analysis, and I'm sure your systems have the same high-quality conditions that they will have in their data centres.

The key to long hard drive life is mentioned several times in the articles and discussions above, by referring to "the right conditions", but that doesn't just mean in controlled data centre conditions. There are three key factors which affect expected hard drive life : Temperature; Vibration; Wear.

If you reduce the operating temperatures, minimise or eliminate vibration (from other disks, from fans etc.) and minimise the duty cycle (wear rate) of the disks, then the disk lifetimes will increase considerably. I recently spent time at Seagate HQ discussing these factors with some key people and they confirmed exactly these points. Conversely, if you run the disks hot, have them in a high-vibration environment, and use them at 100% duty cycle, then their useful lifetime will be considerably shorter.

For surveillance applications specifically, there are ways in which to dramatically increase the reliability and lifetime of disks and at the same time reduce power consumption and maintenance costs, using non-RAID approaches, which involve reducing the operating temperatures (by not heating the drives up in the first place), eliminating vibration using a sequential filing system and very low power fans (or no fans) and also by switching drives off when they are not needed. These techniques are also highly suited to ever-increasing disk capacities and long retention times.

I don't want to be accused of plugging a specific product, so suffice to say that alternative approaches exist for surveillance which can deliver very long disk lifetimes with very low failure rates (less than 0.1% AFR).

Carl Lindgren

•Nov 18, 2013

Alastair,

Still, my point is that during the entire 7-year-long span we had the system in production, the "bell curve" has not been in evidence. Except for a hiccup at the 1-1/2 year mark due to a bad batch of drives, we have only replaced approximately 70 drives and the failure rate immediately prior to the system's retirement has been essentially level.

That begs the question: when would we have expected the drive failure rate to increase? Seven years is nearly double what John, Karim and Mark suggest. 11.8% at the 3-year mark?

I might conclude that the use of ringbuffers by our previous system had some bearing on drive longevity but I would assume products like Coldstore are unable to accomodate that form of recording.

Jerome Humery

•Jan 01, 2014

The BackBlaze findings that consumer drives might be as reliable as enterprise drives really surprised me. Now, I understand that there are many other variables that may refute that statement, but it nevertheless got me thinking. I remember reading an older IPVM article that discussed RAID usage in surveillance video projects. I was also a bit surprised at the results of that article's IPVM pole where responding integrators answered that they did not use RAID in 55% of their projects due mostly to budgetary issues. Additionally, it appears that most of the time the alternate option to RAID is nothing at all.

So my question is the following:

In a small to medium project with tight budgetary restraints, is it better to use the more expensive & more reliable enterprise drives and not be able to afford any redundancy or back-up solution? Or, use cheaper, perhaps less reliable consumer grade drives, but have a budget for RAID?

Carl Lindgren

•Jan 01, 2014

This might not reflect on current drive production but the first iteration of our system installed in 2003 used WD 250GB PATA consumer grade drives. We experienced a much higher rate of drive failures with that system. In 2006, we replaced all of our storage and used 500GB Enterprise SATA drives and upgraded from RAID5 to RAID6. We lost four RAID groups due to drive failures on the first system in three years and none on the newer system in 7 years.

Granted, RAID6 is far more drive failure tolerant than RAID5 but even the drive failure rate was much smaller. Based on our experiences, I would disagree with both the 4-year life expectancy and with the statement that consumer grade drives exhibit similar failure rates in RAID systems.

Another caveat is that Enterprise drives typically use TLER - Time-Limited Error Recovery (Hitachi calls it "command completion time limit" (CCTL)). This is a timer in the drive firmware that limits the time spent correcting detected errors before advising the array controller of a failed operation. RAID systems for video recording must be fine tuned to avoid potential conflicts between TLER/CCTL and controller drive checks due to the continued writing causing slow response to controller drive checks. Conflicts can cause perfectly good drives to be marked bad by the controller and taken offline.