Barry, we had an interesting discussion earlier this year about using 4TB hard drives - consensus was fairly skeptical about the benefits of doing so in surveillance.
I get your hint ;) We would like to do server/storage/hard drive testing in the future but I won't commit to exactly when just yet.
However, we will (Sarit) connect Exacq and get their input on the impact of 3TB/4TB drives with their appliances.
There is a limit on how many drives should be incorporated into any RAID Group. At 16, you are asking for trouble since the more drives in a group, the more likely multiple drives can fail or at least have failures discovered simultaneously. Typical best practice is 8+1 to 10+1 in RAID5 and 8+2 to 10+2 in RAID6.
Beyond that it's not a question of if you'll encounter URE's (unrecoverable read errors) during a rebuild, but when...
I would guess that is due to poor RAID system design. While our earliest SCSI/PATA RAIDs experienced the same issues with kicking out perfectly good drives, newer systems have not encountered the same issues when they were set up properly.
And that is a key. A number of storage manufacturers who claim to "know" video recording actually don't. One key problem we discovered is too short a "drive check period". That is how long the RAID controller waits for the drive to respond when it checks the drive's SMART. The problem with short drive check periods (default is often 7 seconds) is that drives can be slow to respond - especially when they are writing continuously. The net effect is that the controller will kick out drives that are perfectly good.
One of the "fixes" offered to me was to remove and re-install drives that fail the first time. If they come back up as normal, that cures the issue. The problem I have with that is twofold:
- The RAID still has to perform a rebuild, during which time it is running in degraded mode.
- Not all drive failure modes are easily discovered, especially in storage systems that are under heavy use (like ours).
I may be a bit gun-shy but around a year after our last system was installed, the "bell curve" of drive failures spiked up. We pulled and reinserted a number of drives as instructed. Subsequent drive failures snowballed and we lost a number of RAID groups. After contacting the drive manufacturer (WD), we learned that they had a batch of drives that had "head bonding" issues - essentially, the read/write heads were coming off the actuators.
Now, I've been told that some of the newest systems are able to either verify-after-write or at least automatically scan the media. I can't confirm that is possible when a system is writing intensively and continuously, as is the case here. In any case, if a drive is under warranty, I prefer to let the drive manufacturer deal with it and will return any drive the system kicks out.
As far as taking a system down to let it rebuild undisturbed, that is not an option in our case.
There is a clear trend here. The bigger hard drives get, the longer the rebuild time. If RAID systems are used for surveillance, then systems designers must calculate the maximum bit rates and throughput based on the array performance during rebuild. Surveillance systems CANNOT be taken offline during a rebuild - they must record as close to 100% of the time as possible.
Therefore the conclusion must be that as disks become larger, RAID solutions become less and less suitable for high-capacity surveillance storage. There are other solutions out there. Disclosure : My own company manufactures a surveillance storage array which takes a different approach, so I'll not say anything more on that as I'm not trying to plug it, just discuss alternative technologies and approaches.
There are a number of manufacturers taking alternative approaches to storage arrays and are specifically designed for surveillance. Some of them have additonal benefits too, such as higher disk reliability, lower power consumption and the ability to remove individuals disks for evidential purposes. The key point is that some of these systems get better with high drive capacities, not worse.
I think that as disks become larger and larger in capacity, we'll see an increasing trend away from conventional RAID onto other, more surveillance application-specific designs, which are ideally suited to big disks. Accompanying that will be a trend towards longer retention times, since power consumption and total capacity become less of an issue with such designs.
I'm not so sure. I don't see many alternative storage technologies being incorporated into very-large-scale surveillance recording systems. During the course of our year-long RFI and evaluations, not one NVR/VMS manufacturer or Integrator even mentioned RAID alternatives (other than the one company who still was pushing their RAID+tape system (shudder)). Since I would have seriously considered any proposed technology, at least enough to allow them to demonstrate their product(s), I could only assume they are incompatible with our needs. Either that or the word is not getting out - perhaps manufacturers are not looking at the casino vertical?
The only even somewhat unique storage system proposed was DDN (DataDirect Networks) and their product is still basically traditional RAID, but with a couple of twists: Their system can perform partial rebuilds after automatically power-cycling and testing "failed" hard disks (Journaled Drive Rebuilds and "SATAssure Real-Time Error Detection & Correction") and they have very dense systems (up to 60 drives per chassis makes 240TB fit in 4RU).
In fact, the system we chose will be using Dell 60-bay storage but I insisted (despite protests by vendors) it have additional redundancy built in, like failover controllers and dual data paths, a maximum RAID Group size of 10+2 and a hot spare drive for each 30 drives or less. It remains to be seen if Dell has a handle on Surveillance storage but I'll find out before the system is commissioned.
I really think the most game changing technology will be SSD. Although it is not quite ready for large-scale Surveillance storage, I believe it has the potential to resolve many current storage issues, including providing a vast reduction in power and cooling requirements, drive rebuilds in a fraction of the current time and substantial increases in system reliability.
"Surveillance systems CANNOT be taken offline during a rebuild - they must record as close to 100% of the time as possible."
So, how many DVRs/NVRs are using integrated RAIDs? We're using iSCSI-connected RAIDs with Vigil recorders; the recorders have their own internal HDDs configured as "alternate" storage that is used only if the primary storage isn't available. This means if I have to take the RAID offline for a rebuild, the DVR just continues recording internally. Even on the most heavily-used system I have in use now, I can get a solid week out of a single 2TB internal drive...
That does solve one issue but is not ideal if fast, reliable read access is also a requirement.
I believe that is also an issue with products like Coldstore that "spin down" hard disks between record cycles. The point has been made that such products only provide data loss protection during the write process. If a drive fails to come back online when needed, the data on it could be lost.
By the way, your export drive appears to be nearly full ;-o
What's not "fast and reliable" about failover recording to an internal drive? With Vigil at least, it's completely seamless... the only "drawback" as such is that video on the RAID isn't searchable/playable during the rebuild, but that's a temporary condition.
Exactly my point. I said "fast, reliable read access"...
Even offline, I would guess that rebuilds of 3TB drives would take a substantial time - >24 hours?
This turned into a very interesting and informative discussion!
My own viewpoint is that the solution to handling ultra large storage arrays could be to address the common issue which everyone brought up, ie that even the best RAID controller and hard drive array has problems doing its job when pounded constantly with intense amounts of data. Rebuilding a 30TB RAID in the background under these conditions is a problem, as you guys pointed out, and during these long rebuilds, the unit is at risk of further drive failures which could bring the group down completely.
Looking forward we plan to look more to archiving and Exacq's rollout of v5.6 addresses this admirably. By keeping the local storage of an NVR or server/VMS down to a reasonable 6-8TB and then auto-archiving to larger RAID unit/s during a suitable lull in the activity (ie at night for retail/commercial systems using motion) I believe we can in some way mitigate the issues. True, the archive units will be subjected to periods of intense data writes, but should the RAID be in a rebuild mode following a drive change, the archive can be postponed without interrupting live recording. Also by keeping the arrays at a more manageable size, and using several smaller economical units (DAS for instance), individual rebuld times will be kept shorter.
Of course all this will become moot once the 2,3 & 4TB Solid State drives start rolling out, because they won't ever fail will they? :)
"Exactly my point. I said "fast, reliable read access"..."
As long as things are behaving the way you expect them to, I'd classify that as "reliable". If you intentionally take the RAID offline for any kind of maintenance procedure, then you EXPECT to not have access to the data; hence that doesn't make it "unreliable".
It sure doesn't qualify as fast either.