VMS Catastrophic Database Errors?

Fairly regularly I hear stories about VMS database errors (across a variety of vendors). Does anyone have personal experiences?

I am a little confused about why this would happen (or at least why it seems to happen as often as it appears to be). Relational database management systems are basically a solved problem and have been so for years. While most every web applications use a database (Postgres, MySQL, SQLite, etc.), the instances of catastrophic database failures are few and far between. The same thing with DVRs, hard drives fail, power supplies fail, etc. but database issues?

Anyone with experiences or insights to share here?


I see the odd currupted database in Vigil systems, usually due to a power loss at just the wrong time, or a failing hard drive... the sort of things you'd expect to cause database issues.

It's a nuisance, but I wouldn't call it "catastrophic", it just means the video isn't searchable (system is still recording the whole time, just not updating the search database), and most times it's easily solved by running the database rebuild utility (usually anywhere from two to eight hours to complete, depending on the amound of video). Worst case, it means running a script to completely drop and re-create the database, followed by the rebuild utility.

The Geutebruck systems deploy use a proprietary database system, not certain on the structure, but never had a problem with them except once awhile back when they under spec'd system memory required, which caused some issues searching video because the index didn't have enough memory, but that was fixed with a simple memory upgrade and a reboot.

You'd think a flat file system of storing video files would be more resilient, but I remember every now and then have to fix video index files on old Dedicated Micros systems.

Overall I have not see many database "corruptuions" in a long time.

John, this is a very important topic. I just finished writing a column about it for Security Technology Exesutive magazine.

One client of mine experienced such a problem earlier this year. The system was on an APC Smart-UPS 1500, and the server was running the PowerChute software set to perform an orderly shutdown on the server. However, the storage array was a separate computing box (iSCSI across a network switch) and was not set for orderly shutdown.

When the power loss exceeded the 2 hour battery time (hard drives can take a lot of power), the storage array shut down with some resulting disk corruption.

During the service call, the integrator's tech said that there wasn't a way to shut down the storage array. "We contacted the vendor and you can't do it." I contacted the same vendor, who sent me their application note on how to do it. It's a 5-minute configuration action. They have built-in support for that particular UPS.

The integrator was no longer under contract due to other deficiencies on the job, and so the new integrator will take care of it.

In another case this year, with a Genetec system, an Archiver server (recording server) drive failed and the index database got corrupted, and some of the video could not be found. Well, the video was still on the drive, in a folder structure by year, month and day, with files that have the timestamp in the filename. Genetec has a utility that finds these "orphaned" video files and brings them back into the database.

I know of several VMS systems that store the video data in separate files outside of the indexing/search/alarm correlation database.

From other experience as well, I'd say that "power loss with no backup power" and "hard drive loss" are the two primary causes of database corruption. But an experienced integrator can usually recover the data for a VMS system with a well-designed database and video file structure.

There is no excuse for losing video due to power failure. If it is important enough to record and retain the video for some period of time, its important enough to spend $900 or so on a good UPS.

COMPLEX STORAGE SYSTEMS

But if you have a video SAN, or a complex storage array, you have to make sure that all the elements will perform an orderly shutdown and orderly startup in the right sequence. Some systems will require you to shutdown the application, then SQL server (which can take a few minutes), then the operating system, then the storage array or Video SAN. Startup is usually in the reverse order--but you have to test because sometimes startup times for various elements different from the shutdown times. If on startup a raid array has to rebuild that can slow down the overall startup quite a bit, and you can think there is another problem to address, but it is likely a self-healing one.

You have to make sure that you use the UPS software to shut down at the point where you have plenty of time left for all shutdown actions. You have to have the Video SAN elements (switch included) on the UPS as well, and this can significantly shorten the battery uptime from what you might think. Accurate power calculations are important, especially when a Video SAN is involved.

All this has to be tested out before you can walk away from it.

We usually shut down the video system immediately upon power outage, using the UPS only to support an immediate orderly shutdown. Unless the network and cameras are all on emergency power, there is no sense having the server up, unless you need emergency access to recorded video, in which case you can manually start it up for a short session.

MULTIPLE OUTAGE VULNERABILITY

The charging time for a UPS battery can be quite long, and is longer the bigger the UPS battery capacity is. So if you use up most of the battery power and then shut down, you can be vulnerable if a second outage occurs before the recharge can take place. You may not have enough time for the second orderly shutdown. This is mostly a consideration with large and complex storage systems. But even small systems can have this vulnerability if their battery capacity is small.

It pays to have a good understanding of the history, nature, and duration of power outages when designing battery backup power. It causes problems with NVRs and DVRs as well as server-based recording.

I think its an issue that is ignored in many video system projects, especially small ones or NVR/DVR projects, where the customers think of the units as self-contained systems. VCRs could lose power without losing video on a tape. So it's natural for that consideration to carry over into DVR and NVR deployment.

Many integrators install VMS systems with no attention to backup power. You'd be surprised how often this is the case.

Our system ( not sure if I can name it here, I'm a new member ( and manufacturer ) and have yet to read all the rules ), uses a dual server setup for redundancy. We use mySQL for our database and have not seen any problems when managing 1000+ camera installs. We periodically back up the primary database to the secondary server incase a major issue were to happen. The only problems we've seen so far are unexpected power outages right at the wrong moment, or hard drives failing in the server. The database itself seems to be very reliable.

Undisclosed, well, the only other problem is that you use a dual server setup :) I am sure it makes a difference it's just more expensive and harder to justify for smaller systems.

Ray, Matt, so are hard power cycling the main cause of this?

Hard to say what the main cause is for sure, because we don't usually get the call for it until days or weeks later, when someone tries to search some footage and gets a negative response, and at that point, nobody knows for sure what happened a while back ("Oh, we had a power surge... might have been a couple months ago, I don't remember exactly...").

From general experience, an "unclean" shutdown (power loss, RESET button, etc.) is *A* fairly common cause of database corruption. A failing driving is an obvious one, naturally. Bad RAM is also a possibility that manifests itself in all sorts of strange ways. And of course, if your database resides on an external volume (NAS/SAN, eSATA, USB drive, etc.), any interruption in the data connection could create issues.

The Windows logs should contain restart records and disk error records among other things. Network monitoring and server monitoring with alerting can be set up to notify you of restart events, disk errors and so on. Paessler provides networking freeware including a free version of PRTG -- an excellent monitoring product that can monitor servers and network equipment, tell you how much free disk space there is, and so on. I wouldn't want to set up a video system without some kind of health monitoring and alerting. This can prevent situations of discovering that a camera hasn't been getting recorded for some number of months.

Often customer IT departments find it a simple thing to include the video cameras, switches and servers in their monitoring plan. If that's not an option, you can do it yourself. Are we security technology professionals, or not?

"The Windows logs should contain restart records and disk error records among other things."

They do, but those don't always tell you if those things are related to the cause of the database errors. A restart logged in the system only tells you the system restarted, not whether it was caused by a power outage, by someone hitting the Reset button, or if it was done by Windows shutting down normally. Logged disk errors, similarly, only tell you that there ARE disk errors, not what software or data those errors are affecting.

Application logs MAY show database errors being generated, and it may be possible to correlate those to the occurrence of disk errors or restarts, but that's not much more than circumstantial evidence.

"I wouldn't want to set up a video system without some kind of health monitoring and alerting. This can prevent situations of discovering that a camera hasn't been getting recorded for some number of months."

It's a wonderful thing when you can get it... unfortunately we have customers whose only network connections in their retail fuel sites are their corporate LANs, where they're not allowed to connect the DVRs to the network (not that that would provide them internet access anyway), and not allowed to install their own internet connections. Of course, these are the same sites where a DVR can be beeping away for weeks at a time with a warning box that it's recording to backup drives, or hasn't recorded any new footage in the past 24 hours, and it's just ignored until someone needs to look up some video.