Learning From the Mistakes of Other DBAs

I have been a judge for the Exceptional DBA of the Year Awards ever since its inception, and one of the questions that is asked on the application form is “What is the biggest mistake you have made as a DBA, and how did you deal with it”. This question produces some interesting answers, many of which provide important lessons for all DBAs. I have collected some of those responses from past years, not for the current 2011 Exceptional DBA of Year Awards entries, which I think you will find educational. I have edited these entries for readability and to remove any potential confidential information. Hopefully you will learn from these DBAs’ mistakes, and not make the same ones yourself.

  • I revoked permissions as sysadmin for myself.
  • Instead of logging out of the SQL Server, I shut it down instead, right in the middle of the day when it was heavily being used.
  • I was intending to drop a database from a test server, but didn’t realize I was actually on the production server, and as you have probably guessed, I deleted a production database.
  • I setup a SQL Server Agent job to delete a single record each night. But I forgot to include the WHERE clause and all the data in the table was deleted when the job ran.
  • I had noticed that a SQL Server Agent job running a SSIS package failed. Even though I was not familiar with what the SSIS package did, I manually re-ran the job, and it worked this time. Unfortunately, the SSIS package dropped a critical production database.
  • I didn’t bother backing up some seemingly unimportant databases. When one of these databases got corrupted, there was no backup. I had to recreate the database and manually repopulate it myself.
  • I neglected to pay attention to how fast our databases were growing and ran out of space without knowing it was going to happen.
  • I ran a script given by me by our auditors to generate some reports, which caused the database to go into a suspect mode. Now I don’t run any scripts unless I review them first.
  • I accidentally restored an older backup over a production database. I had intended to restore the backup over a test database.
  • I setup log shipping backwards.
  • I restarted the SQL Server service during the middle of a schema change that resulted in dropping of a production table.
  • A user accidently deleted some rows on a table and I was asked to restore the database from the most recent backup. Unfortunately, I had never tested any of the backups, and none of them would restore.
  • I restarted a SQL Server instance. When it came back online, one of the databases was in suspect mode. Not knowing what to do, I detached and then tried to reattach it, which made matters worse. Now I know that detaching a suspect database is about the worst thing you can do.
  • A large server was experiencing some disk-related errors, so it was decided that the fix to the problem we needed to essentially rebuild the server. Of course, before doing so, I backed up all of the databases on the local machine. So we performed the hardware repairs, reformatted all the arrays, reinstalled the OS and SQL Server. It was only then that I realized that I had forgotten to move the backups off the server we had just reformatted. to another location. Fortunately, older backups were available, but not the most recent backups.
  • My biggest mistake happened when I needed to take down a server to upgrade the disks that held the SQL Server data files. The plan was simple: take the SQL server down, move all the files off the RAID 1 array  that held the .mdf files, replace the disks in the array (I think we were going from 73Gb HDDs to 146Gb HDDs), let the RAID 1 mirror rebuild, then put the files back, and go home.  Very quickly we hit a snag (which turned out later to be a firmware bug) in that the new larger disks weren’t recognized by the RAID controller. We then put the old drives back, and they weren’t recognized as well. My biggest mistake was that I hadn’t backed up the databases or detached them. I’d simply moved the files after shutting the SQL Server down – so when SQL Server was started it would be looking for data files which simply wasn’t there! The only option was to get the server fixed, and luckily we had 4 hour support on the server, however it was rapidly approaching midnight and I was due to attend my brother’s wedding the following morning!  I quickly returned home to get enough things packed in a bag so that I could go straight to the wedding once the work was complete, while we waited for the engineer and replacement RAID controller card to arrive. Four hours later we had the server back up and running – and watching that partition initialize was excruciating!  I managed to get home and catch a couple hours of sleep, even though I was still buzzing from adrenaline and caffeine!  The lessons were hard learned from this, and I have never since planned ANY upgrade be it hardware or software, major or minor, without an escape plan after every step.
  • While upgrading a large database from SQL Server 2000 to SQL Server 2005, one that held important financial, it was decided by the business users that a large portion of the data in the database would be archived and then deleted from the production database. The end users worked with the vendor to get a script to handle this operation, and in testing it, the process took over 30 hours, which was way beyond out maintenance window.  So after some review, and discussion with the vendor’s support and development staff, I realized that it would be faster to create a new copy of the table, insert the rows to be kept into it, drop the old table, rename the new one, and rebuild all the constraints and indexes.  This took the process down to about 1/2 hours in testing. The problem came when I loaded the script on production to run the purge process.  In testing, I had performed the operation as an explicit transaction, and initially I had left the SET IDENTITY INSERT statements out of the batch by mistake.  It wasn’t a problem on development because I had blocked the code in an explicit transaction, so when the inserts failed because I didn’t SET IDENTITY_INSERT ON, I issued a rollback, waited, and then fixed the problem.  For some reason, I never saved the script.  When it came time for the scheduled production downtime to perform the upgrade, I loaded the script, ran it, and the table disappeared in mere seconds. The first thing I did when I realized what I had just done was to notify my manager of the problem and let him know that the system wasn’t going to be available as fast as expected, allowing him to decide how best to notify the end users if necessary. Then I took a walk around the building to think about what was going to be required to fix this, and in the shortest time possible. This two minute walk gave me the time to relax for a minute and plan what I was going to do, rather than trying to shoot from the hip. The good thing was I had backups and they had been validated numerous times during testing of the upgrade, and the database was fully logged, so I was able to backup the transaction log from production, and restore the database side-by-side, applying log backups to roll the database forward to the point in time after the downtime began, and before I executed the script.  Then a few changes to the original script allowed me to move the data that was supposed to be retained from the restored copy into the production database, and the system was able to come back online with zero data loss.

If you have any mistakes you would like to share, anonymously or not, please do so below.

Advertisements

9 thoughts on “Learning From the Mistakes of Other DBAs

  1. Myself, another (very skilled) DBA, and a SysAdmin, were expanding the NetApp LUN for a data warehouse database. We didn’t have a proper backup (we relied on a NetApp procedure that wasn’t in place).

    Well, we neglected to stop the SQL Services before we expanded the LUN … and the Data File just disappeared. Gone. Never to return again.

    It took us 2 weeks to restore the DW to its original place…

  2. I needed to update the data in one field for a select group of rows. This is in a table used to measure the company’s compliance to some federal requirements. I fat fingered it and ended up changing the value for everyone. I try to roll it back and find we have no transaction tracking. So we contact the appropriate people to restore the data from backup, only to find that the backup had never been tested and hadn’t run for two years.

  3. Pingback: Wednesday Weekly #sqlserver Links for 2011-28 | sqlmashup

  4. An application was grossly misrepresented by a vendor and the short story is, a database with roughly 1,000 consistent concurrent connections was using the “Max-ID-Plus-One” strategy from a “counter” table to add new rows. (Flash back to the 80’s desktop DB development) As you can imagine, the “counter” table suffered all sorts of grief. The first time we encountered the problem, I had the vendor on a speaker phone in an engineer’s cube and was surrounded by close to a dozen people and two levels of management. Thank heaven for speaker phone or I would have been cooked. The guy advises me to update the “counter” table and set the number to something higher than the current one. I’m clueless about the database but no big deal – I’m in a query window and read the statement to him. Twice. “UPDATE table SET column = number”. Right? “yeah.” Are you [sure] this is what I need to run? “yes”. OK. (I should have known better…) Just as I hit the execute button he chimes in with “Oh, you’ll need to use a WHERE clause.” Kaboom. Updated all records in the table. Pucker factor = 10. Thanks-for-the-help-pal-but-I’m-going-to-be-busy-for-a-while-ok-bye. I set the database offline to pull the plug on the app servers, got the system time, dumped the log to a file, restored last night’s backup and all tlogs since then, loaded my last log dump with a STOPAT parameter for a minute before that “special” moment and then ended up using the UPDATE with the WHERE clause. Oh, I forgot to mention I’d been with the company for about three weeks when this happened.

  5. Pingback: Something for the Weekend – SQL Server Links 15/07/11

  6. I think update mistakenly without where clouse is the most mistakes in database….One of my mistake was i droped a database instead of dropping drop diagram in sql 2000 bcz both option is close….

  7. Not so recently I decided to move temp db to a different physical location on the server. I made the mistake of naming the mdf data file tempdb.mdf the same as the transaction log file tempdb.mdf . When SQL Server Starts and creates TempDB this little mistake will fail the start. Eventually instead of a complete re-install of SQL Server I found the registry keys responsible for TempDB file names and changed them but only after an hour and a half of sweating it out.

  8. Myself, another (very skilled) DBA, and a SysAdmin, were expanding the NetApp LUN for a data warehouse database. We didn’t have a proper backup (we relied on a NetApp procedure that wasn’t in place).

    Well, we neglected to stop the SQL Services before we expanded the LUN … and the Data File just disappeared. Gone. Never to return again.

    It took us 2 weeks to restore the DW to its original place…

    a corollary to this: I’d requested a NetApp LUN be shared by 2 servers. Caused data corruption at an accelerated pace. Lost several database being migrated and lots of supporting files.

Comments are closed.