Major System Outage - Resolved
Posted by: Darryn Fehr | Oct 28 2018
Last updated on Oct 30 2018
Last updated on Oct 30 2018
I received a message from our hosting provider that one or more of the sectors on our server disk had begun to fail. This seemed off as we only use SSD on our servers so it shouldn't experience these issues, also the device was only a couple of months old. I began to investigate and found that the drive could be starting to fail so I made a complete backup of our files.
Files would not compress using the built in backup tools from our cPanel server. I ended up manually writing them into a .zip file and exporting it to a remote server. I then backed up all of the database stuff. Didn't get the email information, derp. I ran an update through our hosting panel to see if the server was just out of date. The updater ran fine and when it finished, some 500 internal errors were fixed. Yay.
I received an email saying one of the processes was offline and to try and reboot the process or restart the server. It had been 18 or so days since we rebooted the server, so we figured a soft reboot of the system wouldn't hurt anything. Proceeded to reboot server.
The server is still offline. Maybe it's tired? Maybe it has some stuff on it's mind? Who knows.
Still no response from the server. Starting to get a little worried now. Hook up the monitor to the server and try and check some stuff out. There it is. A giant error message showing the boot had failed. Oh no, what have we done...
After many failed reboots, tried to switch to another backup kernel. None of the 6 other backups would run. Welp, that's not good. Connected our liveCD to boot Centos into recovery mode and see what's going on. Was able to mount the volume and see all of the files so we know it's working, but the write sector was unavailable and one of the memory chips had failed so uncompression of existing kernels or launching of the new base kernel would fail as it had no swap memory available. This took quite some time to figure out.
Finished backing up all of the files on the server into a tar file and went to bed. Didn't have an extra disk on hand so had to wait until the morning to replace the failed RAID disk. Testing on the board shows the RAID controller is faulty and may have burnt this SSD out so we will look into replacing that as well.
Tyler and Myself went and got a new Samsung 860 EVO V-Nand SSD to replace the dead drive with. This new drive is almost twice as fast as the HP 500 GB we were using before and since it has 3D V-Nand, it was a good upgrade to do either way.
Pulled the `s1` server offline to use it as a chassis for transferring the files from the dead node. Reinstalled Centos on the new drive and begin an install of cPanel to the newest stable kernel.
cPanel was installed successfully and the Centos kernel was updated. Now we can start the process of bringing all the old files from the last drive onto the new drive.
The first of much frustration. The old drive won't mount because it's partitioned into a LV2_Member data set, which means we can't easily transfer files and accounts like we would like. We ended up having to recreate all of the user accounts on the system, and then import the files from the old drive for the public_html folder of each user. Of course this meant the email accounts, domains, and SQL data would not be transferred and had to be manually updates as well.Thank goodness for backups!
All files were transferred to the new drive. Logging into each account through root showed all of the files on each account. Transferred the new drive back into the server unit at the host and reinitialized the `s1` server as well. We should be done now. Sadly that is not the case.
After a lot of trial and error, I found out that new files could be created and edited, along with folders, but because of how the files were transferred (via command line with root user), the permissions of each file had been set to root. This wasn't going to work because each user (commcentre included) didn't own their own files in the servers view anymore. After discussing this issue with cPanel, we ran a few commands to rebuild the file owner permissions for each account, and then re-enabled the file security structure. Bam! Everything works.
Commcentre website, along with all database structures, are back online and chugging along like normal. Thanks for reading!