I fixed it. I fixed the network problems we were having at school for the past week. And if I just stayed an extra hour yesterday, I could’ve fixed it then. It turned out to be a Windows Update that corrupted our trusted root certificates, which prevented anyone from connecting to our wireless server. Let me warn you right now, I’m going to get technical because I want to write this down both for posterity’s sake and to see if I can understand all the things that just went wrong. Let me start from the beginning.
Last Monday, I came to work with teachers telling me they couldn’t log on to the wireless network. I sat down one of the teacher computers, and I sure enough, I couldn’t connect to the wireless except on one of these computers. But I could connect to the internet if I was plugging into the ethernet, and both my iPhone and MacBook Air could connect to the wireless network. I ran tests. I ran a lot of tests, and I couldn’t figure out why this was happening. I thought maybe something I did the week before — something like the roaming profiles and folder redirection I was so excited to finally implement — did something to it, so I decided to recover a backup of our two main domain controllers to a point before I made these changes. Turns out, that had nothing to do with it, and I just made things worse.
Once I recovered the first of the two domain controllers from our backups, I inadvertently disjoined our NAS (Network Attached Storage) drive from the domain. Our network could no longer see it, and I was getting reports from teachers that their students couldn’t access their documents on our network. That was an issue I didn’t discover until days later, and it was one of many issues that came up during this dark period. Our two domain controllers are virtual computers, meaning they don’t exist physically. It’s a version of Windows running inside another version of Windows. Once I recovered the second domain controller, I went to “boot it up,” except the Hyper-V manager that runs this OS gave me errors that it couldn’t start it up. This normally happens, and the way to do it is to send it a command through the command prompt. But! I needed the network for this command to work, and I didn’t have the network since our two domain controllers couldn’t talk to each other. We have two just in case one goes down, and they’re setup to replicate to each other every 15 minutes or so. So I decided to reboot the whole host computer in hopes that fixed anything. It didn’t.
By rebooting this computer, I somehow crashed the whole network. Nobody could logon to the internet anymore, and I was panicking. Both the wireless went down and our physical connections were down. Our domain controllers couldn’t talk to each other even though they both existed and could contact every other server on the network. What the hell happened? I think by rebooting the computer, which is something these machines aren’t really mean to do that often, one of the NIC’s (Network Interface Cards) blew out or was damaged somehow. These are the cards with the Ethernet port, and it was a reason why we weren’t getting any internet activity on my primary domain controller. Once I switched ports, internet went back up, and with our primary domain controller up, our secondary domain controller could finally see it and it could finally replicate itself with the primary domain controller. The physical connection to the internet was back, but the wireless was still down. What happened?
At the beginning of the week, I looked at the error logs on the server that was responsible for our wireless network. But from the 16th to the 21st, there weren’t any messages in the log. I checked the logs on the 19th and 20th, and since I didn’t see any messages, I disregarded this server as the source to our problems. The reason I wasn’t getting any error messages was because the event log was full, and once I logged on to see that message, errors finally started to get logged. I investigated these errors on Friday, but I couldn’t figure them out by the time I had to leave. I came back on Sunday, did some research, and discovered the problems Windows Update caused. There was an update designed just for Windows clients that updated their trusted root certificates with newer versions. All good, right? Well, this update wasn’t configured for Windows Server 2003, which is what my server in charge of the wireless ran. Windows Server 2003 has a small size limit as to how many certificates it stores, and since this update installed more than it could handle, Windows Server 2003 deleted many certificates to make room for the new ones, including our valid certificate in charge of authenticating users who wanted to access our wireless network.
The way enterprise authentication works is this: one server runs both our RADIUS server and an Internet Authentication System server. The Internet Authentication System (IAS) and RADIUS work together. RADIUS connects to our primary domain controller, which runs our Active Directory, which has all our user accounts, with their user names and passwords. IAS stores the encryption and certificates that RADIUS uses to authenticate users who want to logon. Windows Update corrupted these certificates. Since the certificates on our authentication server didn’t match anything, it denied access to everyone. I had to create an entirely new certificate, add it to our RADIUS server, made sure IAS and our primary domain controller accepted it, and then finally wireless would work for everyone. And it did.
There’s a lot more I didn’t talk about, and that’s because I spent most of last week explaining more of the details. But that’s what happened in a nutshell.1 I’m glad this is over, and I’m glad I can finally get on with my life and start doing my job. This week really set me back, and now I try to play catchup. Gotta admit, this was pretty fun. I love mysteries, and this one both frustrated me and satisfied me. Now it seems like there’s nothing I can’t handle, and that’s awesome. I no longer feel insecure at work. I got this.
A very big nutshell. ↩︎