For me, a sysadmin is a jack of all, master of few trades. A good sysadmin can play multiple roles effortlessly, in his professional as well as personal life. Once he is into debugging an issue, don’t disturb. That’s the only condition he would have.
I agree that its difficult to document how a sysadmin should trouble shoot the issue in hand. But this is a humble effort to do so and is based on the Master Class videos of Brendan Gregg, available in YouTube.
When you have an issue in hand, what is the first step you would take ?
How did you come to know about the issue ? Was it your customer or users that reported the issue to you verbally ? Or was it reported by your monitoring system ? Let me de-route here a bit. If the issue was not reported by your monitoring system, it is your responsibility to identify the root cause of the the issue and monitor it, so that you get alerted, before the user gets it next time.
If the issue was reported by the user, don’t take it for granted. You must understand the issue from a sysadmin point of view. For example, if a user claims that the server is down, don’t take it for granted and reboot the server. Try to understand why he thinks his server is down, most probably his website will be down, which require a web server restart or it could be even his database server. Before restarting the services please make sure that his public IP is not blocked in the server firewall.
You should duplicate the issue first and understand the root cause then, solve it. You may need temporary solutions or workarounds. But it is better not to settle for temporary and always go for permanent solution.
Duplication is successful. What’s next ?
You need the right tools with you to take this forward in order to troubleshoot the issue. Not just having the tools is enough, but where, when and how to use it is also important. In the above mentioned case, when the customer says the server is down, you were able to duplicate the issue and found that his website is not working while his email server is working. That’s when you need to know how the systems work.
You will have to use the dig/host/nslookup to see whether the website and the mail server are pointing to the same server. You should have a mental flowchart based on elimination.
If they are on the server, flow of your thoughts should be that are you able to ping and then ssh to the server. Now what if the ssh fails. If SSH fails, what is the error message. Is it “Connection refused” or “Connection reset by peer” or “Connection timed out”. Each of these error messages, should take you to different flow and finally you will be able to solve it.
What could be the reasons and how would I find it ?
Always check for the error logs. Almost all the services will have an option to enable the debug or error log level, ranging from errors to information type messages. Irrespective of whether it is an error with hardware or software, the errors are logged. All you have to do is to find the location where the errors are logged and demystifying the errors shown.
Not just logging is enough, but the error reporting in those logs should be monitored on a regular basis. Unless there is monitoring, you will be waiting for the users to report the errors and that results in bad reputation.
I found the issue and solved it. What’s next ?
First and foremost you should convince or analyze yourself whether the solution you have proposed or put together is temporary or permanent. It should be a permanent fix and shouldn’t repeat. If you fear that it may happen again, the problem is not solved. Work towards a permanent solution which will make sure that the issue wont repeat and if it ever happens, you should be the first to know and solve it before the issue elevates itself to critical level.
Document the issue and the solution. You may have solved it now, but what if the issue repeats in another office network, say, after 3 years. You may have a vague memory that you have seen this issue somewhere, but would have forgotten by now on how you solved it. You will have to reinvent the wheel. That is why documentation is important so that you can come back any time and solve the issues faster.
This also mean that a good sysadmin will make sure that the issue doesn’t repeat on the same server and he will catch it before it happens. However not every sysadmin does, because a sysadmin by nature is a lazy one. But the genius yet lazy sysadmin does it. He documents everything, so that when he is not around, other sysadmins can solve it, and get things moving without disturbing his sleep.
A good sysadmin believes that “prevention is better than cure” and that “a stitch in time saves nine!”