Troubleshooting System Load!
System load average is probably the most fundamental metric you start from when troubleshooting a sluggish system. One of the first commands that are commonly used when troubleshooting a slow system is uptime.
The three numbers after load average — 2.03, 20.17, and 15.09 — represent the 1-, 5-, and 15-minute load averages on the machine, respectively.
A system load average is equal to the average number of processes in a runnable or uninterruptible state. Runnable processes are either currently using the CPU or waiting to do so, and uninterruptible processes are waiting for I/O.
A single-CPU system with a load average of 1 means the single CPU is under constant load. If that single-CPU system has a load average of 4, there is four times the load on the system than it can handle, so three out of four processes are waiting for resources.
The load average reported on a system is not tweaked based on the number of CPUs you have, so if you have a two-CPU system with a load average of 1, one of your two CPUs is loaded at all times — that is, you are 50% loaded.
So a load of 1 on a single CPU system is the same as a load of 4 on a four-CPU system in terms of the amount of available resources used.
The 1-, 5-, and 15-minute load averages describe the average amount of load over that respective period of time and are valuable when you try to determine the current state of a system.
The 1-minute load average will give you a good sense of what is currently happening on a system, so in the previous example, you can see that the server most recently had a load of 2 over the last minute,
but the load had spiked over the last 5 minutes to an average of 20. Over the last 15 minutes the load was an average of 15. This tells us that the machine had been under high load for at least 15 minutes and the load appeared to increase around 5 minutes ago, but it appears to have subsided.
What is exactly High Load Average means?
“It depends on what is causing it.”
The load describes the average number of active processes that are using resources, a spike in load could mean a few things.
What is more important is to determine is whether the load is CPU-bound (processes waiting on CPU resources), RAM-bound (specifically, high RAM usage that has moved into swap), or I/O-bound (processes fighting for disk or network I/O)
Diagnosing Load Problems with Top Command!
When you type top on the command line and press Enter, you will see a lot of system information all at once. This data will continually update so that you see live information on the system, including how long the system has been up, the load average, how many total processes are running on the system, how much memory you have — total, used, and free — and finally a list of processes on the system and how many resources they are using.
top command sorts the processes according to how much CPU they use.
Making sense of Diagnosing High User time
The most common and relatively common problem to diagnose any high load issue is due to a high percentage of user CPU time. This is the most common since the services on your server are likely to take to the bulk of the system load and they are user processes.
If you see high user CPU time but low I/O wait times, you simply need to identify which processes on the system are consuming the most CPU.
By default, top will sort all of the processes by their CPU usage.
In the short term, you can kill (or possibly postpone) some processes until the load comes down, but in the long term, you might need to consider increasing the resources on the machine or splitting some of the functions across more than one server.
In the next part, we will discuss about Diagnosing Out of Memory Issues and Hight I/O wait time!