Monitoring Hyper-V and ESXi—what should you do?

Over the years, I found that building out monitoring scripts and using them properly has proven to be a challenge. When I look back at my internal IT days using platforms like Whatsupgold, PRTG, or N‑central, the question always remained the same: how can I monitor efficiently and get alerts that matter?
In this blog post, I thought I’d tackle something that is a challenge for a lot of people: monitoring Hypervisors. For the purposes of this piece, I’m going to focus on the two main ones: Hyper-V and ESXi.
I will look at this from the point of view of fairly general concepts, so no matter what tool you use, you should see if what I’m suggesting fits your needs and apply what is appropriate to you and your environment.
Monitoring Hyper-V
Let’s start with Hyper-V, as this is likely the simplest one. Most people I talk to that monitor Hyper-V usually focus on two areas:
- Physical health
- OS/Hypervisor health
On the physical health side, this is usually done by working with the host’s management layer. For example, Dell and HP servers have software that is installed in Windows that allows you to get health data through API/SNMP. This will give you the health of the power supplies, various modules (RAM, CPU, etc.), as well as Raid cards, drive health, etc. Personally, I try to focus on the obvious things that usually cause major issues. These tend to be, in no particular order:
- Power supplies health (if any power supply is offline, in the case of redundancy, trigger an alert)
- Raid card (if the RAID state is deprecated, in predicted failure, etc.)
- Drive health (any drive that is showing a SMART alert, or is in a failed state)
- System battery (is the system battery depleted or in a failed state—I’ve seen my fair share of ‘’expanded’’ batteries in servers)
- Temperature status (a high temperature alert should definitely be taken seriously)
On top of that, if you have the ability, you can also monitor chassis intrusion, bios, memory health, network card status (teamed NIC), and more.
All of those can be done through SNMP. Most RMM platforms will have some level of this pre-built, so I highly recommend that you take advantage of it. If it doesn’t, you would need to use a tool like MibBrowser to check the OIDs or look up which OIDs to pull.
The next part of this is to monitor the Hypervisor health. This one is trickier. With Hyper-V, you obviously want to monitor the hypervisor health. Here, I tend to focus on the following:
- CPU usage (high CPU usage by the hypervisor)
- Partition health (if you’ve setup multiple partitions)
- Disk usage
- Guest health, including: CPU usage, Any guest alerts from the host, Disk usage (Obviously, I’d put an agent on every device and monitor each guest if possible to get the most amount of info, but we’re focusing on the Hypervisor for now)
Monitoring ESXi
ESXi has great APIs (CIM APIs) that allow you to monitor tons of metrics. The danger here is that it’s easy to get lost in ESXi monitoring and report on everything and anything. So here is what I typically recommend to focus on—at least to stat with—as far as alerting is concerned:
- Datastore monitoring (disk free alerts)
- Licensing status (alert if the server’s has license expired or is about to expire)
- Temperature alerting
- Raid card
- Power supplies
- Drive health
- System battery
On the guest front, it is similar to Hyper-V. I would focus, at minimum, on:
- CPU
- VMware tools status (are the guest tools installed and running?)
- Memory ballooning (ballooning happens when a ESXi host is running out of physical memory)
- Guest dropped packets (this can be a sign of networking issues on the host/guest OS)
- Disk usage
There is obviously more that you can get (and probably should), but this list should serve as a solid starting point.
Some people feel like ‘’more is better’’ when it comes to monitoring, and its true for the most part. However, if you’re not going to react to a specific alert, then why alert on it? I’m all for capturing data for reporting purposes, but I would then suggest that you turn off alerting for services that you don’t care about or change thresholds so it doesn’t alert. I see this a lot with services like Chassis Intrusion, page file consumption, etc.
Marc-Andre Tanguay is Head Automation Nerd at N‑able. You can follow him on Twitter at @automation_nerd.
© N‑able Solutions ULC y N‑able Technologies Ltd. Todos los derechos reservados.
Este documento solo se proporciona con fines informativos. No debe utilizarse para obtener orientación legal. N‑able no ofrece ninguna garantía, implícita o explícita, ni asume ninguna responsabilidad legal o jurídica por la exactitud, integridad o utilidad de cualquier información contenida en este documento.
N-ABLE, N-CENTRAL y otras marcas comerciales y logotipos de N‑able son propiedad exclusiva de N‑able Solutions ULC y N‑able Technologies Ltd., y pueden ser marcas sujetas al derecho anglosajón, estar registradas o pendientes de registro en la Oficina de Patentes y Marcas de Estados Unidos o en otros países. El resto de marcas comerciales mencionadas en este documento solo se utilizan con fines de identificación y son marcas comerciales (o marcas comerciales registradas) de sus respectivas empresas.