Monitoring and Improving

Monitoring and Improving

Our newest employee, Joe McMahon, recently wrote on the Reclaim blog about the excellent monitoring setup he's been working on using a product called Observium. Before now a lot of email notifications for various services had been coming directly to me and it got rather noisy, plus it wasn't sustainable for one person to have a general idea when things were amiss. We needed to get better about anticipating issues which was why we made better monitoring a priority when we began working with Joe. One of the great things about the new setup is that server issues now report to Slack instead of via email. Slack has proved itself invaluable to our company and really helps filter out the noise (not to mention it doubles as an excellent archive to search against).

By knowing about issues as they come up in Slack we can proactively work on things long before someone notifies us asking if we're having problems. There's work to be done to tune the notifications and make sure we know about everything that's happening but we're making headway and even today I added another integration that's going to be a huge improvement for us.

We use a product called R1Soft for off-site backups on our servers. It's been an excellent tool with low server overhead that quietly backups up all files and databases to a separate server every night. I can't tell you how many times folks are pleasantly surprised to know that we can easily restore lost work automatically and get them back up and running. While in theory backups are the responsibility of all of us individually, we take the approach at Reclaim that when possible we'd rather this kind of stuff just happen automatically and that you shouldn't have to think about it. Luckily the solution has been affordable enough for us to absorb the cost and it's excellent peace of mind for us as well.

Occasionally, as with all software, R1soft will have issues with a backup. Perhaps a firewall change on the server or a full disk or any number of other factors will contribute to it no longer backing up regularly. I would log in and see a screen similar to this:

Monitoring and Improving

Those red numbers are never good. And to make it worse if I wasn't checking every day I could find out only after the fact that a server hadn't been backed up in quite awhile. That's bad for everyone. We needed better notifications in order to know if things were going wrong and take action on them.

So today I worked on that issue building on the work Joe had done to write to a monitoring channel in Slack using their email integration and setup SMTP on the R1soft servers to start sending us reports if any servers failed to backup the previous night.

Monitoring and Improving

Slack makes integrations like this incredibly easy. You do have to have a paid account in order to use email integrations, but setting one up is as simple as them handing you an email address and off you go. You can customize where the messages go and who they show up from, even the avatar of the user. No special scripts, just a simple email address to plug into whatever software you're working with.

Monitoring and Improving

The biggest lesson I'm learning as we continue to refine our monitoring processes is that you can't proactively make systems better until you have clear insight into what's going on. Too often I find myself in a reaction stance of putting out fires when instead I want to know when things are starting to heat up. Turns out the age old saying rings truer than ever, Knowledge is Power.

Reclaiming our Infrastructure With Observium

When you’re at a small, growing company, it’s easy to forget about the engine underneath the glitz and glam of all-star support, value-packed hosting services, and the Jim Groom cult of personality. Luckily for us (or for you), at Reclaim, we never forget. Maintaining our infrastructure to be as reliable, scalable, and secure as possible is one of our top priorities, and we recently got set up with the Observium monitoring system to get ahead of any infrastructure issues before they become customer-impacting.

Observium is a low-cost, SNMP-based, extensible, and (reasonably) easy-to-setup platform that helps us by automatically collecting data from our virtual infrastructure and making it digestible and actionable. For example, if we want to get some information about processor usage (a decent benchmark for how hard a system is working) let’s take a look at our processor reporting graph for our BYU server:

Screen Shot 2015-11-25 at 12.02.17 PM

The graph shows us some great info we can take action on (or not!). Right now, you can see that the processor utilization is fairly consistent for the past 48 hours, but if I were to zoom out on the graph over time (I can go as far back as I want), we might see the processor usage steadily increasing. At some point, the team will get an automatic alert saying “hey, your processor usage is over a certain amount,” which would signal to us that we might want to upgrade that server to accept more capacity, or dive deeper into what might be causing the increase if it’s not from additional users. Either way, the data is digestible and actionable.

Here’s another, more practical example. This morning, we got a processor utilization alert on one of our shared hosting servers. Large numbers of users did not all just sign up at the same time for new hosting, so Tim thought some suspicious activity might be occurring. Here’s what the Observium graph showed us:

Screen Shot 2015-11-25 at 12.12.29 PM

The “spike” at the end is clearly an outlier in comparison to the rest of the time the server is available. Sure enough, Tim looked at our security logs and a suspicious computer was attempting to connect to the server thousands of times. Tim blocked the suspicious IP address and the utilization immediately went back down, as shown in the graph. Again, all of this happened without any impact to the customer’s service – we got the alert and were able to take action before it got so bad that support tickets came in.

In helping us move things along a little quicker and making our alerts more actionable, I integrated the automated alert system with Slack, our collaboration/chat application. Instead of sending emails to a monitoring mailbox or someone’s inbox where they could easily be archived or ignored, the integration exists to alert the entire team at the same time so someone can take action quickly (I mean, that’s kind of the point of an alert, right?) In addition, as a general practice, it is really easy to get into “alert fatigue” in a monitoring system where you end up getting a ton of alerts for stuff that really doesn’t matter much – by customizing our alerts so they really only trigger when “the poo hits the fan,” we don’t fall into that trap.

If a team member wants to get a more general overview on the health of a server, they can log in to the Observium web interface.

Screen Shot 2015-11-25 at 12.24.27 PM

Observium doesn’t do some things I would really like to see, for example, as-is the alert suppression/escalation (getting someone to “acknowledge” an alert has triggered before it triggers again) features are a little light, and I would like to see the ability to turn system events (aka, a server rebooted) into alerts. I would also like to see some of the automatic discovery capabilities for Linux hosts improved or built-in to the system, but in order to overcome this, I wrote a script in bash that automatically makes all of the snmp/firewall/ssh key/configuration changes required on a target server and successfully discovers the device in Observium. If you’d like a copy of the script, please let me know in a comment and I’d be happy to share. Other than these minor quibbles, I’m very happy with the solution the Observium team has put together, and you cannot beat the functionality for the price.

Observium comes in two “flavors,” the “Community Edition” and the “Pro Edition,” and the “Pro” edition only costs £150 (about $225) per year, way, way cheaper than some of the other monitoring solutions available. If it’s something you’re interested in, you only get the alerting functionalities and capabilities from the Pro version, so I highly recommend going down that route.

This platform is and will be a work-in-progress with more features to come, including application monitoring, more advanced alerts, and hopefully some reclaim-specific customizations. We’ll keep tuning this Reclaim engine so you can go create amazing stuff!

Dr. Reclaimlove or: How I learned to Stop Worrying and Love Devops.

One of the best things (besides the /giphy function in Slack) about getting some time each month to work for Reclaim Hosting is how it has put tasks at my “traditional” full-time IT job into perspective; contrasting my full-time IT environment, which is pretty old fashioned (physical stuff), with an environment that relies heavily on Devops, virtual IT, and cloud administration. Fundamentally, Reclaim is really a model example of how to effectively run a lean startup, manage virtual IT, and stay mostly hands off, and it’s been a good introspective experiment for someone like myself who still grips precariously upon the edge of physical infrastructure and an old-school IT background.

Devops is kind of a contentious term for many “traditional” (read: mostly this means hands-on) IT people, because it represents a massive shift in the way IT work is defined and performed. Since people, especially IT people, are often prone to some degree of change-averseness (guilty) and paranoia (doubly guilty) about their precious hardware racks, “if my infrastructure goes away then my job will go away” is not an entirely unreasonable conclusion to arrive at. We are looking at a fairly unprecedented degree of change in our industry and at blazing speed. Our jobs, though not the same as they were 10 or 15 years ago, did not start becoming substantially different until about 3-4 years ago. Neckbeard the Elder would probably be a high performer at most existing IT generalist jobs in 2013 and 2014…maybe 2015, too. The next generation of IT generalists (and IT generalists will still exist) will not rely upon Neckbeard the Not-Quite-Elder (that’s us) unless we decide, right now, to acknowledge that these changes are happening and that we will perish if we don’t adapt.

So how do we embrace the changes if we’re in traditional IT and not Devops IT? First, we have to acknowledge what Devops actually means…and since the “real” definition of Devops is still up for some discussion, let’s try to define it in the context of traditional IT work:

Devops is a collection of hands-off methodologies designed to reduce the need for physical infrastructure in favor of virtual, managed infrastructure over a hosted medium.

I.E., “use the cloud, and write some scripts.” This is in comparison to the Wikipedia definition of Devops:

DevOps is a software development method that stresses communication, collaboration, integration, automation, and measurement of cooperation between software developers and other information-technology (IT) professionals.”

“Woah woah woah. I’m in IT. I’m not a software developer. I don’t want to have to deal with them.”

This sort of makes it sound like you have to be a software developer in order to be successful in IT, which is not entirely true, but I will stress this: if you want to be successful in IT in 2015+, you need to know something about how to code. Your code doesn’t have to be flashy, and you don’t need to be an expert, but it should be effective and reasonably efficient. And in “code” I recommend learning Bash, Python, or Powershell (if you are in a Windows-heavy environment). I dabble in all three of these languages, and though I am not terribly good, I understand some of the thought processes that developers go through when iterating on their previous code and it helps me “get into the head” of a developer a bit. It’s also a huge opportunity for me, and it can be for you too.

If you’re like me, you are an overworked, overstressed IT admin. I have began to embrace Devops because it gives me an avenue for working less…if I commit to the avenue. Basically, I don’t really want to do any work. I would rather be doing other things that are more fun, but I still need to have a job so I can do those other fun things. Some of those fun things are actually “work” but they’re not really “work” to me but I still get paid to do them? Anyway, I can turn this into a win-win using Devops, and I am actually going to use a very Windows-centric example because it’s the easiest to understand, and because most software developers I know do not use Windows and I am trying to keep something of a line there.

Setting up Active Directory and doing it well is difficult. Setting up AD well using Microsoft Azure is less difficult because I don’t have to worry about unscalable and unstable hardware, vendors, CALs, incredibly obtuse licensing, etc. So what is the opportunity? Setting up AD (in the cloud), integrating it with VMM (in the cloud), and creating a “developer hook” (could be as simple as a batch file run from the requestor’s desktop) so developers in the correct AD groups can request the creation of dev and staging machines (and having those machines created for them) without me having to really touch AD anymore, except also “literally” touch it because that hardware doesn’t exist in my universe. There is nothing for developers to break because if the OS gets completely destroyed somehow they can just request a new virtual machine. Microsoft does a lot of the “devops” work for you in this example because of their integration tools, but you could also Powershell a lot of that work away, intelligently, and then maybe you could even have a real lunch break! By the way, the work you do linking Azure (or AWS, or Google) VPN to your network? That will not be work software developers are going to be doing in the foreseeable future.

I am not some Microsoft fanboi, but this is a simple example of how “thinking” in a Devops way can be hugely beneficial and not so scary at all because it illustrates the need for traditional IT expertise with the development and automation expertise. Few of these opportunities existed even 5 years ago. (OK, so that is a little scary.) If you are reading this blog, you might be thinking “well, he’s preaching to the choir a bit,” but I promise you, based on the things that I have seen and heard, I’m not. If you’re not convinced, look around at what some IT specialists are doing on LinkedIn. IT needs to “get real” on these things soon, and start getting their people trained to think in a way that fosters collaboration and automation.

Instead of being wary of Devops, make it work for you, as we are doing at Reclaim. I’m currently deploying a network and server monitoring solution in the cloud for Reclaim (also a “traditional IT” task) and am creating an opportunity for myself to script or program away the SNMP configuration of hosts I’d like to add to the monitoring solution and making it as close to “zero-touch” as possible. In a more advanced environment, you could do something like this using Chef or Puppet, but for this task in particular, I don’t even really need that solution. I am greatly expanding on my Bash/Shell skills (Dev) while incorporating my security, file transmission, service configuration, and permissions skills (Ops). When the prep work is done, the operator will be able to go through a simple series of conditionals that will copy the SNMP config file over to the machines to be monitored, with no additional input needed from “the IT guy.” This is not scary. Self service is good service. Eventually we may even get to the point of auto-discovery. But that’s TNG, we’re still in Star Trek. 🙂

Jim and Tim have built the powerful engine of Reclaim Hosting using simple, powerful DevOps methodologies and thought processes. In doing so, they can focus on the customer and not let the hardware get in their way, and that is the essence of an effective business. “Think Devops” can be the essence of your IT infrastructure, support organization, and even your sanity, if you learn, as I have, how to embrace these changes and live by this mantra. If you can’t commit yet, just start with “think automation.”

I will post more, in the coming weeks, about the monitoring platform, what we are doing, and I will also post some sample config files either here or on Github that you can port right into your own Linux machines if you’d like to start experimenting with the SNMP daemon. Until then, happy reclaiming!