7/8/2018
I woke this morning early to my neighbor's car alarm. Since I couldn't fall back to sleep, I chose instead to think about a customer of mine who is having challenges around constant, unplanned changes in their technology environment. Traditionally, changes to a functioning technology infrastructure are an exception to the normal operations, but in this case, change occurs frequently, sometimes several times during a business day without any controls around those changes. Uncontrolled changes to any environment, especially technology, can result in instability, lack of availability of critical business systems, and the basic inability to continue to manage the environment due to the frequent changes. The current process consists of a notification email that a change will be taking place. To help with this situation, I'm going to provide this customer some tips on creating a change control process that looks to reduce the complexity in the environment.
I had the pleasure of participating in very organized change control with a previous employer both as a change creator and implementer, as well as a participant in the change review process. That process was extremely cumbersome and time consuming. It got the point across that changes were unique and their impact on the operational environment should be carefully considered. At the same time, the process also restricted the ability to implement emergency changes to the environment. We always had to consider project timeline-related changes, but these were easily dismissed as a standard change because as we all know, failing to plan does constitute an emergency. So I've described both extremes: A process that notifies parties that a change is going to take a place, versus a time-consuming, board managed official process for change. This customer can't afford the luxury of either extreme. Here is what I think could be a happy medium:
Changes to a functional environment should require approval from someone, preferably a group of peers and leaders. This means that stakeholders in the operation of the business have heard, understand and approve of the change to their environment. The stakeholders can also prepare their direct reports for the change and to bring awareness to the fact that something in the environment is going to be modified. The best way to do this is through the creation of a change control, change review, change management, change "insert an action here" board. I've heard them called many different things. Someone just needs to get together in a room or on the phone and consider the impact of the changes to the organization. The team should assess the need for the change, the possible negative impact to the organization, whether the change follows the manufacturers or the organizations best practices, and what positive gains the change brings to the company.
Change requests should have a management sponsor. The manager is ultimately responsible for what their team does and sometimes the manager may not have visibility to what technical changes their team has planned. Requiring a manager to change the status of a change record from "draft" to "up for consideration" insures that someone besides the change requester believes that the change to the environment is necessary.
Establish a schedule for changes to be accepted and approved. Say, for instance, that standard changes are only considered on Tuesday at 2PM. Any standard changes made after that time are considered at the next meeting. Emergency changes can be considered through a different process (described in another bullet point).
Planned changes to an environment should be made during a scheduled, weekly outage window. Consistent change days allow the business to narrow down issues that occur as a result of those change days. Once or twice a week is the norm, but could vary based on the availability requirements of the system being changed. Avoid weekend changes where possible. Many incidents naturally occur on Monday mornings (password resets, new employees, a system went down over the weekend because of an external change) and reverting a change at 8AM on a Monday is complicated. Some organizations run changes between 6 and midnight Tuesday and Thursday. Others allow change to occur Monday through Friday after 7PM. And some have a once monthly service window. Constantly making infrastructure changes to the environment all day long prevents the ability for technicians to recognize which of their recent changes broke something. But, if associates come in the morning after a scheduled change window and the web server does not display pages, for instance, the technical team can begin with the changes made during the outage window the night before. This allows changes to be reverted in a manner that restores business functions if necessary. That leads to:
There always needs to be a documented back-out plan (reversion). The technician creating the change should know how to reverse their changes to restore business to the way it was before the change. This could be through restoring from backup, uninstalling a program update, or removing a file. There have been times in my technical history where I had documented no back-out plan. That was a mistake, because there should ALWAYS be a way to reverse changes.
All standard or emergency changes should be accompanied by a registered incident or problem record. This is not true for project-related changes. Regardless, all changes should be documented in a service management system so that historical data is captured about a change, why it was necessary, the approval history, and what components were modified. This allows for event correlation to be conducted and documented when an issue occurs, or a change "feels familiar" to something done previously, successes and/or failures.
Risk should always be assessed and documented for changes. This includes the priority of the change (low to high) and risk impact to the environment (low to high). The higher the priority and the higher the risk impact the more scrutiny a change request should undergo. Low risk, low priority changes may only need to be implemented once per month to continue to reduce the amount of changes to the environment (promotes stability).
The implementation plan should always be tested and validated. Implementing untested changes introduces an ENORMOUS amount of risk to the functionality of the environment. If there isn't a test environment, spend some money to build one. If you have $1 million in hardware, you shouldn't hesitate to spend $50 thousand to test the impact of changes to the organization. After all, it's pretty easy to sell the executive team on bringing stability, availability and manageability to the organization.
Establish an emergency change control process. Where I worked with change control, a titled infrastructure manager was required to approve an emergency change request. This wasn't a bad idea, because someone outside of the unit requesting or making the change took responsibility for its implementation. A sampling of emergency changes to a technology environment include software bugs that could expose customer information, closing a firewall hole that allows unwanted visitors to your environment, reverting a previous change that unexpectedly, and negatively, impacted your business, or a zero-day security vulnerability patch.
The Information Technology Information Library (ITIL) contains very formal standards for change control. The group was created to establish standards for IT service management around how to handle reactive issues (incident management), long-term defects (problem management) and technology modifications (change management). A deeper dive on the official process for change control is located here: https://www.cherwell.com/products/it-service-management/itil-processes/essential-guide-to-itil-change-management
7/2/2018
I have had quite a few job titles in the technology industry over the past twenty years. Some have been self-explanatory, like "server support analyst," while others have been somewhat obscure like "Agency support specialist." The first was related to my job supporting the 25,000 or so physical servers that made up the company infrastructure. The second was my job supporting the servers, workstations, applications and upgrades to those components within an agent's office. The underlying positions were Technical Analyst 2, or Technical Analyst 3 which didn't make a whole lot of sense.
Once I worked in consulting the roles became more challenging to explain. I had roles such as Windows Infrastructure Practice Manager, Senior Implementation Storage and Virtualization Architect, and Technology Solutions Director-Converged Infrastructure. These jobs focused on supporting the sales and implementation of Microsoft Windows-based servers and desktops, server systems (think email) and disk data storage. Or the design and implementation of EMC (now Dell Technologies) disk data storage systems, VMware server virtualization, and Cisco server computing. Or, the lead engineer in charge of the technologies around disk data storage, virtualization and computing that our company would partner with and implement, figure out how to implement them, share that knowledge with the rest of the engineering team, and help sales people sell those technologies.
All of these jobs sound like "yes, I will fix your computer." In actuality, that is not the case. (In all honesty, I do still help fix my kids computers, and help my parents with their router and wireless, and exorcise the demons from our friends kids' computers). Now, as an Enterprise Solutions Consultant I spend more time in front of customers helping them to solve their technology challenges while embracing changes in the technology field. These challenges can relate to networking, security, servers, data center moves and migrations, cloud computing, storage, hardware and software upgrades, and data storage solutions. In many cases, I can help implement these technologies, or at least help to lead the installation team. For you Googlers out there, these are some of the buzzwords that you can search to understand what I have become the trusted adviser to our customers in:
Cisco Unified Computer Servers, Nexus and Catalyst Switches, Multilayer Director Switches
Meraki networking
Dell Technologies XtremIO, Unity, Isilon, RecoverPoint, Avamar, Data Domain, ScaleIO, VxRack, VxBlock
VMware vCenter, ESXi, Horizon, Site Recovery Manager
Nutanix, VxRail, HyperFlex HyperConverged
Pure FlashArray and FlashBlade
I get to work in quoting tools, and work with our account managers to present the solutions that we have designed. Sometimes these presentations are formal, with documents, and pictures, and plans, and questions, and answers, and designs. Sometimes we submit just a formal quote. The important part of this role is to match the solutions with the challenges, while at the same time ensuring that the customer has the staff to support what we recommend. When the project is over, and our industry-leading implementation engineers are finished doing what they do (I can write a blog about that someday), we continue our relationship with our customers to ensure that they are trained to operate and maintain what we installed. In this software-defined world, they have to be able to upgrade their systems on a regular basis to keep everything operating the way it was designed.
There it is, the ESC role, family and friends. Hopefully this is a better explanation than my one-liners on Facebook or in casual conversation.