Sometimes, small changes can result in costly application downtime, emergency rollbacks, accompanied by several hours of troubleshooting, followed by days of analysis and planning. One of the most common changes involves updates and patches to operating systems. These are typically driven by security concerns but may also introduce new functionalities.
The first step to increasing the chances of successful patching is basic organization and planning. Tracking successful or failed changes on servers and applications is essential in IT management, often using tools like Jira or ServiceNow. It’s also important to stay informed about known issues with the patches that are about to be installed, and understand which changes will be made. For example, a few years ago, a Windows update disabled SMB 1.0, and a subsequent update altered how Windows Features are activated.
Keeping application owners informed and involved in approvals is crucial to avoid unintended consequences. By ensuring they approve maintenance windows, downtime and disruptions are minimized, as their input is vital for assessing risks and timing of updates effectively.
By implementing update rings and testing groups further enhances patching reliability by allowing updates to be gradually tested and rolled out. For example starting with the test and development environments is a the best way to do this.
Another way to ensure successful patches and updates is by using patching applications like Azure Update Manager, WSUS or SCCM. These tools provide reports on the latest patches, track the status of each server and their patch level, and allow for automated patching based on predefined conditions and grouping. However, be cautious with fully automated patches without prior testing, as they can lead to issues, such as the recent Crowdstrike patch failure that caused a widespread downtime in hundreds of businesses on a Friday. Additionally, ensuring free disk space is essential for successful updates.
Another important step is to plan how to revert any changes if something goes wrong. For this, I recommend automating pre-update snapshots and deleting them if everything goes well.
By combining these best practices, you can significantly reduce the risks associated with patching and ensure a smoother, more reliable update process.