March 6, 2019
Avoiding Problems When the Clocks Change
The season is upon us when clocks change for Daylight Savings Time. Sure, distributed systems are frequently set by default to use UTC as the basis for all time-related settings and those systems then perform any needed locale-specific adjustments in the client. However, this is not true for everyone and many of us are responsible for systems that rely on timestamps in a locale that adjusts the time twice a year. For the rest of us, even when using UTC, things can get complicated with leap years and the occasional leap second.
In all these cases, potential problems include having parts of the system communicating out of sync with other parts of the system. Messages could be lost. Communications disrupted. Certificates may give incorrect expired notices or report as invalid. Monitoring could be impacted with either outages or multiple data points that were collected an hour apart attempting to use what seems to be the same time stamp. Sometimes data collected during the time change is completely lost during the Fall change, when clocks roll back and repeat an hour, as USA Today reported about medical records systems struggling during Daylight Savings changes.
Even in the best case scenarios, no system is perfect and no team of human maintainers is perfect. Instead, we must learn to find and mitigate against potential problems before they cause far more serious problems.
Engineers can use Gremlin’s Time Travel attack to simulate time changes ahead of the actual event and identify any potential issues before they arise. By testing the impact of time changes on a system while using a limited blast radius, engineers can uncover previously unknown potential problems and fix them before they have an opportunity to have a wider impact. To further enhance reliability, they can gradually retest on wider and wider scales to cover as much of the system as possible.
Goals of Time Travel Chaos Experiments
- Test a system set to use a local time zone. Some local governments require their systems to be set up this way. In other instances, the majority of people responsible for the servers may simply live and work in only one time zone and they figured that it is easier for them to figure out that a resource spike that happens every Saturday at 23:42 local time is the result of a weekly automated backup procedure than for them to do a mental calculation from UTC. In still other places, the setting may be a policy-based decision that was encoded so long ago that the reasoning has been lost but the requirement remains. What happens when Daylight Savings Time starts or ends? During a leap year? When a leap second is added?
- Test a system set to one time zone with users in another. What happens when a system is set to use the time zone settings for the home office, but there are servers and employees and users residing in other time zones? What happens in larger systems that cross international geographic areas, because the Daylight Savings Time change does not occur at the same moment across time zones? For example, Daylight Savings Time begins in the United States and Canada on Sunday, March 10th, 2019 at 2:00 AM and on Sunday, March 31st, 2019 at 1:00 AM in Europe and the United Kingdom.
- Test a large, distributed system with many SREs. What happens in a large, distributed system with thousands of nodes and a large, distributed team of people with the authority to deploy? Can you always be certain that every node is configured in exactly the same manner? What if even one node has different settings? How will your system react? What about your software application, has it been designed to gracefully handle time changes?
Gremlin’s Time Travel attack is part of the full Gremlin product and is considered an intermediate-level attack. To get started with Chaos Engineering, we recommend beginning with Gremlin Free and moving up to our full product as you gain experience.