NTP at NIST Boulder Has Lost Power

(lists.nanog.org)

187 points | by lpage 5 hours ago

13 comments

  • arn3n 3 hours ago
    Wind gusts were reaching 125 MPH in Boulder county, if anyone’s curious. A lot of power was shut off preemptively to prevent downed power lines from starting wildfires. Energy providers gave warning to locals in advance. Shame that NIST’s backup generator failed, though.
    • IncreasePosts 3 minutes ago
      Notably, we had the marshal fire here 4 years ago and recently Xcel settled for $680M for their role in the fire. So they're probably pretty keen not to be on the hook again
    • Maxion 3 hours ago
      Somewhat interesting that they themselves don't have access to the site. You'd think there would have been some disaster plans put in place?
      • ssl-3 2 hours ago
        Maybe this is the disaster plan: There's not a smouldering hole where NIST's Boulder facility used to be, and it will be operational again soon enough.

        There's no present need for important hard-to-replace sciencey-dudes to go into the shop (which is probably both cold, and dark, and may have other problems that make it unsafe: it's deliberately closed) to futz around with the the time machines.

        We still have other NTP clocks. Spooky-accurate clocks that the public can get to, even, like just up the road at NIST in Fort Collins (where WWVB lives, and which is currently up), and in Maryland.

        This is just one set.

        And beyond that, we've also got clocks in GPS satellites orbiting, and a whole world of low-stratum NTP servers that distribute that time on the network. (I have one such GPS-backed NTP server on the shelf behind me; there's not much to it.)

        And the orbital GPS clocks are controlled by the US Navy, not NIST.

        So there's redundancy in distribution, and also control, and some of the clocks aren't even on the Earth.

        Some people may be bit by this if their systems rely on only one NTP server, or only on the subset of them that are down.

        And if we're following section 3.2 of RFC 8633 and using multiple diverse NTP sources for our important stuff, then this event (while certainly interesting!) is not presently an issue at all.

      • TylerE 2 hours ago
        Step One of most disaster plans is not to create a second emergency.
        • amelius 2 hours ago
          But can't NTP server downtime cause a disaster?
          • Vosporos 1 hour ago
            One (amongst many) NTP server going down creates less issues than an NTP server spreading wrong time.
            • macintux 1 hour ago
              General rule of thumb: a misbehaving/slow server in any well-architected distributed system is vastly worse than a dead server.
            • PunchyHamster 52 minutes ago
              technically if you have 3 or more sources that would be caught; NTP protocol was designed for that eventuality
              • da_chicken 36 minutes ago
                Sure, but not needing a failure to cascade to yet another failsafe is still a good idea. After all, all software has bugs, and all networks have configuration errors.
  • themafia 3 hours ago
    > Facility operators anticipated needing to shutdown the heat-exchange infrastructure providing air cooling to many parts of the building, including some internal networking closets. As a result, many of these too were preemptively shutdown with the result that our group lacks much of the monitoring and control capabilities we ordinarily have

    Having a parallel low bandwidth, low power, low waste heat network infrastructure for this suddenly seems useful.

  • glkindlmann 3 hours ago
    Of the various internet .+P, NTP is one I never learned about as a student, so now I'm looking at its web page [1] by its creator David L. Mills (1938-2024). I've found one video of him giving a retrospective of his extensive internet work; he talks about NTP at 34:51 [2] and later at 56:26 [3].

    [1] https://www.eecis.udel.edu/~mills/ntp.html

    [2] https://youtu.be/08jBmCvxkv4?si=WXJCV_v0qlZQK3m4&t=2092

    [3] https://youtu.be/08jBmCvxkv4?si=K80ThtYZWcOAxUga&t=3386

    • ssl-3 3 hours ago
      HN discussion shortly after Dave Mills died, early in 2024: https://news.ycombinator.com/item?id=39051246
    • torcete 2 hours ago
      In [3] he mentions that one can use NTP to observe frequency deviations and use it as an early warning system for fire and AC failure. That really intrigues me. Can you actually? Has this ever been implemented?
  • Animats 4 hours ago
    NIST campus status: Due to elevated fire risk and a power outage for the Boulder area, the DOC Boulder Labs campus is CLOSED on December 19 for onsite business and no public access is permitted; previously approved accesses are revoked.[1]

    WWV still seems to be up, including voice phone access.

    NIST Boulder has a recorded phone number for site status, and it says that as of December 20, the site is closed with no access.

    NIST's main web site says they put status info on various social media accounts, but there's no announcement about this.

    [1] https://www.nist.gov/campus-status

  • gilrain 22 minutes ago
    It’d be a good idea to protect our infrastructure from the climate we created.

    It’s just a good idea, though, not a greedy one… so it won’t happen.

  • cdfuller 4 hours ago
    Can anybody expand on the implications of this?

    Being unfamiliar with it, it's hard to tell if this is a minor blip that happens all the time, or if it's potentially a major issue that could cause cascading errors equal to the hype of Y2K.

    • autarch 3 hours ago
      Time travel is extremely dangerous right now. I highly recommend deferring time travel plans except for extreme temporal emergencies.
      • jeffrallen 2 hours ago
        Would traveling to the past in order to put in place a preemptive fix for this outage be wise or dangerous?

        Asking for a friend.

        • JadeNB 18 minutes ago
          Safety not guaranteed.
        • ExoticPearTree 1 hour ago
          Tell your friend that this course of action failed, as us in the present are still experiencing issues.
          • autarch 1 hour ago
            Well, that's _this_ timeline. Other timelines never had an outage.
            • throwup238 1 hour ago
              Not with Terminator rules…
      • fuzztester 2 hours ago
        Same for database transaction roll back and roll forward actions.

        And most enterprises, including banks, use databases.

        So by bad luck, you may get a couple of transactions reversed in order of time, such as a $20 debit incorrectly happening before a $10 credit, when your bank balance was only $10 prior to both those transactions. So your balance temporarily goes negative.

        Now imagine if all those amounts were ten thousand times higher ...

      • yawpitch 3 hours ago
        Define “extreme”?
    • Animats 3 hours ago
      Google has their own fleet of atomic clocks and time servers. So does AWS. So does Microsoft. So does Ubuntu. They're not going to drift enough for months to cause trouble. So the Internet can ride through this, mostly.

      The main problem will be services that assume at least one of the NIST time servers is up. Somewhere, there's going to be something that won't work right when all the NIST NTP servers are down. But what?

      • axlee 12 minutes ago
        Can't they point these dns records to working servers meanwhile to avoid degradation?
        • creatonez 9 minutes ago
          My understanding is that people who connect specifically to the NIST ensemble in Boulder are doing so because they are running a scientific experiment that relies on that specific clock. When your use case is sensitive enough, it's not directly interchangable with other clocks.

          Everyone else is already connecting to load balanced services that rotate through many servers, or have set up their own load balancing / fallbacks. The mistakenly hardcoded configurations should probably be shaken loose anyways.

      • guenthert 3 hours ago
        Ubuntu using atomic clocks would surprise me. Sure they could, but it's not obvious to me why they would spend $$$$ on such. More plausible to me seems that they would be using GPSDO as reference clocks (in this context, about as good as your own atomic clock), iff they were running their own time servers. Google finds only that they are using servers from the NTP Pool Project, which will be using a variety of reference clocks.

        If you have information on what they actually are using internally, please share.

        • puzzlingcaptcha 2 hours ago
          I think people have a wrong idea of what a modern atomic clock looks like. These are readily available commercially, Microchip for example will happily sell you hydrogen, cesium or rubidium atomic clocks. Hydrogen masers are rather unwieldy, but you can get a rubidium clock in a 1U format and cesium ones are not much bigger. I think their cesium freq standards are formerly a HP business they acquired.

          Example: https://www.microchip.com/en-us/products/clock-and-timing/co...

          • xorcist 2 hours ago
            It is also important to realize that an atomic clock will only give you a steady pulse. It will count seconds for you, and do so very accurately, but that is not the same as knowing what time it is.

            If you get a rubidium clock for your garage, you can sync it up with GPS to get an accurate-enough clock for your hobby NTP project, but large research institutions and their expensive contraptions are more elaborate to set up.

      • genidoi 3 hours ago
        Atomic clock non-expert here, what does having a fleet of atomic clocks entail and why would the hyperscalers bother?
        • Gabrys1 3 hours ago
          Having clocks synchronized between your servers is extremely useful. For example, having a guarantee that the timestamp of arrival of a packet (measured by the clock on the destination) is ALWAYS bigger than the timestamp recorded by the sender is a huge win, especially for things like database scaling.

          For this though you need to go beyond NTP into PTP which is still usually based on GPS time and atomic clocks

        • synack 3 hours ago
          Spanner depends on having a time source with bounded error to maintain consistency. Google accomplishes this by having GPS and atomic clocks in several datacenters.

          https://static.googleusercontent.com/media/research.google.c...

          https://static.googleusercontent.com/media/research.google.c...

          • londons_explore 2 hours ago
            And more importantly, the tighter the time bound, the higher the performance, so more accurate clocks easily pay for themselves in other saved infrastructure costs to service the same number of users.
        • Youden 9 minutes ago
          There's a lot of focus in this thread on the atomic clocks but in most datacenters, they're not actually that important and I'm dubious that the hyperscalers actually maintain a "fleet" of them, in the sense that there are hundreds or thousands of these clocks in their datacenters.

          The ultimate goal is usually to have a bunch of computers all around the world run synchronised to one clock, within some very small error bound. This enables fancy things like [0].

          Usually, this is achieved by having some master clock(s) for each datacenter, which distribute time to other servers using something like NTP or PTP. These clocks, like any other clock, need two things to be useful: an oscillator, to provide ticks, and something by which to set the clock.

          In standard off-the-shelf hardware, like the Intel E810 network card, you'll have an OXCO, like [1], with a GPS module. The OXCO provides the ticks, the GPS module provides a timestamp to set the clock with and a pulse for when to set it.

          As long as you have GPS reception, even this hardware is extremely accurate. The GPS module provides a new timestamp, potentially accurate to within single-digit nanoseconds ([2] datasheet), every second. These timestamps can be used to adjust the oscillator and/or how its ticks are interpreted, such that you maintain accuracy between the timestamps from GPS.

          The problem comes when you lose GPS. Once this happens, you become dependent on the accuracy of the oscillator. An OXCO like [1] can hold to within 1µs accuracy over 4 hours without any corrections but if you need better than that (either more time below 1µs, or more accurate than 1µs over the same time), you need a better oscillator.

          The best oscillators are atomic oscillators. [2] for example can maintain better than 200ns accuracy over 24h.

          So for a datacenter application, I think the main reason for an atomic clock is simply for retaining extreme accuracy in the event of an outage. For quite reasonable accuracy, a more affordable OXCO works perfectly well.

          [0]: https://docs.cloud.google.com/spanner/docs/true-time-externa...

          [1]: https://www.microchip.com/en-us/product/OX-221

          [2]: https://www.u-blox.com/en/product/zed-f9t-module

          [3]: https://www.microchip.com/en-us/products/clock-and-timing/co...

      • adastra22 3 hours ago
        I know this is HN, but the internet is pretty low on the list of things NIST time standards are important for.
        • 2snakes 28 minutes ago
          In a past job I set up at least 5 domain dns servers pointing at nist ntp…
        • willis936 3 hours ago
          But pretty high on the list that NIST NTP is important for (since it leaves the building through the internet).
          • adastra22 3 hours ago
            If NIST NTP goes down, the internet doesn’t go down. But atomic clocks drifting does upset many scientific experiments, which would effectively go down for the duration of the outage.
            • willis936 3 hours ago
              This is the reason GP listed out all the alternative robust NTP services that are GPS disciplined, freely available, and used as redundant sources by any responsible timekeeper.

              What atomic clocks are disciplined by NTP anyway? Local GPS disciplining is the standard. If you're using NTP you don't need precision or accuracy in your timekeeping.

            • szundi 2 hours ago
              [dead]
        • _zoltan_ 3 hours ago
          could you list 3 things that you think are more important than the internet? (I know the internet is going to be fine; I just want to understand what you think ranks higher globally...)
          • adastra22 3 hours ago
            Mostly scientific stuff like astronomical observations — e.g. did this event observed at one telescope coincide with neutrinos detected at this other observatory.

            Note I didn’t say they are more important than the Internet. That’s a value judgement in any case. I said that NIST level 0 NTp servers are more important to these use cases than they are to the Internet.

            • misnome 2 hours ago
              All these use at least GPS for timing
              • adastra22 23 minutes ago
                No, they don’t. GPS is orders of magnitude less reliable than the most up to date metric time synchronization over fixed topology fiber links.
          • Izmaki 3 hours ago
            The ability for humankind to communicate across the entire globe at nearly 1/4 of the speed of light has drastically accelerated our technological advancement. There is no doubt that the internet is a HUGE addition to society.

            It's not super important when compared to basic needs like plumbing, food, electricity, medical assistance and other silly things we take for granted but are heavily dependent on. We all saw what happened to hospitals during the early stages of the COVID pandemic; we had plenty of internet and electricity but were struggling on the medical part. That was quite bad... I'm not sure if it's any worse if an entire country/continent lost access to the Internet. Quite a lot of our core infrastructure components in society rely on this. And a fair bit of it relies on a common understanding of what time "now" is.

          • makeitdouble 3 hours ago
            I think it wont be affected by this but on the top of my head:

            - GPS

            - industrial complex that synchronize operations (we could include trains)

            - telecoms in general (so a level higher than the internet)

    • jhart99 2 hours ago
      NIST maintains several time standards. Gaithersburg MD is still up and I assume Hawaii is as well. Other than potential damage to equipment from loss of power (turbo molecular vacuum pumps and oil diffusion pumps might end up failing in interesting ways if not shut down properly) it will just take some time for the clocks to be recalibrated against the other NIST standards.
    • franklyworks 4 hours ago
      Time engineers are very paranoid. I expect large problems can't occur due to a single provider misbehaving.
    • ThrowawayTestr 3 hours ago
      If your computer was using it as your time server and you didn't have alternatives configured your clock my have drifted a few seconds.
      • Roark66 1 hour ago
        I never checked it, but how much a typical's pc/server's clock does actually drift over a week or a month? I always thought it's well under a second.
        • bhouston 0 minutes ago
          Clocks do drift. Seconds a week is definitely possible.
  • DamonHD 2 hours ago
    So far I think I'm still seeing one of them in my peers list for my public-ish NTP server:

             remote           refid      st t when poll reach   delay   offset  jitter
        ==============================================================================
        +time-e-b.nist.g .NIST.           1 u  372 1024  377  125.260    1.314   0.280
    • DamonHD 2 hours ago
      ...and maybe it's gone:

          #time-e-b.nist.g .NIST.           1 u 1071 1024  377  125.260    1.314   0.280
  • amelius 2 hours ago
    This makes me wonder, if you take the average time of all wristwatches on the planet, accounting for timezones and throwing out outliers, how close would you get to NTP time?

    And how many randomly chosen wristwatches would you need to get anything reasonable?

    • nielsole 36 minutes ago
      I have a hunch my casio wrist watch is designed to be running a bit too quick to make resetting the seconds easier. Your averaging assumes manufacturers try to make their watches as accurate as possible for average conditions
      • amelius 22 minutes ago
        I think it runs quick to be on the safe side, so you never miss appointments, trains, etc. because of your watch.

        But yes, good point.

    • throwup238 1 hour ago
      You’re the person Douglas Adams warned us about.
    • varjag 1 hour ago
      Close but unlikely to be precise in metrology sense. There's unlikely even a billion wrist watches being worn.
    • b112 1 hour ago
      One. One watch. POTUS's watch. And in fact, that's why Boulder is currently shuttered... they disagreed.
  • crazydoggers 3 hours ago
    Status of NIST time servers:

    https://tf.nist.gov/tf-cgi/servers.cgi

  • keepamovin 1 hour ago
    For future reference of civilization: if a facility is critical, it must have a SMR.
  • lovich 3 hours ago
    This was an NTP 0 server right? What is the actual failback mechanism when that level of NTP server fails?

    This is some level of eldritch magic that I am aware of, but not familiar with but am interested in learning.

    • Maxious 2 hours ago
      There's two other sites for the time.nist.gov service so it'll be okay.

      Probably more interesting is how you get a tier 0 site back in sync - NIST rents out these cyberpunk looking units you can use to get your local frequency standards up to scratch for ~$700/month https://www.nist.gov/programs-projects/frequency-measurement...

      • lovich 2 hours ago
        What happens in the event all the sites for time.nist.gov go down? is it included in the spec?

        Also thank you for that link, this is exactly the kind of esoteric knowledge that I enjoy learning about

        • sdrmill 1 hour ago
          Most high-availability networks use pool.ntp.org or vendor-specific pools (e.g., time.cloudflare.com, time.google.com, time.windows.com). These systems would automatically switch to a surviving peer in the pool.

          Many data centers and telecom hubs use local GPS/GNSS-disciplined oscillators or atomic clocks and wouldn’t be affected.

          Most laptops, smartphones, tablets, etc. would be accurate enough for days before drift affected things for the most part.

          Kerberos requires clocks to be typically within 5 minutes to prevent replay attacks, so they’d probably be ok.

          Sysadmins would need to update hardcoded NTP configurations to point to secondary servers.

          If timestamps were REALLY off, TLS certificates might fail, but that’s highly unlikely.

          Databases could be corrupted due to failure of transaction ordering.

          Financial exchanges are often legally required to use time traceable to a national standard like UTC(NIST). A total failure of the NIST distribution layer could potentially trigger a suspension of electronic trading to maintain audit trail integrity.

          Modern power grids use Synchrophasors that require microsecond-level precision for frequency monitoring. Losing the NIST reference would degrade the grid's ability to respond to load fluctuations, increasing the risk of cascading outages.

          • neomantra 5 minutes ago
            Great list! Just double-checked the CAT timekeeping requirements [1] and the requirement is NIST sync. So a subset of all UTC.

            You don’t need to actually sync to NIST. I think most people PTP/PPS to a GPS-connected Grandmaster with high quality crystals.

            But one must report deviations from NIST time, so CAT Reporters must track it.

            I think you are right — if there is no NIST time signal then there is no properly auditable trading and thus no trading. MFID has similar stuff but I am unfamiliar.

            One of my favorite nerd possessions is my hand-signed letter from Judah Levine with my NIST Authenticated NTP key.

            [1] https://www.finra.org/rules-guidance/rulebooks/finra-rules/6...

    • lambdaone 2 hours ago
      There are lots of Stratum 0 servers out there; basically anything with an atomic clock will do. They all count seconds independently from one another, all slowly diverging over time, with offset intervals being measured by mutual synchronization using a number of means (how is this done is interesting all by itself). Some atomic clocks are more accurate than others, and an ensemble of these is typically regarded as 'the' master clock.

      To quote the ITU: "UTC is based on about 450 atomic clocks, which are maintained in 85 national time laboratories around the world." https://www.itu.int/hub/2023/07/coordinated-universal-time-a...

      Beyond this, as other commenters have said, anyone who is really dependent on having exact time (such as telcos, broadcasters, and those running global synchronized databases) should have their own atomic clock fleets. There are thousands and thousands of atomic clocks in these fleets worldwide. Moreover, GPS time, used by many to act as their time reference, is distributed by yet other means.

      Nothing bad will happen, except to those who have deliberately made these specific Stratum 0 clocks their only reference time. Anyone who has either left their computer at its factory settings or has set up their NTP configuration in accordance to recommended settings will be unaffected by this.

  • qmarchi 4 hours ago
    Man, they're having a hell of a time up in Boulder.
  • renewiltord 4 hours ago
    Well, where did NTP at NIST last put it? Did they look there?
    • Y_Y 4 hours ago
      You misunderstand, there's been a coup
      • adastra22 3 hours ago
        Of course there is. Where else would they put the reference standard chickens?
      • renewiltord 4 hours ago
        We have to stop those knaves pushing PTP! NTP must prevail!