# Facebook global outage



## zirias@ (Oct 5, 2021)

As it was discussed in at least one other thread, where it is off-topic, I thought I'd just start a new thread about Facebook's clusterfuck of 2021-10-04 

I have a technical question about it. Word is now they messed up BGP, which sure explains the scope of the problem, but: When FB was down, I had a quick look at DNS using `drill -T`. I found all the domains they use have four nameservers. The interesting thing: trying to resolve the address of these nameservers failed as well. Therefore:

Shouldn't resolving the nameservers still work because of glue records in the TLD's zones?
Shouldn't you put at least one nameserver in an AS you don't administrate yourself?


----------



## SirDice (Oct 5, 2021)

As far as I understood it that BGP mess they made caused their DNS servers to drop off the internet. So even if those glue records were fine there was no way to contact those DNS servers and thus you will not be able to get any authoritative answers. Various DNS servers may have had some data cached for some period but those caches are going to expire at some point. And I'm pretty sure the root DNS servers aren't caching their _internal_ DNS structures. 

Apparently they used those DNS servers for their internal authentication and authorization too because I heard you couldn't even enter the buildings anymore because the access gates didn't accept their passes anymore. If you can't authenticate your own employees (even the ones trying to fix the mess) it would be virtually impossible to correct the mistakes too. Normally you would have some "master" password or key stored somewhere safe, to be able to access a local backup account. But perhaps they opted not to have that (if an attacker could get a hold of that master key or account/password they could potentially login on everything). Or maybe they did have something in place but it just took a long time to get a hold of those stored keys/passwords (things can get a bit messy if you can't even enter the building or safety box to access it; a bit of a chicken and egg situation). 



Zirias said:


> Shouldn't you put at least one nameserver in an AS you don't administrate yourself?


Sure, but that also means you have to trust that third party where it's hosted. That same nameserver could also be abused to gain access so maybe they opted not to have that risk and keep everything on their own premises.

Whatever actually caused this mess you can be sure they will be having quite a few really long meetings trying to figure out a way to prevent this from ever happening again. And those aren't going to be fun meetings.


----------



## zirias@ (Oct 5, 2021)

SirDice said:


> So even if those glue records were fine there was no way to contact those DNS servers and thus you will not be able to get any authoritative answers.


Okay, so … trying to resolve the nameserver itself with drill(1) won't give me any result, even _if_ the glue records are fine, as long as it can't be verified by also asking an authorative nameserver? 

BTW, already read about the rest of the story, but thanks for posting the "executive summary" again. It's just kind of hilarious 



SirDice said:


> Sure, but that also means you have to trust that third party where it's hosted. That same nameserver could also be abused to gain access so maybe they opted not to have that risk and keep everything on their own premises.


Hm yes, that _might_ make sense. But kind of contradicts the redundancy of DNS…


----------



## SirDice (Oct 5, 2021)

Zirias said:


> Okay, so … trying to resolve the nameserver itself with drill(1) won't give me any result, even _if_ the glue records are fine, as long as it can't be verified by also asking an authorative nameserver?


The root DNS servers just have an entry that says the facebook.com domain is that way. If that way leads to a dead end you can't resolve anything from the facebook.com domain. I'm pretty sure they have a couple of more domains but the same applies to those too if you just get directed to what's essentially a blackhole.



Zirias said:


> Hm yes, that _might_ make sense. But kind of contradicts the redundancy of DNS…


I'm sure they've spread their DNS servers across different ASs in different datacenters for redundancy. All those ASs are just all owned by facebook and that BGP mess just took everything out in one go.


----------



## zirias@ (Oct 5, 2021)

Yep, but that wasn't what I asked. I had a look yesterday, all the facebook-related domains have 4 nameservers in subdomains, e.g. `a.ns.whatsapp.net` etc. for whatsapp.com. So I'd expect an `A` (and probably `AAAA`) record to be present in the `net.` zone. Of course that's not authorative. My question was: does `drill -T` just return nothing if it _does_ find the glue record, but can't ask one of the authorative nameservers? I would have somehow expected it shows the non-authorative answer before showing the final error…


SirDice said:


> I'm sure they've spread their DNS servers across different ASs in different datacenters for redundancy. All those ASs are just all owned by facebook and that BGP mess just took everything out in one go.


That, of course, makes sense as well. Kind of a bummer that a single error (whatever it was) takes everything down


----------



## ralphbsz (Oct 6, 2021)

SirDice said:


> Apparently they used those DNS servers for their internal authentication and authorization too because I heard you couldn't even enter the buildings anymore because the access gates didn't accept their passes anymore.


I heard the same story through the Silicon Valley rumor mill. The version I heard is that even the entrance door security system at Facebook's data centers uses Facebook's internal network and DNS. So when all of their networks went down, it became impossible to even get into the building that holds the servers which needed to be restarted or reconfigured. According to the rumors, the problem was eventually solved by using a sledge hammer on a door, and bringing someone and something inside.

The other part of the rumor is that the fix was done by starting with the data center nearest to Facebook's main engineering facilities (which are in Menlo Park); I've heard both Redwood City and Santa Clara mentioned. From there, it became possible to restart networking infrastructure at other sites remotely. That probably used dedicated network links; all the big hyper-scalers have their own fiber networks which they control end to end (not rented bandwidth, but unshared dark fiber). The funny thing about this is that I didn't know that any of the big hyper-scalers have data centers in Silicon Valley. With the insanely high cost of real estate and electricity here, there are few data centers in the immediate vicinity, and the few that exist are run by wholesale colocation operators (like Equinix), usually serving smaller customers. Companies large enough to build dedicated data centers (and Facebook is definitely in that class) typically place them in places where real estate, electricity and cooling are cheaper, but not so remote that labor is unavailable.



> Or maybe they did have something in place but it just took a long time to get a hold of those stored keys/passwords (things can get a bit messy if you can't even enter the building or safety box to access it; a bit of a chicken and egg situation).


I've heard stories that some of the most fundamental security keys (like the ultimate root password to all of Amazon AWS, just as a hypothetical example) are stored in a physical safe (a big steel box with thick walls) near the CEOs office, using a standalone security device. That safe uses traditional mechanical locks (the thing with a dial). I've also heard stories that some of those security devices rely on being unlocked by a pass phrase which is memorized by a small number of humans, but not recorded otherwise (not on a piece of hardware). Part of the long delay in getting Facebook back online might have been caused by the need for one of those humans to be brought to the correct location. If someone has some spare time, they could track what flights Facebook's corporate aircraft took yesterday, it might give us a clue.



> Whatever actually caused this mess you can be sure they will be having quite a few really long meetings trying to figure out a way to prevent this from ever happening again. And those aren't going to be fun meetings.


And everyone else in the industry will also have long meetings, to make sure that "this can't happen to us". Those meetings won't be quite as painful, but no means amusing.


----------



## gpw928 (Oct 6, 2021)

I observed to some colleagues earlier today, this will be a case study in IT risk management for the ages!  For many, there will be some urgency to that.

The solution goes all the way back to the ancient Chinese warlords who knew all about "defense in depth".

But first, you need to understand what needs defending.  BGP has been a potential single point of failure for a lot of Autonomous Systems for a long time...

The best analysis I found was by Celso Martinho and Tom Strickx at Cloudfare.


----------



## zirias@ (Oct 6, 2021)

gpw928 said:


> The best analysis I found was by Celso Martinho and Tom Strickx at Cloudfare.


I can confirm that, that's what I read yesterday as well. It's in depth and makes sense (while facebook's own blog post is less detailed).

Still I wonder why drill(1) didn't show me the glue records  Not that it's that relevant for this incident, but I think it would be helpful for DNS debugging?


----------



## sko (Oct 6, 2021)

SirDice said:


> I'm sure they've spread their DNS servers across different ASs in different datacenters for redundancy. All those ASs are just all owned by facebook and that BGP mess just took everything out in one go.


Nope, they didn't. All their authorative NS are in the same AS (32934) which not only disappeared from the map due to lost BGP peerings, but they widthdrew routes to all (or at least most) prefixes in that AS before the peerings went down.

They pretty impressively demonstrated how 'single point of failure' works on a large scale. plus some other seemingly bad designs in their infrastructure, like binding each and every system like access control and authentication to the same infrastructure without any fallbacks...
This incident is yet another great example to bring up if some PHB wants to hardwire everything "to the cloud".


edit: I found this screenshot from reddit I took when this all unfolded, where someone from FB was posting some information shortly before his account was deleted and the messages disappeared: 



(archived: https://archive.is/QvdmH )

The sentence of the year for me is "There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do". That's textbook clown-college there


----------



## Geezer (Oct 6, 2021)

For as long as the outage was, however long that was, the world was a better place.


----------



## SirDice (Oct 6, 2021)

sko said:


> The sentence of the year for me is "There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do". That's textbook clown-college there


Not that uncommon though. We have people working in our datacenters, all they can do is keep the hardware itself in order and replace things _we_ tell them to do. The people working in the datacenters can't login on anything at all. I have access to the systems via various accounts (even access to the root accounts) but I'm not allowed to go in the datacenters and actually touch the hardware.


----------



## Crivens (Oct 6, 2021)

This is a good case study of why infrastructure needs to be cold startable.


----------



## gpw928 (Oct 6, 2021)

Crivens said:


> This is a good case study of why infrastructure needs to be cold startable.


I have been involved in two cold starts of large data centres in my career.  Neither was planned.
The problem is that such plans are very hard to test unless you have fully duplicated infrastructure.

Much later, in a grand push to virtualise everything, I do remember some clod asserting that "Microsoft says that there's nothing you can't virtualise".

I politely enquired if he saw any issues booting the SAN and server infrastructure without functioning name and time servers.


----------



## eternal_noob (Oct 6, 2021)

Geezer said:


> For as long as the outage was, however long that was, the world was a better place.


I didn't even notice the outage.

What do you guys think about "A Bad Solar Storm Could Cause an 'Internet Apocalypse'". A real threat or just FUD?


----------



## Crivens (Oct 6, 2021)

eternal_noob said:


> I didn't even notice the outage.
> 
> What do you guys think about "A Bad Solar Storm Could Cause an 'Internet Apocalypse'". A real threat or just FUD?


Not FUD. Search for "Carrington Event". 
Today, we would be so deep in the sticky that we would need an industrial grade depth gauge just to know which way is up. There is evidende that solar flares some order of magnitude larger hit earth before, as it even left C14 from its interaction with the atmosphere. And I recently read about our power grid not being cold startable, so you may need to get a diesel generator to the transformation stations to jump start them. And maybe you need to aquire the transportation device from some amish...


----------



## sko (Oct 6, 2021)

Crivens said:


> This is a good case study of why infrastructure needs to be cold startable.


The problem with that is even on a smaller scale, a lot of software higher up in the stack and especially proprietary appliances and software fail in the most stupid and/or spectacular ways if they don't find their surroundings *exactly* like they expect them to be. Therefore you will be hunting weird race-conditions if some service or part of the network isn't up yet or routes are not fully established. The higher in the stack you get, the deeper this rabbit hole gets.
Just take the millions of dumb phone apps that tried to (aggressively) reach the FB servers and significantly increased the load on the root-DNS-servers by doing so.

You might be able to reboot your whole network and bring up L2 and even L3 without any problems, but at least some fallout that might even cause some chain reactions is almost inevitable that causes problems higher up in the stack or might even overload parts of your network.


----------



## eternal_noob (Oct 6, 2021)

Crivens said:


> Carrington Event


Thanks. Interesting read.
Another reason why mankind should not make itself dependent on IT.


----------



## zirias@ (Oct 6, 2021)

sko I've seen this problem on a ridiculously small scale in a student-operated "cafe" on the campus of my old university. They had mainly two Linux servers (on horribly old consumer hardware) and none of them booted correctly when the other one was down. Lot of manual work to get that crap "up" again


----------



## Crivens (Oct 6, 2021)

_View: https://m.youtube.com/watch?v=hESunUuFrzk_


Someone mentioned that the high rate of preppers among military/police/rescue is due to them knowing the plans and chances should things get sideways for any gouvernment action to be of any use. Who can make fire from nothing? Who has an EMP resistent watch?


----------



## sko (Oct 6, 2021)

Zirias said:


> sko I've seen this problem on a ridiculously small scale in a student-operated "cafe" on the campus of my old university. They had mainly two Linux servers (on horribly old consumer hardware) and none of them booted correctly when the other one was down. Lot of manual work to get that crap "up" again



When rebooting a bunch of switches with STP-managed redundant links you can watch the network go down everytime another switch comes on and, if it has higher priority, taking over as STP-master, thus shortly interrupting a bunch of those redundant links for a few seconds, often making other switches and everything above L2 freak out... It gets even funnier with stacked switches that take considerably different times to boot up, so sometimes the whole stack needs to reboot to elect a new master...
This alone can easily add up to an hour or more until even the L2 can be considered 'up' and stable. now add a fleet of servers that barf their souls throughout the network everytime their links come up again during that phase...

Regarding dependent servers/servers that cause a deadlock you don't even need 2 separate machines: just put your DHCP on a VM and let the hypervisor get its IP via DHCP...


----------



## D-FENS (Oct 6, 2021)

Should've built the whole thing Peer to Peer. 
I don't use Facebook myself but I feel your pain. Even at large companies it happens that someone flips the wrong switch occasionally. Especially at very centralized ones it might have global impact.

For me the amazing thing is that this happens so rarely. I would expect it 1-2 a year, because people make mistakes. It's a law of nature.


----------



## SirDice (Oct 6, 2021)

sko said:


> When rebooting a bunch of switches with STP-managed redundant links you can watch the network go down everytime another switch comes on and, if it has higher priority, taking over as STP-master, thus shortly interrupting a bunch of those redundant links for a few seconds, often making other switches and everything above L2 freak out... It gets even funnier with stacked switches that take considerably different times to boot up, so sometimes the whole stack needs to reboot to elect a new master...


Only if you haven't configured STP correctly. You specifically set your core switches to be the root bridge. You don't let it figure this out by itself or you will definitely end up in a situation like the one you're describing. Without a properly configured STP even connecting a blank configured switch anywhere on your network could result in the whole network going down for several minutes.


----------



## sko (Oct 6, 2021)

SirDice said:


> Only if you haven't configured STP correctly. You specifically set your core switches to be the root bridge. You don't let it figure this out by itself or you will definitely end up in a situation like the one you're describing. Without a properly configured STP even connecting a blank configured switch anywhere on your network could result in the whole network going down for several minutes.


Bridge IDs are configured manually everywhere so there is one distinct root bridge, but if we rebooted the root bridge AND the secondary at the same time, the remaining will - as expected/intended - still hand off the role of the root bridge according to their bridge IDs (which are never equally configured on different switches ofc). But I once had to reboot all 3 switches/stacks in our main building and chose to just issue the 'reload' on all of them at the same time. One of the remaining 2 at that site took over as root and thanks to murphy the 3 switches came up in the opposite order as their bridge IDs - so each time the next one finally came up, it took over as root, triggering a STP reconvergence. Once they are all up and the root bridge with the lowest ID has taken over, everything was running fine - but until then the network was completely useless and I just sat there for ~25 minutes watching the dust settle. This was on my weekly monday evening maintenance window, so no screaming were involved...
We since have replaced this configuration with just one stack; so this is no longer an issue.


----------



## astyle (Oct 6, 2021)

ralphbsz said:


> If someone has some spare time, they could track what flights Facebook's corporate aircraft took yesterday, it might give us a clue.


That takes knowing the FAA-registered tail number of those aircraft. They're not gonna wear bright honu livery like ANA


----------



## SirDice (Oct 6, 2021)

sko said:


> But I once had to reboot all 3 switches/stacks in our main building and chose to just issue the 'reload' on all of them at the same time.


I'm betting that was a big "Oops" moment. You typically realize this the second you let go of the enter key. 



sko said:


> Once they are all up and the root bridge with the lowest ID has taken over, everything was running fine - but until then the network was completely useless and I just sat there for ~25 minutes watching the dust settle.


STP is nice but it does have quite a few drawbacks. Once it starts recalculating there's really nothing else you can do but lean back and watch it happen.


----------



## sko (Oct 6, 2021)

SirDice said:


> I'm betting that was a big "Oops" moment.


It was more like a "I know this stuff works, so how bad can this be?". I have a 2-hour maintenance window (or longer, but then my evening is ruined..) and this was right at the beginning, so plenty of time to watch an "edge-case" happening IRL.


----------



## astyle (Oct 6, 2021)

sko said:


> It was more like a "I know this stuff works, so how bad can this be?". I have a 2-hour maintenance window (or longer, but then my evening is ruined..) and this was right at the beginning, so plenty of time to watch an "edge-case" happening IRL.


yeah, this points to the need to do your homework, and have a way to go back if you realize you made a mistake. Something I adopted as my MO lately.


----------



## Jose (Oct 6, 2021)

ralphbsz said:


> The funny thing about this is that I didn't know that any of the big hyper-scalers have data centers in Silicon Valley. With the insanely high cost of real estate and electricity here...


And the likelihood of earthquakes!



ralphbsz said:


> I've heard stories that some of the most fundamental security keys (like the ultimate root password to all of Amazon AWS, just as a hypothetical example) are stored in a physical safe (a big steel box with thick walls) near the CEOs office, using a standalone security device. That safe uses traditional mechanical locks (the thing with a dial). I've also heard stories that some of those security devices rely on being unlocked by a pass phrase which is memorized by a small number of humans, but not recorded otherwise (not on a piece of hardware). Part of the long delay in getting Facebook back online might have been caused by the need for one of those humans to be brought to the correct location. If someone has some spare time, they could track what flights Facebook's corporate aircraft took yesterday, it might give us a clue.


There are software versions of this scheme too:




__





						Shamir's Secret Sharing - Wikipedia
					






					en.wikipedia.org
				






ralphbsz said:


> And everyone else in the industry will also have long meetings, to make sure that "this can't happen to us". Those meetings won't be quite as painful, but no means amusing.


There was some gloating at $WORK, which made me fear the jinx. Hubris comes before the fall.



Crivens said:


> _View: https://m.youtube.com/watch?v=hESunUuFrzk_
> 
> 
> Someone mentioned that the high rate of preppers among military/police/rescue is due to them knowing the plans and chances should things get sideways for any gouvernment action to be of any use. Who can make fire from nothing? Who has an EMP resistent watch?







__





						A Super Solar Flare | Science Mission Directorate
					

In September 1859, a solar flare erupted so intense that the explosion itself was visible to the human eye. A ferocious geomagnetic storm ensued in which Northern Lights descended as far south as Cuba, the Bahamas and Hawaii. Meanwhile, telegraph engineers disconnected their batteries and...




					science.nasa.gov
				







__





						Near Miss: The Solar Superstorm of July 2012 | Science Mission Directorate
					

Two years ago today, a historic solar storm narrowly missed Earth, prompting forecasters to revise the odds of future impacts.




					science.nasa.gov
				




Don't panic.


----------



## zirias@ (Oct 6, 2021)

I have to state it: A thread about an outage of facebook(!) quickly turned into discussing the end of civilization. Dafuq?


----------



## sko (Oct 6, 2021)

astyle said:


> yeah, this points to the need to do your homework, and have a way to go back if you realize you made a mistake. Something I adopted as my MO lately.


All the switch configs are in revision control and these switches are 30seconds away from my desk; so yes - if this really would have went bad I had a pretty straightforward backup plan: just revert the config change and reboot them one by one. As said: I had plenty of time, knew this 'should' work and wanted to see how it would handle this edge case (e.g. in case of a long-term power outage our UPS won't handle).


----------



## Jose (Oct 6, 2021)

Zirias said:


> Yep, but that wasn't what I asked. I had a look yesterday, all the facebook-related domains have 4 nameservers in subdomains, e.g. `a.ns.whatsapp.net` etc. for whatsapp.com. So I'd expect an `A` (and probably `AAAA`) record to be present in the `net.` zone.


The root servers return an NS record, which is a host name, not an IP. They will also return a glue record if the host name is under the domain that is being queried, which is very common. (Apparently this is called in-bailiwick.)


----------



## zirias@ (Oct 6, 2021)

Jose said:


> The root servers return an NS record, which is a host name, not an IP. They will also return a glue record if the host name is under the domain that is being queried, which is very common. (Apparently this is called in-bailiwick.)


Exactly. And my question still is: doesn't drill(1) show these when tracing?


----------



## Jose (Oct 6, 2021)

Zirias said:


> Exactly. And my question still is: doesn't drill(1) show these when tracing?


`drill ns` does show them for me.

```
drill ns facebook.com                                      
;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 61759
;; flags: qr rd ra ; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 8 
;; QUESTION SECTION:
;; facebook.com.    IN    NS

;; ANSWER SECTION:
facebook.com.    9510    IN    NS    c.ns.facebook.com.
facebook.com.    9510    IN    NS    b.ns.facebook.com.
facebook.com.    9510    IN    NS    a.ns.facebook.com.
facebook.com.    9510    IN    NS    d.ns.facebook.com.

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:
a.ns.facebook.com.    9510    IN    A    129.134.30.12
b.ns.facebook.com.    9510    IN    A    129.134.31.12
c.ns.facebook.com.    9510    IN    A    185.89.218.12
d.ns.facebook.com.    9510    IN    A    185.89.219.12
a.ns.facebook.com.    9510    IN    AAAA    2a03:2880:f0fc:c:face:b00c:0:35
b.ns.facebook.com.    9510    IN    AAAA    2a03:2880:f0fd:c:face:b00c:0:35
c.ns.facebook.com.    9510    IN    AAAA    2a03:2880:f1fc:c:face:b00c:0:35
d.ns.facebook.com.    9510    IN    AAAA    2a03:2880:f1fd:c:face:b00c:0:35

;; Query time: 0 msec
;; SERVER: 172.16.1.4
;; WHEN: Wed Oct  6 08:12:26 2021
;; MSG SIZE  rcvd: 285
```
They're the A records in the additional section.


----------



## zirias@ (Oct 6, 2021)

Ah! ok, so I have a way to check for them. I guess `drill -T a.ns.facebook.com` doesn't show them, even if they are used to query the authorative nameservers and that fails. That's somewhat confusing as I first thought facebook might have withdrawn their glue records.


----------



## Jose (Oct 6, 2021)

Zirias said:


> Ah! ok, so I have a way to check for them. I guess `drill -T a.ns.facebook.com` doesn't show them, even if they are used to query the authorative nameservers and that fails. That's somewhat confusing as I first thought facebook might have withdrawn their glue records.


If you think about it, the glue records don't live on the Facebook DNS servers. They'd be useless there, since the address of those servers is precisely what you're trying to find. How would you query the Facebook DNS servers for a glue record if the address of the Facebook DNS servers is precisely what you're trying to find?

I've only set up small-time domains; many, many orders of magnitude smaller than Facebook, but FWIW in those glue records are hosted at my registrar.

Thought experiment. Suppose I think glue records are an Evil Hack, and I've come up with this scheme to work around them. I have two domains, example.com and example.net. The name server for example.com is ns.example.net, and the name server for example.net is ns.example.com. Why won't that work?


----------



## zirias@ (Oct 6, 2021)

Jose said:


> If you think about it, the glue records don't live on the Facebook DNS servers. They'd be useless there, since the address of those servers is precisely what you're trying to find. How would you query the Facebook DNS servers for a glue record if the address of the Facebook DNS servers is precisely what you're trying to find?


Although you kind of answered my question (gotta remember how to "see" glue records with drill(1) ), you still don't understand me. The trace option (`-T`) of drill is specifically for debugging purposes. So I just _assumed_ it _would_ show glue records it finds on the way. After all, these are needed to be _able_ to ask facebook's nameservers in the first place. There's no doubt the whole operation will fail if none of these authorative nameservers can be contacted, that was never my question.


----------



## astyle (Oct 6, 2021)

Jose said:


> Thought experiment. Suppose I think glue records are an Evil Hack, and I've come up with this scheme to work around them. I have two domains, example.com and example.net. The name server for example.com is ns.example.net, and the name server for example.net is ns.example.com. Why won't that work?


Circular references.


----------



## zirias@ (Oct 6, 2021)

Just to get that straight: You can mess up glue records. If you request none in a domain update, you get none. And yes, there are usecases where you don't need any (if your authorative nameservers are in different domains that _are_ "glued").

`drill -T` not showing anything surprised me and led me to the (false) assumption facebook might have messed up their glue records with erroneous domain updates. That's it…


----------



## Jose (Oct 6, 2021)

This appears to be an implementation detail of drill(1). dig(1) with `+trace` doesn't show them either, but it actually queries one of the servers it must've got through a glue record!

```
dig facebook.com ns +trace

; <<>> DiG 9.16.4 <<>> facebook.com ns +trace
;; global options: +cmd
.            37065    IN    NS    k.root-servers.net.
.            37065    IN    NS    b.root-servers.net.
.            37065    IN    NS    i.root-servers.net.
.            37065    IN    NS    h.root-servers.net.
...
facebook.com.        172800    IN    NS    c.ns.facebook.com.
facebook.com.        172800    IN    NS    d.ns.facebook.com.
facebook.com.        172800    IN    NS    b.ns.facebook.com.
facebook.com.        172800    IN    NS    a.ns.facebook.com.
;; Received 284 bytes from 129.134.30.12#53(a.ns.facebook.com) in 0 ms
```

It's also much faster than drill, for some reason.


----------



## zirias@ (Oct 6, 2021)

Yep, gotta remember that: To see glue records with drill(1), do an explicit `NS` query.


----------



## Crivens (Oct 6, 2021)

Zirias said:


> I have to state it: A thread about an outage of facebook(!) quickly turned into discussing the end of civilization. Dafuq?


I don't know about the network stuff involved, and I could hardly care less about FB being gone, but I do worry about the increasing dependence on fragile infrastructure and how interwoven some fragile stuff is. Texas Snowstorm anyone? Pipeline shutdown?


----------



## Sevendogsbsd (Oct 6, 2021)

Crivens: exactly. I am in Texas and the idiots running the local government have a private corporation in control of the power grid. The state is not on the national grid for some stupid reason. In the interest of saving $, they did not winterize the equipment and we all know the outcome of that.


----------



## jbo (Oct 6, 2021)

Sevendogsbsd said:


> Crivens: exactly. I am in Texas and the idiots running the local government have a private corporation in control of the power grid. The state is not on the national grid for some stupid reason. In the interest of saving $, they did not winterize the equipment and we all know the outcome of that.


Wait... I hope I misunderstood this. Are you saying that the state of Texas is not connected to the U.S. national power grid? i.e. Texas' power grip is decoupled/isolated/separated from that?
If so, are there physical links that are simply decativated "when not needed" or is it really, truly an isolated grid?


----------



## Crivens (Oct 6, 2021)

jbodenmann https://www.texastribune.org/2011/02/08/texplainer-why-does-texas-have-its-own-power-grid/

Oh, and ctrl-f mexico 

But back on topic, any news about the face book outage?


----------



## Sevendogsbsd (Oct 6, 2021)

jbodenmann said:


> Wait... I hope I misunderstood this. Are you saying that the state of Texas is not connected to the U.S. national power grid? i.e. Texas' power grip is decoupled/isolated/separated from that?
> If so, are there physical links that are simply decativated "when not needed" or is it really, truly an isolated grid?


That is correct. It is quite idiotic actually. The state government wants independence from the federal government, but they will sure take federal money when it is offered…

As for the mechanics of the separation, I am not sure of the details.


----------



## jbo (Oct 6, 2021)

The United States are a really scary place...


----------



## Vull (Oct 6, 2021)

jbodenmann said:


> The United States are a really scary place...


Acknowledged


----------



## astyle (Oct 6, 2021)

jbodenmann said:


> The United States are a really scary place...


Especially Texas


----------



## Beastie7 (Oct 6, 2021)

I would love to go back to the early 00's when all this garbage never existed, and people actually went outside. I hope Facebook, etc. dies a good death.


----------



## Sevendogsbsd (Oct 6, 2021)

Agree. I had a fleeting interest in social media but then I saw what it was/is. I have had a Twitter account 2x and rage quit both times because frankly I get obsessed with it and there are so many trolls. At least I had the good sense to only make fake burner accounts.


----------



## SirDice (Oct 6, 2021)

I have a whole bunch of social media accounts. Don't really use them though, I just created them so my "name" is reserved and nobody could impersonate me. I do use Facebook but mostly to keep in touch with my parents and some friends. The only thing I've been recently posting are pictures of my cats (you're supposed to post those on the internet  ). And I couldn't post new ones for 6 whole hours! My whole world collapsed! Seriously though, I barely noticed it. Did read a lot of news stories about it, that was hard to miss, it got plastered all over.


----------



## teo (Oct 6, 2021)

The worst thing knowingly those applications or systems are intrument of mass espionage as (facebook, whatsapp, instagram, google, microsoft) that collect personal data,  trade personal data at will, we still continue to use. It is time for our relatives to use other secure applications such as Telegram or Signal, and thus safeguard something, and are not at the mercy of hackers, because oligopolies do not give us a penny on the dollar.


----------



## ralphbsz (Oct 7, 2021)

jbodenmann said:


> The United States are a really scary place...


It's not 100% clear that Texas is really in the United States. I mean, legally it technically is ... but emotionally, it doesn't quite feel like it.

It's somewhat like Bavaria and Germany: When in Bavaria, you know that you're primarily in the "Free state of Bavaria" (Freistaat Bayern), and only secondarily in Germany.

Now, seriously about the power grid: People from Europe have to recognize that some of the isolation is also simply due to the distances involved: From Houston, Texas to New York is about as far as from Madrid, Spain to Warsaw, Poland. Also about the same direction (northeast).


----------



## eternal_noob (Oct 7, 2021)

ralphbsz said:


> It's not 100% clear that Texas is really in the United States. I mean, legally it technically is ... but emotionally, it doesn't quite feel like it.


Yup.


SirDice said:


> I have a whole bunch of social media accounts. Don't really use them though, I just created them so my "name" is reserved and nobody could impersonate me.


I use a different name on every platform i am active on (3 forums, 1 Github). So that one can't link my identity to my interests.

On topic: I find it hilarious / alarming / disturbing that Mr. Zuckerberg lost $7 billion just because his services weren't working for 6 hours. Society is rotten.


----------



## Geezer (Oct 7, 2021)

eternal_noob said:


> I use a different name on every platform...  So that one can't link my identity to my interests.



How many hybrid duck-squid do you think there are in the world?


----------



## eternal_noob (Oct 7, 2021)

I am unique. In every way.

s/squid/Cthulhu/g


----------



## jbo (Oct 7, 2021)

ralphbsz said:


> Now, seriously about the power grid: People from Europe have to recognize that some of the isolation is also simply due to the distances involved: From Houston, Texas to New York is about as far as from Madrid, Spain to Warsaw, Poland. Also about the same direction (northeast).


I appreciate that input. I do tend to remind myself of that but whenever I read something like this internally I still go "uuuh". I live in a small land-locked country in central Europe. I believe the longest border-to-border distance is something just above 300km.



eternal_noob said:


> I use a different name on every platform i am active on (3 forums, 1 Github). So that one can't link my identity to my interests.


Given that you're a FreeBSD user I take it that you know that profiling does not stop at comparing user names


----------



## Deleted member 30996 (Oct 7, 2021)

Crivens said:


> Not FUD. Search for "Carrington Event".


If it happened now it would take years to replace all the transformers that it would fry. It's supposed to be something that happens in a cycle and that not supposed to the first time it occured
"We will have lost nearly 40% of of our magnetic field strength by 2030. If earth continues to lose 5%+ per decade of magnetic field strength during this solar maximum, it will be less able to hold  energy from a large solar CME away from the surface mantle and core of earth. Things will melt. A Carrington sized solar CME (coronal mass ejection), which melted the electric wires and components of 1859, will have an exponentially greater impact on earth with our currently depleted, still weakening field strength.​The Carrington event was not the largest ever recorded and is in the middle range of historic extreme CME events​The Carrington event is a documented, once in every 150 to 200 year cyclic solar event, of that large size, or magnitude for you science terminology fans. Do the math as we approach the 200 year mark. There is currently no developed theoretical or practical shielding to protect electrical components, if solar electrical forcing is powerful enough to reach into earths core, sending a wave of energy then back through the mantle into the crust outward."​








						Solar Kill Shot? Cyclic 200 Year Super Flare, To Hit Our Exhausted Magnetic Field
					

IRUUR1 writes for insiders. Globalists and individualists alike, who are aware of the fundamental underpinnings of the emerging centralized system, practical options for necessary, evolutionary decentralization, and are curious about the natural envi...




					www.publish0x.com
				




Col. Ed Dames, a remote viewer for the CIA, predicted "The Killshot" in 2012. Then new age noodle noggins got involved and started talking nonsense. Ed Snowden said the CIA was aware of the cycle.

NASA and NOAA Space Weather Satellite Data Updated Daily:
Solar Storm Monitor


----------



## hitest (Oct 7, 2021)

Sevendogsbsd said:


> Agree. I had a fleeting interest in social media but then I saw what it was/is. I have had a Twitter account 2x and rage quit both times because frankly I get obsessed with it and there are so many trolls. At least I had the good sense to only make fake burner accounts.


At the moment I have no social media accounts.  I have a love/hate relationship with social media (I mostly hate it).  I find that when I use social media, like you, I become too obsessive.  I deleted my FB account 2 weeks ago, it was the last social media platform to go.  I deleted FB once before and then foolishly re-engaged.  I don't miss the white noise of social media.


----------



## astyle (Oct 7, 2021)

I don't have a Facebook account, and proud of that.  I do have a Twitter account, but that was acquired back before 2010, when I was trying to do my research on how to jailbreak an Ipod Touch 2nd gen. I haven't touched that account since, so I would not be surprised if it got weeded out for being inactive.

But man, Facebook is more fuss than it's worth - Not only there's pressure to curate what you've got, if you do something stupid, it can snowball into something beyond your control in a hurry. Even a simple like on Heineken's page can, after awhile, get you labeled an alcoholic, that will pull in devel/violence as a dependency, and then armed police will show up at your doorstep, charge you with a crime committed in a country you never even heard of, and good luck trying to completely untangle all that in court. All from a simple like on Heineken's page.


----------



## Beastie7 (Oct 7, 2021)

Wasn't Facebook originally a DARPA funded project called LifeLog?


----------



## eternal_noob (Oct 7, 2021)

Nope. Zuckerberg stole it from the Winklevoss twins.


----------



## astyle (Oct 7, 2021)

Beastie7 said:


> Wasn't Facebook originally a DARPA funded project called LifeLog?


I didn't even know LifeLog existed, until Beastie7 mentioned it here. But reading about it on Wikipedia - one would think that Zuck stole the idea of such extensive tracking/logging from DARPA, rather than the Winklevoss twins. The Winklevoss lawsuit centered around source code and mis-representing the company valuation duiring negotiations, which amounted to securities fraud. DARPA's ambitions were fulfilled by Zuck perfectly, to a T, with the money that the Winklevoss twins were cheated out of, allegedly. That would be my tongue-in-cheek conclusion.


----------



## jammied (Oct 10, 2021)

One thing I do wonder is, would it not make sense when operating a project as large as Facebook to use static routes to some degree to make it harder for routing failures to occur? Thing is, dynamic protocols such as BGP are innately more at risk of failure. I feel like the whole outage is evidence of some degree of poor network management. Not that I am greatly surprised for a company whose product is in fact written in PHP.


----------



## zirias@ (Oct 10, 2021)

jammied said:


> One thing I do wonder is, would it not make sense when operating a project as large as Facebook to use static routes to some degree to make it harder for routing failures to occur? Thing is, dynamic protocols such as BGP are innately more at risk of failure. I feel like the whole outage is evidence of some degree of poor network management. Not that I am greatly surprised for a company whose product is in fact written in PHP.


AFAIK, there's no way _not_ to use BGP. As I understand it, you can configure routes statically that are announced by BGP, but if you mess up these routes when changing something in your infrastructure, you will end up with the same problem.


----------



## jammied (Oct 10, 2021)

Zirias said:


> AFAIK, there's no way _not_ to use BGP. As I understand it, you can configure routes statically that are announced by BGP, but if you mess up these routes when changing something in your infrastructure, you will end up with the same problem.


Yes and no. So, to make your routes globally available, BGP has to be involved somewhere along the time. Key points are:


Individual ISPs could setup static routes to your network without relying on BGP. This is relevant as it would provide some level of backup in event of routing failure.
You certainly do not need to use BGP on your internal networks, at all. You can either use other routing protocols (of which there are many), entirely rely on static routes, or have a mix of dynamically routed IPs and static routes. It is worth noting, it is fairly straightforward for BGP to obtain route information that is either declared by static routes or discovered by other routing protocols.
The most important role for BGP in some respects is to discover where other IP blocks are on the internet as opposed to annoucing your own routes. You could if you wanted implement a structure where your network never announces BGP routes to other peers but where you rely on the carriers you use to annouce for you that they have routes to your network. That is not to say you would actually want to do this, but it is technically possible. 
On the last point, I think some techniques would make sense for adding a backup in case of BGP failures:

Seek to have a small handful of ISPs setup static routes to a restricted portion of your address space so that if your BGP does fail, you have some form of failover.
Have some network equipment that has external IP addresses assigned by carriers and not yourself.
Setup VPN access to network equipment with external IP addresses assigned by other carriers from their own IP blocks and use the VPN as an alternative way for getting external access to your full address block in event of BGP failure.


----------



## zirias@ (Oct 10, 2021)

jammied said:


> So, to make your routes globally available, BGP has to be involved somewhere along the time.


And that's the main point. And we have seen in the past how fragile this can be 


jammied said:


> Individual ISPs could setup static routes to your network without relying on BGP. This is relevant as it would provide some level of backup in event of routing failure.


Sure. I have no experience about that but: would they be even willing to do so?


jammied said:


> You certainly do not need to use BGP on your internal networks, at all.


Of course not. And I have no idea how facebook handled that…


jammied said:


> The most important role for BGP in some respects is to discover where other IP blocks are on the internet as opposed to annoucing your own routes. You could if you wanted implement a structure where your network never announces BGP routes to other peers but where you rely on the carriers you use to annouce for you that they have routes to your network.


Well again, I kind of doubt your peers would be willing to do that. But again, I don't have any experience in that field…


----------



## jammied (Oct 10, 2021)

Zirias said:


> And that's the main point. And we have seen in the past how fragile this can be
> 
> Sure. I have no experience about that but: would they be even willing to do so?
> 
> ...


I think for a company as large as Facebook, I think the peers would have a strong incentive to agree to that. Also, want to highlight specifically, with the notion I suggested of getting IP addresses assigned from carrier IP blocks, you wouldn't need to a big company to get that, I could easily get that from broadband provider on my home internet package.


----------



## sko (Oct 11, 2021)

You couldn't reliable set up e.g. multiple uplinks with static routes. If one goes down the route would still be there and the ISP would try to send traffic over that (inactive) link. The whole point of BGP is to make routing dynamic and resilient - There is (usually) never only a single path to a network, but multiple over different paths. So if one goes down, there are still others available - they might be slower, longer or have a low preference set, but they are still there so the prefixes are still available.

BGP works; it has been there since the dawn of time and has proven to be reliable and has been refined and hardened over the years (ROA/RPKI). Human error will always haunt _every_ service, no matter how well designed the protocol/service is; this has nothing to do with BGP. With properly configured BGP it is much harder to bring a network down by accident than by mangling static routes and (huge) routing tables by hand or somehow automated/scripted. (been there, had to do that for a crappy uplink with LTE-backup, would never want to touch such bullsh*t again)
AWS and Microsoft brought down their clouds (multiple times) without any involvement of BGP, purely by other bad design decisions... (e.g. MS placing all servers for an essential background service in a single DC...)

In the case of FB, they made the assumption their backbone from their "edge datacenters" to the main DC would never go down. To prevent malfunctioning servers from being available to the world it seems they thought it would be best to just let them revoke their routes if they can't reach their backend - but as their backbone went dark, all DNS servers revoked their routes; effectively killing all DNS for their zones, hence everything else that relies on DNS (hint: almost everything) also stopped working and even more routes were revorked.
Some static routes *might* have worked here to some extent, but static routes are a nightmare to maintain even on smaller networks, let alone in multi-homed AS or huge networks consisting of multiple AS. You absolutely have to use a dynamic routing protocol here, and BGP is the best and agreed-upon standard for inter-network routing.

Internally you can always set up multiple routing protocols; e.g. OSPF; and you can even exchange routing information between different protocols - but then you again start to make it complex and errors or unforeseeable events will propagate and might again cause a chain reaction.


----------



## jammied (Oct 11, 2021)

sko said:


> You couldn't reliable set up e.g. multiple uplinks with static routes. If one goes down the route would still be there and the ISP would try to send traffic over that (inactive) link. The whole point of BGP is to make routing dynamic and resilient


Well, you could. Just would be questionable how much point there would be in doing it through a method other than through BGP directly except for a small part of your address space which intended to receive limited traffic. Essentially though, you just need the connection to peers to be made in a way that would result in the route dropping from the foreign peers routing table if the connection was lost. An obvious way of doing this would be maintaining the connection through CHAP in such a way that the existence of the route in the routing table was inherently tied to the existence of the CHAP connection.

Either way, if you know a certain connection is always going to provide access to a certain set of routes, you can simply ask the remote peer to create static routes to your network on the edge routers working on the basis that the remote peers internal routing protocols will then propagate the routes across their network using their internal routing protocols. Essentially, the existence of those routes on the remote peers network is not tied to state of your own routing protocols but only that of the remote peers network. Wouldn't not be difficult to maintain a small number of static routes in such a manner. If however you try to setup static routes on every router, yes, I would agree, that would become more of a nuisance.


----------

