# uptime



## Argentum (Oct 8, 2021)

I think I have discovered the root cause of FreeBSD *unpopularity*:

```
~> uptime
 8:30PM  up 247 days,  3:38, 6 users, load averages: 0.90, 0.96, 0.93
```

*Nothing really happens...*


```
PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 2635 root         17  20    0  2082M  1922M kqread   6 5213.0  56.71% bhyve
 1707 root         17  20    0  2082M  1908M kqread   2 245.5H   1.37% bhyve
19103 root         13  20    0  2073M  1781M kqread   2 201:57   0.53% bhyve
 2409 root         13  20    0  2073M  1791M kqread   6  46.1H   0.44% bhyve
```

_This is not any record, this is just an average..._


----------



## zirias@ (Oct 8, 2021)

So, your kernel has unpatches security issues? nice


----------



## Argentum (Oct 8, 2021)

Zirias said:


> So, your kernel has unpatches security issues? nice


I know, but I cannot reboot it right now.

Fortunately it has custom built kernel with most unnecessary stuff removed and Bhyve kernels even more tuned...

And there is no urgent need to reconfigure:


```
# zpool status; zpool iostat -v
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 03:44:47 with 0 errors on Fri Oct  1 06:44:48 2021
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0

errors: No known data errors
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zroot       1.82T  34.4T     10    200   608K  13.4M
  raidz1    1.82T  34.4T     10    200   608K  13.4M
    ada0p3      -      -      2     36   209K  4.66M
    ada1p3      -      -      2     34   204K  4.63M
    ada2p3      -      -      2     36   209K  4.66M
    ada3p3      -      -      2     34   204K  4.63M
----------  -----  -----  -----  -----  -----  -----
```


----------



## mer (Oct 8, 2021)

I'm not the only one I see.
"Windows" the number is related to how many days between reboots.

As for unpatched security stuff:
I always look at what the CVE says.  If it is something that does not apply to me (say "inbound web PHP malformed traffic" when I don't have any webserver) I simply ignore it until one pops up I care about.
So yes, if whatever Argentum is running does not have the attack vector exposed, it should be a non issue.


----------



## Argentum (Oct 8, 2021)

mer said:


> I always look at what the CVE says.  If it is something that does not apply to me (say "inbound web PHP malformed traffic" when I don't have any webserver) I simply ignore it until one pops up I care about.


And I do not have to reboot it to upgrade the webserver....
And the webservers are on the Bhyves, not on the hypervisor.

And after all of this painstaking agony and boredom I type on the terminal:

```
~# fuck
No fucks given

~# pkg info thefuck
thefuck-3.31
Name           : thefuck
Version        : 3.31
Installed on   : Fri Oct  8 21:24:26 2021 EEST
Origin         : misc/thefuck
Architecture   : FreeBSD:12:*
Prefix         : /usr/local
Categories     : python misc
Licenses       : MIT
Maintainer     : ygy@FreeBSD.org
WWW            : https://github.com/nvbn/thefuck
Comment        : App that corrects your previous console command
Annotations    :
Flat size      : 659KiB
Description    :
Thefuck is a magnificent app which corrects your previous console command.
It tries to match a rule for the previous command, creates a new command
using the matched rule and runs it. Thefuck comes with a lot of predefined
rules, but you can create your own rules as well.

You should place this command in your shell config file:

eval $(thefuck --alias)

WWW: https://github.com/nvbn/thefuck
```


----------



## mer (Oct 8, 2021)

fantastic


----------



## Deleted member 30996 (Oct 13, 2021)

Argentum said:


> I think I have discovered the root cause of FreeBSD *unpopularity*:
> *Nothing really happens...*


No, it just keeps running and running. That's the way I like my .mp3 players.

306 days uptime my best screenshot record for a desktop on my X61 .mp3 player.
The W520 that took over that job when the X61 fan died currently at 172 days.


----------



## Argentum (Oct 13, 2021)

Trihexagonal said:


> No, it just keeps running and running. That's the way I like my .mp3 players.
> 
> 306 days uptime my best screenshot record for a desktop on my X61 .mp3 player.
> The W520 that took over that job when the X61 fan died currently at 172 days.


My personal server uptime record with FreeBSD is over *1000* days.


----------



## mer (Oct 13, 2021)

My best was above Trihexagonal and below Argentum.  It only got reset because of a power outage that lasted longer than the UPS.


----------



## Argentum (Oct 13, 2021)

mer said:


> My best was above Trihexagonal and below Argentum.  It only got reset because of a power outage that lasted longer than the UPS.


We had to move to another service provider.


----------



## SirDice (Oct 13, 2021)

Don't have a high uptime, I reboot often. 



Argentum said:


> Nothing really happens...


But yeah, nothing really happens. Even with the amount of tinkering I do. It borders on boring. And I mean 'boring' in a positive sense, it "Just Works™"


----------



## Argentum (Oct 13, 2021)

SirDice said:


> But yeah, nothing really happens. Even with the amount of tinkering I do. It borders on boring. And I mean 'boring' in a positive sense, it "Just Works™"


Tried to create a joke - people like systems that are no so boring and keep them  busy...


----------



## mark_j (Oct 13, 2021)

SirDice said:


> Don't have a high uptime, I reboot often.



That's your Microsoft training kicking in.


----------



## zirias@ (Oct 13, 2021)

mer said:


> As for unpatched security stuff:
> I always look at what the CVE says. If it is something that does not apply to me (say "inbound web PHP malformed traffic" when I don't have any webserver) I simply ignore it until one pops up I care about.


That's what you _should_ do (and what I do as well). Still, a patchlevel-release with _none_ of the SAs affecting my system doesn't happen _that_ often.

It happens a bit more often that the kernel isn't affected, so, in theory, a reboot is not required. But then, you'd have to make sure any service e.g. linking to some lib from base is restarted. In practice, it's more reliable to just find a timeslot for a reboot.

Getting an uptime of several 100 days _without_ having any unpatched security issues would IMHO, if at all possible, require an extremely stripped down system (many `WITHOUT_*` in /etc/src.conf, many `nodevice` et al in your kernel config).


----------



## SirDice (Oct 13, 2021)

mark_j said:


> That's your Microsoft training kicking it.


Nah, uptimes are overrated. When I make drastic changes I like to reboot to make sure everything comes up correctly in case of a power failure or some other reason that may cause the system to restart. Some people just seem to have something against rebooting a system, I don't. I have no problems whatsoever to reboot a system.


----------



## zirias@ (Oct 13, 2021)

SirDice said:


> uptimes are overrated.


Very much! If you think about it, there's never a way to ensure you'll _never_ need a reboot – sooner or later, there _will_ be some issue (e.g. security) forcing you to do it. So, the only way to have something like a "zero downtime service" is to operate that service with redundant instances. But then, reboots don't hurt anyways.


----------



## mer (Oct 13, 2021)

Argentum said:


> We had to move to another service provider.


Well, all the different providers here put the power onto the same set of wires, when you get an ice storm that brings down trees and snaps wires, it doesn't matter who is supplying the power


----------



## SirDice (Oct 13, 2021)

Zirias said:


> So, the only way to have something like a "zero downtime service" is to operate that service with redundant instances.


Key difference here is that you can guarantee uptime of your _service_, not the _server_. The service you provide is important, the server that runs that service isn't.


----------



## kpedersen (Oct 13, 2021)

SirDice said:


> I don't. I have no problems whatsoever to reboot a system.


Agreed. On some installs (especially those using X11 and the GPU and are public facing) I have it reboot over night in a cronjob.

It may be excessive but forcing a cold boot is really handy to know the next morning when something has gone wrong and debug the faulty change rather than in a month (or more!) time when there could be so many things that have changed that could cause the issue.

That said, this thread reminded me that I set up a tunnel into work re-using an old server just as COVID was hitting:


```
Last login: Mon Oct  4 15:12:02 2021 from x.x.x.x
OpenBSD 6.5 (GENERIC.MP) #3: Sat Apr 13 14:48:43 MDT 2019

Welcome to OpenBSD: The proactively secure Unix-like operating system.

path$ uptime
12:27PM  up 783 days, 20:55, 1 user, load averages: 0.01, 0.06, 0.07
path$
```

I wish I used a Raspberry Pi to be honest. Would have saved a heap of electricity haha.


----------



## angry_vincent (Oct 13, 2021)

i am also not into uptime records, and reboot often, but from historical point of view it is entertaining to know, why such myth exist


----------



## Argentum (Oct 13, 2021)

angry_vincent said:


> i am also not into uptime records, and reboot often, but from historical point of view it is entertaining to know, why such myth exist


But sometimes the clients are using the service non-stop and there no need to reboot...


----------



## mer (Oct 13, 2021)

Argentum said:


> But sometimes the clients are using the service non-stop and there no need to reboot...


Ahh, the "service" vs "server" argument   Telephony with the "5 nines" requirements:  that's for the service, not an individual server.  So you have redundancy to give you high availability on the service, which lets you upgrade/fix/replace individual servers.

I like seeing a long uptime, but reboot when it's needed.  Upgrades(including security patches) and power outages are really the only reasons I typically reboot.  As SirDice points out, rebooting after major changes that could affect reboot after a power failure, is a good thing.  Make sure it's correct when you control it, not scream into the darknes that it doesn't work when you can't control it.


----------



## Argentum (Oct 13, 2021)

mer said:


> Ahh, the "service" vs "server" argument   Telephony with the "5 nines" requirements:  that's for the service, not an individual server.  So you have redundancy to give you high availability on the service, which lets you upgrade/fix/replace individual servers.
> 
> I like seeing a long uptime, but reboot when it's needed.  Upgrades(including security patches) and power outages are really the only reasons I typically reboot.  As SirDice points out, rebooting after major changes that could affect reboot after a power failure, is a good thing.  Make sure it's correct when you control it, not scream into the darknes that it doesn't work when you can't control it.


No need to criticize. This is just what happened. I did not have another server back then and because the physical access to the data-center was also difficult I just did not reboot some time:





After that I moved it...


----------



## mer (Oct 13, 2021)

Argentum Sorry, no criticism was intended.  I was simply leveraging you to point out there is a distinction between a service and the server(s) it runs on.  If you only have a single non redundant server, of course not rebooting is often the correct thing.

CVEs, security patches, a good sysadmin will actually read, research and evaluate the "does this actually affect me" instead of simply "OMG MUST PATCH NOW!!!".


----------



## D-FENS (Oct 13, 2021)

Zirias said:


> So, your kernel has unpatches security issues? nice



It is probably off the grid.


----------



## D-FENS (Oct 13, 2021)

Argentum said:


> I think I have discovered the root cause of FreeBSD *unpopularity*:
> 
> ```
> ~> uptime
> ...



 A NUC with Arch:

```
% uptime
 19:06:34 up 175 days, 6 min,  1 user,  load average: 0.39, 0.41, 0.38
```


----------



## grahamperrin@ (Oct 13, 2021)

Rarely more than a few days. I typically start a new boot environment before any package upgrade.


----------



## D-FENS (Oct 13, 2021)

grahamperrin said:


> Rarely more than a few days. I typically start a new boot environment before any package upgrade.


Did you include /usr in your BE? I don't use BEs but when I tried it out it did not include the userland by default.


----------



## grahamperrin@ (Oct 13, 2021)

I make basic use of bectl(8) as outlined as <https://forums.FreeBSD.org/posts/535005>

There's the `-r` flag for a _recursive_ boot environment, but I don't understand this.


----------



## mer (Oct 13, 2021)

roccobaroccoSC said:


> Did you include /usr in your BE? I don't use BEs but when I tried it out it did not include the userland by default.


by default a boot environment includes things like /usr/local, basically everything that does not have it's own dataset


----------



## grahamperrin@ (Oct 13, 2021)

```
root@mowa219-gjp4-8570p-freebsd:~ # bectl mount n249988-2c614481fd5-a /tmp/cooee
Successfully mounted n249988-2c614481fd5-a at /tmp/cooee
root@mowa219-gjp4-8570p-freebsd:~ # chroot /tmp/cooee
root@mowa219-gjp4-8570p-freebsd:/ # ls /usr/home
root@mowa219-gjp4-8570p-freebsd:/ # ls /usr/local
FreeBSD_ARM64                   llvm10                          openoffice-4.2.1602022694
GNUstep                         llvm11                          openoffice-4.2.1619649022
ICAClient                       llvm12                          openssl
VirtualBox                      man                             poudriere
bin                             mono                            sbin
etc                             openjdk11                       share
false                           openjdk11-jre                   steam-utils
furybsd                         openjdk16                       true
include                         openjdk8                        var
lib                             openjfx14                       wine-proton
lib32                           openoffice-4.1.6                www
libdata                         openoffice-4.2.1579913986       x86_64-portbld-freebsd14.0
libexec                         openoffice-4.2.1589199787
root@mowa219-gjp4-8570p-freebsd:/ # exit
exit
root@mowa219-gjp4-8570p-freebsd:~ # bectl umount n249988-2c614481fd5-a
root@mowa219-gjp4-8570p-freebsd:~ # ls /usr/local
FreeBSD_ARM64                   llvm10                          openoffice-4.2.1602022694
GNUstep                         llvm11                          openoffice-4.2.1619649022
ICAClient                       llvm12                          openssl
VirtualBox                      man                             poudriere
bin                             mono                            sbin
etc                             openjdk11                       share
false                           openjdk11-jre                   steam-utils
furybsd                         openjdk16                       true
include                         openjdk8                        var
lib                             openjfx14                       wine-proton
lib32                           openoffice-4.1.6                www
libdata                         openoffice-4.2.1579913986       x86_64-portbld-freebsd14.0
libexec                         openoffice-4.2.1589199787
root@mowa219-gjp4-8570p-freebsd:~ #
```
Thanks mer


----------



## ralphbsz (Oct 14, 2021)

I'm theoretically a fan of rebooting often. That's simply the admission that no software (not even FreeBSD) is perfect, and everything has the risk of small resource leaks.

The problem with that theory is that my FreeBSD home server is also the network gateway. And if I reboot, the network will be out for maybe 2 or 3 minutes. My wife is a morning person, and gets up way earlier than I do. So when I get up is not a good time to reboot, she's probably busy doing work, or at least relying on her cellphone etc. functioning. On the other hand, our college-age son loves playing computer games with his friends late at night (and he does not get up early). So the time when I either reboot or run freebsd-update is usually a weekend morning, when my wife isn't working (she might be shopping, doing gardening), and our son is still asleep. So a typical uptime for my machine is a week or two. If it reaches a month, that means I've been too busy to check for updates and quickly reboot.

The upper limit is set by SSL certificates: Those need to be renewed every 3 months, and after that the web server needs to be restarted. I think it's silly to restart a major service and leave the rest running (you're disrupting one service, may as well do all of them), so I always use a reboot at that time.


----------



## astyle (Oct 21, 2021)

mark_j said:


> That's your Microsoft training kicking in.


I'm completely with SirDice on his response to that. Generally, with Windows systems, I am reboot-happy, only because a reboot helps the metal clear its throat (metaphorically speaking). Sometimes, it can be difficult to reboot a machine, because it's in the middle of something important that needs to complete first. But avoiding a reboot just for the heck of it???


astyle said:


> UNIX is not a religion, buddy.


Neither is Microsoft. Just reboot whenever you see the need. If it's difficult to safely reboot a machine - consider a different approach to the _service_ the server is providing.


----------



## Deleted member 30996 (Oct 21, 2021)

Argentum said:


> My personal server uptime record with FreeBSD is over *1000* days.


That's pretty impressive for a desktop. What kind of laptop is it?

Oh, that was at a data-center... I was talking about desktop uptime. I rebooted this one yesterday.

When I decided to use one of the Thinkpad W520's I had been using as a general purpose FreeBSD desktop as my multimedia machine (I do more than just listen to .mp3's on it) I pulled the Ethernet cord from it and could plug it back in now and go online. For an update.

I never turn off gkrellm2, urxvt, xfe or audacious from boot till shutdown and can leave Gimp open for days. I have a bamboo cutting board I got a Walmart that is the perfect size to set it on and not block the vent if I hold it on my lap and do graphic work. Which is most relaxing while listening to music in the recliner.

With a current total of 13,666 songs and beaucoup movies on USB I'm in the Halloween spirit today watching some Coffin Joe and listening to music. I have Gimp open for the shot and that's shown running at full load for it at 181 days:


----------



## astyle (Oct 21, 2021)

Trihexagonal said:


> I have a bamboo cutting board I got a Walmart that is the perfect size to set it on and not block the vent if I hold it on my lap and do graphic work. Which is most relaxing while listening to music in the recliner.


Perfect laptop to take to a ski resort and do your Gimp work outside the chalet:  just hold it on your lap to keep warm


----------



## Deleted member 30996 (Oct 21, 2021)

That shows 57C so that's not hot. This one is showing 57C, too.


----------



## Argentum (Oct 21, 2021)

Trihexagonal said:


> That's pretty impressive for a desktop. What kind of laptop is it?


It was a server, not desktop. With hardware raid, ECC memory and dual power supplies. Must say that power supplies actually never failed.


----------



## Menelkir (Oct 21, 2021)

Trihexagonal said:


> That's pretty impressive for a desktop. What kind of laptop is it?
> 
> Oh, that was at a data-center... I was talking about desktop uptime. I rebooted this one yesterday.
> 
> ...


As an off-topic, there's a bar that I've used to go in São Paulo/BR were the owner was Mojica's son-in-law.


----------



## astyle (Oct 21, 2021)

Argentum said:


> dual power supplies. Must say that power supplies actually never failed.


Making me jealous. What's the brand of those power supplies? I swear by EVGA.


----------



## Argentum (Oct 21, 2021)

astyle said:


> Making me jealous. What's the brand of those power supplies? I swear by EVGA.


Do not remember. It was few years ago when the hardware got retired. It was 2U rack mount unit. I built it myself, but this happened 10 years before that. From a barebone. I think it was Gigabyte brand.


----------



## baaz (Dec 10, 2021)

take a look at this...... 17 years of uptime !


----------



## astyle (Dec 10, 2021)

baaz said:


> take a look at this......


I would encourage you to post a better discussion than this when posting links.... I did take a look, but I was asking myself, "Why should I look? How is that relevant to the topic?"


----------



## baaz (Dec 10, 2021)

astyle said:


> I would encourage you to post a better discussion than this when posting links.... I did take a look, but I was asking myself, "Why should I look? How is that relevant to the topic?"


sorry edited it


----------



## mark_j (Dec 10, 2021)

OpenVMS clusters are truly a thing of beauty and they've had them for decades. Their galaxies are/were rather groundbreaking.


----------



## ralphbsz (Dec 10, 2021)

With modern VMS, is it still true that every machine has to be rebooted at least once per year?

If I remember right, I think the reason was that the hardware clock in the VAX only stored month, day, hour, minute second, but didn't know what year it was. For that reason, the hardware clock was called the TOY clock: Time Of Year. The year is stored on the boot disk, but for some reason, that storage is updated only when you boot. So if you don't reboot for over a year, on the next boot time will go backwards by one year.

Now, that does not affect a VAXcluster, as every single host in the cluster has to reboot, but seen from the outside, the cluster remains perfectly operational.

There is a lot of hardware goodness that has been sacrificed on the altar of commodity hardware. For example, on PA-RISC HP-UX machines, you could lose a CPU (physical failure of the processor chip), and the machine would not shut down. Happened to us once: We noticed that we didn't get any mail for a day, and discovered that the sendmail master process has crashed. This made no sense to us, since sendmail was absolutely reliable (duh, obviously, it is an incredibly well engineered piece of software). Investigation of the log files showed that sendmail was executing exactly the moment that the hardware supervisor detected a failure of one CPU, and powered down that CPU. The only visible effect was that the process that had been running on the failing CPU gets aborted. Our machine kept running as a 3-way SMP, but the sendmail process did not get automatically restarted. No problem, we manually started it, ordered a spare CPU from field service. A few days later, at a convenient time, we powered the machine down, replaced the CPU, and brought it back up.

On IBM mainframes, you can do a lot of hardware repairs without shutting the machine down, like replace memory, CPU, busses, disk interfaces, and so on. Everything is redundant, and components can be individually disabled and powered down while the machine itself keeps running. I think the Z series mainframes are the last computer on which such a thing is still possible. The same is true of high-end disk arrays (Shark, Symmetrix, Hitachi).


----------



## ccammack (Dec 10, 2021)

ralphbsz said:


> On IBM mainframes, you can do a lot of hardware repairs without shutting the machine down, like replace memory, CPU, busses, disk interfaces, and so on. Everything is redundant, and components can be individually disabled and powered down while the machine itself keeps running. I think the Z series mainframes are the last computer on which such a thing is still possible. The same is true of high-end disk arrays (Shark, Symmetrix, Hitachi).



Telecom switches too. Shut down half the machine and everything else keeps working. Swap out hardware, apply patches, test, change your mind and roll back. Amazing stuff.


----------



## mark_j (Dec 10, 2021)

ralphbsz said:


> With modern VMS, is it still true that every machine has to be rebooted at least once per year?


Never in my experience.
It could if a patch kit was to be applied that required a reboot, but then a rolling reboot around the cluster would ensure it was never down.



ralphbsz said:


> If I remember right, I think the reason was that the hardware clock in the VAX only stored month, day, hour, minute second, but didn't know what year it was. For that reason, the hardware clock was called the TOY clock: Time Of Year. The year is stored on the boot disk, but for some reason, that storage is updated only when you boot. So if you don't reboot for over a year, on the next boot time will go backwards by one year.



Yes, you're correct it does not store a date as something like a unixtime, but you're not really correct.
I think the TOY (battery backed) stored around 500 or 600 days before it looped back.

The VAX  (OpenVMS) stores the date in SYS$SYSTEM:SYS.EXE and uses it plus the TOY value to set $GQ_SYSTIME. The GQ_SYSTIME is the interval clock and it's just updated by an interrupt (which priority level I cannot recall). I don't believe the TOY is ever used again while the system is running. This means the time would not move forward if it was powered off for those 500/600 days, but it would not go backwards. Regardless, on reboot it would ask you to set the time anyway. It would never go backwards while running nor require a reboot to reset it (SET TIME can be run at any time).

The time can skew, of course, if the power supply to the box is not 60Hz. 

I think VAX admins used to run set time periodically to reset the time using the TOY, but I am not sure if that's correct or not.

Of course, modern OpenVMS has had access to DTSS off decnet as well as ntp like all unix-like systems. Before the advent of ntp on OpenVMS, DTSS was the main time synchronisation across a cluster (as well as the cluster protocol itself).

Whether you keep an expensive piece of hardware powered off for a year I leave for debate to all the accountants.


----------



## meaw229a (Dec 11, 2021)

I came across this site a while ago: https://www.reddit.com/r/uptimeporn/   Looks like long uptimes can be become
some sort of religion for some people.

For myself. If I make a change to my box or have a bit of a bigger upgrade I reboot. If something than looks not right
I can fix it on the spot. That's better than scratching my head a couple of month later and try to remember what
I have done.


----------



## mark_j (Dec 11, 2021)

Perhaps uptime became a religion because often these 'ancient' operating systems took a very long time to reboot. Some even broke upon reboot  because some disk started out of order. 
Nowadays hardware is so much more resilient. 
If it wasn't then endless reboot to install the latest windows would surely blow the machine up.


----------



## Argentum (Dec 11, 2021)

mark_j said:


> Perhaps uptime became a religion because often these 'ancient' operating systems took a very long time to reboot. Some even broke upon reboot  because some disk started out of order.
> Nowadays hardware is so much more resilient.


With servers I have been in a situation where the physical access to the machine has been difficult and limited (drive to the DC, find the people with access). In several cases no KVM device connected, perhaps even not available. That means every reboot should be 100% sure. In such a situation one naturally does not want to reboot when the system is working and does its job...

Today I have one such box running:

```
$ date; uptime
Sat Dec 11 11:57:33 EET 2021
11:57AM  up 310 days, 20:05, 6 users, load averages: 1.68, 1.12, 0.91
```

_... and I know that I can temporarily order KVM service but last time I couldn't get it working because the service providers device is probably old and uses EOL Flash. I had no idea how to get the Flash working on my FreeBSD desktop..._


----------



## mark_j (Dec 11, 2021)

Argentum said:


> With servers I have been in a situation where the physical access to the machine has been difficult and limited (drive to the DC, find the people with access). In several cases no KVM device connected, perhaps even not available. That means every reboot should be 100% sure. In such a situation one naturally does not want to reboot when the system is working and does its job...
> 
> Today I have one such box running:
> 
> ...


Very true. I managed servers accessible only via serial port remotely accesible and yes, you have to be damn certain otherwise...


----------



## ralphbsz (Dec 11, 2021)

mark_j said:


> Perhaps uptime became a religion because often these 'ancient' operating systems took a very long time to reboot.


Definitely used to be slow. In the 70s and early 80s, booting an IBM mainframe typically took 1.5 hours for cold boot (from powerup), 3/4 hour warm boot (if everything already had power). I remember one of the big advances of the IBM 43x1 was that it would boot faster; that was considered a requirement for the mid-range environment it was typically deployed into. At the same time, minicomputers (VAX, Data General, we used a Four Phase machine branded Philips) still needed 5-10 minutes to boot.

Today we actually have the opposite religion, which machines being automatically and regularly rebooted. That's sort of an admission that software (including operating systems) are imperfect, and cruft will accumulate.


----------



## mark_j (Dec 11, 2021)

ralphbsz said:


> Definitely used to be slow. In the 70s and early 80s, booting an IBM mainframe typically took 1.5 hours for cold boot (from powerup), 3/4 hour warm boot (if everything already had power). I remember one of the big advances of the IBM 43x1 was that it would boot faster; that was considered a requirement for the mid-range environment it was typically deployed into. At the same time, minicomputers (VAX, Data General, we used a Four Phase machine branded Philips) still needed 5-10 minutes to boot.


And you would sweat on every second hoping something wouldn't fail & you'd be down the entire day & looking for a new job the next.


----------



## ralphbsz (Dec 12, 2021)

On mainframes, things usually don't fail, and they don't crash. Or to be more accurate: Things fail all the time, but they keep running, perhaps a little slower than usual , or with less disk space or less memory. That is until a junior programmer by mistake manages to crash the OS from a user process (in Fortran, no assembly required). The third time it happened, one of the operators came out to my terminal, and asked me to "not do that again, and show us exactly what you did". I had been wondering why the machine crashed every time I wanted to compile/link/run my analysis program.

The answer, by the way, was a wonderfully subtle bug. When you program in Fortran 77, it so happens that the entry part of the program (not a subroutine, but the main body) is implicitly in a subroutine that is called "MAIN". You don't know that, it's not documented, it's just an internal convention. My problem was that I had defined a subroutine that was also called MAIN, which did most of the work of the program (I probably had other subroutines with names like SETUP, READINP and WRTRSLT. The problem happened because our system ran a combination of the Hitachi F77 compiler with the IBM linker. The compiler happily prepared a module called MAIN (with the main program), and another module called MAIN (with the subroutine). The linker would take the first one and copy it into the executable. It would then notice that a subroutine named MAIN is being called, and put another copy of the first one in (not the second one, due to an incompatibility between Hitachi and IBM). It would then notice that a subroutine called MAIN is being called, and put yet another copy of the first one in. The linker "knew" that Fortran programs can't be recursive, and it "knew" that object modules have unique names, and it was supposed to have a table of all modules it had already linked, but the incompatibility in naming convention between Hitachi and IBM broke that. So the linker would create the executable in memory, trying to copy infinitely many copies of MAIN into memory. Unfortunately, the linker is a system program, so it is exempt from things like memory quota, and it exhausted and overwrote all memory on the machine, causing an OS crash.

No problem, I would go have a snack. After a while, the machine comes back up, I remembered what I had been working on, and I restarted the compile/link/execute cycle, causing another snack break. And again. During the third time, the operators had noticed the number of the terminal that was running the last process before the machine went down, and they found me because I had my coffee mug and all my paperwork sitting there and always returned to the same place. So I explained it to them. They went and got some of the system programmers, who looked at my code, and looked at some IBM documentation, and looked at some logs of the machine, and after an hour or so they told me that I had rediscovered a bug that was known to IBM but not yet to customers. All I had to do to fix the program was to rename the subroutine to anything other than MAIN, and it worked perfectly.


----------



## Crivens (Dec 12, 2021)

On some old IBM you could name your binary the same as the syncer process which is running when the power is on emergency and about to fail. That process would not be interrupted, was not killable and highly annoying to operators.  Together with another little known feature of the batch processing, you could run your CPU hogging code any time while the bean counters were looking at non responding spreadsheets in the mean time.


----------



## Argentum (Nov 3, 2022)

Argentum said:


> With servers I have been in a situation where the physical access to the machine has been difficult and limited (drive to the DC, find the people with access). In several cases no KVM device connected, perhaps even not available. That means every reboot should be 100% sure. In such a situation one naturally does not want to reboot when the system is working and does its job...
> 
> Today I have one such box running:
> 
> ...


Today it looks like I am going to decommission that machine. Have transferred all the VM-s to another machine and one of the drives in zpool is giving errors already...


```
# date; uptime
Thu Nov  3 21:10:17 EET 2022
 9:10PM  up 638 days,  5:18, 7 users, load averages: 1.61, 1.40, 1.28
```

But, yes, FreeBSD as a system seems stable.


----------

