# Cluster aware cron system?



## vash (Sep 29, 2009)

Hey guys...

I've a slightly odd problem, one that has had me pulling out the dark roots for days now; that is a network / cluster aware cron system.

Given a couple of load balancers, a half dozen application servers and a clustered database - all using basic components (apache, mysql, pen, vrrpd, squid), I have one particular application that unfortunately uses crons to update certain aspects of it's data.

The trouble with the application servers running crons is that each one will be triggered - updating the data every interval. By interleaving the crons, each application server could run one job every interleave/interval, however this is still not ideal if I need to take one or more of the application servers off-line.

So... I guess what I'm looking for is a distributed cron system that can maintain it's job listing on multiple machines but run on only one machine (sorry but I can't word that any easier!)

Suggestions?


----------



## aragon (Sep 30, 2009)

One machine running cron that sshs to the other machines?


----------



## vash (Sep 30, 2009)

That still puts overall control of the crons onto one (backend) machine - leading to a single point of failure.

Ideally the cron system has to run on a number of machines and has to be aware of those machines so that if one or more fails, then one of the remaining ones will still run the relevant job at the specified interval.

To try and put it into a better framework, consider the following:

@ Interval 1
[Machine 1] -- [Machine 2] -- [Machine 3]

At the given interval, all 3 machines are aware that each other is "up", and delegates the job to a random machine to run the cron job.

@ Interval 2
[Machine 1] -- [OFFLINE] -- [Machine 3]

Now, the remaining two machines realize that a machine has been taken offline (or has failed), so delegate the job to one of the remaining machines.

@ Interval 3
[Machine 1] -- [Machine 2] -- [Machine 3]

The machines are aware that machine 2 has been brought back online, however just in case it is flapping, delegate the job to machine 1 or machine 3 in the interim.

@ Interval 4
[Machine 1] -- [Machine 2] -- [Machine 3]

All machines are online and have been stable for at least one interval, therefore delegate the job to any of the 3 machines in the pool.

If you consider that the cronjob itself manipulates the back-end database, you can see that running it more than once per interval, would result in data inconsistencies unless very carefully written.


----------



## DutchDaemon (Sep 30, 2009)

Do _all_ of the machines have to run the job at least x times per day, or could you (in case all six servers are online for an indefinite amount of time) always run the cron job from e.g. machine1? In that case I would just put the cron job on machine1, and script the cron job on the other machines in a fixed order, with the script containing something like:

```
machine2:
if machine1 up, exit, else run job
machine3:
if machine1 || machine2 up, exit, else run job
machine4
if machine1 || machine2 || machine3 up, exit, else run job
etc.
```
This depends on having a surefire way of determining whether a machine is up and functioning (a simple ping may not suffcie).


----------



## vash (Sep 30, 2009)

> Do all of the machines have to run the job at least x times per day...



Assuming the job is hourly, and there are 6 machines, then the job needs to be run 24 times per day - not 144 times (24x6).



> ...or could you (in case all six servers are online for an indefinite amount of time) always run the cron job from e.g. machine1?



That is what I currently do, however from time to time I need to take machine1 offline (or indeed, it could suffer a failure), hence the request for an "cluster-aware" system.

I've considered your script, and partially rejected it, however I agree it may prove the best way forward. If I could turn this into a small C application running on each machine, with each one talking to a list of co-servers, then I think it might prove very reliable.

I think having communication between the machines is the key here, okay, it's maybe not possible to know when a machine is going offline, but we do know when they back online, so yes, pinging, or at least querying a port for information ie: (*bad* pseudo conversation)


```
Machine1> Ask Machine2 when it came online
Machine2> [No reply]
Machine1> Bugger, okay, ask Machine3 when it came online
Machine3> Just a minute ago, still catching my breath.
Machine1> Okay, not to worry, I'll run the job
Machine2> [No response - still offline]
Machine3> Okilly Dokilly
```

I'm just surprised this type of problem has not occurred before, surely load-balanced web servers have a need for some form of stable distributed cron system even if only for garbage collection?.


----------



## DutchDaemon (Sep 30, 2009)

Well, the simple fact is that cron is not a network(-aware) service (there's no crond listening on port xyz ..), so there are no cluster variations for it (that I know of). That's why clusters usually involve an external or master administrative system that coordinates and executes remote jobs from a central location, using e.g. key-based ssh. Maybe you can build something on top of, or incorporating, a tool like sysutils/heartbeat to instruct cron.


----------



## vash (Sep 30, 2009)

That looks rather promising .. I'm all in favor of wrapping and using existing technologies  Thanks.


----------

