graceful degradation is your friend
I am a strong believer that tech systems which desire high uptime should have graceful degradation. Up and down need not be binary.
For example imagine hosting 1,000 web sites with multiple servers. If we put all sites on all servers, they might all go down. But with N servers, if we put each site on 1 server, an outage will then be 1/Nth of the total site. This is much more manageable.
In the above example we partitioned by client. We could partition by function too. The simplest case is read vs. write. Often it is easy to stay up for reads if you are down for writes. One way to do this is simply to use a CDN and have it keep serving while you are down!