Downtime - Apologies and what went wrong
Hi All,
As some of you may have realised, the planned upgrade sort of crashed everything, and we had our longest period of downtime since the site began.
This is partly because I had to go to sleep (thanks to a newborn and a job).
The good news is that the backup process worked! We've restored to seconds before the upgrade took the site offline.
The bad news is that federation is likely to be.. wonky.. for a little while. The site may also go up and down while I undo some of the fixes I tried.
Ultimately the issue came down to the upgrade failing (I am not sure why - will be digging into this now the priority is no longer getting the site up) and then the containers not talking to eachother, so the UI wouldn't talk to lemmy, and lemmy wouldn't talk to the database.
I rebuilt the containers, restored the backup, restarted everything, and it's all come back up (admittedly not perfect right now).
Importantly, I want to issue an apology. This isn't what I want for Lemmy.zip, and it should've been handled way better by myself. I'm always learning but this took way longer than it should've, and while I take some solace in the fact the backup process worked and has been proven to work in production, the delay in being able to get this back up is entirely my fault and frankly unacceptable.
I'll be working to document this outage, the steps it took to get it back up, and some form of repeatable plan so a repair can be replicated in the future if I'm not available.
In terms of upgrading to 0.19.11 - I will have to try again soon as it's got some security fixes we desperately need to implement.
Thanks
Demigodrick
Way I see it, family and mental health always comes before internet randos. Thanks for working hard for everyone.
Lots of Internet randos have been very nice and supportive, so I feel a debt to the community to make this place the best it can be.
But thank you ❤️
I will try and reply to each comment - but you've all been really kind and that means so much ❤️
If you're interested, this graph will show you how far behind we are. We should eventually catch up, but things will likely be very delayed for up to 12 hours.
The status page did not work as expected - and I'll try and link a few more places where I post updates. If you haven't yet, definitely join the matrix space and you'll get minute by minute panic updates 🫠
That graph is really kind of neat, but it seems to only be synchronizing with a single instance at a time from what I can tell. I saw the world line has dropped significantly, but the other lines don't look like they've fallen yet.
Yes, the lemmy.world admins kindly manually reset the timer for their instance so it started updating straight away!
If an instance goes down, other instances slowly back off sending retries of activities so not to waste sending them to dead instances.
You can use this tool to see this info. It links lemmy.world but you can search for any instance, and then look up lemmy.zip either under failed or lagging instances. You'll see on the far right the "next send try" time and date. Looks like a lot will try again around 9pm (although I'm not entirely sure on the timezone there) - so over the next few hours instances will send another try, see that lemmy.zip is back up, and then start federation with us again :)
It's cool to see that there's logic built-in that keeps instances from sending Federation requests to dead instances. But when an instance comes back online, they will re-synchronize themselves. An instance may drop out of the Federation, but when it comes back, it will get everything it missed. Eventually.
I've been there. But it is my honor to bestow upon you this award to commemorate the accomplishment
Ah yes. I still wear my 25 year old “deleted a prod database” badge with honor
It's a bittersweet honour to have. My personal fail was being too cocky updating a 'handful' of product descriptions.
Honestly the blast radius is pretty small compared to Cloudstruck
That's amazing, thank you. I might actually print that out and frame it 🤣
Thanks for all your hard work. We missed .zip while it was gone.
Thank you! Hopefully it stays back now!
Thanks for the update.
I was a bit worried for your mental health as the hours of downtime continued :)
Awesome that the backup restore procedure work that well.
One thing I have been wondering is, why status.lemmy.zip stayed all green during all of this.
Because it was technically working, it's just that "UI wouldn't talk to lemmy, and lemmy wouldn't talk to the database". Soo they were operating, but not communicating to each other.
Most of the day, when I was loading the front page lemmy.zip I got 502 bad gateway, so I was thinking something like adding a check of if the page load with the external status tool.
You can join the Matrix server https://matrix.to/#/#lemmy.zip:matrix.org for real time updates.
I was a little worried that he got arrested by the UK POLICE due to the online privacy act
😆 I heard a siren the other day and thought "oh shit, they've found me!"
I was using https://www.lemmy-status.org/endpoints/_lemmy-zip to check
Something definitely went wrong with the status page - going to find a different solution!
Thank you, they're really kind words and they mean a lot ❤️
It is my first. It's like 2:30 in the morning for me right now, I'm trying to feed them while they try to roll off me and I've had a whole 3 hours sleep in the last day 🤣 I'm glad it gets easier!
You're providing a service out of your own time, pocket and energy; you don't owe anyone.
It's the other way around, we owe you.
So thank you.
Learn from your mistakes and carry on. 👍
Thank you!
Dude, you're being wayyyy to harsh on yourself!
You run this awesome instance for free while caring for a newborn, you don't owe anybody nothing.
Forget the delay, forget apologies and "unacceptable". Real life comes before social media, don't beat yourself up for the outage.
People who can't stand downtime should practice personal redundancy by creating backup accounts on other instances ;)
Thanks, i appreciate the kind words. I always feel guilty when it doesn't work. You have all put your trust in me to run an instance, so when it goes wrong I deffo feel the need to make it right.
Thank you for this post. Don't be so harsh on yourself, everyone can make a mistake!
Good to see Lemmy.zip back up!
Dude, keeping this running with a job and a newborn? You're headed for sainthood.
If you don't have one, you could start an out of band chat during updates, just in case you need some eyes on things or just some moral support. I'm sure we have at least a few subject matter experts around if you can stand us :)
You can join the Matrix server https://matrix.to/#/#lemmy.zip:matrix.org
Please do join the matrix chat, always posting live updates of what's happening and experience is always welcome there!
No need to apologize as you have been doing a stellar job. Your family needs to always take priority no matter what. I don't care if it is down for a week as your health and kid are far more important.
One thing I will say is that I think Lemmy.zip could really benefit from a external way of communicating announcements. It doesn't need to be complicated and you could reuse your existing mastodon account to post updates when things go wrong. It also could allow for users to give advise on how to fix issues.
Thanks, yes I agree, I'll be likely adding something to mastodon and im planning to look at alternative status pages as this one failed the one time it was really needed.
Things gotta fuck up sometimes, tis how we figure shit out and learn things! You got this.
Definitely a learning experience 😅
No worries! Make sure you're getting enough rest!
Thank you! Gonna sleep a lot better tonight knowing the site is working again 😅
As a new parent myself, I'm stunned you managed to find the time to restore it at all. Good on ya, fella!
It's quite an experience isn't it! If I time it right, there's a two hour window at the moment where I can guarantee myself a break. Usually it's sleep, but I squeezed in a backup restore instead 🤣
Thanks for the update! I figured it must have had something to do with the baby and the busy life + the update not working as expected, so I was patient.
After 12hrs or so I did go on mastodon to look for an update (just a 'everything crashed, working on the backup' kinda message) so if this ever happens again that might be an idea?
Thanks for working so hard on getting everything back up and don't forget to rest!
You can join the Matrix server https://matrix.to/#/#lemmy.zip:matrix.org
I've been trying to get away from matrix since it leaks metadata like a sieve. And I've been using SimpleX instead. Plus, it's easier to manage the E2E keys and not fuck up and lose access to your messages.
Thanks! I have no idea what matrix is though, is it like an open source / decentralized discord?
Bascially yes.
Awesome, I'll look into it
It's an attempt at reinventing IRC while neglecting to realize XMPP exists.
It's an open source and decentralized messaging protocol matrix.org It is messenger first and "Discord" second, but it can be used as a public server in a similar way.
I was on Matrix but then I got tired of how buggy and broken it was
Matrix is deffo the best place to get live updates but I'll be adding updates to mastodon and to the status page, when I find a working replacement!
Thank you for all your hard work, as an IT guy I know the feeling when production doesn't work as it should, and the feeling of relief when the backups are actually being restored and working.
Take care and make sure to take a break if you need to, we'll still be here.
Thank you!
I appreciate the transparency and frankly couldn't ask for more. Shit happens and this is a one-person operation. Thanks for all your effort!
Thank you! 😊
thanks for the effort and also explanation
Hey, no worries, an explanation is the least I can do!
Appreciate the honesty and transparency. Thank you for your hard work maintaining the site, and hopefully you're able to restore everything to a fully working state!
Thank you! Hopefully things should be back to normal now, but please let me know if anything doesn't look right 😊
Don't worry. Newborn is a trump card, but even without it you literally are a volunteer.
Anyone complaining about you volunteering your time esp with a newborn, is not a parent..but ignoring that you're doing this for free. Thanks for your time, effort, and just happy it's back up :).
Social media that can go down for a day or two is way better than a shit hole of advertising and manipulation that is Facebook, reddit, and all the rest.
Thank you! I feel very guilty when it doesn't work right but I do appreciate the kind words 😊
Hey you’ve done a ton for all of us and I can’t thank you enough for the work and dedication. Don’t be too hard on yourself, your child and well being are both important, it’s fine. I’d rather some downtime than losing you as admin. Pretty sure most on the sever would agree.
Thank you! 😊
We appreciate you very much! Take all the rest you need. 🫡
Thank you!
Thank you for the transparent announcement, and don't sweat it!
Thank you, appreciate that!
Hard agree with all of the other comments here. No apologies needed, you do a great job of keeping this instance going and the transparency is appreciated.
I temporarily switched over to an alt account and was back browsing Lemmy after figuring out .zip was offline, absolutely no big deal.
Thank you ❤️
Hey man, you've got absolutely nothing to worry about. The fact that you have this service for us at all is quite frankly amazing and we thank you for it. As another commenter said below, I'd rather have a day worth of downtime than to be on big corporate social media and have everything fixed quicker. Because I know that I'm not the product here. When it did not come back, I checked the status page and it said it was working. So I just figured something broke and decided I'd wait until it came back.
Actually... disregard everything I said above. I'm so fucking mad right now. I could bite holes in bricks. I mean, how dare you notice that there's a problem and not get it fixed absolutely immediately. /s
Status page definitely decided to pile on to the issues and screw me over further!
Glad people are very accepting but I certainly don't want to make a habit of downtime! 🤣
Don't fret about it, things happen. You run a great service for us. A job and a new family already add up to being more than two full-time obligations. Managing Zip along with all that is a lot. Thanks for doing it.
Thank you! 😊
Trial by fire. At least it was interesting(!)
Praise be to the backup strategy 🙂
I'd tested the backup strategy before but its never the same as actually having to do it for real, the relief when the backup worked was immense 🤣
How dare you interrupt my ability to look at memes and see the same news article posted in 17 places at once!
Jokes aside I appreciate the work y'all do to keep this sorta thing running without any pay or thanks for the most part.
I am greatful.
Only 17 times?! I need to spin up some more communities 🤣
Thank you, I appreciate the kind words 😊
We so back
Thanks for your hard work! Remember your mental health always has priority though. Cheers mate.
Thank you ❤️
I think you handled it very well. Not sure how it could’ve been handled better tbh. I figured something didn’t go as planned and I didn’t have any problems waiting for you to find a solution. No apologies needed.
Thank you! I appreciate the kind words 😊
I didn't realize it was a single person keeping it all running! Tech sometimes goes wonky, good job getting it back online!
Thank you!
Been there, done that, with my Friendica instance. 2 days of downtime while rebuilding a corrupted database, while people are tapping their feet waiting for all to return. I'm with you in spirit, my friend.
Thanks for all your hard work keeping the dream alive! And for keeping good backups
You're good bro. Sort of assumed something went wrong with the upgrade.
No worries. You’re doing a great job even though things are hard from time to time.
Thanks for your efforts. ❤️
Thank you! ❤️
Thank you for this post. Don't be too harsh on yourself, everyone can make a mistake!
Glad to have lemmy.zip up and running again!
Thanks Blaze, appreciate the support as always!
Unfortunately these things can and do happen. I'm glad you were able to get things functional with a restoration. Best of luck troubleshooting and repairing the leftover gremlins.
Thanks for all you do to support Lemmy.
Congratulations on the baby. We should thank you for making us go touch grass.
lemmy.zip best instance lfg
Everyone has already said it, but don't even sweat it. I had a little time when I tried to check in on Lemmy and couldn't, so I simply set it aside to try again later. Things hum along smoothly here so much more often than not, and I wouldn't be surprised if a sizable portion of us have been in that same position ourselves of needing to straighten out a failed upgrade for some project or other. I know I certainly have
Praise be to containers \o/
I definitely felt that! Trying to check the feed and getting a 502 was not nice. Good thing I had an account with another instance for the interim. Anyway, no service is bulletproof and things absolutely go wrong when running a server, no matter how good or prepared we may be. Having working backups is instrumental and I'm glad you have that going.
Thank you Rose! I think when I saw that first 502 I took a year or two off my life with stress 🤣
Hey man its all good. I understand living the working parent life. It ain't easy
Thank you for your work!
Thanks squirrel! 😊
Chiil, you do a great job managing this. There is no need to blame that way yourself.
Get some rest, enjoy first stages of parenting and take your time updating.
Thanks a lot for lemmy.zip and take care of yourself.
No worries man. It's just social media. We survived!
One thing I was curious about is I didn't know where to go to look up info on what was going on? But you mentioned you were posting links, so I'll bookmark some of that info! Anyway thanks for all your hard work.
Yeah absolutely, I'll be putting more links up. I wasn't prepared for things to go quite so wrong and for so long, so other than the matrix chat there wasn't any other info.
I'll be using mastodon, matrix, and I've updated the status page to be one that should actually work now :)
This is FOSS, not a job. You are doing it for free + maybe some donations people give. This is social media, not some critical health thing that needs to be working 24/7. Thanks for caring, now you know for next time, but don't beat yourself up. We appreciate your efforts and the transparency you have been giving us by making these posts! Keep on going with the transparency, but take it easy on yourself.
The graph shows the Federation sink is finished after four and a half hours.
Woo! We should be good now, although there might be a few lost activities if things don't look 100%
Great work and welcome back 🤗
Thanks for the hard work! Glad the server is back online.
A suggestion: Post a message on status.lemmy.zip when there is maintenance. That was where I thought to check when I found that the main site was not working. Though, it was reporting the site was fine when it was unavailable, this time.
Oh, and congratulations on the newborn!