We’re back online, sorry for the downtime – Here’s why it took so long

After Jorge had dockerized the entire platform and it had been running for a few months, I noticed some dubious stats that didn’t make sense (more wallets than addresses with positive balance, some wallets with a negative balance). I guess our updateBalanceTables code is still not perfect. This is really hard to get right, because it directly works with the databases in order to avoid having to recalculate all the balances every block (which takes about half an hour nowadays!), and there are quite some edge cases. I was actually thinking about rewriting this part in Haskell, but now I’m not sure if BGE is still worth investing so much time in, with alternatives like BlockSci proliferating.

Anyway, unfortunately while trying to debug I noticed an unwanted side effect of docker: the common way of killing docker containers is absolutely brutal and even though we had already spent considerable time making BGE pretty much safe to kill in any state, this brutality was just too much: our databases became corrupted.

That’s a pity because rebuilding everything from scratch takes a few days ATM with the blockchain growing and growing. But what’s really frustrating is when you have just waited 3 days for FastBlockReader to finish and then an assert throws an exception, because somehow some blocks are missing!

Now Jorge had similar issues in the winter when dockerizing and switching servers, and the culprit seems to be somewhere between bitcoind and bitcoinJ. BitcoinJ’s master version is still not being segwit compatible, forcing us to use a rather unmaintained segwit branch. In addition, our assumption that it was safe to use bitcoind’s on-disk block files through a bitcoinJ loader seems to have been unwarranted. We’re still not sure why, but for now the only way to avoid this problem was bootstrapping with the PeerSource giving us the blocks one by one over the internal network. This process takes a bit longer, more like 5 days ATM.

So we really ought to implement saner process killing in our docker containers, completely rewrite BGE in a microservice architecture and finally get away from BitcoinJ. Or just try BlockSci. Which is a little hard as long as our systems are running, because it needs 50G RAM. 😉 What do you think?

Leave a Reply Cancel reply