Friday, 5 October 2012

Can't hear myself!

This morning I turned on and decrypted by home computer, (like any usual morning.)
I kicked in screen and went to make a hot beverage, (but I pronounced it beveridge.)
I came back to my screen, unlocked it and found that my computer had failed to ssh into my development server, but HAD managed to ssh into my house server. I set up key exchange as soon as I built both servers, and it has been working ever since.

I tried, (http, https, ssh -vv) my development server, (over with the good people at Bytemark Hosting) but nothing, (I never thought that the problem that I'm about to report was at their end - Bytemark are so nice that no one would want to DDoS them, and too /good/ to have a partial outage.)

Then a little application called WhatPuse told me that it could not get to its server, (possibly not using SSH) and I opened pidgin while thinking about the problem. I was seeing a partial outage, (which is a real pain because reporting "my ssh does not work but most websites are fine" to any ISP other than aa.net.uk is going to be frustrating.)

My first reaction was: My dev server! I've spent two weeks rebuilding that from scratch for a new project and I did not back-up before I went to bed, (I'm kidding ofcourse [alot]- backup is encrypted and automatic.) No, what I really thought was, "I wonder what I did to make that crash? It was up last night - I'd better check the KVM-console". But alarms were going off in my head. Pidgin claimed to authenticate, (using XMPP) with Google Talk, (talk.google.com:5222) but pidgin reported an error authenticating with MSN, (well it might not have got to authentication - I just knew at this point that it had not worked.)

By now I had a browser open, (Google Chrome) and tried to Google "can't ssh or msn but https works", and it "did", (with the EFF add-on Chrome defaults to https if it can; Which it did.) Nothing jumped out at me, as I looked for outage reports and other similar descriptions.

So what was happening? Had the firewall.uk broken? http://www.bbc.co.uk/ worked, but again that was just one protocol.

I set nmap going against my own remote server: All scanned ports on dev-null.alexx.net are filtered.
hmm. Maybe my local software firewall updated last night - I check and even flush it: still nothing.
The next gate is obviously my local router - I log in and check the 'firewall' part of it, and then turn that off as well. Still http, (and even youtube is working) but the website of my hosting company is not available to me - but http://www.downforeveryone.com claims that it is up, (as are all of my sites. So that is reassuring.)

So my end seems to be working, and from somewhere on the Internet the other end was working - traceroute time. I know that my connection, (today) is carried by completel.net, (so I check their website, and that works.) I then look at the exchange point between them and my hosting company. That shows up in the trace as raw IPv4, but a quick whois tells me, Neo Telecoms. hmm http://www.neotelecoms.com/ does _not_ work, (maybe that isn't their website - quick google shows that it probably is, but also a helpful tweet from 2011 that looks like an outage report.) https://twitter.com/NeoTelecoms has no such update this morning.

... and then my routing flips over to those-people-who-are-less-that-Level4 and the Internet is shiny and new.... all of my websites start working, I go to try ssh aaaaand then it flips back to "Neo Telecoms", (they don't feel much like "The One"). So, (feels like playing Cludo) is it Mr http://www.lonap.net with the faulty switch, Miss https://www.euro-ix.net/ with the faulty router or is Neo the one, (sorry) responsible?

Then I realise - just because I can debug this problem does not mean that I should. I should just report the outage, but I'm a friend of a customer of a reseller of completel.net.

All of this took me a few minutes to diagnose and about an hour to blog.

The routing, meanwhile flip-flops between Neo and level3 like a dieing fish, (or more likely a netops who is trying to solve the problem - which is reassuring - but having been there they may have cost of routing weighing down one side of the scales.)

Service resumes 79 minutes later.... (but I can quite imagine that someone has been up all night trying to fix this - if there is a netops sysadmin out there on their way to a well deserved rest, "Thank you". )

Update 01: This works, (for me) as a lonap status page, (and shows that there wasn't a blip this morning). I'm still looking for the euro-ix and $other_providers equivalent - RIPE or .eu should have a status.eu that shows the latest version of these for each Internet exchange and another for each ISP that wants to offer it.)

Update 02franceix, (yes, still with a hiddious 90's URL) is the sort of thing that I'm looking for. www.neotelecoms.com fail to provide useful information, (other than a glorified sales catalogue) and their /en/ is covered with french, but both are better than http://www.parix.net which is offline. PARIX seems to be owned by France Telecom, and in an odd twist they were bought by the mobile, (cell) phone company Orange.com... that do not offer graphs of their network, (which I understand, but parix.net _should_.)

No comments:

Post a Comment

About this blog

Sort of a test blog... until it isn't