Author: M.Prince (25 May 08 9:42pm)
You'd think we were on a 486 wouldn't you....
We actually have pretty significant hardware spread across two world-class data centers. The problem is that our original schema hasn't been able to keep up with our increased load. Let me describe our current setup in order to explain the problem:
- We receive many million email messages each day (far more than we report on our statistics page).
- Those messages all arrive at autonomous mail servers that know nothing about the rest of the Project.
- At specified intervals, a central machine collects the messages that have been received on one of the mail servers to process them.
- Approximately 1 in 7 messages we receive is from an IP address we haven't seen before.
- Each new IP address results, effectively, in a new web page.
- At the same time, harvesters are banging away at honey pot pages around the web. Each time one of the honey pots is accessed, we need to respond with a new spam trap email address and record the IP that the spam trap was handed to.
- The value of the Project is, in part, the ability to relate together various events (harvesters with spam servers with other IPs that are in the same net block) so each time an event occurs we do a lot of relating to see how that event is connected with other events we've seen.
- The scoring system we built for http:BL takes into account a number of factors including when the last activity for an IP address occurred. That means we need to traverse millions of IP records every time we build the http:BL zone files.
- Meanwhile, we're like a Google suck -- since our website is so massively interconnected and since we create so many new pages all the time the Google spider hits us several million times per day. Literally. That's good because it gets data out and makes it more accessible. It's bad because... well, it can be overwhelming.
- The setup is fragile in so far as when we begin to fall behind the problem escalates out of control very quickly.
So we've chosen what to prioritize. We decided that we wanted to prioritize the functioning of the honey pot network, the reception of email, and the distribution of http:BL ahead of information displayed on the website. When the website goes into "Maintenance" it is because our engineers have determined that the load from the website has reached a point that it threatens one of our higher priorities. We shut down the site for a while and let the databases catch up.
What are we doing to solve the problem. First, we have ordered and installed more hardware. Instead of relying on a single back-end database we now have a master/multiple-slave setup. It helped a bit, but at the same time we installed more mail servers which has increased the volume of spam, so it hasn't helped enough.
We're also redesigning the database schema from the ground up. For example, we originally had different tables for different IP types (e.g., harvester, spam server, dictionary attacker, etc.). That made some sense because we thought of those as different things back in the day, but it meant that doing a query like "Show me the IPs near 157.252.10.251" had to hit multiple tables. Bad news. The new schema has a single table for IPs and then assigns characteristics to describe the IP in question. Lots of other similar changes that should make things like generating the http:BL zone files MUCH faster.
The good news is that during all these challenges the core functioning of the Project has remained basically online. The website goes down a lot. That frustrates me. We're working on creating a scheme that will keep is up and still allow us to provide as much useful information as possible.
|