August 02, 2010: Another busy weekend

Let's look at how I've handled data growth the past few weeks. I've been adding about a dozen realms every week during this, as well.

July 21 - Frustrated that I can't update realm 42 and select from realm 15 at the same time, I think that partitioning the MyISAM table by realm would help, perhaps only locking a single partition for updates instead of the whole table. I partition the Auction table and the Market Cache table.

July 24 - I realize that partitioning doesn't do a damn thing for me. Really, partitioning is best used for spanning across disks, and maybe some index assistance, or something. I coalesce the partitions and start converting those big tables from MyISAM to InnoDB.

July 25 - The InnoDB conversion does not go well, as I run out of space, and it's generally a bit mucky. End up doing an "emergency" move of the website and database from Slicehost to Linode, converting to InnoDB in the process. There were few casualties, including a lot of market cache data and about 4 days of auction data. This wasn't an error, just that I thought that data wasn't that necessary and I wanted to do a quick move.

July 30 - I have 88 A/H realms and 4 neutral realms in one table. Despite having plenty of space and using InnoDB for row locks and using transactions to group the writes to disk, I build and maintain a ~4.0 cpu load and 20-deep results queue. After fiddling with InnoDB memory buffer size and dumping the results queue to measure the changes, I blame disk thrashing and decide to (again, sooner than expected) split the DB into master and slaves. I shut down the whole site, as the DB can't digest all the data gathered by the crawlers, and I'm pretty sure the tables are a bit hosed anyway.

I had discussed how I wanted to load-balance the backend. In a nutshell, have a main DB server that manages static global data, and replicate that to slave databases which would handle 50 realms each. When a user runs a page specific to a realm, all the database queries go to that slave database that handles that realm. It should keep query time way down, and split the load so the site always feels lightweight, and allow for a decent plan for future realm inclusion instead of simply "add more and see what happens."

July 31 - I buy two more VPSes from Linode. I'm so glad they allow private IPs so I don't have to do DB traffic across an Internet IP, and get billed for that bandwidth. As I set up those VPSes, I start exporting the data from the big tables.. but even just getting that data out is taking forever. I decide to start fresh, especially since I know that replication can be fickle.

This process of the DB split was more involved than I initially imagined. Some issues encountered, in no particular order, were:
- Move the data-split load-balanced tables into their own schema. (I use the Oracle connotation of the word "schema" since it just gets confusing how MySQL calls them "databases" but then I consider the "database" to be the whole server.) Initially this was just so I could target only the main schema for replication, but replication in MySQL goes by connection more than by location of data, so that idea was out. I maintained the separate schema then just so I'd remember which tables hold split data and which are replicated, but then realized it was a good idea for permissions purposes. When talking to the master DB server, forbid writes to the split schema (since it'd then replicate them to all slaves, making the slaves hold data they don't need). When talking to a slave DB server, forbid writes to the replicated schema (since they'd force the table out of sync and not be recorded on the master or other slaves anyway).
- Update IP addresses in the 3 Linode boxes to include the internal IPs
- Update iptables to allow MySQL connections between database servers, along with the web server
- Install MySQL on the new Linode slaves and set them up for replication
- Assign permissions properly within MySQL on the master and slave DBs for replication
- Assign permissions properly within MySQL on the master and slave DBs for the web visitors and crawler scripts
- Stop all updates on the master DB, mysqldump it, and import it onto the slaves
- Do the dump and import again because it's very picky on the position in the binary log to start replication
- Update the script that receives the data from the crawlers to place that raw data into the proper queue for each server
- Create a script to feed that raw data from the queue to a slave DB ready to process it
- Update the script that actually inserts the new data to report the status back to the main server properly.
- Update PHP code so it can determine which slave DB to use once a realm is picked, and then use only that for future queries.
- Update PHP code so it now points to the right schema for each table.
- Update custom MySQL functions and views to point to the right schemas
- Fire up the crawlers, hope the data follows the convoluted path from Armory -> Crawler -> Main Server -> Slave Queue on Main Server -> Slave DB -> Table in Slave DB

All that was "completed" by late Saturday afternoon, and crawlers turned on after about 24 hours of being dormant, now feeding an empty database.

Aug 1 - I used to call this project like playing with a train set. These aren't "real" servers, they're a handful of VPSes, and I get them all to talk to each-other in interesting ways and push data around. One main server, and a bunch of tiny inconsequential dime-a-dozen crawlers. It's like playing sysadmin/DBA. I had a neato control panel to show me exactly how slow and dire things had become, and the types of bottlenecks I was encountering and had to resolve. Now, with multiple DB servers and replication, it's not a train set, it's an air traffic control tower. A lot more going on now that is almost too hard to keep in one's head all at once. And a lot of my dials that were pointed at the main DB said "it's all good" but I know there were slave DBs that I had no idea how they were doing. So on this day I set up a new script to report on server health as far as memory, CPU, disk, DB processes, crawlers running, parsers running, stuff like that. I already had something similar for the crawlers to tell me how often they were erroring out, how long they took, how many were running, etc.. but this was to record the health of the machines themselves. I now have an impressive page full of numbers and indicators telling me how things are going.

Aug 2 - And things are going pretty good. It's almost too quiet. Oh, there's activity on the slave DBs, but the crawlers are humming along, the parsers never have a queue, data goes where it's supposed to, load stays low. However, I know things are good now because they only have a couple days' worth of data in them. Wait until I start hitting day 14 and 15 and I start pruning old data off of the full databases, and we'll see how things go. Hopefully okay. I used to have my own "tuesday maintenance" where I would archive off and delete anything over 2 weeks old. Well, that just meant on Monday there were 6 days' worth of useless data sitting around, and I'd never really want that data after I archived it. So now it'll periodically delete old data every hour-ish instead of waiting for Tuesday, and I hope with that habit it'll never be a ton of data to remove, slowing things down.

In other news, the Armory was being flaky again this morning. The weekend was good, though; few errors. Some info on crawler handling of a flaky Armory and 2nd paragraph: current timeouts.

Category: Technical | Posted by: admin


No comments yet

Add Comment

This item is closed, it's not possible to add new comments to it or to vote on it