July 16, 2010: To answer some questions about data collection

There are a few small/cheap VPS on which I have installed crawler scripts. Each crawler polls TheUndermineJournal.com every few minutes, asking if a realm needs to be crawled (which happens about once an hour) and supplying current load information to assist in load balancing across crawlers. If a realm needs to be scanned, that realm is assigned to that crawler and battle.net credentials are supplied. The crawler then connects to the Armory via the same http web interface that's open to the public and scans the auction house listings. Once the scan is complete, the crawler calculates the market price of all the items seen, and sends the raw auction listings along with the market prices back to the main site. The main site stores that data in a short-term file queue, and another process pops that data file off the queue and inserts it into the database.

Each cheap VPS has around 128MB of memory and can peak at about 10 realms being scanned at once. However, I'm trying to keep the number of running scans to about half that for failover compensation and other unexpected heavy demand (armory running slow that day, resume after a shutdown, etc). Also, I don't want to flood the Armory with too many TCP connections from the same IP and risk that IP getting banned, so I doubt I'll push that scan number higher even if I get some VPSes with more memory. Hard to say, though.

