May 30, 2012: Deep History and the Diablo 3 Effect
Coincidentally, also around the middle of May, I was thinking again about how to record more auction history beyond 2 weeks. I've thought about this problem off and on, and often get stuck on how a deep history tends to mean a lower resolution, and I think a daily resolution for pricing stats is quite inferior to the hourly resolution we get from Blizzard. The trouble with hourly resolution is that you need hardware better than the typical VPS to update data and return queries in a reasonable time. Finally, I never came up with an algorithm that satisfied me to crunch 24 data points into 1.
Averages didn't look right. Why average the market price across all observations in a day, when perhaps you have gaps in your observations? Let's say you're getting data on a Tuesday. You get all of Tuesday early morning, then Tuesday maintenance hits. All realms are down, lots of auctions start expiring. Realms come back up in a couple hours, but let's suppose the API is down for another couple. Finally the API comes back, and you finish up with a few hours at the end of the day. A lot happened in the 12 hours or so, even while realms were down. Would averaging the price and quantity be an accurate representation? I don't think so, since the averages never actually appeared on the market.
I broke the problem down, and first thought only about the market price. How would I get one market price for an item for the entire day. I'd like to have a market price that actually existed for at least one snapshot. What should I do for the times when the item is sold out? Well, to extrapolate how it works out, the market price is the most accurate when there are the most items available. So, conveniently, that also gives me a quantity value for the day. So, to get a pair of values for an item for a day, just observe all snapshots as they come in. The market price and quantity of the first snapshot of the day is the value we record. Then, for each subsequent snapshot that day, if the quantity available is greater or equal to the one we already recorded, overwrite the quantity and market price with the one from the current snapshot. At the end of the day, you have a market price that actually existed, when the quantity was at its peak for the day.
I thought and struggled a bit over how to record the data. I had one table structure that made it quick and easy to update the data as new snapshots came in, but querying that data for display (and pruning it whenever that became necessary) would be pretty slow. Plus, it had lots of data duplication in storing the realm and item IDs for every row, which wasted space. I had another table structure that showed slower updates, but querying and pruning were very, very quick. In the end, I decided to use both tables. The first table was for the current day, and would get its quick updates as the day went on. At the beginning of the next day, I would pull all that data out of the "update" table and into the "storage" table. This data pull would take an hour or two, but I only had to do it once a day, and after that, selects from the "storage" table were always quick. Yay for the hybrid model! I observed it for a couple weeks, and the performance characteristics stayed the same, and drive space usage was relatively low. (I got 2 weeks of stats for all items on all alliance and horde realms on the US and EU in about 750MB.)
Since I wasn't even sure if it would work out, the deep history database and processing is on its own separate VPS. It should not affect the rest of the processes in The Undermine Journal, so if it is slow or processor-intensive or anything, no problem. Since it's a separate box, and it receives all the auction data, it was a good candidate to be a backup for the live auction stats that get used on the category and user controls pages. In fact, since it's geographically much closer to the web server than the main auction database, it's now the first place we go to get that data. (The hourly historical data still sits on the auction database server in my apartment.) The upside to all of this is that we have some better speed on those pages, and some redundancy. If we can't get to the main auction history database, we can still fall back onto the deep history database for some stats and managing market notifications. If we can't get to the deep history database, we skip the deep history stats and rely on the main auction history database like we always did in the past. Not that the site was ever down a lot, but this redundancy will give me more peace of mind that the site won't be down just because a database is inaccessible.
May 31, 2012, 12:14:06 Patrick wrote:
_, Very nice info and straight to the point. I am not sure if this is actually the best place to ask but do you guys have any thoughts on where to get some professional writers? Thank you
May 31, 2012, 13:34:29 Bayesian average wrote:
Averaging when you have different number of data points seems like a variation on the Bayesian Average, where you would use the previous market price and number of items available as priors, and the new market 'average' and number of data points to add them in:
You could also consider an exponential average model, so new data points are 'folded' in with previous ones in a smooth way. There are good ways to account for missing points this way too, but the math gets a little wonky.
June 01, 2012, 20:09:54 Anonymous wrote:
The first comment is spam =P