September 21, 2012: Captcha Update

You may be aware that if someone loads enough item pages in a 24-hour period, they are presented with a test to make sure they're not a script that's scraping our data. (It takes considerable resources to construct each item page, so I wouldn't want to get hammered by any scripts.) Such a test to distinguish humans from bots is called a CAPTCHA: "Completely Automated Public Turing test to tell Computers and Humans Apart". You know them as those twisted illegible words you're asked to type. An industry standard Captcha is the Google-supplied reCAPTCHA, which is what I've used here.

Most visitors will never see the captcha, as the number of pages you need to load in a day is fairly high. However, there are a few power users that do reach it, and reCAPTCHA has been getting harder and harder for people to solve. I mitigated this somewhat by allowing logged-in users to skip the captcha, with the idea that anyone scraping the data wouldn't use a logged-in account that I could detect and disable. Still, some were very upset about how difficult the captcha was to solve.

I researched other captcha methods, and a new popular method is to show a grid of images and tell the user to select all of a certain type. For example, a grid of various objects is shown, and the user is asked to select all the clocks, or basketballs, or houses, or whatever. All the images are different, and the number of correct images is variable. You can see an example of this system at PICATCHA.

I liked this method, and thought about using PICATCHA, but thought I could do better. I figured it'd be cute to use stuff from WoW, as it would be audience-appropriate and more resistant to known images in existing image libraries. At first I was going to use item icons ("click all the swords", "click all the bolts of cloth") but there's a finite number of icons, they're easily categorized, and some icons are a bit ambiguous. So, using some methods from RealmPop and Transmog Fashion, I made a version sourced from images from Battle.net.

Here's some examples so you get the idea:





Hopefully that'll remain secure and be easier for visitors to answer. I made sure that all the correct selections didn't have helms on, so you could clearly see the race. It takes 5-10 seconds to pick some random toons and build a question and answer key, but that should be required rarely anyway.

Category: Technical | Posted by: admin |

August 17, 2012: Yesterday's Maintenance

It's been a while, so let me give you a quick overview of how this whole site serves up tasty data.

There's the main server, "Undermine." It's a VPS (virtual private server) hosted at Linode in New Jersey. It does a lot. Undermine hosts the sole web server, the sole mail server, and the primary database. Undermine's database is a replication master and holds all item tooltip data and game data (enchants, reagents, item classes.. everything in-game except for auctions), all user data (subscriptions, messages, etc) and the general state of things. Undermine also does all the talking to Battle.net to grab the auction data and package it for Kezan. Undermine sends the market notifications, via email and XMPP. Undermine is backed up daily. It has an uptime of 251 days. Yeah.

There's the auction database server, "Kezan." Kezan is an actual hardware server that's in my apartment in Northeast Pennsylvania. Kezan is a replication slave to Undermine, through an SSH tunnel. This means Kezan has all the data in Undermine's database, but it also has (almost) all the auction data, too. Undermine packages the auction data from Battle.net, and Kezan continually grabs and parses it from Undermine. Kezan holds auction summary data (seen on category pages, on the Addon, and XML/CSV APIs) and detailed auction data (individual auction history for 2 weeks, which allows hourly resolution graphs, sold stats, and some other stuff). Whenever you're on an item or seller page, most of those queries get run on Kezan. Kezan has 16GB of memory and a beefy $600 Enterprise-level SSD for some quick data retrieval and pruning, but that SSD is only 60GB. Kezan has an uptime of 261 days.

Finally, there's the deep history server, "Zandalar." Zandalar is a new separate VPS, also hosted at Linode in NJ. It was just added in May. It parses auction data from Undermine to assemble the daily item history stats beyond 2 weeks. Zandalar is also a replication slave to Undermine, so it has all the same data (items,users,etc). Since it gets pretty much the same data as Kezan, that let me position Zandalar as a failover for Kezan, at least as far as summary data goes. (There's not enough room or performance for detailed hourly data.) Finally, Zandalar allows me to serve the rare item notifications, since its proximity to Undermine is better than Kezan and it has that summary auction data. Zandalar has an uptime of 85 days.

Okay, so. Lately the disk space usage on Kezan's SSD has been creeping upward. We continually insert new records and prune old records out of the main auction table, and over time, though space and indexes are reused, the table is no long efficient. Unfortunately, we were nearing disk space limits. 2 weeks of every auction ever seen for all US and all English-EU realms, about 312 million auctions, made that one table 44GB (including indexes). Normally what you'd do is just optimize the table, which means creating a new copy in the same place, and deleting the old. However, there isn't enough space on the SSD for another copy of the table.

So, I stopped all parsing and queries on Kezan, and exported the table using mysqldump and gzip, to the spinny HDD. Surprisingly, the gzipped insert statements were only about 2.7GB. Then I dropped the table and recreated it from that export script. It took about 30 minutes to dump it, then another couple hours to refill it. Finally, it took about 4 hours to catch up to the backlog of about 5000 datasets pulled from Battle.net (a dataset being one realm's faction's hourly parse). Thanks to Zandalar, I didn't have to take down the whole site to do this maintenance. The site would failover to Zandalar for most auction queries, and just not include the bits that absolutely required the full auction table. Zandalar is just a wee VPS, so it strained a bit under the load, but all things considered, it performed pretty well.

The end result? That 44GB table is down to 27GB, which gives us some breathing room for another bunch of months. I wanted to do it ASAP, because auction data is going to get pretty interesting with MoP and I didn't want downtime then.

Lessons learned: People don't read the news section, because although I mentioned that there would be maintenance that day, lots of folks still asked why bits were missing. Sigh.

Category: Technical | Posted by: admin | 1 Comment

March 17, 2012: The ongoing saga of @UndermineJrnl

@UndermineJrnl on Twitter has a long and boring history that I'll share with you today.

It started out as @QuickArmory, for my old armory clone site which is no more. It would announce any news items and act as another way for my visitors to contact me. Quick Armory closed with WoW 4.0.1, which was shortly after The Undermine Journal started, so I just renamed the account.

@UndermineJrnl didn't do much for a while, until market notifications were released. Then I decided to send notifications via DM. This required participants to follow the account to receive DMs, so my follower count went up by hundreds. It still didn't have any public tweets (except rare news ones from me) and it worked pretty well. Until I bumped into the 250 DM/day limit. Notifications via DM stopped, and now I had this Twitter account with a bunch of interested followers.

After a few months, I decided to tweet syndicated editorial links as they came in. Let's pretend there's a blog called "Awesome Gold Blog" for these examples. An article's tweet from @UndermineJrnl would look like:
"Finding Success on the Auction House" - Awesome Gold Blog http://tuj.me/_0000
Great. But I occasionally got replies from followers that were intended for the article's author, not for me. What to do? Well, as a gentle hint, I would find the twitter accounts for each blog I followed, and use that as the credit instead of the blog name:
"Finding Success on the Auction House" - @AwesomeGold http://tuj.me/_0000
That kinda worked, but I still got some replies that would go nowhere. Sometimes they'd just address both of us. How to direct the replies to the author alone?

Retweets! When I retweet an author's article, it shows up in your feed as coming from the author. Sounds good. Many authors would tweet (either manually or automatically) when they had new blog posts. I still wanted my process to be automatic. So, this meant that when I had to tweet, first I had to search the author's twitter feed for recent links, and compare them to the link I had. If they matched, then just retweet them instead of building my own tweet.
RT @AwesomeGold Hey, check out my new blog post: "Finding Success on the Auction House" http://bit.ly/zxcvbnm
Good in theory, but often the links were different. Sometimes the author would tweet the URL that went direct to the article, but their RSS feed had other feedburner stuff on the end of the URL that wasn't really necessary. To see if the links were the same, I'd just see if the shorter of the two links was in the longer of the two. Example:
http://www.awesomegoldblog.com/2012/03/Finding_Success_on_the_Auction_House.html
is the same as
http://www.awesomegoldblog.com/2012/03/Finding_Success_on_the_Auction_House.html?feedburner=fromrss&otherstuff=wenthere
I also had to follow all redirects from twitter to get to the last page that wouldn't redirect.
http://t.co/ASdfgHJkl redirects to:
http://bit.ly/zxcvbnm redirects to:
http://www.awesomegoldblog.com/2012/03/Finding_Success_on_the_Auction_House.html
I got all that done, but then someone posted a link just to their blog (http://www.awesomegoldblog.com/) and that was retweeted instead of the article. Sometimes blog posts just wouldn't get picked up for retweets (other URL nonsense, usually), and a new tweet would be crafted from @UndermineJrnl. It seemed as though I couldn't automate this process reliably.

I rolled back to where I'd just include the author's twitter handle in the credit. I still get replies. Oh well.

Category: Technical | Posted by: admin |

March 01, 2012: This HTTPS thing.. not as easy as I thought.

I know I mentioned it before, but I kept thinking about it. HTTPS everywhere. Technically it isn't necessary, but I liked the idea enough for a few reasons:
  1. Even though the money's handled through Paypal, we still have user accounts (and user cookies) you wouldn't want spoofed.
  2. A lot of people use this site from work, and The Man doesn't need to hear about your auction activities (even if looking up the IP is still possible).
  3. I had HTTPS for www already, so a lot of the site was coded to handle HTTPS since October 2010.
  4. It's something new and different to do.
Yesterday I picked up the wildcard SSL certificate and set about the installation. This required me to set up the certificate on the web server, set the web server to redirect http requests to https for the domain, and update all the code to point to https so future links don't need to redirect.

First, the certificate setup was a bit of a pain. I use lighttpd for the web server, and I'm very happy with it, but the certificate installation directions don't include lighttpd, just major servers like Apache and IIS. No problem, I'll search around and see how I did it with the previous certificates. I had to argue with it for a bit so that the certificate chain was getting sent to the browser correctly. I don't remember doing it this way before, but it works now, so that's cool. Then, there's the issue of clients (browsers) accepting the wildcard certificate. Most clients understand *.foo.com by now, but a few don't. And one of them is an older version of wget. That says:
ERROR: certificate common name “*.theunderminejournal.com” doesn’t match requested host name “theunderminejournal.com”.
To connect to theunderminejournal.com insecurely, use ‘--no-check-certificate’.
So that's annoying. But recent versions of wget handle it correctly, so over time, that issue will go away. Just wish it worked out-of-the-box for those users who are pulling addons and APIs automatically. There are probably other tools that aren't caught up with wildcard certificates yet.

Second, the automatic redirection. This required some fiddly regular expression work in the lighttpd config, but no big problem. Had to make sure the subdomains stuck on the redirection instead of just pushing it to the root.

Third, the code. There were about 30 files that had "http" in them that changed to "https." Just 20 minutes of thoughtful search and replace. The tedium was that those files were all over. Some were in the public directory (the pages you see) but others were in back-end code, like market notification generation and my personal server stats page. Code was updated in a number of odd places.

And then the database replication broke. At midnight. All day, I resisted temptations to make a change before running off to an appointment. I'd instead make the change when I had time to do checks and see that things were still working well. The code replacement and forced redirect were put in around 10:30pm local time, and I looked around at some pages and tools and things were all fine. I noticed around 11:30pm that my auction database server (hosted at home, acting as a replication slave for the account info) couldn't connect to the master database server (hosted at Linode, acting as a replication master for everything but auction data). This replication setup was born back when I had multiple databases, splitting realms among other shared servers. I kept this arrangement so that I could have the account and static (items, spells, etc) data on the backed-up server at Linode, and give the auction DB at home local and updated read-only access to those same tables.

I have this replication between two very distant servers set up through an SSH tunnel. I use autossh to reconnect whenever my cable modem goes down (which isn't often, thankfully, but it's a residential account so it has occasional hiccups). I've had this setup also since October 2010. Once you get it working, it's actually pretty robust. If the connection breaks, the slave DB doesn't freak out, it just remembers where it was and tries another connection in 60 seconds. Whenever it reconnects, it starts where it left off and quickly catches up to the master. Last night, though, it couldn't hold a connection. My logs showed that it stopped working pretty much when I was fiddling with the HTTPS stuff, but why would that matter? That doesn't touch the database at all. I thought it was just a coincidence, and would manually restart the database service (both on kezan and on undermine) and try to keep the replication going, but it would quickly abort. I had visions of having to nuke the slave database and reload it from the master, which would not work well, and it was midnight and I was tired and I needed the site to work, because it's no longer just a silly project, it now has paying customers. I was a little frazzled.

So I tried stuff (nothing irreversible), and it didn't work, and I'd walk away and think about it. I decided the HTTPS swap had to have something to do with it, somehow. I was watching the SSH tunnel connection and trying to see why it'd fail, and I examined the script I had run every minute to keep the tunnel active. Not only does that SSH tunnel keep database replication going, but it's also the way I get data from the auction database back to the web server. So I have a script that asks the web server if it can still see the auction database, and if not, I assume the tunnel died and I restart it. Of course, this script is a (tiny) web page, pulled with wget. And wget on my database server couldn't handle the wildcard certificate on the ssl where it was redirected, so it'd bomb. So, every minute, the server would trash the SSH tunnel and start a new one. And at the same time every minute, it'd try to restart replication, and fail shortly afterwards.

Of course. So, I told wget to ignore the certificate, and the tunnel wasn't getting trashed every minute, and replication resumed. Yay. I went to bed. This morning I had a similar issue with the redirects and our CDN. I had to open up another host just for the CDN to reference that didn't require HTTPS and only serves images. And I found where the wowhead tooltips weren't using my CDN on HTTPS, so I had to fix that, too.

This one little change got its claws in lots of things. I think things are good now, though. And just a few days ago, I changed the method I use to connect to the Battle.net API and how many PHP instances that lighttpd and fastcgi keep around, just so I have enough different things changed at once to make it impossible to debug any issues. ::facepalm::

Category: Technical | Posted by: admin | 2 Comments

February 24, 2012: Ban list, and other things.

Things have been moving along, but nothing terribly exciting. Some details are available in the forum.

One of the more interesting things is XMPP notifications. I finally got around to implementing that. I'm not sold on the usefulness of it, but it's another option and perhaps some folks like it. You can see how a lot of that went down in this forum thread.

Yesterday I thought about going HTTPS-only for the whole site, mostly due to this article. What's slowing me down is that I'd need to support the EU subdomain on HTTPS, which requires a wildcard certificate for ~$100/year instead of the current certificate which costs $10/year. I don't hear any Europeans looking for HTTPS, and US visitors who care can use HTTPS if they so choose, so, might as well save the money.

Lately we've had some "trouble" with visitors from China. It seems like their search engine crawlers don't care to follow our robots.txt, and others unaffiliated with search engines (but still often in China) like to try to crawl our item pages.

We've had bot protection on the item pages for a while. It's there because it takes considerable resources to build the item pages (as you can see from the usual progress bar during loads) and it's easy to script a crawler to hit every URL. So, right now, the site records the visitor's IP every time an item page is loaded. It keeps that information for 24 hours. If you load 100 item pages (the "loading" page counts as one, so it's more like ~50 items) in a 24-hour period, you get a screen with a captcha. Complete the captcha, and your record is wiped, and you get another 100 requests until it happens again. Hopefully most folks don't see that page, but if they do, they can continue on without much interruption.

Some really dumb bots just start loading all the item pages they can, then when they start getting captchas, they just bounce on the captcha page until it starts returning item data again. This caused real problems once or twice with aggressive bots and how I recorded requesting IPs, but that's a bit more streamlined. (This isn't a challenge: it's still relatively easy to overwhelm.. it's just a little more robust so a strong wind doesn't topple things.)

The trouble is that these bots may request pages every few seconds, endlessly, and that drives up server load even though I'm not bothering to request the info from the database. So I end up manually adding them to iptables to drop their packets. Sometimes this happens in the middle of the night and I don't notice, and the next day when I check, it says the site was slow for hours. Not cool.

Today I implemented another part of that bot protection on item pages. Once you hit 100 requests in 24 hours, as said before, it starts asking you to solve captchas. However, it'll keep counting the requests. If you hit 200 requests in 24 hours (ignoring 100 captcha pages), you're obviously a bot and your packets get ignored completely for 24 hours. This helps to protect the web server from a persistent attack from one IP, without any manual intervention. And it drops server load, since it's iptables dropping the packets before they even hit the web service and php code.

If I had to describe my focus in one word with The Undermine Journal in the past few months, it's this: "automation." The site does so much now, that if I had to manage it without all these scripts, it'd be madness.

Category: Technical | Posted by: admin |

January 27, 2012: WoWInterface Comes Through

Yesterday I mentioned I'd remove the GE addon from WoWInterface. I sent the admins a PM asking for the addon to be removed (the method they request). We had a brief chat about why I wanted it removed, and they pointed me to this Java app. It's a command line utility that one could use to build zip files from version control and push them to WoWI, or to push pre-built zips to the site.

A quick glance at the source code looks like it pretends to be a web browser instead of using some upload API from WoWI, but whatever, it works. The Undermine Journal GE will remain on WoW Interface and get updated automatically from now on, yay.

Category: Technical | Posted by: admin | 3 Comments

January 26, 2012: The Addon

We've had the GE ("General Edition") addon for quite a while. I would update it once a week, and it is a chore. It's one of the few manual processes I have left with the site. I have to run the script to generate the average data, then go to both Curse and WoWI, login, navigate to the update section, and upload the zip file through the web interface.. it's a drag.

My project is different from most addons because I have weekly updates, not in source code, but in the data. I don't have a need for revision control because I don't have other authors, nor is the addon itself so complex that I'd benefit much from having old versions available.

A few months ago I looked into automating updates. Curse has support for revision control as well as uploading through a script. I worked on an upload script, but didn't get it working for one reason or another. I blamed Curse, although now I think it may have been a bug in my code. WoW Interface has revision control for source code, but still forces you to login and click some buttons to pop out a zip file.

Today I gave it another look, since I've not been timely with the GE updates in the new year. I found an issue with my Curse script, and fixed that. I made a cron job for it, and now it should update every week without any thought from me. WoW Interface still doesn't allow you to post zips automatically (which is no fault of theirs, because my project is a little weird) so I've asked to have my addon removed from there. It's not like I got that many hits from WoWI anyway. It was something like 60 hits per week. I'd get ten or fifteen times that from Curse.

Both the realm-specific addon and the market API have had steady success since release. They usually average around 400 hits a day each. Not bad for services that require a login.

Sidenote: Back when I accepted donations, I also had donation buttons on both the Curse and the WoW Interface addon pages. I got zero donations through those sites; all the donations I got were from the pages here.

Category: Technical | Posted by: admin |

October 04, 2011: Humble Pie, perhaps

I recently whined about antivirus products and their incompatibility with this site. I searched Google to confirm my suspicions, but everything told me that those antivirus products that actually sniffed traffic would remove the header that would allow gzip anyway. So it shouldn't be some weird gzip/chunked issue. I decided to look at my code path for non-gzip clients.

My initial method for sending gzipped content closely followed the "Other Method" listed here. Basically, use ob_* methods to buffer the output, then when the page is completed, if the headers from the browser indicate that gzip is supported, gzip the contents of the buffer and push it to the browser. It was more involved than using a handler closer to the web server instead of in my own code, but I liked knowing exactly when gzipped content would be sent.

However, upon reading the notes for ob_end_flush(), I saw that there may be multiple output buffers, and they should all be flushed in a loop. I didn't use nested ob_start()s, so it was probably unnecessary, but this uncertainty about the buffer behavior upset me. Since zlib was already installed in my php environment on lighttpd, I decided to use ini_set to enable zlib.output_compression when I wanted output to be sent gzipped.

Long story short, I stopped hammering the screws in, and decided to use a screwdriver. Gzip compression is turned back on for clients that support it, output buffering (within my code) is disabled, and hopefully those issues won't reappear.

Category: Technical | Posted by: admin |

October 01, 2011: Stop using proxies and anti-virus products that suck

I've been getting more reports of pages getting cut off. The pages look like they're loading fine, then just stop for no reason before the entire page is loaded.

It turns out that shoddy anti-virus programs and proxies are usually to blame. They like to examine content before it hits the browser, and they get confused, probably when the server sends the combination of gzipped content encoding and chunked transfer encoding. They'll unzip the first chunk, approve it, ship it to the browser, and when the second chunk comes along, they forget where they were in the unzip, and throw up their hands in despair. It's bullshit, because proper browsers who can handle these encodings (which speed up downloads, which speeds up response times) send the right headers and display these without a problem.

So I turned off gzip, at least until I figure out a way to sniff out these misbehaving tools and send uncompressed content only to them. The site will be slower for everyone because some people have poor software that doesn't play nice.

Category: Technical | Posted by: admin | 1 Comment

August 11, 2011: Curses, InnoDB!

My index "optimizations" to save space last week are causing more issues, despite my research. Decided to reindex the main auction table yet again, to be even more like how it was before I broke it. Hope I have enough space.

This time, I'll leave the site open, since reads will still be supported during reindexing. Updates will be queued until the reindexing is done.

Found out that the InnoDB Plugin is a drop-in replacement to the built-in InnoDB engine in MySQL, and supports fast index creation. Great, so this time I don't have to do that stupid table copy! For some reason, my version of MySQL (5.1.56 on Fedora 13) doesn't have the plugin included, even though it should be there. Shiny. Had to track down an appropriate RPM for the MySQL I'm using, and extract the plugin file. I was smart enough to try this on my dev server first. Worked there. Did the same method in production. Worked again. Now I'm using the snazzy InnoDB plugin. Hooray.

So I go about dropping the item index and creating it anew. Oh, what's this? It's doing a table copy anyway. Yay, thanks for wasting 2 hours with promises that can't be kept! Ugh. I cancel that, realizing I don't have enough disk space to do an entire table copy. Now I've scripted deleting the auction data older than 9 days ago. Hopefully that'll give me enough space for the new copy.

Site should stay up during the change. Updates will be iffy. End result is that this morning's laggy searches, causing max_user_connections overflows, should stop.

I also wrote and am running a script that checks every minute for web queries that last longer than 90 seconds, and kills them. Usually one query will get stuck, and lock enough tables/records (on a read? shrug.) to make everyone else wait. Killing those queries is not the nicest thing to do, but at least it doesn't cause problems for everyone else for hours.

Category: Technical | Posted by: admin |