leifw: April 2011

Sunday, April 17, 2011

The Cost of Small Files in the Cloud

I've been working with a friend on a little project that generates a mound of small files. Specifically, we'd like to serve 12 million files of less than 4 KB with an average size of 500 bytes. It's only 6 GB.

I thought Amazon S3 might be a reasonable way to do it, so I did the math. I'm not expecting the project to get a lot of usage, so I initially only cared about the cost of getting the files out there. Amazon S3 charges $0.01 per 1000 PUTs, $0.14/GB/month for storage, and $0.10/GB for input bandwidth. For one month that's 12M/1000 * 0.01 + .14 * 6 + .1 * 6 = $121.44, of which $120 comes from the cost of PUTs alone. That quickly dissuaded me from considering S3.

Then I remembered that Rackspace has a cloud file storage product, too. Rackspace's Cloud Files pricing is based only on bytes stored and incoming and outgoing bandwidth. My 6 GB would cost $0.15/GB/month to store and $0.08/GB to upload, for a total of $1.38. Now we're talking.

I was a little dubious about the pricing, so I contacted Rackspace to double check my numbers. The rep told me I was doing the math right and claimed that you only pay for reported file size, not size rounded up to the nearest block and not HTTP header overhead.

The cost winner if you want to store a bunch of small files in the Cloud is clearly Rackspace.

Labels: cloud files, cost, rackspace, s3

posted by Leif Wickland @ 1:02 AM 0 comments

How HTML5 geolocation works in Firefox, Chrome, and Internet Explorer

HTML5 introduces (will introduce?, will have introduced once it's approved?) the Geolocation API, which enables a web page to ask your browser to ask you if the page can be told where on earth you are. For example, when composing a tweet, Twitter includes an "Add Your Location" link that uses this new API.

Of course, your web browser doesn't just know where you are; it has to ask something else. On a device like your mobile phone which includes a GPS receiver, the operating system can tell it. However, on a notebook without GPS, your browser's not so lucky. Enter the might power of Google. In this circumstance both browsers fire off a request to http://google.com/lat/json. If your computer has a WiFi device turned on at the time, both browsers send along the name, hardware ID, and signal strength of every WiFi base point that your little lappy can see. When I was sitting at work, the data sent looked like this:

{
    "version": "1.1.0",
    "request_address": true,
    "wifi_towers": [{
        "mac_address": "xx-xx-xx-xx-xx-01",
        "ssid": "XXXXXXXXXXXX",
        "signal_strength": -81
    }, {
        "mac_address": "xx-xx-xx-xx-xx-00",
        "ssid": "XXXXXXXX",
        "signal_strength": -83
    }, {
        "mac_address": "xx-xx-xx-xx-xx-02",
        "ssid": "XXXXXXXXXXX",
        "signal_strength": -83
    }, {
        "mac_address": "xx-xx-xx-xx-xx-30",
        "ssid": "XXXXXXXX",
        "signal_strength": -62
    }, {
        "mac_address": "xx-xx-xx-xx-xx-62",
        "ssid": "XXXXXXXXXXX",
        "signal_strength": -82
    }, {
        "mac_address": "xx-xx-xx-xx-xx-a0",
        "ssid": "XXXXXXXX",
        "signal_strength": -75
    }, {
        "mac_address": "xx-xx-xx-xx-xx-a1",
        "ssid": "XXXXXXXXXXXX",
        "signal_strength": -74
    }, {
        "mac_address": "xx-xx-xx-xx-xx-61",
        "ssid": "XXXXXXXXXXXX",
        "signal_strength": -81
    }, {
        "mac_address": "xx-xx-xx-xx-xx-a2",
        "ssid": "XXXXXXXXXXX",
        "signal_strength": -75
    }, {
        "mac_address": "xx-xx-xx-xx-xx-32",
        "ssid": "XXXXXXXXXXX",
        "signal_strength": -57
    }, {
        "mac_address": "xx-xx-xx-xx-xx-31",
        "ssid": "XXXXXXXXXXXX",
        "signal_strength": -65
    }, {
        "mac_address": "xx-xx-xx-xx-xx-58",
        "ssid": "XXXXXXXX",
        "signal_strength": -76
    }, {
        "mac_address": "xx-xx-xx-xx-xx-60",
        "ssid": "XXXXXXXX",
        "signal_strength": -75
    }]
}

Google then ships back coordinates for where they think you are based on that WiFi data. The response also includes city, state, street address, and an estimate of accuracy.

When I was at work, their guess of accuracy was spooky good. Not only did the Big G put me in the right building, but they also correctly discerned that I was on the south side of that (not terribly large) building.

I also gave it a go from home. There the accuracy was a little less impressive, but still pretty good. Google's guess was off by about 250 meters, but that's probably because I live in a relatively newly built up part of a subdivision. Google put me on the side of the neighborhood that's been around for a few more years.

If your WiFi adapter is turned off, you're not in range of any networks, or your PC doesn't have WiFi, then the request to google.com/lat/json sends an empty set for the WiFi data and Google appears to just use your IP address to determine where you are, returning a city-accurate location. The returned accuracy was reported as 22 km when I tried it.

Internet Explorer 9 uses a similar trick; when asked for geolocation, it sends a request to https://inference.location.live.net/inferenceservice/v21/Pox/GetLocationUsingFingerprint containing WiFi data. IE's request doesn't include network name, interestingly. Microsoft's API is much uglier, relying on XML written in that pedantic way that gives XML a bad name, filled with disgustingly repetitive xmlns junk. In my very small sample size, Microsoft's data was much lower quality. From home, my location was misjudged by about 5 km with a reported uncertainty of 16 km. At work, Microsoft just gave up and said it didn't know where I was. I wonder if MS just fell back on geo-IP locating for the location it provided when I was at home because when I removed all of the WiFi data from the request and resubmitted it, the returned location was identical.

posted by Leif Wickland @ 12:30 AM 2 comments

Tuesday, April 05, 2011

11 Buggy Disappointments in MongoDB

I've been exploring MongoDB primarily for its map-reduce functionality lately. I've found a few shortcoming that I'm not crazy about.

If you dump (i.e. export to BSON format) a document containing an array with undefined elements, then when you import that data, the undefined elements are lost, changing the indexes of the values. For example, exporting a document that contained a member that was created like
var a = [];
a[0] = "a";
a[2] = "c";
when imported will result in an array like ["a", "c"], with "c" moved from spot 2 to 1.
If you export data from a sharded collection, only data from one of the shards is actually output.
I've had strange, inconsistent problems where mapReduce would fail with no reasonable explanation after importing a large amount (~10GB) of data into a sharded collection. Bouncing the cluster's mongods has fixed the problems.
Mongo's db.eval() is much more dangerous than the documentation lets on. If you do something moderately dumb in there that uses a lot of memory, you can cause the server to run out of memory, which causes it to segfault and leaves the data store in a state that requires recovery.
The JavaScript map and reduce function passed to mapReduce don't pull in variables or functions from their scope as one would normally expect of JavaScript. The mongo solution is the scope argument passed to mapReduce. Unfortunately, scope can't be used to pass functions. As far as I can tell, the Mongo approach is to add other function into the db.system.js collection, which is a pretty poor solution because it hinders code maintenance.
For reasons I can't explain, I've had mapReduce jobs fail a couple of time after running for hours and going through both the map and reduce phases entirely then reporting that a function defined in db.system.js was undefined. If it was undefined, it should've been reported as undefined before it was called 50 million times. The same mapReduce jobs ran successfully on smaller samples of data in unsharded collections.
Mongo's mapReduce is single threaded. shudder No, really. No matter how many mapReduce jobs you throw at it at once, the CP utilization will hover in the 100% range. If you want Mongo to use more than one core for mapReduce, you need multiple shards on the same box, but in practice that's a rather ineffective approach; Mongo doesn't always shard when I want it to shard, even if I set the chunk size stupidly low.
If you do shard to try to get multiple cores running mapReduce, you may find that the primary eats all the memory, starving the secondary shard servers.
The VM I was playing with Mongo in only had a 100 GB partition. I ran quite a few different mapreduce jobs which output into different tables, eating up lots of disk space. One of the mapreduce jobs hung. I realized that the partition had less than 2 GB of free space left, so Mongo couldn't allocate another slab. No message was returned to the client explaining this, unfortunately.
I went around removing all the documents from collections and then dropping them, trying to free disk space. For whatever reason, Mongo didn't actually free any disk space after I did that. I tried bouncing the cluster to see if that would allow the servers to reclaim the freed space to no avail. Although frustrating, that makes sense given that the Mongo docs say, "This [deleted] space is reused but never freed to the operating system."
The mongo docs say if you want it to return the freed space to the operating system, you should repair the DB. So I tried
```
db.repairDatabase()
```
Unfortunately, if your disk is full-ish, Mongo won't have space to repair. When I tried, I got the following error.
```
Cannot repair database X having size: 21415067648 (bytes) because free disk space is: 887955456 (bytes)
```

There's still plenty to like in Mongo, but at this point, I feel like Mongo's mapReduce functionality is better suited to running queries which are too big to fit in memory, rather than serious data crunching. Perhaps my difficulties have been due to getting sharding involved with mapReduce. It's also possible I've made a crucial mistake in configuring sharding, but I think I followed the directions pretty closely.

Labels: mapReduce, mongoDB

posted by Leif Wickland @ 9:36 PM 7 comments

Comments in PHP source code for calling a user defined function

A co-worker asked me if it was possible to get PHP 5.3 to search other namespaces when it failed to find a function in the global namespace, so we went off to the PHP source code to the place that the virtual machine calls user defined functions. We found the following comments:

/* Never reached */
/* FIXME: output identifiers properly */
/* Not sure what should be done here if it's a static method */

In turn my colleague quipped, PHP 5.3? More like PHP 0.5.

posted by Leif Wickland @ 2:09 PM 0 comments

Monday, April 04, 2011

Proud Moment

The help system on my wife's smartphone uses the product I worked on for a few years. I never would've known it if somebody hadn't pointed it out to me.

posted by Leif Wickland @ 5:04 PM 0 comments

leifw