leifw

Tuesday, April 05, 2011

11 Buggy Disappointments in MongoDB

I've been exploring MongoDB primarily for its map-reduce functionality lately. I've found a few shortcoming that I'm not crazy about.

If you dump (i.e. export to BSON format) a document containing an array with undefined elements, then when you import that data, the undefined elements are lost, changing the indexes of the values. For example, exporting a document that contained a member that was created like
var a = [];
a[0] = "a";
a[2] = "c";
when imported will result in an array like ["a", "c"], with "c" moved from spot 2 to 1.
If you export data from a sharded collection, only data from one of the shards is actually output.
I've had strange, inconsistent problems where mapReduce would fail with no reasonable explanation after importing a large amount (~10GB) of data into a sharded collection. Bouncing the cluster's mongods has fixed the problems.
Mongo's db.eval() is much more dangerous than the documentation lets on. If you do something moderately dumb in there that uses a lot of memory, you can cause the server to run out of memory, which causes it to segfault and leaves the data store in a state that requires recovery.
The JavaScript map and reduce function passed to mapReduce don't pull in variables or functions from their scope as one would normally expect of JavaScript. The mongo solution is the scope argument passed to mapReduce. Unfortunately, scope can't be used to pass functions. As far as I can tell, the Mongo approach is to add other function into the db.system.js collection, which is a pretty poor solution because it hinders code maintenance.
For reasons I can't explain, I've had mapReduce jobs fail a couple of time after running for hours and going through both the map and reduce phases entirely then reporting that a function defined in db.system.js was undefined. If it was undefined, it should've been reported as undefined before it was called 50 million times. The same mapReduce jobs ran successfully on smaller samples of data in unsharded collections.
Mongo's mapReduce is single threaded. shudder No, really. No matter how many mapReduce jobs you throw at it at once, the CP utilization will hover in the 100% range. If you want Mongo to use more than one core for mapReduce, you need multiple shards on the same box, but in practice that's a rather ineffective approach; Mongo doesn't always shard when I want it to shard, even if I set the chunk size stupidly low.
If you do shard to try to get multiple cores running mapReduce, you may find that the primary eats all the memory, starving the secondary shard servers.
The VM I was playing with Mongo in only had a 100 GB partition. I ran quite a few different mapreduce jobs which output into different tables, eating up lots of disk space. One of the mapreduce jobs hung. I realized that the partition had less than 2 GB of free space left, so Mongo couldn't allocate another slab. No message was returned to the client explaining this, unfortunately.
I went around removing all the documents from collections and then dropping them, trying to free disk space. For whatever reason, Mongo didn't actually free any disk space after I did that. I tried bouncing the cluster to see if that would allow the servers to reclaim the freed space to no avail. Although frustrating, that makes sense given that the Mongo docs say, "This [deleted] space is reused but never freed to the operating system."
The mongo docs say if you want it to return the freed space to the operating system, you should repair the DB. So I tried
```
db.repairDatabase()
```
Unfortunately, if your disk is full-ish, Mongo won't have space to repair. When I tried, I got the following error.
```
Cannot repair database X having size: 21415067648 (bytes) because free disk space is: 887955456 (bytes)
```

There's still plenty to like in Mongo, but at this point, I feel like Mongo's mapReduce functionality is better suited to running queries which are too big to fit in memory, rather than serious data crunching. Perhaps my difficulties have been due to getting sharding involved with mapReduce. It's also possible I've made a crucial mistake in configuring sharding, but I think I followed the directions pretty closely.

Labels: mapReduce, mongoDB

posted by Leif Wickland @ 9:36 PM 7 comments

leifw

Tuesday, April 05, 2011

11 Buggy Disappointments in MongoDB

More Leif Wickland

Links

Archives