2. What is Wordnik Project to track language like GPS for English Dictionary is a road block to the language Roughly 200 new words created daily Language is not static Capture information about all words Meaning is often undefined in traditional sense Machines can determine meaning through analysis Needs LOTS of data
3. Why should You care Every Developer can use a Robust Language API! Wordnik migrated to MongoDB > 5 Billion documents > 1.2 TB Zero application downtime Learn from our Experience
4. Wordnik Not just a website! But we have one Launched Wordnik entirely on MySQL Hit road bumps with insert speed ~4B rows on MyISAMtables Tables locked for 10’s of seconds during inserts But we need more data! Created elaborate update schemes to work around it Lost lots of sleep babysitting servers while researching LT solution
5. Wordnik + MongoDB What are our storage needs? Database vs. Application Logic No PK/FK constraints No Stored Procedures Consistency? Lots of R&D Tried most all noSQL solutions
6. Migrating Storage Engines Many parts to this effort Setup & Administration Software Design Optimization Many types of data at Wordnik Corpus Structured HierarchicalData User Data Migrated #1 & #2
7. Server Infrastructure Wordnik is Heavily Read-only Master / Slave deployment Looking at replica pairs MongoDB loves system resources Wordnik runs dedicated boxes to avoid other apps being sent to disk (aka time-out) Memory + Disk = Happy Mongo Many X the disk space of MySQL Easy pill to swallow until…
8. Server Infrastructure Physical Hardware 2 x 4 core CPU, 32gb RAM, FC SAN Had bad luck on VMs (you might not) Disk speed => performance
9. Software Design Two distinct use cases for MongoDB Identical structure, different storage engine Same underlying objects, same storage fidelity (largelykey/value) Hierarchical data structure Same underlying objects, document-oriented storage
10. Software Design Create BasicDBObjects from POJOs and used collection methods BasicDBObjectdbo = new BasicDBObject("sentence",s.getSentence()) .append("rating",s.getRating()).append(...); ID Generation to manage unique _ID values Analogous to MySQL AutoIncrement behavior Compatible with MySQL Ids (more later) dbo.append("_ID", getId()); collection.save(dbo); Implemented all CRUD methods in DAO Swappable between MongoDB and MySQL at runtime
11. Software Design Key-Value storage use case Easy as implementing new DAOs SentenceHandlerh = new MongoDBSentenceHandler(); Save methods construct BasicDBObject and call save() on collection Implement same interface Same methods against DAO between MySQL and MongoDB versions Data Abstraction 101
12. Software Design What about bulk inserts? FAF Queued approach Add objects to queue, return to caller Every X seconds, process queue All objects from same collection are appended to a single List<DBObject> Call collection.insert(…) before 2M characters Reduces network overhead Very fast inserts
13. Software Design Hierarchical Data done more elegantly Wordnik Dictionary Model Java POJOs already had JAXB annotations Part of public REST api Used Mysql 12+ tables 13 DAOs 2500 lines of code 50 requests/second uncached Memcache needed to maintain reasonable speed
15. Software Design MongoDB’s Document Storage let us… Turn the Objects into JSON via Jackson Mapper (fasterxml.com) Call save Support all fetch types, enhanced filters 1000 requests / second No explicit caching No less scary code
17. Migrating Data Migrating => existing data logic Use logic to select DAOs appropriately Read from old, write with new Great system test for MongoDB SentenceHandlermysqlSh = new MySQLSentenceHandler(); SentenceHandlermongoSh = new MongoDbSentenceHandler(); while(hasMoreData){ mongoSh.asyncWrite(mysqlSh.next()); ... }
18. Migrating Data Wordnik moved 5 billion rows from MySQL Sustained 100,000 inserts/second Migration tool was CPU bound ID generation logic, among other Wordnik reads MongoDB fast Read + create java objects @ 250k/second (!)
19. Going live to Production Choose your use case carefully if migrating incrementally Scary no matter what Test your perf monitoring system first! Use your DAOs from migration Turn on MongoDB on one server, monitor, tune (rollback, repeat) Full switch over when comfortable
20. Going live to Production Really? SentenceHandlerh = null; if(useMongoDb){ h = new MongoDbSentenceHandler(); } else{ h = new MySQLDbSentenceHandler(); } return h.find(...);
21. Optimizing Performance Home-grown connection pooling Master only ConnectionManager.getReadWriteConnection() Slave only ConnectionManager.getReadOnlyConnection() Round-robin all servers, bias on slaves ConnectionManager.getConnection()
22. Optimizing Performance Caching Had complex logic to handle cache invalidation Out-of-process caches are not free MongoDB loves your RAM Let it do your LRU cache (it will anyway) Hardware Do not skimp on your disk or RAM Indexes Schema-less design Even if no values in any document, needs to read document schema to check
23. Optimizing Performance Disk space Schemaless => schema per document (row) Choose your mappings wisely ({veryLongAttributeName:true}) => more disk space than ({vlan:true})
25. Other Tips Data Types Use caution when changing DBObjectobj = cur.next(); long id = (Long) obj.get(“IWasAnIntOnce”) Attribute names Don’t change w/o migrating existing data! WTFDMDG????
26. What’s next? GridFS Store audio files on disk Requires clustered file system for shared access Capped Collections (rolling out this week) UGC from MySQL => MongoDB Beg/Bribe 10gen for some Features