3. 3
Background
+Traditional Applications
Limited Data
Top priority on consistency
Focus on average latency
Ideally fit with RDBMS
Utilized the DB intrinsic features well
Good part of logic resided in DB
+Next Gen Applications
Web Scale (~infinite)
ALWAYS available
High performance in ALL cases
Data in the form of key/value pair
Logic part of Application Layer
4. 4
RDBMS with Nextgen Apps – Failure
+Scale
Limit to maximum data supported
Sharding is an option, but then RDBMS features are lost
+Economy
Requires large arrays of fast, expensive disks
Very expensive
+Availability still an issue
5. 5
NoSQL Databases
+Name is confusing
Not RDBMS at all
NoREL Databases a better name
+Key Value Store
+Extremely scalable
+High performance
+Always available
+Weak Consistency (CAP Theorem)
+Distributed
Use commodity hardware - Cheap
+Might not hold ACID properties
+Only for specific Use – Not everything is good
6. RDBMS vs NoSQL Databases
+Go for RDBMS when
Small instances of simple straight forward systems
Joins, secondary indexing, referential integrity, group by/order by
+Go for NoSQL when
Data scale
Read/write scale
Data model is
Flexible
Semi-structured
6
8. Some famous NoSQL Databases
+Open-source
HBase
Cassandra
Voldemort
Dynomite
Hypertable
CouchDB
VPork
MongoDB
Riak
+Closed-source
BigTable
Dynamo
PNUTS
8
9. 9
HBase
+Based on Google BigTable
+Sparse distributed persistent multi-dimensional sorted map
+On top of Hadoop HDFS
+Master Slave Model
Single Master (SPOF)
+Especially good when
Objects are huge
Data production/consumption is distributed and is tunneled through map/reduce
jobs
+Loose Data Model
Column Families
+Timestamp based versioning
+Not supported on Windows
+Major Users – Adobe, Twitter, Yahoo, Veoh, Streamy, Trend Micro
10. HBase Architecture & Table Structure
+Loosely based on Consistent Hashing
+Table made up of regions
Region specified by startkey and endkey
A region may live on a different node.
+Tables sorted by Rows
+Schema defines column families only
Each family consists of any no. of columns
Each column consists of any no. of versions
Columns within a family are sorted & stored together
+Everything except table name are byte[]
10
11. Connecting to Hbase
+Java Client API
HBaseConfiguration config = new HBaseConfiguration();
HTable table = new HTable(config, “table_name”);
Put p = new Put(Bytes.toBytes(“key”));
p.add(Bytes.toBytes(“key”), Bytes.toBytes(“column”), Bytes.toBytes(“value”));
table.put(p);
Get g = new Get(Bytes.toBytes(“key”));
Result r = table.get(g);
+HBase Shell
$ ${HBASE_HOME}/bin/hbase shell
hbase> describe “table_name”
hbase> put “table_name", “key”, “columnfamily:columnname", "value“
hbase> get “table_name”, “key”
hbase> scan “table_name”
+Thrift Gateway
+REST Gateway
+Many other non-java clients
11
12. Cassandra
+Based on Amazon Dynamo
+Open sourced by Facebook in 2008
+Peer to Peer Model
No Master Node
+Works on Windows as well
+Distributed Key/Value Store
+Configurable parameters for Consistency/Availability
+Especially suited if
Number of Objects is huge
objects are of small sizes (<1 MB)
+Major Users: Facebook, Digg, Twitter etc.
12
13. 13
NoSQL Databases – Assumptions
+Data size is huge
System must partition its data across multiple nodes
+Reliable
Data must be safe even when disks and nodes fail
System must replicate data
+Performance
Needs to perform well on cheap hardware and maintain low latency ALWAYS
14. 14
NoSQL Databases – Design Strategies
+Complex Distributed System
+Partitioning
Consistent Hashing
+Consistency
Eventual Consistency
Vector Clocks
+Data Models
Primary Key -> Value
Value can be semi-structured
Multi-version Storage
+Storage Layouts
Column storage with Locality groups
Log structured Merge Trees
+Cluster Management
Peer to Peer vs Master/Slave approach
Gossip
15. 15
References
+Bigtable: A Distributed Storage System for Structured Data
http://labs.google.com/papers/bigtable-osdi06.pdf
+Dynamo: Amazon's Highly Available Key-value Store
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
+NOSQL debrief, June 2009
http://static.last.fm/johan/nosql-20090611/intro_nosql.pdf
http://static.last.fm/johan/nosql-20090611/hbase_nosql.pdf
http://static.last.fm/johan/nosql-20090611/cassandra_nosql.ppt
+NoSQL Databases Official Site
http://nosql-database.org
+Hbase – Hadoop Wiki
http://wiki.apache.org/hadoop/Hbase
+Apache Cassandra Wikipedia
http://en.wikipedia.org/wiki/Apache_Cassandra