SlideShare a Scribd company logo
1 of 20
Download to read offline
Comparing Scalable NOSQL Databases
      Functionalities and Measurements




                Dory Thibault

                    UCL

 Contact : thibault.dory@student.uclouvain.be


             Sponsor : Euranova


      Website : nosqlbenchmarking.com




              February 15, 2011
Motivation
                 Overview of the databases
                              Methodology
                                   Results
                  Summary and conclusion

Clari
cations
  As a lot of people who read those slides did not get the oral
  explanations that MUST go with it, here are a few words of
  warning :
       All the databases were used with default con
gurations, I will
       post them soon on nosqlbenchmarking.com
       No index was set manually, doing so could have a big impact
       on performances
       Don't jump too fast on the conclusions, it would be WRONG
       to say that Cassandra is very good and that HBase sucks.
       The Cassandra implementation of MapReduce seems to be
       buggy and do not scale. There must be something wrong with
       my HBase con
guration, HBase is known to run gigantic
       cluster without problems.
                                                                        2 / 20
Motivation
                  Overview of the databases
                               Methodology
                                    Results
                   Summary and conclusion

Clari
cations
  Also keep in mind that a benchmark is always biased by the chosen
  methodology so :
        The way I store data in each database could have an impact
        on the performances
        The summary about the results should not be taken in an
        absolute way, especially the
rst one. When I say Good or
        Bad it is in THIS particular case. Moreover raw results are not
        the most important, scalability is very important too. So good
        performances for Cassandra MapReduce but without
        scalability is NOT good.
        The data set is too small, I'm testing cache performances (but
        it is the same for all of the databases)
  I will add soon a written analysis and a self critic about those
  results on www.nosqlbenchmarking.com
                                                                          3 / 20
Motivation
                    Overview of the databases
                                 Methodology
                                      Results
                     Summary and conclusion

Motivation
  YCSB

  Yahoo! Cloud Servicing Benchmark is the best known noSQL bench-
  marking application so why make another one?


         YCSB uses data generated from statistical distributions
         instead of real data

         YCSB only focuses on read/write/update/scan performances

         YCSB results for elasticity are not conclusive

  Idea

         Data and use case inspired by a concrete case : Wikipedia

         Test read/update performances

         Test MapReduce performances by computing an inverted
         search index
                                                                     4 / 20
Motivation
                                              Cassandra 0.6.10
                  Overview of the databases
                                              HBase 0.20.6
                              Methodology
                                              mongoDB 1.6.5
                                    Results
                                              Riak 0.14
                   Summary and conclusion




Cassandra 0.6.10




  Overview
  Cassandra is a fully distributed column oriented data store that pro-
  vides a MapReduce implementation using Hadoop.


      All the nodes in the cluster play the same role
      The data (existing and new) are sharded automatically among
      the nodes
      The developer can choose the consistency level for each
      request




                                                                          5 / 20
Motivation
                                               Cassandra 0.6.10
                   Overview of the databases
                                               HBase 0.20.6
                               Methodology
                                               mongoDB 1.6.5
                                     Results
                                               Riak 0.14
                    Summary and conclusion




HBase 0.20.6


  Overview
  HBase is a column oriented database that aims to provide low latency
  requests on top of Hadoop HDFS

      An HBase cluster uses several kinds of servers :
             HDFS needs at least one  namenode          datanodes
                                                              and several

             HBase needs a     ZooKeeper cluster master    , a         and several

             regionservers
      The requests must be made to the master(s)
      On the HDFS level, existing data are not sharded
      automatically but new data are
      On the HBase level, the data are divided into regions that are
      sharded automatically across regionservers

                                                                                     6 / 20
Motivation
                                               Cassandra 0.6.10
                   Overview of the databases
                                               HBase 0.20.6
                               Methodology
                                               mongoDB 1.6.5
                                     Results
                                               Riak 0.14
                    Summary and conclusion




mongoDB 1.6.5




  Overview

  mongoDB is a document oriented database that stores JSON dic-
  tionnaries. It provides auto sharding and a MapReduce implemen-
  tation.


       A mongoDB cluster is made of several kinds of servers :
             The shard servers that store data
             The con
guration servers that store the con
guration
             The router servers that receive and route the requests
       Existing and new data are sharded automatically

       MapReduce can only use one thread by server




                                                                      7 / 20
Motivation
                                              Cassandra 0.6.10
                  Overview of the databases
                                              HBase 0.20.6
                              Methodology
                                              mongoDB 1.6.5
                                    Results
                                              Riak 0.14
                   Summary and conclusion




Riak 0.14



  Overview
  Riak is a fully distributed key/bucket store with an implementation
  of MapReduce.


      Buckets can store the data directly or be a link to another
      bucket
      All the nodes in the cluster play the same role
      The data (existing and new) are sharded automatically
      amongs the nodes
      The developer can choose the consistency level for each
      request



                                                                        8 / 20
Motivation
                 Overview of the databases   The data used
                              Methodology    The client
                                   Results   The methodology
                  Summary and conclusion

The data

  Wikipedia export

  20.000 pages downloaded from Wikipedia



       Every document is in XML format

       All documents sum up to 620Mo

       Each document is associated to a single integer ID


  Insertions

  Each document is inserted only once during the whole benchmark




                                                                   9 / 20
Motivation
                Overview of the databases   The data used
                             Methodology    The client
                                  Results   The methodology
                 Summary and conclusion

The client

  Overview
      Fully random requests
      Acts as a perfect load balancer
      The proportion of updates can be speci
ed
      Speci
c parts : read/write/update and MapReduce

  Updates
  The updates simply concatenate the string 1" at the end of the
  article.



                                                                    10 / 20
Motivation
                Overview of the databases   The data used
                             Methodology    The client
                                  Results   The methodology
                 Summary and conclusion

MapReduce
 Overview
 MapReduce is used to build a reverse index for a given keyword.
 The reverse index is a list of pairs made of :
      ID : the ID of the article if Count 6= 0
      Count : the number of occurrences of the keyword in this
      article
 Justi

More Related Content

What's hot

Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBasephanleson
 
A Study of Performance NoSQL Databases
A Study of Performance NoSQL DatabasesA Study of Performance NoSQL Databases
A Study of Performance NoSQL DatabasesAM Publications
 
Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBKaushik Rajan
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operationsphanleson
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sqlAnuja Gunale
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
Sql vs NoSQL-Presentation
 Sql vs NoSQL-Presentation Sql vs NoSQL-Presentation
Sql vs NoSQL-PresentationShubham Tomar
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 

What's hot (13)

No SQL introduction
No SQL introductionNo SQL introduction
No SQL introduction
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
 
A Study of Performance NoSQL Databases
A Study of Performance NoSQL DatabasesA Study of Performance NoSQL Databases
A Study of Performance NoSQL Databases
 
Performance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODBPerformance Analysis of HBASE and MONGODB
Performance Analysis of HBASE and MONGODB
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
No sql database
No sql databaseNo sql database
No sql database
 
paper
paperpaper
paper
 
Sql vs NoSQL-Presentation
 Sql vs NoSQL-Presentation Sql vs NoSQL-Presentation
Sql vs NoSQL-Presentation
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 

Similar to Comparing noSQL databases : benchmark

DSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraDSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraShrikant Samarth
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project ReportTushar Dalvi
 
Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)orcsab
 
Benchmarking Couchbase Server for Interactive Applications
Benchmarking Couchbase Server for Interactive ApplicationsBenchmarking Couchbase Server for Interactive Applications
Benchmarking Couchbase Server for Interactive ApplicationsAltoros
 
Oracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBaseOracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBasePaulo Fagundes
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraYashIyengar
 
MongoDB Lab Manual (1).pdf used in data science
MongoDB Lab Manual (1).pdf used in data scienceMongoDB Lab Manual (1).pdf used in data science
MongoDB Lab Manual (1).pdf used in data sciencebitragowthamkumar1
 
Trends in Computer Science and Information Technology
Trends in Computer Science and Information TechnologyTrends in Computer Science and Information Technology
Trends in Computer Science and Information Technologypeertechzpublication
 
Altoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsAltoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsJeff Harris
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentationSalma Gouia
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLbalwinders
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbsonalighai
 

Similar to Comparing noSQL databases : benchmark (20)

DSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraDSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and Cassandra
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project Report
 
Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)
 
Benchmarking Couchbase Server for Interactive Applications
Benchmarking Couchbase Server for Interactive ApplicationsBenchmarking Couchbase Server for Interactive Applications
Benchmarking Couchbase Server for Interactive Applications
 
Hbase
HbaseHbase
Hbase
 
C1803041317
C1803041317C1803041317
C1803041317
 
Oracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBaseOracle NoSQL Database Compared to Cassandra and HBase
Oracle NoSQL Database Compared to Cassandra and HBase
 
Dsm project-h base-cassandra
Dsm project-h base-cassandraDsm project-h base-cassandra
Dsm project-h base-cassandra
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoop
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and Cassandra
 
No sql
No sqlNo sql
No sql
 
MongoDB Lab Manual (1).pdf used in data science
MongoDB Lab Manual (1).pdf used in data scienceMongoDB Lab Manual (1).pdf used in data science
MongoDB Lab Manual (1).pdf used in data science
 
NoSQL Basics and MongDB
NoSQL Basics and  MongDBNoSQL Basics and  MongDB
NoSQL Basics and MongDB
 
Trends in Computer Science and Information Technology
Trends in Computer Science and Information TechnologyTrends in Computer Science and Information Technology
Trends in Computer Science and Information Technology
 
Altoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applicationsAltoros using no sql databases for interactive_applications
Altoros using no sql databases for interactive_applications
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentation
 
HadoopDB in Action
HadoopDB in ActionHadoopDB in Action
HadoopDB in Action
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsb
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

Comparing noSQL databases : benchmark

  • 1. Comparing Scalable NOSQL Databases Functionalities and Measurements Dory Thibault UCL Contact : thibault.dory@student.uclouvain.be Sponsor : Euranova Website : nosqlbenchmarking.com February 15, 2011
  • 2. Motivation Overview of the databases Methodology Results Summary and conclusion Clari
  • 3. cations As a lot of people who read those slides did not get the oral explanations that MUST go with it, here are a few words of warning : All the databases were used with default con
  • 4. gurations, I will post them soon on nosqlbenchmarking.com No index was set manually, doing so could have a big impact on performances Don't jump too fast on the conclusions, it would be WRONG to say that Cassandra is very good and that HBase sucks. The Cassandra implementation of MapReduce seems to be buggy and do not scale. There must be something wrong with my HBase con
  • 5. guration, HBase is known to run gigantic cluster without problems. 2 / 20
  • 6. Motivation Overview of the databases Methodology Results Summary and conclusion Clari
  • 7. cations Also keep in mind that a benchmark is always biased by the chosen methodology so : The way I store data in each database could have an impact on the performances The summary about the results should not be taken in an absolute way, especially the
  • 8. rst one. When I say Good or Bad it is in THIS particular case. Moreover raw results are not the most important, scalability is very important too. So good performances for Cassandra MapReduce but without scalability is NOT good. The data set is too small, I'm testing cache performances (but it is the same for all of the databases) I will add soon a written analysis and a self critic about those results on www.nosqlbenchmarking.com 3 / 20
  • 9. Motivation Overview of the databases Methodology Results Summary and conclusion Motivation YCSB Yahoo! Cloud Servicing Benchmark is the best known noSQL bench- marking application so why make another one? YCSB uses data generated from statistical distributions instead of real data YCSB only focuses on read/write/update/scan performances YCSB results for elasticity are not conclusive Idea Data and use case inspired by a concrete case : Wikipedia Test read/update performances Test MapReduce performances by computing an inverted search index 4 / 20
  • 10. Motivation Cassandra 0.6.10 Overview of the databases HBase 0.20.6 Methodology mongoDB 1.6.5 Results Riak 0.14 Summary and conclusion Cassandra 0.6.10 Overview Cassandra is a fully distributed column oriented data store that pro- vides a MapReduce implementation using Hadoop. All the nodes in the cluster play the same role The data (existing and new) are sharded automatically among the nodes The developer can choose the consistency level for each request 5 / 20
  • 11. Motivation Cassandra 0.6.10 Overview of the databases HBase 0.20.6 Methodology mongoDB 1.6.5 Results Riak 0.14 Summary and conclusion HBase 0.20.6 Overview HBase is a column oriented database that aims to provide low latency requests on top of Hadoop HDFS An HBase cluster uses several kinds of servers : HDFS needs at least one namenode datanodes and several HBase needs a ZooKeeper cluster master , a and several regionservers The requests must be made to the master(s) On the HDFS level, existing data are not sharded automatically but new data are On the HBase level, the data are divided into regions that are sharded automatically across regionservers 6 / 20
  • 12. Motivation Cassandra 0.6.10 Overview of the databases HBase 0.20.6 Methodology mongoDB 1.6.5 Results Riak 0.14 Summary and conclusion mongoDB 1.6.5 Overview mongoDB is a document oriented database that stores JSON dic- tionnaries. It provides auto sharding and a MapReduce implemen- tation. A mongoDB cluster is made of several kinds of servers : The shard servers that store data The con
  • 13. guration servers that store the con
  • 14. guration The router servers that receive and route the requests Existing and new data are sharded automatically MapReduce can only use one thread by server 7 / 20
  • 15. Motivation Cassandra 0.6.10 Overview of the databases HBase 0.20.6 Methodology mongoDB 1.6.5 Results Riak 0.14 Summary and conclusion Riak 0.14 Overview Riak is a fully distributed key/bucket store with an implementation of MapReduce. Buckets can store the data directly or be a link to another bucket All the nodes in the cluster play the same role The data (existing and new) are sharded automatically amongs the nodes The developer can choose the consistency level for each request 8 / 20
  • 16. Motivation Overview of the databases The data used Methodology The client Results The methodology Summary and conclusion The data Wikipedia export 20.000 pages downloaded from Wikipedia Every document is in XML format All documents sum up to 620Mo Each document is associated to a single integer ID Insertions Each document is inserted only once during the whole benchmark 9 / 20
  • 17. Motivation Overview of the databases The data used Methodology The client Results The methodology Summary and conclusion The client Overview Fully random requests Acts as a perfect load balancer The proportion of updates can be speci
  • 18. ed Speci
  • 19. c parts : read/write/update and MapReduce Updates The updates simply concatenate the string 1" at the end of the article. 10 / 20
  • 20. Motivation Overview of the databases The data used Methodology The client Results The methodology Summary and conclusion MapReduce Overview MapReduce is used to build a reverse index for a given keyword. The reverse index is a list of pairs made of : ID : the ID of the article if Count 6= 0 Count : the number of occurrences of the keyword in this article Justi
  • 21. cation This kind of computation implies that all the documents are crawled and take advantage of the speci
  • 23. Motivation Overview of the databases The data used Methodology The client Results The methodology Summary and conclusion The methodology 1 Start up a clean cluster of size 3 and insert all the documents 2 Choose a total number of requests, a read percentage and starts the benchmark 3 Wait one minute and starts the benchmark again 4 Wait
  • 24. ve minutes and starts the benchmark again 5 Start the MapReduce benchmark 6 Add a new node to the cluster and wait for it to be ready then restart immediately the bench with the new node's IP in the list 7 Jump to 3 until there are no more computer to add to the cluster 12 / 20
  • 25. Motivation Overview of the databases Methodology Results Summary and conclusion Read/update results 13 / 20
  • 26. Motivation Overview of the databases Methodology Results Summary and conclusion Read/update results without HBase 14 / 20
  • 27. Motivation Overview of the databases Methodology Results Summary and conclusion MapReduce performance 15 / 20
  • 28. Motivation Overview of the databases Methodology Results Summary and conclusion The HBase case Veri
  • 29. cations made : Checked the logs : nothing seemed problematic HDFS level : running the balancer with a very low threshold distributed the blocks evenly but without any impact on the performances HBase level : the regions where always nearly evenly distributed across the regionservers The number of rows did not change and the content of each row was correct 16 / 20
  • 30. Motivation Overview of the databases Methodology Results Summary and conclusion Summary of raw performances DB read/update performances MapReduce performances Cassandra Good Very Good HBase Bad / N.A. Average / N.A mongoDB Good Poor but scalable Riak Poor / unstable Average but scalable 17 / 20
  • 31. Motivation Overview of the databases Methodology Results Summary and conclusion Summary of scalability Going from 3 to 8 servers is a 266% increase in capacity, here are the observed increases in performances : DB read/update MapReduce Cassandra 153% 112% HBase 11% 43% mongoDB 145% 211% Riak 74% 189% Riak 7 nodes max 155% 168% 18 / 20
  • 32. Motivation Overview of the databases Methodology Results Summary and conclusion Conclusion and future work Conclusion The elastic gain seems more apparent than with YCSB but not linear either It is worth testing MapReduce performances as the results vary a lot between databases for both raw and scalability performances Future work This is still a work in progress : Applying this benchmark to other databases (Terrastore, Voldemort, Scalaris ...) Trying with a growing/bigger data set 19 / 20
  • 33. Motivation Overview of the databases Methodology Results Summary and conclusion Questions and remarks Any questions or remarks? 20 / 20