SlideShare a Scribd company logo
1 of 16
Download to read offline
 
                              	
  
Sqooping	
  50	
  Million	
  Rows	
  a	
  Day	
  from	
  MySQL	
  
                              	
  
                              	
  


                                                              Eric	
  Hernandez	
  
                                                              Database	
  Administrator	
  
Parent_YearMonth_Merge	
  
                                                  Child_YearMonth_0	
  
                                                  Child_YearMonth_1	
  
                                                  Child_YearMonth_2	
  
 App	
  	
                                        Child_YearMonth_3	
  
Servers	
                                         Child_YearMonth_4	
  
                                                  Child_YearMonth_5	
  
                                                  Child_YearMonth_6	
  
                                                  Child_YearMonth_7	
  
                                                  Child_YearMonth_8	
  
                                                  Child_YearMonth_9	
  

               +50	
  Million	
  New	
  Rows	
  a	
  Day	
  	
  

               +1.5	
  Billion	
  Rows	
  a	
  Month	
  
3	
  Month	
  RotaSonal	
  Life	
  Cycle	
  



MySQL	
  AcSve	
  	
                                                            MySQL	
  Archive	
  
Writer	
  Instance	
                                                    Long-­‐Term	
  Storage	
  Instance	
  



                                                                                 Two	
  Months	
  
Current	
  Month	
  
                                                                                     Ago	
  


  One	
  Month	
                                                                Three	
  Months	
  
     Ago	
                                                                          Ago	
  


 Two	
  Months	
  
                                                                                     So	
  on	
  ..	
  	
  
     Ago	
  
Problem:	
  Data	
  Analyst	
  have	
  to	
  pull	
  data	
  from	
  two	
  different	
  sources.	
  
                                   One	
  of	
  the	
  goals	
  of	
  our	
  project	
  is	
  to	
  create	
  a	
  single	
  data	
  
                                   source	
  for	
  analyst	
  to	
  mine.	
  	
  




MySQL	
  AcSve	
  	
                                                                    MySQL	
  Archive	
  
Writer	
  Instance	
                                                            Long-­‐Term	
  Storage	
  Instance	
  

                                                                                             Two	
  Months	
  
Current	
  Month	
  
                                                                                                 Ago	
  
  One	
  Month	
                                                                           Three	
  Months	
  
     Ago	
                                                                                     Ago	
  

                                                                                                 So	
  on	
  ..	
  	
  
Data	
  Analyst	
  with	
  Hadoop	
  only	
  have	
  to	
  pull	
  from	
  one	
  data	
  source.	
  




                                                                      Hadoop	
  Cluster	
  
                                                                                  Hive	
  
MySQL	
  AcSve	
  	
                                                               	
  
Writer	
  Instance	
                                                      With	
  all	
  data,	
  	
  
                                                                              current	
  
                                                                   to	
  the	
  last	
  24	
  hours.	
  	
  
Current	
  Month	
  

  One	
  Month	
  
     Ago	
  
A^empt	
  1.0	
  Sqooping	
  in	
  Data	
  from	
  MySQL	
  
Sqoop	
  enSre	
  table	
  into	
  hive	
  every	
  day	
  at	
  0030	
  
                                                                                                                          Parent_201108_Merge	
  
                 9	
  Node	
  	
                                                                                              Child_201108_0	
  
          Hadoop	
  Cluster	
                                                                                                 Child_201108_1	
  
    4	
  TB	
  Available	
  Storage	
  
                                                                                                                              Child_201108_2	
  
                                                                                                                              Child_201108_3	
  
                                                                                                                              Child_201108_4	
  
    Hive	
  Table	
                                                                                                           Child_201108_5	
  	
  
                                                                                                                              Child_201108_6	
  	
  

2011-­‐08-­‐01	
                                                                                                              Child_201108_7	
  
5	
  Million	
  Rows	
  Per	
  Table	
                                                                                        Child_201108_8	
  
2	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
  
                                                                                                                              Child_201108_9	
  
20	
  Minute	
  Total	
  Time	
  
Total	
  50	
  Million	
  Rows	
  into	
  Hive	
  Table	
  

2011-­‐08-­‐02	
                                               2011-­‐08-­‐10	
  
10	
  Million	
  Rows	
  Per	
  Table	
                        50	
  Million	
  Rows	
  Per	
  Table	
  
4	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
                 20	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
  
40	
  Minutes	
  Total	
  Time	
                               200	
  Minutes	
  Total	
  Time	
  
Total	
  100	
  Million	
  Rows	
  into	
  Hive	
  Table	
     Total	
  500	
  Million	
  Rows	
  into	
  Hive	
  Table	
  
A^empt	
  2.0	
  Incremental	
  Sqoop	
  of	
  Data	
  from	
  MySQL	
  


                                            Child_YearMonth	
  Schema	
  

  ID	
  BIGINT	
                   MISC	
                       MISC	
                       MISC	
                 Date_Created	
  
Auto	
  Increment	
               Column	
                     Column	
                     Column	
                 TimeStamp	
  


                                                 Parent_201108_Merge	
  
                                                     Child_201108_0	
  
                                                     Child_201108_1	
  
                                                     Child_201108_2	
  
                                                     Child_201108_3	
  
                                                     Child_201108_4	
  
                                                     Child_201108_5	
  	
  
                                                     Child_201108_6	
  	
  
                                                     Child_201108_7	
  
                                                     Child_201108_8	
  
                                                     Child_201108_9	
  



	
  
sqoop	
  import	
  -­‐-­‐where	
  "date_created	
  between	
  '${DATE}	
  00:00:00'	
  and	
  '${DATE}	
  23:59:59’”	
  
A^empt	
  2.0	
  Incremental	
  Sqoop	
  of	
  Data	
  from	
  MySQL	
  



                9	
  Node	
  	
                                 Sqoop                                                           Parent_201108_Merge	
  
                                                                         	
  
         Hadoop	
  Cluster	
                                   Last	
  2 with	
  wher                                             Child_201108_0	
  
                                                                        4	
  hou          e
   4	
  TB	
  Available	
  Storage	
                                             rs	
  fro 	
  clause	
                           Child_201108_1	
  
                                                                                          m	
  Da
                                                                                                  te_Cr                           Child_201108_2	
  
                                                                                                          eat     ed	
            Child_201108_3	
  
                                                                                                                                  Child_201108_4	
  
      Hive	
  Table	
                                                                                                             Child_201108_5	
  	
  
                                                                                                                                  Child_201108_6	
  	
  
                                                                                                                                  Child_201108_7	
  
                                                                                                                                  Child_201108_8	
  
                                                                                                                                  Child_201108_9	
  
 2011-­‐08-­‐01	
  
 5	
  Million	
  Rows	
  Per	
  Table	
  
 2	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
                   2011-­‐08-­‐10	
  
 10	
  Minute	
  Total	
  Time	
                                  5	
  Million	
  Rows	
  Per	
  Table	
  
 Total	
  50	
  Million	
  Rows	
  into	
  Hive	
  Table	
        2	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
  
                                                                  10	
  Minute	
  Total	
  Time	
  
                                                                  Total	
  50	
  Million	
  Rows	
  into	
  Hive	
  Table	
  
 2011-­‐08-­‐02	
  
                                                                  	
  
 5	
  Million	
  Rows	
  Per	
  Table	
  
 2	
  Minutes	
  Sqoop	
  Sme	
  Per	
  Table	
  
 10	
  Minute	
  Total	
  Time	
  
 Total	
  50	
  Million	
  Rows	
  into	
  Hive	
  Table	
  
                                                                  Consistent	
  run	
  Smes	
  for	
  sqoop	
  jobs	
  achieved	
  
 	
  
Ager	
  our	
  2.0	
  Incremental	
  Process	
  we	
  had	
  achieved	
  consistent	
  run	
  Smes	
  
however,	
  two	
  new	
  problems	
  surfaced.	
  
	
  
1)  Each	
  day	
  10	
  new	
  parts	
  would	
  be	
  added	
  to	
  the	
  Hive	
  table	
  which	
  caused	
  10	
  more	
  map	
  
     tasks	
  per	
  hive	
  query.	
  	
  	
  
2)  Space	
  consumpSon	
  on	
  hadoop	
  cluster.	
  
                                                                  	
  
Too	
  many	
  parts	
  and	
  	
  map	
  tasks	
  per	
  query.	
  


          Hive	
  Table	
  
                                                                                                            Parent_201108_Merge	
  
                         Part-­‐0	
                                                                           Child_201108_0	
  
                         Part-­‐1	
                                                                           Child_201108_1	
  
                         Part-­‐2	
  
                         Part-­‐3	
                                                                           Child_201108_2	
  
2011-­‐08-­‐01	
         Part-­‐4	
                               Sqoop	
                                     Child_201108_3	
  
                         Part-­‐5	
  
                         Part-­‐6	
                                                                           Child_201108_4	
  
                         Part-­‐7	
                                                                           Child_201108_5	
  	
  
                         Part-­‐8	
  
                         Part-­‐9	
                                                                           Child_201108_6	
  	
  
                        Part-­‐10	
                                                                           Child_201108_7	
  
                        Part-­‐11	
                                                                           Child_201108_8	
  
                        Part-­‐12	
  
                        Part-­‐13	
                                                                           Child_201108_9	
  
                        Part-­‐14	
  
2011-­‐08-­‐02	
        Part-­‐15	
                            For	
  3	
  Days	
  of	
  Data	
  
                        Part-­‐16	
  
                        Part-­‐17	
              30	
  Map	
  tasks	
  must	
  be	
  processed	
  for	
  
                        Part-­‐18	
  
                        Part-­‐19	
                             any	
  Hive	
  Query	
  
                        Part-­‐20	
  
                        Part-­‐21	
  
                        Part-­‐22	
  
                        Part-­‐23	
  
2011-­‐08-­‐03	
        Part-­‐24	
                       For	
  30	
  Days	
  of	
  Data	
  
                        Part-­‐25	
  
                        Part-­‐26	
             300	
  Map	
  tasks	
  must	
  be	
  processed	
  for	
  
                        Part-­‐27	
                             any	
  Hive	
  Query	
  
                        Part-­‐28	
  
                        Part-­‐29	
  
Parent_201108_Merge	
  
         Hive	
  Table	
                                                                                                                    Child_201108_0	
  
                                                                                                                                            Child_201108_1	
  
                                                                                            p	
  
                                                                                    Sqoo                                                    Child_201108_2	
  
                                                                                                                                            Child_201108_3	
  
                                             Part-­‐0	
  
                                             Part-­‐1	
                                                                                     Child_201108_4	
  
                                             Part-­‐2	
                                                                                     Child_201108_5	
  	
  
                        ParSSon	
  
                                             Part-­‐3	
                                                                                     Child_201108_6	
  	
  
2011-­‐08-­‐01	
                             Part-­‐4	
  
                     dt=2011-­‐08-­‐01	
                                                                                                    Child_201108_7	
  
                                             Part-­‐5	
  
                                             Part-­‐6	
                                                                                     Child_201108_8	
  
                                             Part-­‐7	
                                                                                     Child_201108_9	
  
                                             Part-­‐8	
  
                                             Part-­‐9	
  
                                                            To	
  sqoop	
  10	
  tables	
  into	
  one	
  parSSon	
  
                                             Part-­‐0	
     I	
  choose	
  to	
  dynamically	
  create	
  a	
  parSSon	
  based	
  on	
  date	
  
                                             Part-­‐1	
  
                                             Part-­‐2	
     and	
  Sqoop	
  the	
  data	
  into	
  parSSon	
  directory	
  with	
  an	
  append	
  
                                             Part-­‐3	
  
2011-­‐08-­‐02	
        ParSSon	
            Part-­‐4	
  
                     dt=2011-­‐08-­‐02	
                    #	
  Set	
  date	
  to	
  yesterday	
  
                                             Part-­‐5	
  
                                             Part-­‐6	
     DATE=`date	
  +%Y-­‐%m-­‐%d	
  -­‐d	
  "1	
  day	
  ago"`	
  
                                             Part-­‐7	
     	
  
                                             Part-­‐8	
     #Create	
  ParSSon	
  
                                             Part-­‐9	
  
                                                            echo	
  "ALTER	
  TABLE	
  ${TABLE}	
  ADD	
  IF	
  NOT	
  EXISTS	
  PARTITION	
  (dt='${DATE}')	
  locaSon	
  
                                             Part-­‐0	
     '${PARTITION_DIR}';	
  exit;"	
  |	
  /usr/bin/hive	
  
                                             Part-­‐1	
  
                                             Part-­‐2	
     	
  
                                             Part-­‐3	
     #	
  Sqoop	
  in	
  event_logs	
  
                        ParSSon	
  
2011-­‐08-­‐03	
                             Part-­‐4	
  
                     dt=2011-­‐08-­‐03	
  
                                             Part-­‐5	
     TABLE_DIR=/user/hive/warehouse/${TABLE}	
  
                                             Part-­‐6	
     PARTITION_DIR=$TABLE_DIR/${DATE}	
  
                                             Part-­‐7	
     	
  
                                             Part-­‐8	
  
                                             Part-­‐9	
     sqoop	
  import	
  -­‐-­‐where	
  "date_created	
  between	
  '${DATE}	
  00:00:00'	
  and	
  '${DATE}	
  
                                                            23:59:59'"	
  -­‐-­‐target-­‐dir	
  $PARTITION_DIR	
  -­‐-­‐append	
  
Hive	
  Table	
                                                                                    Parent_201108_Merge	
  
                                                                                                                 Child_201108_0	
  
                                             Part-­‐0	
  
                                                                                                                 Child_201108_1	
  
                                             Part-­‐1	
                                                          Child_201108_2	
  
                                             Part-­‐2	
  
                                             Part-­‐3	
  
                                                                  Sqoop	
                                        Child_201108_3	
  
                        ParSSon	
            Part-­‐4	
                                                          Child_201108_4	
  
2011-­‐08-­‐01	
     dt=2011-­‐08-­‐01	
     Part-­‐5	
  
                                             Part-­‐6	
                                                          Child_201108_5	
  	
  
                                             Part-­‐7	
                                                          Child_201108_6	
  	
  
                                             Part-­‐8	
  
                                             Part-­‐9	
                                                          Child_201108_7	
  
                                             Part-­‐0	
                                                          Child_201108_8	
  
                                             Part-­‐1	
                                                          Child_201108_9	
  
                                             Part-­‐2	
  
                                             Part-­‐3	
  
2011-­‐08-­‐02	
        ParSSon	
            Part-­‐4	
  
                     dt=2011-­‐08-­‐02	
     Part-­‐5	
  
                                             Part-­‐6	
  
                                             Part-­‐7	
     As	
  a	
  result	
  of	
  sqooping	
  into	
  hive	
  parSSons	
  only	
  a	
  
                                             Part-­‐8	
     minimal	
  amount	
  map	
  task	
  have	
  to	
  be	
  processed.	
  	
  	
  
                                             Part-­‐9	
  
                                                                                       1	
  Day	
  =	
  10	
  Map	
  Tasks	
  
                                             Part-­‐0	
  
                                             Part-­‐1	
                             2	
  Days	
  =	
  20	
  Map	
  Tasks	
  
                                             Part-­‐2	
                                                   …	
  
                                             Part-­‐3	
  
                        ParSSon	
            Part-­‐4	
                           30	
  Days	
  =	
  300	
  Map	
  Tasks	
  
2011-­‐08-­‐03	
     dt=2011-­‐08-­‐03	
     Part-­‐5	
  
                                             Part-­‐6	
  
                                             Part-­‐7	
  
                                             Part-­‐8	
  
                                             Part-­‐9	
  
Space	
  ConsumpSon	
  




Parent_201108_Merge	
  
                                                                         Hadoop	
  

   Child_201108_0	
  
   Child_201108_1	
  
   Child_201108_2	
  
   Child_201108_3	
  
   Child_201108_4	
         1	
  Month	
  of	
  Data	
  
   Child_201108_5	
  	
            =	
  30GB	
  	
  
                                                                        ReplicaSon	
  
   Child_201108_6	
  	
                                                  Factor	
  3	
  	
  	
  	
  	
  
   Child_201108_7	
  
   Child_201108_8	
  
   Child_201108_9	
  

                                    1	
  Year	
  of	
  Data	
  
                                         3	
  Replicas	
          1	
  Replica	
  =	
  30	
  GB	
  
                                   1.08	
  TB	
  in	
  HDFS	
     3	
  Replicas	
  =	
  90	
  GB	
  	
  
                                                                          in	
  HDFS	
  
Sqooping	
  with	
  Snappy	
  

	
  
sqoop	
  import	
  -­‐-­‐compression-­‐codec	
  org.apache.hadoop.io.compress.SnappyCodec	
  -­‐z	
  




 Parent_201108_Merge	
  
                                                                                                 Hadoop	
  
    Child_201108_0	
  
    Child_201108_1	
  
    Child_201108_2	
  
    Child_201108_3	
  
    Child_201108_4	
                      1	
  Month	
  of	
  Data	
  
    Child_201108_5	
  	
                         =	
  30GB	
  	
                                ReplicaSon	
  
    Child_201108_6	
  	
                                                                         Factor	
  3	
  	
  	
  	
  	
  
    Child_201108_7	
  
    Child_201108_8	
  
    Child_201108_9	
  
                                               1	
  Year	
  of	
  Data	
                  1	
  Replica	
  =	
  6	
  GB	
  
                                                    3	
  Replicas	
              3	
  Replicas	
  =	
  18	
  GB	
  	
  in	
  HDFS	
  
                                              216	
  GB	
  in	
  HDFS	
         with	
  5:1	
  Snappy	
  Compression	
  
Summary	
  
	
  
1)  Develop	
  some	
  kind	
  of	
  incremental	
  import	
  when	
  sqooping	
  in	
  large	
  acSve	
  tables.	
  If	
  you	
  
     do	
  not,	
  your	
  sqoop	
  jobs	
  will	
  take	
  longer	
  and	
  longer	
  as	
  the	
  data	
  grows	
  from	
  the	
  
     RDBMS.	
  
2)  Limit	
  the	
  amount	
  of	
  parts	
  that	
  will	
  be	
  stored	
  in	
  HDFS,	
  this	
  translates	
  into	
  Sme	
  
     consuming	
  map	
  tasks,	
  use	
  parSSoning	
  if	
  possible.	
  
3)  Compress	
  data	
  in	
  HDFS.	
  You	
  will	
  save	
  space	
  in	
  HDFS	
  as	
  your	
  replicaSon	
  factor	
  makes	
  
     mulSple	
  copies	
  of	
  your	
  data.	
  You	
  may	
  also	
  benefit	
  in	
  processing	
  as	
  your	
  Map/Reduce	
  
     jobs	
  have	
  less	
  data	
  to	
  transfer	
  and	
  hadoop	
  becomes	
  less	
  I/O	
  bound.	
  	
  
                                                                      	
  
?	
  
 	
  

More Related Content

Recently uploaded

CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 

Recently uploaded (20)

Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Sqooping 50 Million Rows a Day from MySQL

  • 1.     Sqooping  50  Million  Rows  a  Day  from  MySQL       Eric  Hernandez   Database  Administrator  
  • 2. Parent_YearMonth_Merge   Child_YearMonth_0   Child_YearMonth_1   Child_YearMonth_2   App     Child_YearMonth_3   Servers   Child_YearMonth_4   Child_YearMonth_5   Child_YearMonth_6   Child_YearMonth_7   Child_YearMonth_8   Child_YearMonth_9   +50  Million  New  Rows  a  Day     +1.5  Billion  Rows  a  Month  
  • 3. 3  Month  RotaSonal  Life  Cycle   MySQL  AcSve     MySQL  Archive   Writer  Instance   Long-­‐Term  Storage  Instance   Two  Months   Current  Month   Ago   One  Month   Three  Months   Ago   Ago   Two  Months   So  on  ..     Ago  
  • 4. Problem:  Data  Analyst  have  to  pull  data  from  two  different  sources.   One  of  the  goals  of  our  project  is  to  create  a  single  data   source  for  analyst  to  mine.     MySQL  AcSve     MySQL  Archive   Writer  Instance   Long-­‐Term  Storage  Instance   Two  Months   Current  Month   Ago   One  Month   Three  Months   Ago   Ago   So  on  ..    
  • 5. Data  Analyst  with  Hadoop  only  have  to  pull  from  one  data  source.   Hadoop  Cluster   Hive   MySQL  AcSve       Writer  Instance   With  all  data,     current   to  the  last  24  hours.     Current  Month   One  Month   Ago  
  • 6. A^empt  1.0  Sqooping  in  Data  from  MySQL   Sqoop  enSre  table  into  hive  every  day  at  0030   Parent_201108_Merge   9  Node     Child_201108_0   Hadoop  Cluster   Child_201108_1   4  TB  Available  Storage   Child_201108_2   Child_201108_3   Child_201108_4   Hive  Table   Child_201108_5     Child_201108_6     2011-­‐08-­‐01   Child_201108_7   5  Million  Rows  Per  Table   Child_201108_8   2  Minutes  Sqoop  Sme  Per  Table   Child_201108_9   20  Minute  Total  Time   Total  50  Million  Rows  into  Hive  Table   2011-­‐08-­‐02   2011-­‐08-­‐10   10  Million  Rows  Per  Table   50  Million  Rows  Per  Table   4  Minutes  Sqoop  Sme  Per  Table   20  Minutes  Sqoop  Sme  Per  Table   40  Minutes  Total  Time   200  Minutes  Total  Time   Total  100  Million  Rows  into  Hive  Table   Total  500  Million  Rows  into  Hive  Table  
  • 7. A^empt  2.0  Incremental  Sqoop  of  Data  from  MySQL   Child_YearMonth  Schema   ID  BIGINT   MISC   MISC   MISC   Date_Created   Auto  Increment   Column   Column   Column   TimeStamp   Parent_201108_Merge   Child_201108_0   Child_201108_1   Child_201108_2   Child_201108_3   Child_201108_4   Child_201108_5     Child_201108_6     Child_201108_7   Child_201108_8   Child_201108_9     sqoop  import  -­‐-­‐where  "date_created  between  '${DATE}  00:00:00'  and  '${DATE}  23:59:59’”  
  • 8. A^empt  2.0  Incremental  Sqoop  of  Data  from  MySQL   9  Node     Sqoop Parent_201108_Merge     Hadoop  Cluster   Last  2 with  wher Child_201108_0   4  hou e 4  TB  Available  Storage   rs  fro  clause   Child_201108_1   m  Da te_Cr Child_201108_2   eat ed   Child_201108_3   Child_201108_4   Hive  Table   Child_201108_5     Child_201108_6     Child_201108_7   Child_201108_8   Child_201108_9   2011-­‐08-­‐01   5  Million  Rows  Per  Table   2  Minutes  Sqoop  Sme  Per  Table   2011-­‐08-­‐10   10  Minute  Total  Time   5  Million  Rows  Per  Table   Total  50  Million  Rows  into  Hive  Table   2  Minutes  Sqoop  Sme  Per  Table   10  Minute  Total  Time   Total  50  Million  Rows  into  Hive  Table   2011-­‐08-­‐02     5  Million  Rows  Per  Table   2  Minutes  Sqoop  Sme  Per  Table   10  Minute  Total  Time   Total  50  Million  Rows  into  Hive  Table   Consistent  run  Smes  for  sqoop  jobs  achieved    
  • 9. Ager  our  2.0  Incremental  Process  we  had  achieved  consistent  run  Smes   however,  two  new  problems  surfaced.     1)  Each  day  10  new  parts  would  be  added  to  the  Hive  table  which  caused  10  more  map   tasks  per  hive  query.       2)  Space  consumpSon  on  hadoop  cluster.    
  • 10. Too  many  parts  and    map  tasks  per  query.   Hive  Table   Parent_201108_Merge   Part-­‐0   Child_201108_0   Part-­‐1   Child_201108_1   Part-­‐2   Part-­‐3   Child_201108_2   2011-­‐08-­‐01   Part-­‐4   Sqoop   Child_201108_3   Part-­‐5   Part-­‐6   Child_201108_4   Part-­‐7   Child_201108_5     Part-­‐8   Part-­‐9   Child_201108_6     Part-­‐10   Child_201108_7   Part-­‐11   Child_201108_8   Part-­‐12   Part-­‐13   Child_201108_9   Part-­‐14   2011-­‐08-­‐02   Part-­‐15   For  3  Days  of  Data   Part-­‐16   Part-­‐17   30  Map  tasks  must  be  processed  for   Part-­‐18   Part-­‐19   any  Hive  Query   Part-­‐20   Part-­‐21   Part-­‐22   Part-­‐23   2011-­‐08-­‐03   Part-­‐24   For  30  Days  of  Data   Part-­‐25   Part-­‐26   300  Map  tasks  must  be  processed  for   Part-­‐27   any  Hive  Query   Part-­‐28   Part-­‐29  
  • 11. Parent_201108_Merge   Hive  Table   Child_201108_0   Child_201108_1   p   Sqoo Child_201108_2   Child_201108_3   Part-­‐0   Part-­‐1   Child_201108_4   Part-­‐2   Child_201108_5     ParSSon   Part-­‐3   Child_201108_6     2011-­‐08-­‐01   Part-­‐4   dt=2011-­‐08-­‐01   Child_201108_7   Part-­‐5   Part-­‐6   Child_201108_8   Part-­‐7   Child_201108_9   Part-­‐8   Part-­‐9   To  sqoop  10  tables  into  one  parSSon   Part-­‐0   I  choose  to  dynamically  create  a  parSSon  based  on  date   Part-­‐1   Part-­‐2   and  Sqoop  the  data  into  parSSon  directory  with  an  append   Part-­‐3   2011-­‐08-­‐02   ParSSon   Part-­‐4   dt=2011-­‐08-­‐02   #  Set  date  to  yesterday   Part-­‐5   Part-­‐6   DATE=`date  +%Y-­‐%m-­‐%d  -­‐d  "1  day  ago"`   Part-­‐7     Part-­‐8   #Create  ParSSon   Part-­‐9   echo  "ALTER  TABLE  ${TABLE}  ADD  IF  NOT  EXISTS  PARTITION  (dt='${DATE}')  locaSon   Part-­‐0   '${PARTITION_DIR}';  exit;"  |  /usr/bin/hive   Part-­‐1   Part-­‐2     Part-­‐3   #  Sqoop  in  event_logs   ParSSon   2011-­‐08-­‐03   Part-­‐4   dt=2011-­‐08-­‐03   Part-­‐5   TABLE_DIR=/user/hive/warehouse/${TABLE}   Part-­‐6   PARTITION_DIR=$TABLE_DIR/${DATE}   Part-­‐7     Part-­‐8   Part-­‐9   sqoop  import  -­‐-­‐where  "date_created  between  '${DATE}  00:00:00'  and  '${DATE}   23:59:59'"  -­‐-­‐target-­‐dir  $PARTITION_DIR  -­‐-­‐append  
  • 12. Hive  Table   Parent_201108_Merge   Child_201108_0   Part-­‐0   Child_201108_1   Part-­‐1   Child_201108_2   Part-­‐2   Part-­‐3   Sqoop   Child_201108_3   ParSSon   Part-­‐4   Child_201108_4   2011-­‐08-­‐01   dt=2011-­‐08-­‐01   Part-­‐5   Part-­‐6   Child_201108_5     Part-­‐7   Child_201108_6     Part-­‐8   Part-­‐9   Child_201108_7   Part-­‐0   Child_201108_8   Part-­‐1   Child_201108_9   Part-­‐2   Part-­‐3   2011-­‐08-­‐02   ParSSon   Part-­‐4   dt=2011-­‐08-­‐02   Part-­‐5   Part-­‐6   Part-­‐7   As  a  result  of  sqooping  into  hive  parSSons  only  a   Part-­‐8   minimal  amount  map  task  have  to  be  processed.       Part-­‐9   1  Day  =  10  Map  Tasks   Part-­‐0   Part-­‐1   2  Days  =  20  Map  Tasks   Part-­‐2   …   Part-­‐3   ParSSon   Part-­‐4   30  Days  =  300  Map  Tasks   2011-­‐08-­‐03   dt=2011-­‐08-­‐03   Part-­‐5   Part-­‐6   Part-­‐7   Part-­‐8   Part-­‐9  
  • 13. Space  ConsumpSon   Parent_201108_Merge   Hadoop   Child_201108_0   Child_201108_1   Child_201108_2   Child_201108_3   Child_201108_4   1  Month  of  Data   Child_201108_5     =  30GB     ReplicaSon   Child_201108_6     Factor  3           Child_201108_7   Child_201108_8   Child_201108_9   1  Year  of  Data   3  Replicas   1  Replica  =  30  GB   1.08  TB  in  HDFS   3  Replicas  =  90  GB     in  HDFS  
  • 14. Sqooping  with  Snappy     sqoop  import  -­‐-­‐compression-­‐codec  org.apache.hadoop.io.compress.SnappyCodec  -­‐z   Parent_201108_Merge   Hadoop   Child_201108_0   Child_201108_1   Child_201108_2   Child_201108_3   Child_201108_4   1  Month  of  Data   Child_201108_5     =  30GB     ReplicaSon   Child_201108_6     Factor  3           Child_201108_7   Child_201108_8   Child_201108_9   1  Year  of  Data   1  Replica  =  6  GB   3  Replicas   3  Replicas  =  18  GB    in  HDFS   216  GB  in  HDFS   with  5:1  Snappy  Compression  
  • 15. Summary     1)  Develop  some  kind  of  incremental  import  when  sqooping  in  large  acSve  tables.  If  you   do  not,  your  sqoop  jobs  will  take  longer  and  longer  as  the  data  grows  from  the   RDBMS.   2)  Limit  the  amount  of  parts  that  will  be  stored  in  HDFS,  this  translates  into  Sme   consuming  map  tasks,  use  parSSoning  if  possible.   3)  Compress  data  in  HDFS.  You  will  save  space  in  HDFS  as  your  replicaSon  factor  makes   mulSple  copies  of  your  data.  You  may  also  benefit  in  processing  as  your  Map/Reduce   jobs  have  less  data  to  transfer  and  hadoop  becomes  less  I/O  bound.      
  • 16. ?