Sqooping 50 Million Rows a Day from MySQL

Sqooping
50
Million
Rows
a
Day
from
MySQL

Eric
Hernandez

Database
Administrator

Parent_YearMonth_Merge

Child_YearMonth_0

Child_YearMonth_1

Child_YearMonth_2

App

Child_YearMonth_3

Servers
Child_YearMonth_4

Child_YearMonth_5

Child_YearMonth_6

Child_YearMonth_7

Child_YearMonth_8

Child_YearMonth_9

+50
Million
New
Rows
a
Day

+1.5
Billion
Rows
a
Month

3
Month
RotaSonal
Life
Cycle

MySQL
AcSve

MySQL
Archive

Writer
Instance
Long-‐Term
Storage
Instance

Two
Months

Current
Month

Ago

One
Month
Three
Months

Ago
Ago

Two
Months

So
on
..

Ago

Problem:
Data
Analyst
have
to
pull
data
from
two
diﬀerent
sources.

One
of
the
goals
of
our
project
is
to
create
a
single
data

source
for
analyst
to
mine.

MySQL
AcSve

MySQL
Archive

Writer
Instance
Long-‐Term
Storage
Instance

Two
Months

Current
Month

Ago

One
Month
Three
Months

Ago
Ago

So
on
..

Data
Analyst
with
Hadoop
only
have
to
pull
from
one
data
source.

Hadoop
Cluster

Hive

MySQL
AcSve

Writer
Instance
With
all
data,

current

to
the
last
24
hours.

Current
Month

One
Month

Ago

A^empt
1.0
Sqooping
in
Data
from
MySQL

Sqoop
enSre
table
into
hive
every
day
at
0030

Parent_201108_Merge

9
Node

Child_201108_0

Hadoop
Cluster
Child_201108_1

4
TB
Available
Storage

Child_201108_2

Child_201108_3

Child_201108_4

Hive
Table
Child_201108_5

Child_201108_6

2011-‐08-‐01
Child_201108_7

5
Million
Rows
Per
Table
Child_201108_8

2
Minutes
Sqoop
Sme
Per
Table

Child_201108_9

20
Minute
Total
Time

Total
50
Million
Rows
into
Hive
Table

2011-‐08-‐02
2011-‐08-‐10

10
Million
Rows
Per
Table
50
Million
Rows
Per
Table

4
Minutes
Sqoop
Sme
Per
Table
20
Minutes
Sqoop
Sme
Per
Table

40
Minutes
Total
Time
200
Minutes
Total
Time

Total
100
Million
Rows
into
Hive
Table
Total
500
Million
Rows
into
Hive
Table

A^empt
2.0
Incremental
Sqoop
of
Data
from
MySQL

Child_YearMonth
Schema

ID
BIGINT
MISC
MISC
MISC
Date_Created

Auto
Increment
Column
Column
Column
TimeStamp

Parent_201108_Merge

Child_201108_0

Child_201108_1

Child_201108_2

Child_201108_3

Child_201108_4

Child_201108_5

Child_201108_6

Child_201108_7

Child_201108_8

Child_201108_9

sqoop
import
-‐-‐where
"date_created
between
'${DATE}
00:00:00'
and
'${DATE}
23:59:59’”

A^empt
2.0
Incremental
Sqoop
of
Data
from
MySQL

9
Node

Sqoop Parent_201108_Merge

Hadoop
Cluster
Last
2 with
wher Child_201108_0

4
hou e
4
TB
Available
Storage
rs
fro
clause
Child_201108_1

m
Da
te_Cr Child_201108_2

eat ed
Child_201108_3

Child_201108_4

Hive
Table
Child_201108_5

Child_201108_6

Child_201108_7

Child_201108_8

Child_201108_9

2011-‐08-‐01

5
Million
Rows
Per
Table

2
Minutes
Sqoop
Sme
Per
Table
2011-‐08-‐10

10
Minute
Total
Time
5
Million
Rows
Per
Table

Total
50
Million
Rows
into
Hive
Table
2
Minutes
Sqoop
Sme
Per
Table

10
Minute
Total
Time

Total
50
Million
Rows
into
Hive
Table

2011-‐08-‐02

5
Million
Rows
Per
Table

2
Minutes
Sqoop
Sme
Per
Table

10
Minute
Total
Time

Total
50
Million
Rows
into
Hive
Table

Consistent
run
Smes
for
sqoop
jobs
achieved

Ager
our
2.0
Incremental
Process
we
had
achieved
consistent
run
Smes

however,
two
new
problems
surfaced.

1)  Each
day
10
new
parts
would
be
added
to
the
Hive
table
which
caused
10
more
map

tasks
per
hive
query.

2)  Space
consumpSon
on
hadoop
cluster.

Too
many
parts
and

map
tasks
per
query.

Hive
Table

Parent_201108_Merge

Part-‐0
Child_201108_0

Part-‐1
Child_201108_1

Part-‐2

Part-‐3
Child_201108_2

2011-‐08-‐01
Part-‐4
Sqoop
Child_201108_3

Part-‐5

Part-‐6
Child_201108_4

Part-‐7
Child_201108_5

Part-‐8

Part-‐9
Child_201108_6

Part-‐10
Child_201108_7

Part-‐11
Child_201108_8

Part-‐12

Part-‐13
Child_201108_9

Part-‐14

2011-‐08-‐02
Part-‐15
For
3
Days
of
Data

Part-‐16

Part-‐17
30
Map
tasks
must
be
processed
for

Part-‐18

Part-‐19
any
Hive
Query

Part-‐20

Part-‐21

Part-‐22

Part-‐23

2011-‐08-‐03
Part-‐24
For
30
Days
of
Data

Part-‐25

Part-‐26
300
Map
tasks
must
be
processed
for

Part-‐27
any
Hive
Query

Part-‐28

Part-‐29

Parent_201108_Merge

Hive
Table
Child_201108_0

Child_201108_1

p

Sqoo Child_201108_2

Child_201108_3

Part-‐0

Part-‐1
Child_201108_4

Part-‐2
Child_201108_5

ParSSon

Part-‐3
Child_201108_6

2011-‐08-‐01
Part-‐4

dt=2011-‐08-‐01
Child_201108_7

Part-‐5

Part-‐6
Child_201108_8

Part-‐7
Child_201108_9

Part-‐8

Part-‐9

To
sqoop
10
tables
into
one
parSSon

Part-‐0
I
choose
to
dynamically
create
a
parSSon
based
on
date

Part-‐1

Part-‐2
and
Sqoop
the
data
into
parSSon
directory
with
an
append

Part-‐3

2011-‐08-‐02
ParSSon
Part-‐4

dt=2011-‐08-‐02
#
Set
date
to
yesterday

Part-‐5

Part-‐6
DATE=`date
+%Y-‐%m-‐%d
-‐d
"1
day
ago"`

Part-‐7

Part-‐8
#Create
ParSSon

Part-‐9

echo
"ALTER
TABLE
${TABLE}
ADD
IF
NOT
EXISTS
PARTITION
(dt='${DATE}')
locaSon

Part-‐0
'${PARTITION_DIR}';
exit;"
|
/usr/bin/hive

Part-‐1

Part-‐2

Part-‐3
#
Sqoop
in
event_logs

ParSSon

2011-‐08-‐03
Part-‐4

dt=2011-‐08-‐03

Part-‐5
TABLE_DIR=/user/hive/warehouse/${TABLE}

Part-‐6
PARTITION_DIR=$TABLE_DIR/${DATE}

Part-‐7

Part-‐8

Part-‐9
sqoop
import
-‐-‐where
"date_created
between
'${DATE}
00:00:00'
and
'${DATE}

23:59:59'"
-‐-‐target-‐dir
$PARTITION_DIR
-‐-‐append

Hive
Table
Parent_201108_Merge

Child_201108_0

Part-‐0

Child_201108_1

Part-‐1
Child_201108_2

Part-‐2

Part-‐3

Sqoop
Child_201108_3

ParSSon
Part-‐4
Child_201108_4

2011-‐08-‐01
dt=2011-‐08-‐01
Part-‐5

Part-‐6
Child_201108_5

Part-‐7
Child_201108_6

Part-‐8

Part-‐9
Child_201108_7

Part-‐0
Child_201108_8

Part-‐1
Child_201108_9

Part-‐2

Part-‐3

2011-‐08-‐02
ParSSon
Part-‐4

dt=2011-‐08-‐02
Part-‐5

Part-‐6

Part-‐7
As
a
result
of
sqooping
into
hive
parSSons
only
a

Part-‐8
minimal
amount
map
task
have
to
be
processed.

Part-‐9

1
Day
=
10
Map
Tasks

Part-‐0

Part-‐1
2
Days
=
20
Map
Tasks

Part-‐2
…

Part-‐3

ParSSon
Part-‐4
30
Days
=
300
Map
Tasks

2011-‐08-‐03
dt=2011-‐08-‐03
Part-‐5

Part-‐6

Part-‐7

Part-‐8

Part-‐9

Space
ConsumpSon

Parent_201108_Merge

Hadoop

Child_201108_0

Child_201108_1

Child_201108_2

Child_201108_3

Child_201108_4
1
Month
of
Data

Child_201108_5

=
30GB

ReplicaSon

Child_201108_6

Factor
3

Child_201108_7

Child_201108_8

Child_201108_9

1
Year
of
Data

3
Replicas
1
Replica
=
30
GB

1.08
TB
in
HDFS
3
Replicas
=
90
GB

in
HDFS

Sqooping
with
Snappy

sqoop
import
-‐-‐compression-‐codec
org.apache.hadoop.io.compress.SnappyCodec
-‐z

Parent_201108_Merge

Hadoop

Child_201108_0

Child_201108_1

Child_201108_2

Child_201108_3

Child_201108_4
1
Month
of
Data

Child_201108_5

=
30GB

ReplicaSon

Child_201108_6

Factor
3

Child_201108_7

Child_201108_8

Child_201108_9

1
Year
of
Data
1
Replica
=
6
GB

3
Replicas
3
Replicas
=
18
GB

in
HDFS

216
GB
in
HDFS
with
5:1
Snappy
Compression

Summary

1)  Develop
some
kind
of
incremental
import
when
sqooping
in
large
acSve
tables.
If
you

do
not,
your
sqoop
jobs
will
take
longer
and
longer
as
the
data
grows
from
the

RDBMS.

2)  Limit
the
amount
of
parts
that
will
be
stored
in
HDFS,
this
translates
into
Sme

consuming
map
tasks,
use
parSSoning
if
possible.

3)  Compress
data
in
HDFS.
You
will
save
space
in
HDFS
as
your
replicaSon
factor
makes

mulSple
copies
of
your
data.
You
may
also
beneﬁt
in
processing
as
your
Map/Reduce

jobs
have
less
data
to
transfer
and
hadoop
becomes
less
I/O
bound.

Sqooping 50 Million Rows a Day from MySQL

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Sqooping 50 Million Rows a Day from MySQL