Architectural Building Blocks of the Netflix Cloud Platform and lessons learned while implementing the same.
Commandments of Web Scale Cloud Deployments
3. What
is
Ne#lix?
With
more
than
26
million
streaming
members
in
the
United
States,
Canada,
LaEn
America,
the
United
Kingdom
and
Ireland,
Ne#lix,
Inc.
(NASDAQ:
NFLX)
is
the
world's
leading
internet
subscripEon
service
for
enjoying
movies
and
TV
programs.
…
In
all,
more
than
800
devices
that
stream
from
Ne#lix
are
available.
(hp://ir.netflix.com)
hp://bit.ly/LWST5w
4. Who
Am
I
• In
the
Movie
Business
J
– Manager,
Cloud
Pla#orm/
Infrastructure
@
Ne#lix
– @
Ne#lix
since
2008
– Prior
day
jobs
• System
Architect/Lead
@
AOL
(Netscape,
iPlanet,
Sun)
•
@stonse
•
hp://www.linkedin.com/in/sudhirtonse
Important:
This
talk
is
a
developer
community
outreach
by
me
as
an
individual
and
the
content
here
may
or
may
not
reflect
Ne#lix’s
official
view.
5. Why
am
I
here?
• Share
the
Story
of
Ne,lix
and
its
use
of
the
Amazon
Cloud
– Why
did
Ne#lix
move
to
the
Cloud?
– How
did
we
move?
– What
did
we
learn?
• Share
Technical
Challenges
and
SoluEons
– Contribute
back
to
the
community
• Perhaps
Interest
you
in
Helping
us
Reach
the
Next
Steps
– Yes,
I
am
Hiring!
6. What
is
in
it
for
You?
• Various
Open
Source
Offerings
• Tech
papers
• Blogs
&
ArEcles
• Meetups
and
Talks
like
this
J
7. What’s
in
it
for
Ne#lix?
bird
in
a
Big
Cloud
• Small
• Tech
Community
Engagement
• Open
Source
ContribuEons
10. What’s
a
Cloud?
• Cloud:
Cloud
compu<ng
is
the
delivery
of
compuEng
and
storage
capacity
[1]
as
a
service
[2]
to
a
heterogeneous
community
of
end-‐
recipients.
Images
Courtesy:
Wikipedia/Company
logos
16. Why
Cloud
contd
…
• UndifferenEated
Heavy
Lising
– MulE
Region
– On
Demand
CompuEng
Power
– Tons
of
Features
J
17. On
Demand
Auto
Scaling!
• Traffic
Paerns
Compute
Compute
• Scale
UP
&
Down
based
on
Demand
– Use
CloudWatch
Time
• RPS
Time
• Load
Average
Slow
Growth
Periodic
Jobs
• …
Compute
Compute
Compute
Time
Time
Time
Predictable
Bursts
Unpredictable
Bursts
Steady
State
18. Instance
Instance
Instance
Instance
Instance
Instance
Instance
Instance
Scale
Up
Instance
Instance
Instance
Instance
Instance
Scale
Up
19. Scale
Down
Instance
Instance
Instance
Instance
Instance
Instance
Instance
27. What’s
in
a
name?
• Cloud
instances
are
ephemeral
– They
have
no
fixed
NAME
– The
have
a
public
IP
address,
a
private
IP
address
and
can
opEonally
be
associated
with
an
ElasEc
IP
Address
– How
can
you
address
your
services?
• Via
ElasEc
IP
(but
these
are
limited
per
account)
• Route
53
(A
DNS
service
offered
by
Amazon)
• Ne#lix
uses
in-‐house
app
called
Discovery
Service
– Keeper
of
addresses
and
metadata
of
running
instances
Shakespeare
28. Inter
Process
CommunicaEon
• Ne#lix
uses
NIWS
– Ne#lix
Internal
Web
Services
– Common
infrastructural
library
that
aids
in
RPC
• Based
on
JSR-‐311
(Jersey)
• Uses
Discovery
Service
to
obtain
instances
of
every
service
• Has
an
in
built
Mid
Tier
s/w
LoadBalancer
Sudhir
Tonse
29. BiopSys
Danny
Yuan
• Search
Logs
on
1000s
of
Amazon
Instances
– Per
Cluster,
Apps,
Instances,
Time
Range
etc.
31. Metrics
• One
cannot
fully
Understand
what
One
cannot
Observe
J
• Ne#lix
Pla#orm
has
several
Metrics/Data
CollecEon
components
– Servo
(
@Monitors)
– Tracers/Counters
– Chukwa
(for
Log
Events
and
Business
Metrics)
– More
J
34. Lessons
Learned
• Roman
Riding
is
hard
– e.g.
sharing
traffic
between
Datacenter
(SQL)
and
Cloud
(NoSQL)
• Plan
for
Failure
– Test
for
Failure
(Chaos
Monkey
&
Simian
Army)
37. Cloud
Commandments
1. Thou shalt not have Sticky in-memory sessions
– Hard to Scale
2. Thou shalt not direclty use a Central SQL database in the user request path
– Atleast not one that uses locks and transactions
3. Thou shalt not store important data on ephemeral instances
– These are lost when instances go down. Use EBS volumes, S3 or other persistence stores
4. Thou shalt embrace a homogenous architecture
– Much easier to achieve operational efficiency
5. Thou shalt understand and embrace the CAP theorem
– Choice between CP and AP. Most web scale deployments choose AP
6. Thou shalt gaurd all external calls using the
Dependency Command Pattern
– Idea is to effectively gaurd user request procesing threads
7. Thou shalt be prepared to scale according to thy needs
– Web traffic can come in bursts, its important to scale up/down the whole SOA stack based on
resources needed
38. Cloud
Commandments
contd…
8. Thou shalt keep a wary eye on thy cost
– It all adds up eventually. Plenty of low hanging fruits avaialble to save costs
9. Thou shalt secure thy data and instances
– Encrypt data; secure access to instances. (Pay attention to Security Groups)
10. Thou shalt instrument thy code
– You cant trust what you cant see
11. Thou shalt effectively monitor thy access points
– Its the cloud and things can go wrong or go reaaaal slooow
12. Thou shalt deploy thy instances in multiple regions
and zones
– For maximizing SLAs and availability
13. Thou shalt be wary of SPOF
– Mantra of distributed system design
14. Thou shalt always plan for failure
– Its just a question of when, not if. Have a good backup plan
44. Dependency
Command
• network
Emeouts
and
retries
• separate
threads
on
per-‐dependency
thread
pools
• semaphores
(via
a
tryAcquire,
not
a
blocking
call)
• circuit
breakers
45. Failures
• Failures
will
happen
– It’s
a
quesEon
of
when
and
how
NOT
“if”
– Plan
• Regularly
Test
for
possible
Failures
– Ne#lix
Simian
Army:
e.g.
Chaos
Monkey,
Latency
Monkey
…
• Severity
– Minimize
the
impact
of
a
failure
• Occurrence
– Minimize
the
frequency
of
a
failure
• Observability
– Minimize
the
Eme
to
detect
and
respond
46. Simian
Army
Chaos
Monkey
•
Simulates
hard
failures
in
AWS
by
killing
a
few
instances
per
ASG
(e.g.
Auto
Scale
Group)
•
Similar
to
how
EC2
instances
can
be
killed
by
AWS
with
lile
warning
•
Tests
clientsʼ
ability
to
gracefully
deal
with
broken
connecEons,
interrupted
calls,
etc...
•
Verifies
that
all
services
are
running
within
the
protecEon
of
AWS
Auto
Scale
Groups,
which
reincarnates
killed
instances
•
If
not,
the
Chaos
monkey
will
win!
Conformity
Monkey
.
•
Verifies
that
all
services
are
running
within
the
protecEon
of
AWS
Auto
Scale
Groups,
which
reincarnates
killed
instances
•
If
not,
app/service
team
is
noEfied
47. Simian
Army
…
Latency
Monkey
•
Simulates
sos
failures
-‐-‐
i.e.
a
service
gets
slower
•
Injects
random
delays
in
NIWS
(client-‐
side)
or
Server
(server-‐side)
of
a
client-‐
server
interacEon
•
Tests
the
ability
of
applicaEons
to
detect
and
recover
(i.e.
Graceful
DegradaEon)
from
the
harder
problem
of
delays,
that
leads
to
thundering
herd
and
Emeouts
Other
Monkeys
• Security
Monkey
Chaos
Gorilla
• Janitor
Monkey
• Simulates
Zone
Outage
• Efficiency
Monkey
• ..
more
48. Building
Redundancy
and
Availability
• Deploy
in
mulEple
zone
and
consider
mulEple
regions
• Pay
aenEon
to
various
modes
of
failures
49. Three
Balanced
Availability
Zones
Load
Balancers
Zone
A
Zone
B
Zone
C
Persistence
Store
Persistence
Store
Persistence
Store
Courtesy
@adrianco
50. Triple
Replicated
Persistence
Load
Balancers
Zone
A
Zone
B
Zone
C
Persistence
Store
Persistence
Store
Persistence
Store
51. Isolated
Regions
US-‐East
Load
Balancers
EU-‐West
Load
Balancers
Zone
A
Zone
B
Zone
C
Zone
A
Zone
B
Zone
C
Persistence
Store
Persistence
Store
Persistence
Store
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
54. Tips
Guidelines
contd
…
• Amazon
CloudWatch
– Is
your
friend!
Ne#lix
Servo
(
hp://github.com/ne#lix/servo)
helps
you
publish
metrics
to
CloudWatch
• ELB
– Always
keep
your
Zones
Balanced!
– Healthcheck
URLs
are
important
• Auto
Scaling
Groups
– This
is
an
amazing
feature
that
can
really
save
you
$$
$s
and
help
you
run
more
efficiently.
Read
hp://bit.ly/NgwS0K
55. Tips
Guidelines
contd
…
• Keep
acEve
track
of
Usage
Costs
– Usage
costs
can
surprise
you!
– Ne#lix
has
an
internal
tool
which
we
may
open
source.
Watch
@Ne#lixOSS
• Reserve
Instances
– ReservaEon
can
save
you
$$$s
(upto
71%
!!)
(YMMV)
– Guarantees
availability
when
you
need
it
56. Tips/Guidelines
• S3
Best
PracEces
– Amazon
doc:
hp://bit.ly/MW93xj
– Know
when
to
use
Regional
S3
Endpoints
• Important
when
your
dev/test
team
and
deployments
are
in
different
regions
– Use
Smart
Bucket/Key
naming
• Use
3
to
63
characters.
• Use
only
lower
case
leers
(at
least
one),
numbers,
'.'
and
'-‐'.
• Don't
start
or
end
the
bucket
name
with
'.'
and
don't
follow
or
precede
a
'.'
with
a
'-‐'.
– Compress
Data
– Use
TTLs
– Many
more
…
67. Credits
Adrian
Cockros
(@adrianco),
Ruslan
Meshenberg
(@rusmeshenberg),
Yury
Izrailevsky,
Joe
Sondow
(@joesondow),
Ben
Christensen
(@benchristensen),
Jordan
Zimmerman
(@randgalt),
Ariel
Tseltlin
(@atseitlin),
Allen
Wang,
Eran
Landau,
Danny
Yuan,
Pradeep
Kamath
And
Members
of
the
Ne#lix
Cloud
Pla#orm
Team
69. Amazon Cloud Terminology Reference
See http://aws.amazon.com/ This is not a full list of Amazon Web Service features
(courtesy @adrianco)
• AWS
–
Amazon
Web
Services
(common
name
for
Amazon
cloud)
• AMI
–
Amazon
Machine
Image
(archived
boot
disk,
Linux,
Windows
etc.
plus
applicaEon
code)
• EC2
–
ElasEc
Compute
Cloud
– Range
of
virtual
machine
types
m1,
m2,
c1,
cc,
cg.
Varying
memory,
CPU
and
disk
configuraEons.
– Instance
–
a
running
computer
system.
Ephemeral,
when
it
is
de-‐allocated
nothing
is
kept.
– Reserved
Instances
–
pre-‐paid
to
reduce
cost
for
long
term
usage
– Availability
Zone
–
datacenter
with
own
power
and
cooling
hosEng
cloud
instances
– Region
–
group
of
Availability
Zones
–
US-‐East,
US-‐West,
EU-‐Eire,
Asia-‐Singapore,
Asia-‐Japan
• ASG
–
Auto
Scaling
Group
(instances
booEng
from
the
same
AMI)
• S3
–
Simple
Storage
Service
(hp
access)
• EBS
–
ElasEc
Block
Storage
(network
disk
filesystem
can
be
mounted
on
an
instance)
• RDS
–
RelaEonal
Database
Service
(managed
MySQL
master
and
slaves)
• SDB
–
Simple
Data
Base
(hosted
hp
based
NoSQL
data
store)
• SQS
–
Simple
Queue
Service
(hp
based
message
queue)
• SNS
–
Simple
NoEficaEon
Service
(hp
and
email
based
topics
and
messages)
• EMR
–
ElasEc
Map
Reduce
(automaEcally
managed
Hadoop
cluster)
• ELB
–
ElasEc
Load
Balancer
• EIP
–
ElasEc
IP
(stable
IP
address
mapping
assigned
to
instance
or
ELB)
• VPC
–
Virtual
Private
Cloud
(extension
of
enterprise
datacenter
network
into
cloud)
• IAM
–
IdenEty
and
Access
Management
(fine
grain
role
based
security
keys)