NextGen BigData Workloads in NextGen Sequencing - 20140402 - Phoenix - TGEN

Biomedical & Advertising
Tech Overarching Themes*
Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy
*Obligatory movie references… shout-out to my hometown LA

Biomedical Research Goal:
Therapeutics => Diagnostics => Prognostics
• Reverse engineer how genetic variation leads to
(un)desired traits
• Therapeutics => traditional medicine
• Diagnostics => personalized medicine
– NextGen public health
– Requires hi-res mechanical knowledge
• Prognostics => GATTACA (dys/eu)topia
– Managed populations / NextGen eugenics

NextGen BigData Workloads in
NextGen Sequencing
Allen Day, PhD
@MapR @allenday
April 2014

Typical Plan, Phases 1-4
1. Design Experiment /
Collect Biosamples
2. Sequencing /
Molecular Assays
3. Data Management
4. ? ? ?
5. PROFIT ! ! ! !
http://knowyourmeme.com/memes/profit

The Changing Workload in
Underpants Collection
Sboner, et al, 2011. The real cost of sequencing: higher than you think!

Typical Plan, Phases 1-4
1. Design Experiment /
Collect Biosamples
2. Sequencing /
Molecular Assays
3. Data Management
4. ? ? ?
5. PROFIT ! ! ! !

Phase 4: Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP
from my experiment
• That elucidate genetic
basis of phenotype…
• And rank order them by
impact/likelihood/etc

SELECT
snp, expEvidence
FROM
myExp, exp1, …
OUTER JOIN expN …
WHERE
myExp.snp = “mySnp”
ORDER BY
p, freq, conservation, etc

• In context of
– Racial background
– Experimental design-
specific concerns (e.g.
familial IBD/IBS)
– Environmental factors
and penetrance
– Assay-specific biases and
noise
SELECT
snp, expEvidence
FROM
myExp, exp1, …
OUTER JOIN expN …
WHERE
ORDER BY

SELECT
snp, expEvidence
FROM
myExp, exp1, …
OUTER JOIN expN …
WHERE
ORDER BY
• In context of, e.g.
– ε1: Racial, etc.
background
– ε2: Experimental design-
specific concerns (e.g.
familial IBD/IBS)
– ε3: Environmental factors
and penetrance
– ε4: Assay-specific biases
and noise
phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4
At risk of over-simplification for
business-level concept…

Phase 4: Automated Insights Engine
SELECT
snp, expEvidence
FROM
myExp, exp1, … expN
exps=powerset(all exps)
OUTER JOIN complement(exps)
WHERE
powerset(all SNPs, phenotypes)
ORDER BY
arbitrary models
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
C. Briggsae inbred strain compatibility
[supplementary slide]

SELECT
snp, expEvidence
FROM
myExp, exp1, … expN
exps=powerset(all exps)
OUTER JOIN complement(exps)
WHERE
powerset(all SNPs, phenotypes)
ORDER BY
arbitrary models
SNPs,
assay technologies,
assayed phenotypes,

Right…
SNPs,
assay technologies,
assayed phenotypes,

Phase 4: Naïve Implementation
SNPs,
assay technologies,
assayed phenotypes,
Big compute
It’s monolithic
It scales polynomially with data size
This is bad, it takes too long to get a result

Co-expression (10K samples) and Linkage
Gene Annotation / Set Completion
BMP6
BMP2
MMP3
LIF
NOS2A
MMP13
CSPG4
ACAN
ACAN
ACAN
COL11A2
COL11A2
COL9A1
MATN1
LECT1
MATN4
HAPLN1
HAPLN1
ITGA10
EDIL3
NGF
MAST4
MATN3
EPYC
COL11A1
COL11A1
COL10A1
COL10A1
THBS3
C1QTNF3
WISP1
PDPN
PDLIM4
CHST3
MIA
SOX5
CYTL1
TNMD
AKR1C1
MMP12
ETNK1
RELA
FOSL1
EIF2C2
NUPL1
RLF
RELB
SOD2
RNF24
RNF24
XYLT1
HAS2
BDKRB1
HSPC159
SLC28A3
FZD10
SLC28A3
HSPC159
BDKRB1
HAS2
XYLT1
RNF24
RNF24
SOD2
RELB
RLF
NUPL1
EIF2C2
FOSL1
RELA
ETNK1
MMP12
AKR1C1
TNMD
CYTL1
SOX5
MIA
CHST3
PDLIM4
PDPN
FZD10
WISP1
C1QTNF3
THBS3
COL10A1
COL10A1
COL11A1
COL11A1
EPYC
MATN3
MAST4
NGF
EDIL3
ITGA10
HAPLN1
HAPLN1
MATN4
ACAN
ACAN
ACAN
LECT1
MATN1
COL9A1
COL11A2
COL11A2
CSPG4
MMP13
NOS2A
LIF
MMP3
BMP2
BMP6
Day. 2009. Disease gene characterization through large-scale co-expression analysis.
http://www.ncbi.nlm.nih.gov/pubmed/20046828
+ =>

Better: Push Logic to Phase 3
SNPs,
assay technologies,
assayed phenotypes,
Denormalize and
Percolate
(re)prioritize &
(re)process
service queries
drive dashboards
create reports
denormalize for
display
buffer
New
models

What’s a Percolator?
• Google Percolator
– “Caffeine” update 2010
• Iterative, incremental
updates
• No batch processing
• Decouple computation
from data size
Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions
and Notifications

What’s in the Percolator?
• Optimize for access
patterns, maybe many tables
– Dependency graph of
intermediate matrices
– AT = correlate(transpose(A))
• Parallelize table computation
– Twitter Algebird
• Analysis of intermediates
triggers downstream
action
– Codify business logic
(research methods) into
data management layer
– Prioritize and minimize
unproductive computation
Denormalize and
Percolate
(re)prioritize &
(re)process
https://github.com/twitter/algebird
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-d

What’s in the Percolator?
• Optimize for access patterns,
maybe many tables
– Dependency graph of
intermediate matrices
– AT = correlate(transpose(A))
• Parallelize table computation
– Twitter Algebird
• Analysis of intermediates
triggers downstream
action
– Codify business logic
(research methods) into
data management layer
– Prioritize and minimize
unproductive computation
Denormalize and
Percolate
(re)prioritize &
(re)process
MapR M7 especially suitable – services complex multi-tenant workloads at very large scale, see
http://www.slideshare.net/allenday/20131212-sydney-big-data-analytics

Double Percolator
• Apologies, Google
Images yields no SFW
images for
– “Double Percolator”

ENCODE
http://www.nature.com/news/encode-the-human-encyclopaedia-1.11312

Data Generation
e.g. basic research
Data Analysis
e.g. pharma Control
channel
Clinical / Patient
consumption

Data Generation
e.g. basic research
Data Analysis
e.g. pharma
Clinical / Patient
consumption
Control
channel

Robot Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery

Robot (Data?) Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery

If they were unlabeled, would you
know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model
Building
NPR. 2011. The Search For Analysts To Make
Sense Of 'Big Data’
http://www.npr.org/2011/11/30/142893065

Building
• Identify network
structures
• Label them
• Observe
stimulus=>response
space mapping
• Purposefully target
• PROFIT ! ! ! !

Building
• Identify network
structures
• Label them
• Observe
stimulus=>response
space mapping
• Purposefully target
• PROFIT ! ! ! !
Parallels to Twitter revenue model
social network node labeling
=> gene annotation
Google Knowledge Graph
=> bio-ontologies
Ad impressions
=> small molecule perturbation
Profit
=> Save lives 
http://www.google.com/insidesearch/features/search/knowledge.html
http://www.bioontology.org/

Dendrite on M7 HBase
Denormalize and
Percolate
(re)prioritize &
(re)process
MapR M7
HBase
Titan
API
Percolation
“business logic”
Dendrite
Visualization &
ad-hoc queries
detailed view…

Further Reading
MapR M7 especially suitable – services complex multi-tenant
workloads at very large scale, see @allenday deck:
http://www.slideshare.net/allenday/20131212-sydney-big-
data-analytics
Implementing matrix transforms + business logic workflows,
see @ceteri “Enterprise Data Workflows with Cascading”:
http://shop.oreilly.com/product/0636920028536.do
Math and data structure underpinnings, see @ceteri and
@allenday “Just Enough Math”:
http://liber118.com/pxn/course/itml/just_enough_math.html
Denormalize and
Percolate
(re)prioritize &
(re)process

Further Reading
Day, et al. 2007. Celsius: a community resource for
Affymetrix microarray data.
http://www.ncbi.nlm.nih.gov/pubmed/17570842
Human Genetics & Big Data
http://www.slideshare.net/allenday/20131212-sydney-
garvan-institute-human-genetics-and-big-data
Denormalize and
Percolate
(re)prioritize &
(re)process

Next Topic: Optimizing 1º Analysis
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
<= We were just here
“future high ROI use cases”
<= We now go here
“current high ROI use cases”

1º seq analysis, in a nutshell

Crossbow
Langmead, et al. 2009. Searching for SNPs with cloud computing

1º seq analysis, format details
.fastq .bam .vcf
short read
alignment
genotype
calling analysis

1º seq analysis, map-reduce style
.fastq .bam .vcf
short read
alignment
genotype
calling
MAP
MAP
REDUCE, rotate matrix 90º
ref seq

Ion Flux
• Sequencing workflow in MapReduce
(Hadoop, Cascading, Amazon Elastic M/R)
• Integrated with Ion Torrent as a plugin to
stream sequence to the cloud
• Emphasis on scalability and latency
– assay->clinical report turnaround in < 24h
• Compare to fast-follower stack ILMN
MiSeq+BaseSpace
http://aws.amazon.com/solutions/case-studies/ion-flux/
http://d36cz9buwru1tt.cloudfront.net/Ion-Flux-2011-02-Architecture.pdf

SeqWare / Nimbus Informatics
O’Connor, et al. 2010. SeqWare Query Engine: storing and searching sequence data in the cloud
http://seqware.github.io/

MapR Advantage for R&D
• Home directories on DFS
– NFS. Transparent to user
• Low prototype-cost / research support
– Scale prototype as needed
• Low transition cost / operationalizing research
– Prototype incrementally becomes a product
• Low operational cost / high machine utilization
– Leverage MapR performance

C. briggsae inbred line incompatibility
Ross, et al. 2011. Caenorhabditis briggsae Recombinant Inbred Line Genotypes Reveal Inter-
Strain Incompatibility and the Evolution of Recombination

self join
self joineQTLs
Samples
eQTLs
Samples
Samples
Samples
eQTLs
eQTLs

Incidence matrices
A (U*Q) and B (U*V)
UsersQuery Terms
Users
Clicked Videos
Query Term = Clicked Term

Join on dimension U…
QueryTerms
Users

Relate Q to V
QueryTerms
Users

Cross-recommendation
QueryTerms
Clicked Videos

NextGen BigData Workloads in NextGen Sequencing - 20140402 - Phoenix - TGEN

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

More from Allen Day, PhD

More from Allen Day, PhD (20)

Recently uploaded

Recently uploaded (20)

NextGen BigData Workloads in NextGen Sequencing - 20140402 - Phoenix - TGEN

Editor's Notes