SlideShare a Scribd company logo
1 of 49
Biomedical & Advertising
Tech Overarching Themes*
Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy
*Obligatory movie references… shout-out to my hometown LA
Biomedical Research Goal:
Therapeutics => Diagnostics => Prognostics
• Reverse engineer how genetic variation leads to
(un)desired traits
• Therapeutics => traditional medicine
• Diagnostics => personalized medicine
– NextGen public health
– Requires hi-res mechanical knowledge
• Prognostics => GATTACA (dys/eu)topia
– Managed populations / NextGen eugenics
NextGen BigData Workloads in
NextGen Sequencing
Allen Day, PhD
@MapR @allenday
April 2014
Typical Plan, Phases 1-4
1. Design Experiment /
Collect Biosamples
2. Sequencing /
Molecular Assays
3. Data Management
4. ? ? ?
5. PROFIT ! ! ! !
http://knowyourmeme.com/memes/profit
The Changing Workload in
Underpants Collection
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
Typical Plan, Phases 1-4
1. Design Experiment /
Collect Biosamples
2. Sequencing /
Molecular Assays
3. Data Management
4. ? ? ?
5. PROFIT ! ! ! !
Phase 4: Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype…
– And an associated SNP
from my experiment
• That elucidate genetic
basis of phenotype…
• And rank order them by
impact/likelihood/etc
Phase 4: Example GWAS/SNP Analysis
SELECT
snp, expEvidence
FROM
myExp, exp1, …
OUTER JOIN expN …
WHERE
myExp.snp = “mySnp”
ORDER BY
p, freq, conservation, etc
Phase 4: Example GWAS/SNP Analysis
• In context of
– Racial background
– Experimental design-
specific concerns (e.g.
familial IBD/IBS)
– Environmental factors
and penetrance
– Assay-specific biases and
noise
SELECT
snp, expEvidence
FROM
myExp, exp1, …
OUTER JOIN expN …
WHERE
myExp.snp = “mySnp”
ORDER BY
p, freq, conservation, etc
SELECT
snp, expEvidence
FROM
myExp, exp1, …
OUTER JOIN expN …
WHERE
myExp.snp = “mySnp”
ORDER BY
p, freq, conservation, etc
Phase 4: Example GWAS/SNP Analysis
• In context of, e.g.
– ε1: Racial, etc.
background
– ε2: Experimental design-
specific concerns (e.g.
familial IBD/IBS)
– ε3: Environmental factors
and penetrance
– ε4: Assay-specific biases
and noise
phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4
At risk of over-simplification for
business-level concept…
Phase 4: Automated Insights Engine
SELECT
snp, expEvidence
FROM
myExp, exp1, … expN
exps=powerset(all exps)
OUTER JOIN complement(exps)
WHERE
myExp.snp = “mySnp”
powerset(all SNPs, phenotypes)
ORDER BY
p, freq, conservation, etc
arbitrary models
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
C. Briggsae inbred strain compatibility
[supplementary slide]
Phase 4: Automated Insights Engine
SELECT
snp, expEvidence
FROM
myExp, exp1, … expN
exps=powerset(all exps)
OUTER JOIN complement(exps)
WHERE
myExp.snp = “mySnp”
powerset(all SNPs, phenotypes)
ORDER BY
p, freq, conservation, etc
arbitrary models
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Phase 4: Automated Insights Engine
Right…
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Phase 4: NaĂŻve Implementation
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Big compute
It’s monolithic
It scales polynomially with data size
This is bad, it takes too long to get a result
Co-expression (10K samples) and Linkage
Gene Annotation / Set Completion
BMP6
BMP2
MMP3
LIF
NOS2A
MMP13
CSPG4
ACAN
ACAN
ACAN
COL11A2
COL11A2
COL9A1
MATN1
LECT1
MATN4
HAPLN1
HAPLN1
ITGA10
EDIL3
NGF
MAST4
MATN3
EPYC
COL11A1
COL11A1
COL10A1
COL10A1
THBS3
C1QTNF3
WISP1
PDPN
PDLIM4
CHST3
MIA
SOX5
CYTL1
TNMD
AKR1C1
MMP12
ETNK1
RELA
FOSL1
EIF2C2
NUPL1
RLF
RELB
SOD2
RNF24
RNF24
XYLT1
HAS2
BDKRB1
HSPC159
SLC28A3
FZD10
SLC28A3
HSPC159
BDKRB1
HAS2
XYLT1
RNF24
RNF24
SOD2
RELB
RLF
NUPL1
EIF2C2
FOSL1
RELA
ETNK1
MMP12
AKR1C1
TNMD
CYTL1
SOX5
MIA
CHST3
PDLIM4
PDPN
FZD10
WISP1
C1QTNF3
THBS3
COL10A1
COL10A1
COL11A1
COL11A1
EPYC
MATN3
MAST4
NGF
EDIL3
ITGA10
HAPLN1
HAPLN1
MATN4
ACAN
ACAN
ACAN
LECT1
MATN1
COL9A1
COL11A2
COL11A2
CSPG4
MMP13
NOS2A
LIF
MMP3
BMP2
BMP6
Day. 2009. Disease gene characterization through large-scale co-expression analysis.
http://www.ncbi.nlm.nih.gov/pubmed/20046828
+ =>
Better: Push Logic to Phase 3
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Denormalize and
Percolate
(re)prioritize &
(re)process
service queries
drive dashboards
create reports
denormalize for
display
buffer
New
models
What’s a Percolator?
• Google Percolator
– “Caffeine” update 2010
• Iterative, incremental
updates
• No batch processing
• Decouple computation
from data size
Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions
and Notifications
What’s in the Percolator?
• Optimize for access
patterns, maybe many tables
– Dependency graph of
intermediate matrices
– AT = correlate(transpose(A))
• Parallelize table computation
– Twitter Algebird
• Analysis of intermediates
triggers downstream
action
– Codify business logic
(research methods) into
data management layer
– Prioritize and minimize
unproductive computation
Denormalize and
Percolate
(re)prioritize &
(re)process
https://github.com/twitter/algebird
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-d
What’s in the Percolator?
• Optimize for access patterns,
maybe many tables
– Dependency graph of
intermediate matrices
– AT = correlate(transpose(A))
• Parallelize table computation
– Twitter Algebird
• Analysis of intermediates
triggers downstream
action
– Codify business logic
(research methods) into
data management layer
– Prioritize and minimize
unproductive computation
Denormalize and
Percolate
(re)prioritize &
(re)process
MapR M7 especially suitable – services complex multi-tenant workloads at very large scale, see
http://www.slideshare.net/allenday/20131212-sydney-big-data-analytics
Double Percolator
• Apologies, Google
Images yields no SFW
images for
– “Double Percolator”
ENCODE
http://www.nature.com/news/encode-the-human-encyclopaedia-1.11312
Data Generation
e.g. basic research
Data Analysis
e.g. pharma Control
channel
Clinical / Patient
consumption
Data Generation
e.g. basic research
Data Analysis
e.g. pharma
Clinical / Patient
consumption
Control
channel
Robot Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
Robot (Data?) Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
If they were unlabeled, would you
know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model
Building
NPR. 2011. The Search For Analysts To Make
Sense Of 'Big Data’
http://www.npr.org/2011/11/30/142893065
If they were unlabeled, would you
know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model
Building
• Identify network
structures
• Label them
• Observe
stimulus=>response
space mapping
• Purposefully target
• PROFIT ! ! ! !
If they were unlabeled, would you
know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model
Building
• Identify network
structures
• Label them
• Observe
stimulus=>response
space mapping
• Purposefully target
• PROFIT ! ! ! !
Parallels to Twitter revenue model
social network node labeling
=> gene annotation
Google Knowledge Graph
=> bio-ontologies
Ad impressions
=> small molecule perturbation
Profit
=> Save lives 
http://www.google.com/insidesearch/features/search/knowledge.html
http://www.bioontology.org/
Dendrite on M7 HBase
Denormalize and
Percolate
(re)prioritize &
(re)process
MapR M7
HBase
Titan
API
Percolation
“business logic”
Dendrite
Visualization &
ad-hoc queries
detailed view…
Further Reading
MapR M7 especially suitable – services complex multi-tenant
workloads at very large scale, see @allenday deck:
http://www.slideshare.net/allenday/20131212-sydney-big-
data-analytics
Implementing matrix transforms + business logic workflows,
see @ceteri “Enterprise Data Workflows with Cascading”:
http://shop.oreilly.com/product/0636920028536.do
Math and data structure underpinnings, see @ceteri and
@allenday “Just Enough Math”:
http://liber118.com/pxn/course/itml/just_enough_math.html
Denormalize and
Percolate
(re)prioritize &
(re)process
Further Reading
Day, et al. 2007. Celsius: a community resource for
Affymetrix microarray data.
http://www.ncbi.nlm.nih.gov/pubmed/17570842
Human Genetics & Big Data
http://www.slideshare.net/allenday/20131212-sydney-
garvan-institute-human-genetics-and-big-data
Denormalize and
Percolate
(re)prioritize &
(re)process
Next Topic: Optimizing 1Âş Analysis
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
<= We were just here
“future high ROI use cases”
<= We now go here
“current high ROI use cases”
1Âş seq analysis, in a nutshell
1Âş seq analysis, in a nutshell
Crossbow
Langmead, et al. 2009. Searching for SNPs with cloud computing
1Âş seq analysis, format details
.fastq .bam .vcf
short read
alignment
genotype
calling analysis
1Âş seq analysis, map-reduce style
.fastq .bam .vcf
short read
alignment
genotype
calling
MAP
MAP
REDUCE, rotate matrix 90Âş
ref seq
Ion Flux
• Sequencing workflow in MapReduce
(Hadoop, Cascading, Amazon Elastic M/R)
• Integrated with Ion Torrent as a plugin to
stream sequence to the cloud
• Emphasis on scalability and latency
– assay->clinical report turnaround in < 24h
• Compare to fast-follower stack ILMN
MiSeq+BaseSpace
http://aws.amazon.com/solutions/case-studies/ion-flux/
http://d36cz9buwru1tt.cloudfront.net/Ion-Flux-2011-02-Architecture.pdf
SeqWare / Nimbus Informatics
O’Connor, et al. 2010. SeqWare Query Engine: storing and searching sequence data in the cloud
http://seqware.github.io/
MapR Advantage for R&D
• Home directories on DFS
– NFS. Transparent to user
• Low prototype-cost / research support
– Scale prototype as needed
• Low transition cost / operationalizing research
– Prototype incrementally becomes a product
• Low operational cost / high machine utilization
– Leverage MapR performance
THANKS!
C. briggsae inbred line incompatibility
Ross, et al. 2011. Caenorhabditis briggsae Recombinant Inbred Line Genotypes Reveal Inter-
Strain Incompatibility and the Evolution of Recombination
self join
self joineQTLs
Samples
eQTLs
Samples
Samples
Samples
eQTLs
eQTLs
Incidence matrices
A (U*Q) and B (U*V)
UsersQuery Terms
Users
Clicked Videos
Query Term = Clicked Term
Join on dimension U…
QueryTerms
Users
Relate Q to V
QueryTerms
Users
Cross-recommendation
QueryTerms
Clicked Videos

More Related Content

Viewers also liked

Wolce 2009 Role Presentation
Wolce 2009 Role PresentationWolce 2009 Role Presentation
Wolce 2009 Role Presentationkarenvelasco
 
Exceptional Care SolutionsCIH2012
Exceptional Care SolutionsCIH2012Exceptional Care SolutionsCIH2012
Exceptional Care SolutionsCIH2012T Chaudhry FIoD
 
Are you leveraging all that LinkedIn can facilitate for your business or career?
Are you leveraging all that LinkedIn can facilitate for your business or career?Are you leveraging all that LinkedIn can facilitate for your business or career?
Are you leveraging all that LinkedIn can facilitate for your business or career?Darrel Griffin
 
Bb on Tour 2016 | Exploring the Grades Journey and Improving Assessment Feedb...
Bb on Tour 2016 | Exploring the Grades Journey and Improving Assessment Feedb...Bb on Tour 2016 | Exploring the Grades Journey and Improving Assessment Feedb...
Bb on Tour 2016 | Exploring the Grades Journey and Improving Assessment Feedb...Blackboard APAC
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
C2 m3
C2 m3C2 m3
C2 m3DGS
 
Mapping Report Lucas Jacobs
Mapping Report Lucas JacobsMapping Report Lucas Jacobs
Mapping Report Lucas JacobsLucas Jacobs
 

Viewers also liked (7)

Wolce 2009 Role Presentation
Wolce 2009 Role PresentationWolce 2009 Role Presentation
Wolce 2009 Role Presentation
 
Exceptional Care SolutionsCIH2012
Exceptional Care SolutionsCIH2012Exceptional Care SolutionsCIH2012
Exceptional Care SolutionsCIH2012
 
Are you leveraging all that LinkedIn can facilitate for your business or career?
Are you leveraging all that LinkedIn can facilitate for your business or career?Are you leveraging all that LinkedIn can facilitate for your business or career?
Are you leveraging all that LinkedIn can facilitate for your business or career?
 
Bb on Tour 2016 | Exploring the Grades Journey and Improving Assessment Feedb...
Bb on Tour 2016 | Exploring the Grades Journey and Improving Assessment Feedb...Bb on Tour 2016 | Exploring the Grades Journey and Improving Assessment Feedb...
Bb on Tour 2016 | Exploring the Grades Journey and Improving Assessment Feedb...
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
C2 m3
C2 m3C2 m3
C2 m3
 
Mapping Report Lucas Jacobs
Mapping Report Lucas JacobsMapping Report Lucas Jacobs
Mapping Report Lucas Jacobs
 

More from Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 

More from Allen Day, PhD (20)

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and Genomics
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 

Recently uploaded

Pharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, PricingPharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, PricingArunagarwal328757
 
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptxDr.Nusrat Tariq
 
Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.ANJALI
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingNehru place Escorts
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Measurement of Radiation and Dosimetric Procedure.pptx
Measurement of Radiation and Dosimetric Procedure.pptxMeasurement of Radiation and Dosimetric Procedure.pptx
Measurement of Radiation and Dosimetric Procedure.pptxDr. Dheeraj Kumar
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Case Report Peripartum Cardiomyopathy.pptx
Case Report Peripartum Cardiomyopathy.pptxCase Report Peripartum Cardiomyopathy.pptx
Case Report Peripartum Cardiomyopathy.pptxNiranjan Chavan
 
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdfHemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdfMedicoseAcademics
 
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...narwatsonia7
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowNehru place Escorts
 
Call Girls Near Airport Ahmedabad 9907093804 All Area Service COD available A...
Call Girls Near Airport Ahmedabad 9907093804 All Area Service COD available A...Call Girls Near Airport Ahmedabad 9907093804 All Area Service COD available A...
Call Girls Near Airport Ahmedabad 9907093804 All Area Service COD available A...sonalikaur4
 
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service LucknowVIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknownarwatsonia7
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Modelssonalikaur4
 
call girls in paharganj DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in paharganj DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in paharganj DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in paharganj DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
Glomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptxGlomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptxDr.Nusrat Tariq
 

Recently uploaded (20)

Pharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, PricingPharmaceutical Marketting: Unit-5, Pricing
Pharmaceutical Marketting: Unit-5, Pricing
 
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in munirka  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in munirka DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptx
 
Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
 
Measurement of Radiation and Dosimetric Procedure.pptx
Measurement of Radiation and Dosimetric Procedure.pptxMeasurement of Radiation and Dosimetric Procedure.pptx
Measurement of Radiation and Dosimetric Procedure.pptx
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
 
Case Report Peripartum Cardiomyopathy.pptx
Case Report Peripartum Cardiomyopathy.pptxCase Report Peripartum Cardiomyopathy.pptx
Case Report Peripartum Cardiomyopathy.pptx
 
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdfHemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdf
 
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
Russian Call Girls Gunjur Mugalur Road : 7001305949 High Profile Model Escort...
 
Epilepsy
EpilepsyEpilepsy
Epilepsy
 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
 
Call Girls Near Airport Ahmedabad 9907093804 All Area Service COD available A...
Call Girls Near Airport Ahmedabad 9907093804 All Area Service COD available A...Call Girls Near Airport Ahmedabad 9907093804 All Area Service COD available A...
Call Girls Near Airport Ahmedabad 9907093804 All Area Service COD available A...
 
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service LucknowVIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
 
call girls in paharganj DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in paharganj DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in paharganj DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in paharganj DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
Glomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptxGlomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptx
 

NextGen BigData Workloads in NextGen Sequencing - 20140402 - Phoenix - TGEN

  • 1. Biomedical & Advertising Tech Overarching Themes* Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy *Obligatory movie references… shout-out to my hometown LA
  • 2. Biomedical Research Goal: Therapeutics => Diagnostics => Prognostics • Reverse engineer how genetic variation leads to (un)desired traits • Therapeutics => traditional medicine • Diagnostics => personalized medicine – NextGen public health – Requires hi-res mechanical knowledge • Prognostics => GATTACA (dys/eu)topia – Managed populations / NextGen eugenics
  • 3. NextGen BigData Workloads in NextGen Sequencing Allen Day, PhD @MapR @allenday April 2014
  • 4. Typical Plan, Phases 1-4 1. Design Experiment / Collect Biosamples 2. Sequencing / Molecular Assays 3. Data Management 4. ? ? ? 5. PROFIT ! ! ! ! http://knowyourmeme.com/memes/profit
  • 5. The Changing Workload in Underpants Collection Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  • 6. Typical Plan, Phases 1-4 1. Design Experiment / Collect Biosamples 2. Sequencing / Molecular Assays 3. Data Management 4. ? ? ? 5. PROFIT ! ! ! !
  • 7. Phase 4: Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
  • 8. Phase 4: Example GWAS/SNP Analysis SELECT snp, expEvidence FROM myExp, exp1, … OUTER JOIN expN … WHERE myExp.snp = “mySnp” ORDER BY p, freq, conservation, etc
  • 9. Phase 4: Example GWAS/SNP Analysis • In context of – Racial background – Experimental design- specific concerns (e.g. familial IBD/IBS) – Environmental factors and penetrance – Assay-specific biases and noise SELECT snp, expEvidence FROM myExp, exp1, … OUTER JOIN expN … WHERE myExp.snp = “mySnp” ORDER BY p, freq, conservation, etc
  • 10. SELECT snp, expEvidence FROM myExp, exp1, … OUTER JOIN expN … WHERE myExp.snp = “mySnp” ORDER BY p, freq, conservation, etc Phase 4: Example GWAS/SNP Analysis • In context of, e.g. – Îľ1: Racial, etc. background – Îľ2: Experimental design- specific concerns (e.g. familial IBD/IBS) – Îľ3: Environmental factors and penetrance – Îľ4: Assay-specific biases and noise phenotype = αgenotype + β + Îľ1 + Îľ2 + Îľ3 + Îľ4 At risk of over-simplification for business-level concept…
  • 11. Phase 4: Automated Insights Engine SELECT snp, expEvidence FROM myExp, exp1, … expN exps=powerset(all exps) OUTER JOIN complement(exps) WHERE myExp.snp = “mySnp” powerset(all SNPs, phenotypes) ORDER BY p, freq, conservation, etc arbitrary models SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies C. Briggsae inbred strain compatibility [supplementary slide]
  • 12. Phase 4: Automated Insights Engine SELECT snp, expEvidence FROM myExp, exp1, … expN exps=powerset(all exps) OUTER JOIN complement(exps) WHERE myExp.snp = “mySnp” powerset(all SNPs, phenotypes) ORDER BY p, freq, conservation, etc arbitrary models SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies
  • 13. Phase 4: Automated Insights Engine Right… SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies
  • 14. Phase 4: NaĂŻve Implementation SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Big compute It’s monolithic It scales polynomially with data size This is bad, it takes too long to get a result
  • 15. Co-expression (10K samples) and Linkage Gene Annotation / Set Completion BMP6 BMP2 MMP3 LIF NOS2A MMP13 CSPG4 ACAN ACAN ACAN COL11A2 COL11A2 COL9A1 MATN1 LECT1 MATN4 HAPLN1 HAPLN1 ITGA10 EDIL3 NGF MAST4 MATN3 EPYC COL11A1 COL11A1 COL10A1 COL10A1 THBS3 C1QTNF3 WISP1 PDPN PDLIM4 CHST3 MIA SOX5 CYTL1 TNMD AKR1C1 MMP12 ETNK1 RELA FOSL1 EIF2C2 NUPL1 RLF RELB SOD2 RNF24 RNF24 XYLT1 HAS2 BDKRB1 HSPC159 SLC28A3 FZD10 SLC28A3 HSPC159 BDKRB1 HAS2 XYLT1 RNF24 RNF24 SOD2 RELB RLF NUPL1 EIF2C2 FOSL1 RELA ETNK1 MMP12 AKR1C1 TNMD CYTL1 SOX5 MIA CHST3 PDLIM4 PDPN FZD10 WISP1 C1QTNF3 THBS3 COL10A1 COL10A1 COL11A1 COL11A1 EPYC MATN3 MAST4 NGF EDIL3 ITGA10 HAPLN1 HAPLN1 MATN4 ACAN ACAN ACAN LECT1 MATN1 COL9A1 COL11A2 COL11A2 CSPG4 MMP13 NOS2A LIF MMP3 BMP2 BMP6 Day. 2009. Disease gene characterization through large-scale co-expression analysis. http://www.ncbi.nlm.nih.gov/pubmed/20046828 + =>
  • 16. Better: Push Logic to Phase 3 SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models
  • 17. What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental updates • No batch processing • Decouple computation from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • 18. What’s in the Percolator? • Optimize for access patterns, maybe many tables – Dependency graph of intermediate matrices – AT = correlate(transpose(A)) • Parallelize table computation – Twitter Algebird • Analysis of intermediates triggers downstream action – Codify business logic (research methods) into data management layer – Prioritize and minimize unproductive computation Denormalize and Percolate (re)prioritize & (re)process https://github.com/twitter/algebird http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-d
  • 19. What’s in the Percolator? • Optimize for access patterns, maybe many tables – Dependency graph of intermediate matrices – AT = correlate(transpose(A)) • Parallelize table computation – Twitter Algebird • Analysis of intermediates triggers downstream action – Codify business logic (research methods) into data management layer – Prioritize and minimize unproductive computation Denormalize and Percolate (re)prioritize & (re)process MapR M7 especially suitable – services complex multi-tenant workloads at very large scale, see http://www.slideshare.net/allenday/20131212-sydney-big-data-analytics
  • 20. Double Percolator • Apologies, Google Images yields no SFW images for – “Double Percolator”
  • 22. Data Generation e.g. basic research Data Analysis e.g. pharma Control channel Clinical / Patient consumption
  • 23. Data Generation e.g. basic research Data Analysis e.g. pharma Clinical / Patient consumption Control channel
  • 24. Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 25. Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 26. If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building NPR. 2011. The Search For Analysts To Make Sense Of 'Big Data’ http://www.npr.org/2011/11/30/142893065
  • 27. If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • PROFIT ! ! ! !
  • 28. If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • PROFIT ! ! ! ! Parallels to Twitter revenue model social network node labeling => gene annotation Google Knowledge Graph => bio-ontologies Ad impressions => small molecule perturbation Profit => Save lives  http://www.google.com/insidesearch/features/search/knowledge.html http://www.bioontology.org/
  • 29.
  • 30. Dendrite on M7 HBase Denormalize and Percolate (re)prioritize & (re)process MapR M7 HBase Titan API Percolation “business logic” Dendrite Visualization & ad-hoc queries detailed view…
  • 31. Further Reading MapR M7 especially suitable – services complex multi-tenant workloads at very large scale, see @allenday deck: http://www.slideshare.net/allenday/20131212-sydney-big- data-analytics Implementing matrix transforms + business logic workflows, see @ceteri “Enterprise Data Workflows with Cascading”: http://shop.oreilly.com/product/0636920028536.do Math and data structure underpinnings, see @ceteri and @allenday “Just Enough Math”: http://liber118.com/pxn/course/itml/just_enough_math.html Denormalize and Percolate (re)prioritize & (re)process
  • 32. Further Reading Day, et al. 2007. Celsius: a community resource for Affymetrix microarray data. http://www.ncbi.nlm.nih.gov/pubmed/17570842 Human Genetics & Big Data http://www.slideshare.net/allenday/20131212-sydney- garvan-institute-human-genetics-and-big-data Denormalize and Percolate (re)prioritize & (re)process
  • 33. Next Topic: Optimizing 1Âş Analysis Sboner, et al, 2011. The real cost of sequencing: higher than you think! <= We were just here “future high ROI use cases” <= We now go here “current high ROI use cases”
  • 34. 1Âş seq analysis, in a nutshell
  • 35. 1Âş seq analysis, in a nutshell
  • 36. Crossbow Langmead, et al. 2009. Searching for SNPs with cloud computing
  • 37. 1Âş seq analysis, format details .fastq .bam .vcf short read alignment genotype calling analysis
  • 38. 1Âş seq analysis, map-reduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90Âş ref seq
  • 39. Ion Flux • Sequencing workflow in MapReduce (Hadoop, Cascading, Amazon Elastic M/R) • Integrated with Ion Torrent as a plugin to stream sequence to the cloud • Emphasis on scalability and latency – assay->clinical report turnaround in < 24h • Compare to fast-follower stack ILMN MiSeq+BaseSpace http://aws.amazon.com/solutions/case-studies/ion-flux/ http://d36cz9buwru1tt.cloudfront.net/Ion-Flux-2011-02-Architecture.pdf
  • 40. SeqWare / Nimbus Informatics O’Connor, et al. 2010. SeqWare Query Engine: storing and searching sequence data in the cloud http://seqware.github.io/
  • 41.
  • 42. MapR Advantage for R&D • Home directories on DFS – NFS. Transparent to user • Low prototype-cost / research support – Scale prototype as needed • Low transition cost / operationalizing research – Prototype incrementally becomes a product • Low operational cost / high machine utilization – Leverage MapR performance
  • 44. C. briggsae inbred line incompatibility Ross, et al. 2011. Caenorhabditis briggsae Recombinant Inbred Line Genotypes Reveal Inter- Strain Incompatibility and the Evolution of Recombination
  • 46. Incidence matrices A (U*Q) and B (U*V) UsersQuery Terms Users Clicked Videos Query Term = Clicked Term
  • 47. Join on dimension U… QueryTerms Users
  • 48. Relate Q to V QueryTerms Users

Editor's Notes

  1. The genomic position (x-axis) of probesets within a 6 megabase region centered at the location of TTN, a gene known to be associated with LMGD2, is plotted versus the Pearson correlation coefficient An external file that holds a picture, illustration, etc.Object name is pone.0008491.e023.jpg (y-axis) to a list of probesets targeting other genes known to be associated with LGMD2 (excluding TTN) across 11636 HG-U133_Plus_2 microarrays. Solid circles: probesets targeting TTN, An external file that holds a picture, illustration, etc.Object name is pone.0008491.e024.jpg: probesets that are for genes of unknown function and, open circles: probesets for known genes in interval.