1. Biomedical & Advertising
Tech Overarching Themes*
Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy
*Obligatory movie references⌠shout-out to my hometown LA
2. Biomedical Research Goal:
Therapeutics => Diagnostics => Prognostics
⢠Reverse engineer how genetic variation leads to
(un)desired traits
⢠Therapeutics => traditional medicine
⢠Diagnostics => personalized medicine
â NextGen public health
â Requires hi-res mechanical knowledge
⢠Prognostics => GATTACA (dys/eu)topia
â Managed populations / NextGen eugenics
7. Phase 4: Example GWAS/SNP Analysis
⢠Find me related SNPsâŚ
â From other experiments
⢠Given a phenotypeâŚ
â And an associated SNP
from my experiment
⢠That elucidate genetic
basis of phenotypeâŚ
⢠And rank order them by
impact/likelihood/etc
8. Phase 4: Example GWAS/SNP Analysis
SELECT
snp, expEvidence
FROM
myExp, exp1, âŚ
OUTER JOIN expN âŚ
WHERE
myExp.snp = âmySnpâ
ORDER BY
p, freq, conservation, etc
9. Phase 4: Example GWAS/SNP Analysis
⢠In context of
â Racial background
â Experimental design-
specific concerns (e.g.
familial IBD/IBS)
â Environmental factors
and penetrance
â Assay-specific biases and
noise
SELECT
snp, expEvidence
FROM
myExp, exp1, âŚ
OUTER JOIN expN âŚ
WHERE
myExp.snp = âmySnpâ
ORDER BY
p, freq, conservation, etc
10. SELECT
snp, expEvidence
FROM
myExp, exp1, âŚ
OUTER JOIN expN âŚ
WHERE
myExp.snp = âmySnpâ
ORDER BY
p, freq, conservation, etc
Phase 4: Example GWAS/SNP Analysis
⢠In context of, e.g.
â Îľ1: Racial, etc.
background
â Îľ2: Experimental design-
specific concerns (e.g.
familial IBD/IBS)
â Îľ3: Environmental factors
and penetrance
â Îľ4: Assay-specific biases
and noise
phenotype = Îąďgenotype + β + Îľ1 + Îľ2 + Îľ3 + Îľ4
At risk of over-simplification for
business-level conceptâŚ
11. Phase 4: Automated Insights Engine
SELECT
snp, expEvidence
FROM
myExp, exp1, ⌠expN
exps=powerset(all exps)
OUTER JOIN complement(exps)
WHERE
myExp.snp = âmySnpâ
powerset(all SNPs, phenotypes)
ORDER BY
p, freq, conservation, etc
arbitrary models
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
C. Briggsae inbred strain compatibility
[supplementary slide]
14. Phase 4: NaĂŻve Implementation
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Big compute
Itâs monolithic
It scales polynomially with data size
This is bad, it takes too long to get a result
16. Better: Push Logic to Phase 3
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
annotations/ontologies
Denormalize and
Percolate
(re)prioritize &
(re)process
service queries
drive dashboards
create reports
denormalize for
display
buffer
New
models
17. Whatâs a Percolator?
⢠Google Percolator
â âCaffeineâ update 2010
⢠Iterative, incremental
updates
⢠No batch processing
⢠Decouple computation
from data size
Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions
and Notifications
18. Whatâs in the Percolator?
⢠Optimize for access
patterns, maybe many tables
â Dependency graph of
intermediate matrices
â AT = correlate(transpose(A))
⢠Parallelize table computation
â Twitter Algebird
⢠Analysis of intermediates
triggers downstream
action
â Codify business logic
(research methods) into
data management layer
â Prioritize and minimize
unproductive computation
Denormalize and
Percolate
(re)prioritize &
(re)process
https://github.com/twitter/algebird
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-d
19. Whatâs in the Percolator?
⢠Optimize for access patterns,
maybe many tables
â Dependency graph of
intermediate matrices
â AT = correlate(transpose(A))
⢠Parallelize table computation
â Twitter Algebird
⢠Analysis of intermediates
triggers downstream
action
â Codify business logic
(research methods) into
data management layer
â Prioritize and minimize
unproductive computation
Denormalize and
Percolate
(re)prioritize &
(re)process
MapR M7 especially suitable â services complex multi-tenant workloads at very large scale, see
http://www.slideshare.net/allenday/20131212-sydney-big-data-analytics
26. If they were unlabeled, would you
know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model
Building
NPR. 2011. The Search For Analysts To Make
Sense Of 'Big Dataâ
http://www.npr.org/2011/11/30/142893065
27. If they were unlabeled, would you
know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model
Building
⢠Identify network
structures
⢠Label them
⢠Observe
stimulus=>response
space mapping
⢠Purposefully target
⢠PROFIT ! ! ! !
28. If they were unlabeled, would you
know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model
Building
⢠Identify network
structures
⢠Label them
⢠Observe
stimulus=>response
space mapping
⢠Purposefully target
⢠PROFIT ! ! ! !
Parallels to Twitter revenue model
social network node labeling
=> gene annotation
Google Knowledge Graph
=> bio-ontologies
Ad impressions
=> small molecule perturbation
Profit
=> Save lives ď
http://www.google.com/insidesearch/features/search/knowledge.html
http://www.bioontology.org/
29.
30. Dendrite on M7 HBase
Denormalize and
Percolate
(re)prioritize &
(re)process
MapR M7
HBase
Titan
API
Percolation
âbusiness logicâ
Dendrite
Visualization &
ad-hoc queries
detailed viewâŚ
31. Further Reading
MapR M7 especially suitable â services complex multi-tenant
workloads at very large scale, see @allenday deck:
http://www.slideshare.net/allenday/20131212-sydney-big-
data-analytics
Implementing matrix transforms + business logic workflows,
see @ceteri âEnterprise Data Workflows with Cascadingâ:
http://shop.oreilly.com/product/0636920028536.do
Math and data structure underpinnings, see @ceteri and
@allenday âJust Enough Mathâ:
http://liber118.com/pxn/course/itml/just_enough_math.html
Denormalize and
Percolate
(re)prioritize &
(re)process
32. Further Reading
Day, et al. 2007. Celsius: a community resource for
Affymetrix microarray data.
http://www.ncbi.nlm.nih.gov/pubmed/17570842
Human Genetics & Big Data
http://www.slideshare.net/allenday/20131212-sydney-
garvan-institute-human-genetics-and-big-data
Denormalize and
Percolate
(re)prioritize &
(re)process
33. Next Topic: Optimizing 1Âş Analysis
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
<= We were just here
âfuture high ROI use casesâ
<= We now go here
âcurrent high ROI use casesâ
39. Ion Flux
⢠Sequencing workflow in MapReduce
(Hadoop, Cascading, Amazon Elastic M/R)
⢠Integrated with Ion Torrent as a plugin to
stream sequence to the cloud
⢠Emphasis on scalability and latency
â assay->clinical report turnaround in < 24h
⢠Compare to fast-follower stack ILMN
MiSeq+BaseSpace
http://aws.amazon.com/solutions/case-studies/ion-flux/
http://d36cz9buwru1tt.cloudfront.net/Ion-Flux-2011-02-Architecture.pdf
40. SeqWare / Nimbus Informatics
OâConnor, et al. 2010. SeqWare Query Engine: storing and searching sequence data in the cloud
http://seqware.github.io/
41.
42. MapR Advantage for R&D
⢠Home directories on DFS
â NFS. Transparent to user
⢠Low prototype-cost / research support
â Scale prototype as needed
⢠Low transition cost / operationalizing research
â Prototype incrementally becomes a product
⢠Low operational cost / high machine utilization
â Leverage MapR performance
44. C. briggsae inbred line incompatibility
Ross, et al. 2011. Caenorhabditis briggsae Recombinant Inbred Line Genotypes Reveal Inter-
Strain Incompatibility and the Evolution of Recombination
The genomic position (x-axis) of probesets within a 6 megabase region centered at the location of TTN, a gene known to be associated with LMGD2, is plotted versus the Pearson correlation coefficient An external file that holds a picture, illustration, etc.Object name is pone.0008491.e023.jpg (y-axis) to a list of probesets targeting other genes known to be associated with LGMD2 (excluding TTN) across 11636 HG-U133_Plus_2 microarrays. Solid circles: probesets targeting TTN, An external file that holds a picture, illustration, etc.Object name is pone.0008491.e024.jpg: probesets that are for genes of unknown function and, open circles: probesets for known genes in interval.