A Dataflow Processing Chip for Training Deep Neural Networks
Scalability20140226
1. The CERN Approach in
the Age of “Big Data”
T. Nicholas Kypreos, Ph.D.
!
Principal Data Scientist and Project Manager
MAQ Software
Seattle Scalability Meetup, 26 Feb, 2014
2. acknowledgments
•
CMS Collaboration and CERN !
•
•
•
•
FNAL for support in writing my dissertation… and for computing
facilities to abuse !
Bradford Stephens for letting me do this… !
Mike Miller and Adam Kocoloski from Cloudant for letting me steal some
nice slides on the subject! !
•
•
UF for sending me there!
Cloudant was just bought by IBM — so clearly all physicists are
experts in the cloud!
everyone here for suffering this diatribe
2
4. the life of LHC data
•
information picked up by detectors!
•
•
•
data are clumped together to create events!
real-time hardware filtering — FIFO buffering!
raw data are sent to T0 center at CERN for archival (replicated at FNAL)!
•
•
transferred to T1 sites, reconstructed, filtered into primary datasets!
transferred to T2 sites, reconstructed skimmed and filtered !
•
•
accessed at this level by general users across the globe!
moved to laptops for direct analysis!
•
peer-review of analysis, lots of liquor, an eventual paper
4
5. the reader
Figure 3-3. Slice of the CMS detector in the r
particles.
•
lots of ways you can
39
describe this!
•
5
plane showing the paths of diff
collection of detectors
reading out information
(like a camera?)
6. LHC 2010-2011 LHC 2012+
luminosity
•
LHC can produce all SM particles
across a range of energies!
•
•
•
production is not equiprobable -governed by the cross section, σ !
small cross sections correspond
to rarer processes in nature!
event production depends on
integrated luminosity of the machine
Nevents =
•
(pp ! X) L
probability x trials
higher luminosity - more events
I WANT THE UNICORNS!
6
7. information has to all come together somehow…
•
trigger system helps determine what we want to keep and throw away
(hardware-based pattern recognition) !
•
real-time interface with detectors keeps data flowing that’s not broken
Figure 3-16. Architecture of the CMS Level 1 trigger.
and collected at the right time — constant feedback and data validation
this still hurts my brain
7
Figure 3-17. Overview of the CMS trigger control system.
9. triggering
•
collision rate of 40 MHz!
•
•
100 Hz can be written to tape
➞ rate reduction is necessary!
rate reduced in two phases!
•
Level-1 (hardware)!
•
•
High Level Trigger (HLT) (software)!
•
•
collects the full event 100 kHz out!
on-site computing farm performs
fast partial reconstruction!
pack trigger menus into bits users
fight for priority
9
10. what we’re aiming for…
•
•
lots of filtering!
•
multivariate analysis,
machine learning, derp-aderp, Kalman filtering…
fitting and clustering!
•
analysis teams have to
fight for their analysis
priority — you can’t
always get what you want
data from online
monitoring goes to a
computer farm (HLT) !
10
11. the life of LHC data
•
information picked up by detectors!
•
•
•
data are clumped together to create events!
real-time hardware filtering — FIFO software filtering!
raw data are sent to T0 center at CERN for archival (replicated at
FNAL)!
•
•
transferred to T1 sites, reconstructed, filtered into primary datasets!
transferred to T2 sites, reconstructed skimmed and filtered !
•
•
accessed at this level by general users across the globe!
moved to laptops for direct analysis!
•
peer-review of analysis, lots of liquor, an eventual paper
11
12. he CMS experiment has taken a novel approach to Grid data distribution. Instead
having a central processing component making global decisions on replica allocatiers
on, CMS has a data management layer composed of a series of collaborating agents;
1PB/week
each persistent, be a bit
he agents arecenter maystateless processes which manage specific parts of replicaon operations at each non- the distribution network. The agents communicate
different (think site in
e a bit different
ynchronously through centers)!
standard data a blackboard architecture; as a result the system is robust in
he face ofsites with 100’s TB CMS’ data distribution network is represented in a
many local failures.
(100)
milar fashionsites with 100sThis means that the management of multi-hop transto the internet.
many worker nodes of
ge, 10000’s
rs at datasetstorage, 1000s of system allows CMS to seamlessly bridge the gap
TB of level is simple. This
so many? Politics, power budget,
etween traditional HEP “waterfall” type data distribution, and more dynamic Grid
worker nodes
2.6. EVENT able FLOW
ata distribution. Using this system CMS is DATA to reach transfer rates of 500Mbps,
(giant map reduce)!
nd sustain transfers for long periods in tape-to-tape transfers. The system has been
sed to driven in part byof large scale Service Challenges for LCG this year.
manage branches
CMS
•
•
•
politics, physical
roduction
limitations, budgets
~6 Tier-1
Centres (off-site)
Online
(HLT)
~10 online
streams (RAW)
Detector
Facility
ically High Energy Physics (HEP) exts have been able to rely on manpowere mechanisms for distributing data to geally dispersed collaborators. The newTier 0
ERN Large Hadron Collider (LHC) [1]
CMS-CAF
Tier 0
(CERN Analysis Facility)
ng necessitates a move to more scal~10 online
streams
(RAW)
~10 online
streams (RAW)
Primary
tape
archive
First pass
reconstruction
Regional
Centres
~50 Datasets
(RAW+RECO)
~50 Datasets
(RAW+RECO)
shared amongst
Tier 1's
Average of
~8 Datasets
per Tier 1
(RAW+RECO)
Secondary tape
archive
Tier 1
Tier 1
Tier 1
Tier 1
Tier 1 1
Tier
Analysis,
Calibration,
Re-reconstruction,
skim making...
Tier 1
Skims:
RECO,
AOD’s
~25 Tier-2
Centres
Small
Sites
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
1
13. data access
•
•
users submit jobs through
custom build interface
(think EMR)!
4.7. GRID WORKLOAD MANAGEMENT SYSTEMS
jobs have parameters to take
them to blacklist/whitelist
tier-2 centers, access
datasets and simulation sets!
k
as
sig
n
re
qu
es
job
t jo
b
site boundary
job
subm
it
CMS
batc
h
ign
13
t job
looks a lot like a traditional
batch queue
ues
Site
Batch MGR
Local
Policy
Task Queue
ass
Harvest
Batch Slots
returns custom data
structures!
•
tas
Global Policy
Task Queue
managed from local or remote
location!
•
mit
req
•
UI
sub
6
Site
Batch Slot
CMS
Batch
Figure 4.3: Schematics showing the hierarchical task queue architecture.
14. outputs and maintenance
•
data reprocessing happens as software improves!
•
data formats consistency is enforced in major versions
NO BROKEN CODE IN PRODUCTION… even though it is crowdsourced among 100s of physicists who are awful at coding!
•
incremental improvements are proposed, reviewed, and systematically
merged!
•
read back data on remote storage areas on T2s and T3s or to scratch
areas!
•
T0 and T1 centers are kept for privileged users: those people doing
validation and data monitoring — not private analysis!
•
failed jobs can be re-submitted or continued as more data are available
15. the life of LHC data
•
information picked up by detectors!
•
•
•
data are clumped together to create events!
real-time hardware filtering — FIFO buffering!
raw data are sent to T0 center at CERN for archival (replicated at FNAL)!
•
•
transferred to T1 sites, reconstructed, filtered into primary datasets!
transferred to T2 sites, reconstructed skimmed and filtered !
•
•
accessed at this level by general users across the globe!
moved to laptops for direct analysis!
•
peer-review of analysis, lots of liquor, an eventual paper
15
16. each analysis is sacred
•
there are losses, systematic uncertainties, statistical uncertainties,
biases
http://www.slac.stanford.edu/econf/C030908/papers/TUIT001.pdf
16
17. data structuring and the framework that binds
•
•
not databases!
•
CMS software framework is written in C++ libraries and bound together in
Python framework!
data are put together in events and stored in custom C-based data
structures (streamed as nTuples)!
•
•
•
•
compile C++ code, load in specific libraries for the analysis from the
framework as-needed!
all parts are visible and configurable (no black boxes) !
processes, filters, new data constructions are controlled in Python and
read out per “event”!
compiled code is transported on-site when submitting jobs via the GRID
17
18. signal on the noise
and the other pair is required to have a mass in the range
shape. The 12–120 GeV. The ZZ background⇥ is evaluated from of the Z ⇥⇤ model where
reference width of the Z is taken from MC simulation studies. Two different approaches are employed to estimate
the reducible and instrumental backgrounds from data. it provides a sufficiently
M⇤ . The Z ⇥⇤ width is chosen as a reference becauseBoth start
by selecting events in a background control region, well separated
from the signal region, by relaxing
onance such that detector resolution the isolation andThe background shape for
dominates. identification
criteria for two same-flavour reconstructed leptons. In the first approach, the additional pair of leptons is required to have the same
d channel is normalized signal contamination) while in the second, two data above
charge (to avoid to the number of observed events in
opposite-charge leptons failing the isolation and identification criThe extended likelihood function, L,control region with three passing
teria are required. In addition, a is
leptons and one failing lepton is used to estimate contributions
from backgrounds with three prompt leptons and one misidenti⇥
N
fied lepton. The event rates measured⇤ the background control
in
µNsignal region µS (Rthe measured
e µ
µ
region are extrapolated to the
using ⇥ ) f + B f
(6–7)
L (m|R⇥ , M, , ⇥g , a, b, µB ) =
S
B ,
probability for a reconstructed N!
lepton to pass the isolation and
µ
µ
i=1
identification requirements. This probability is measured in an independent sample. Within uncertainties, comparable background
“m” represents the observable are estimated by both methods.
counts in the signal region in the measurement, which is the the invariant
The number of selected ZZ → 4ℓ candidate events in the mass
Fig. 4. Distribution of the four-lepton invariant mass for the ZZ → 4ℓ anal
rangeis the m4ℓ 160 GeV, in each of the three final states, is M⌅⌅ 200 GeV.
110 total number of observed events where
pton pairs. “N”
The points represent the data, the filled histograms represent the backgro
given in Table 3, where m4ℓ is the four-lepton invariant mass. The
and the open histogram shows the signal expectation for a Higgs boson of m
number of predicted background events, in each of the three fiGeV,
Poisson meanstates, and theirnumber of expected background events.125after selection of events with K Dexpectation. The insetinshowstext. m4ℓ
of the total uncertainties are also given, together with mH = “µ’ isadded to the background 0.5, as described the the
the
tribution
nal
the number of signal events expected from a SM Higgs boson of
ean of the distribution from 4ℓ distribution is shown in Fig. 4. There is athat µ = µB + µS .
mH = 125 GeV. The m which N is an observation such
is tabulated using a simulation of the qq → ZZ/Zγ process.
clear peak at the Z boson mass where the decay Z → 4ℓ is restatistical analysis only includes events with m4ℓ 100 GeV.
constructed. This feature of the data is well reproduced by the
Fig. 5 (upper) shows the distribution of K D versus m4ℓ
background estimation. The figure also shows an excess of events
events selected in the 4ℓ subchannels. The colour-coded regi
above the expected background around 125 GeV. The total backshow the expected background. Fig. 5 (lower) shows the same tw
ground and thefor the Parton Density Function. bins
numbers of events observed in the three
be confused with PDF
dimensional distribution of events, but this time superimpo
where an excess is seen are also shown in Table 3. The combined
on the expected event density from a SM Higgs boson (mH
1
signal reconstruction and selection efficiency, with respect 8to the
•
sometimes we measure to calibrate,
discover new physics, find nothing!
•
at the end of the day, there will be a
multivariate analysis
19. workflow
rkflow Ladder
Number of users
Large datasets (100 TB)
Complex computation
Large datasets (100 TB)
Simple computation
Shared datasets (500 GB)
Complex computation
Shared datasets (10-500 GB)
Complex computation
Shared datasets (10-100 GB)
Simple computation
Shared datasets (0.1-10 GB)
Simple computation
Private datasets (0.1-10 GB)
Simple computation
}
}
}
19
Use Grid compute and storage exclusively
Work on departmental resources,
store resulting datasets to Grid storage
Work on laptop/desktop machine,
store resulting datasets to Grid storage
20. the take away…
•
particle physics has a special place in scale!
•
•
•
•
•
•
physicists are awful programmers — but no worse than most!
all machine learning gets reduced to being a “fit” of some sort!
a lot has been done since this was designed 15 years ago !
keep only what you can use — “skim data for fun and profit”!
humans are biased to find patterns — be careful and don’t trust yourself!
•
•
•
long history of real-time analysis, developed the proto-cloud!
exercise due diligence, test your algorithms, clean your data!
take time for Data Quality Monitoring (DQM)!
parallelize when you can
20