An overview of the state of the Biodiversity Heritage Library's first storage cluster. It covers the basics of building a clustered and distributed storage with commodity hardware and open source software , and also details such as working software to maintain synchronization with other global partners. Presented to the Biodiversity Heritage Library Europe's Technical Architecture board at Natural History Museum, London on August 25, 2010.
DevEX - reference for building teams, processes, and platforms
Clustered and distributed storage with commodity hardware and open source software
1. Clustered and
distributed
storage
with
commodity hardware
and open source software
Phil Cryer
BHL Developer, Systems Analyst
BHL Europe Technical Board Meeting
25-27 August 2010, NHM London
2. BHL data, on our cluster
BHL’s first cluster in Woods Hole
• Hardware - commodity servers
o (6) six 4U sized cabinets
o (24) twenty-four 1.5TB hard drives in each cabinet
3. BHL data, on our cluster
BHL’s first cluster in Woods Hole
• Hardware - commodity servers
o (6) six 4U sized cabinets
o (24) twenty-four 1.5TB hard drives in each cabinet
• Software - open source software
o operating system is Debian GNU/Linux (squeeze)
o filesystem - ext4
supports filesystems up to 1 EB (1000 PB) and max file size of 16 TB
o clustered file system - GlusterFS (3.0.4)
all drives run in a networked/RAID1 setup
all files are replicated and redundantly copied across the cluster
New: Acquia is using GlusterFS for their Drupal SaaS implementation
o monitoring - Monit, Ganglia for alerts and reporting
4. BHL data, on our cluster
http://whbhl01.ubio.org/ganglia
5. BHL data, on our cluster
BHL’s first cluster in Woods Hole
• Hardware - commodity servers
o (6) six 4U sized cabinets
o (24) twenty-four 1.5TB hard drives in each cabinet
• Software - open source software
o operating system is Debian GNU/Linux (squeeze)
o filesystem - ext4
supports filesystems up to 1 EB (1000 PB) and max file size of 16 TB
o clustered file system - GlusterFS (3.0.4)
all drives run in a networked/RAID1 setup
all files are replicated and redundantly copied across the cluster
New: Acquia is using GlusterFS for their Drupal SaaS implementation
o monitoring - Monit, Ganglia for alerts and reporting
• Capacity - cluster has 97TB of replicated/distributed storage
o currently using 66TB of data for 78492 books
o a full record for a book can be 24MB - 3GB
7. Initial file population
Populating a cluster with our data at the Internet Archive
• Looked at many options
o ship a pre-populated server (Sun Thumper with 48TB capacity)
o shipping individual external hard-drives
o download the files on our own
9. Initial file population
Populating a cluster with our data at the Internet Archive
• Looked at many options
o ship a pre-populated server (Sun Thumper with 48TB capacity)
o shipping individual external hard-drives
o download the files on our own
• Path of least resistance, we wrote a script and used the Internet2 connection at the
Marine Biology Laboratory (Woods Hole) to download directly to the first cluster
o knew it would take forever to download (but it took longer)
o needed space to download files (cluster buildout)
o networking issues in Woods Hole (overloaded local router)
o file verification (checksums that don’t...)
• Lessons learned - would we do it again? Probably not.
• Current propagation method
o initial distribution - mailing external drives (1, 5)
o syncing of the changes for future content (smaller bites)
10. Code: grabbyd
1
Internet Archive, San Francisco BHL Global, Woods Hole
Automated process to continuously download the latest BHL data
• Uses subversion to get an updated list of new BHL content as IA identifiers
http://code.google.com/p/bhl-bits/source/browse/#svn/trunk/iaidentifiers
• An enhanced version of the original download script to transfer the data
o grabbyd - a script that parses the latest iaidentifiers list, determines the IDs of the
new data and downloads the data to the cluster
o Will provide detailed reporting with status pages and/or another method (webapp,
email, RSS, XML, etc)
Code available (open sourced, BSD licensed):
[1] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/grabby/grabbyd
12. Replication|Replication
Why do we need replication?
• First BHL stored everything at the Internet Archive in San Francisco
o no backup or safety net
o limited in what we could do with, and serve, our data
• Now with our first BHL cluster, we gain
o redundancy - will be able to serve from the cluster and fall back to IA if needed
o analytics - the files are ‘local’ to parse through, discover new relationships
o serving options - geo-location, eventually will be able to serve from closest server
13. Replication|Replication
Why do we need replication?
• First BHL stored everything at the Internet Archive in San Francisco
o no backup or safety net
o limited in what we could do with, and serve, our data
• Now with our first BHL cluster, we gain
o redundancy - will be able to serve from the cluster and fall back to IA if needed
o analytics - the files are ‘local’ to parse through, discover new relationships
o serving options - geo-location, eventually will be able to serve from closest server
• Next - share the data with everyone
o Europe
o Australia
o China
o etc...
• Provide safe harbor
o lots of copies...
14. Code: bhl-sync
Open source Dropbox model
• uses and implements many open source projects
o inotify - a subsystem within the Linux kernel that extends the filesystem to notice
changes to the filesystem and report them to applications (in the kernel since
2.6.13 (2005))
o lsyncd - an open source project that provides a wrapper into inotify
o OpenSSH - secure file transfer
o rsync - long term, proven syncing subsystem
15. Code: bhl-sync
Open source Dropbox model
• uses and implements many open source projects
o inotify - a subsystem within the Linux kernel that extends the filesystem to notice
changes to the filesystem and report them to applications (in the kernel since
2.6.13 (2005))
o lsyncd - an open source project that provides a wrapper into inotify
o OpenSSH - secure file transfer
o rsync - long term, proven syncing subsystem
What does bhl-sync do?
• runs lsyncd as a daemon that notices kernel events and kicks off rync over OpenSSH
to mirror data to designated remote servers
• the only requirement on the remote system is a secure login for a normal user (using a
key based OpenSSH) keeping the process neutral and not requiring any other specific
technologies (OS, applications, filesystem) on the remote system (cross-platform)
• want to mirror BHL? it’s now possible (you just need a lot of storage)
Code available (open sourced, BSD licensed):
http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/bhl-sync.sh
17. BHL content distribution
1
Internet Archive, San Francisco BHL Global, Woods Hole
2 2
BHL, St. Louis BHL Europe, London
Code available (open sourced, BSD licensed):
[1] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/grabby/grabbyd
[2] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/bhl-sync.sh
18. BHL content distribution
1 ?
Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing
2 2 ?
BHL, St. Louis BHL Europe, London BHL Australia, Melbourne
Code available (open sourced, BSD licensed):
[1] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/grabby/grabbyd
[2] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/bhl-sync.sh
19. Other replication challenges
• Deleting content - "going dark"
o this can be data that is removed from search indexes, but still
retrievable via URI
o or deleted data not available (requires a separate sync process)
• New content coming in from other sources
o Localization of content - maybe it all can't be shared?
o National nodes consideration
20. BHL content + local data
Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing
Content sourced from China, scanned by
Internet Archive, replicated into BHL Global
21. BHL content + regional data
Internet Archive, San Francisco BHL Global, Woods Hole
?
BHL Europe, Paris BHL Europe, London BHL Europe, Berlin
Content sourced from BHL Europe partners may, or may
not, be passed back to Internet Archive and BHL Global
22. Fedora-commons integration
Integrated digital repository-centered platform
• Enables storage, access and management of virtually any kind of digital content
• can be a base for software developers to build tools and front ends on for sharing,
reuse and displaying data online
• Is free, community supported, open source software
23. Fedora-commons integration
Integrated digital repository-centered platform
• Enables storage, access and management of virtually any kind of digital content
• can be a base for software developers to build tools and front ends on for sharing,
reuse and displaying data online
• Is free, community supported, open source software
• Creates and maintains a persistent, stable, digital archive
o provides backup, redundancy and disaster recovery
o complements (doesn’t replace or put any demands upon) existing architecture by
incorporating open standards
o stores data in a neutral manner, allowing for an independent disaster recovery
option
o shares data via OAI, REST based interface
24. BHL content distribution
Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons
BHL, St. Louis BHL Europe, London
25. BHL content distribution
Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons
OAI
BHL node Fedora-commons
26. BHL content distribution
Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons
OAI
BHL node Fedora-commons
27. Thanks + questions
Thanks to Adrian Smales, Chris
Sleep (NMH), Chris Freeland, Tom
Garnett (BHL) and Cathy Norton,
Anthony Goddard, Woods Hole
networking admins (MBL) for their
work and support of this project.
email phil.cryer@mobot.org
skype phil.cryer
twitter @fak3r
slides available on slideshare