SlideShare a Scribd company logo
1 of 27
Clustered and
     distributed
         storage
                         with
         commodity hardware
     and open source software




                        Phil Cryer
   BHL Developer, Systems Analyst
BHL Europe Technical Board Meeting
  25-27 August 2010, NHM London
BHL data, on our cluster
BHL’s first cluster in Woods Hole
 • Hardware - commodity servers
    o (6) six 4U sized cabinets
    o (24) twenty-four 1.5TB hard drives in each cabinet
BHL data, on our cluster
BHL’s first cluster in Woods Hole
 • Hardware - commodity servers
    o (6) six 4U sized cabinets
    o (24) twenty-four 1.5TB hard drives in each cabinet
 • Software - open source software
    o operating system is Debian GNU/Linux (squeeze)
    o filesystem - ext4
          supports filesystems up to 1 EB (1000 PB) and max file size of 16 TB
    o clustered file system - GlusterFS (3.0.4)
          all drives run in a networked/RAID1 setup
          all files are replicated and redundantly copied across the cluster
          New: Acquia is using GlusterFS for their Drupal SaaS implementation
    o monitoring - Monit, Ganglia for alerts and reporting
BHL data, on our cluster




             http://whbhl01.ubio.org/ganglia
BHL data, on our cluster
BHL’s first cluster in Woods Hole
 • Hardware - commodity servers
    o (6) six 4U sized cabinets
    o (24) twenty-four 1.5TB hard drives in each cabinet
 • Software - open source software
    o operating system is Debian GNU/Linux (squeeze)
    o filesystem - ext4
          supports filesystems up to 1 EB (1000 PB) and max file size of 16 TB
    o clustered file system - GlusterFS (3.0.4)
          all drives run in a networked/RAID1 setup
          all files are replicated and redundantly copied across the cluster
          New: Acquia is using GlusterFS for their Drupal SaaS implementation
    o monitoring - Monit, Ganglia for alerts and reporting
 • Capacity - cluster has 97TB of replicated/distributed storage
    o currently using 66TB of data for 78492 books
    o a full record for a book can be 24MB - 3GB
Files from a record

# ls -lh /mnt/glusterfs/www/a/actasocietatissc26suom
total 649M
-rwxr-xr-x 1 www-data www-data 19M 2009-07-10 01:55    actasocietatissc26suom_abbyy.gz
-rwxr-xr-x 1 www-data www-data 28M 2009-07-10 06:53    actasocietatissc26suom_bw.pdf
-rwxr-xr-x 1 www-data www-data 1.3K 2009-06-12 10:21   actasocietatissc26suom_dc.xml
-rwxr-xr-x 1 www-data www-data 18M 2009-07-10 03:05    actasocietatissc26suom.djvu
-rwxr-xr-x 1 www-data www-data 1.3M 2009-07-10 06:54   actasocietatissc26suom_djvu.txt
-rwxr-xr-x 1 www-data www-data 14M 2009-07-10 02:08    actasocietatissc26suom_djvu.xml
-rwxr-xr-x 1 www-data www-data 4.4K 2009-12-14 04:42   actasocietatissc26suom_files.xml
-rwxr-xr-x 1 www-data www-data 20M 2009-07-09 18:57    actasocietatissc26suom_flippy.zip
-rwxr-xr-x 1 www-data www-data 285K 2009-07-09 18:52   actasocietatissc26suom.gif
-rwxr-xr-x 1 www-data www-data 193M 2009-07-09 18:51   actasocietatissc26suom_jp2.zip
-rwxr-xr-x 1 www-data www-data 5.7K 2009-06-12 10:21   actasocietatissc26suom_marc.xml
-rwxr-xr-x 1 www-data www-data 2.0K 2009-06-12 10:21   actasocietatissc26suom_meta.mrc
-rwxr-xr-x 1 www-data www-data 416 2009-06-12 10:21    actasocietatissc26suom_metasource.xml
-rwxr-xr-x 1 www-data www-data 2.2K 2009-12-01 12:20   actasocietatissc26suom_meta.xml
-rwxr-xr-x 1 www-data www-data 279K 2009-12-14 04:42   actasocietatissc26suom_names.xml
-rwxr-xr-x 1 www-data www-data 324M 2009-07-09 13:28   actasocietatissc26suom_orig_jp2.tar
-rwxr-xr-x 1 www-data www-data 34M 2009-07-10 04:35    actasocietatissc26suom.pdf
-rwxr-xr-x 1 www-data www-data 365K 2009-07-09 13:28   actasocietatissc26suom_scandata.xml
Initial file population
Populating a cluster with our data at the Internet Archive
 • Looked at many options
    o ship a pre-populated server (Sun Thumper with 48TB capacity)
    o shipping individual external hard-drives
    o download the files on our own
Initial file population
Initial file population
Populating a cluster with our data at the Internet Archive
 • Looked at many options
    o ship a pre-populated server (Sun Thumper with 48TB capacity)
    o shipping individual external hard-drives
    o download the files on our own

• Path of least resistance, we wrote a script and used the Internet2 connection at the
  Marine Biology Laboratory (Woods Hole) to download directly to the first cluster
   o knew it would take forever to download (but it took longer)
   o needed space to download files (cluster buildout)
   o networking issues in Woods Hole (overloaded local router)
   o file verification (checksums that don’t...)


• Lessons learned - would we do it again? Probably not.

• Current propagation method
   o initial distribution - mailing external drives (1, 5)
   o syncing of the changes for future content (smaller bites)
Code: grabbyd


                                      1
    Internet Archive, San Francisco       BHL Global, Woods Hole



Automated process to continuously download the latest BHL data
 • Uses subversion to get an updated list of new BHL content as IA identifiers
   http://code.google.com/p/bhl-bits/source/browse/#svn/trunk/iaidentifiers
 • An enhanced version of the original download script to transfer the data
    o grabbyd - a script that parses the latest iaidentifiers list, determines the IDs of the
      new data and downloads the data to the cluster
    o Will provide detailed reporting with status pages and/or another method (webapp,
      email, RSS, XML, etc)

   Code available (open sourced, BSD licensed):
   [1] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/grabby/grabbyd
Code: grabbyd + reporting




           http://cluster.biodiversitylibrary.org/
Replication|Replication
Why do we need replication?
• First BHL stored everything at the Internet Archive in San Francisco
    o no backup or safety net
    o limited in what we could do with, and serve, our data
• Now with our first BHL cluster, we gain
    o redundancy - will be able to serve from the cluster and fall back to IA if needed
    o analytics - the files are ‘local’ to parse through, discover new relationships
    o serving options - geo-location, eventually will be able to serve from closest server
Replication|Replication
Why do we need replication?
• First BHL stored everything at the Internet Archive in San Francisco
    o no backup or safety net
    o limited in what we could do with, and serve, our data
• Now with our first BHL cluster, we gain
    o redundancy - will be able to serve from the cluster and fall back to IA if needed
    o analytics - the files are ‘local’ to parse through, discover new relationships
    o serving options - geo-location, eventually will be able to serve from closest server
• Next - share the data with everyone
    o Europe
    o Australia
    o China
    o etc...
• Provide safe harbor
    o lots of copies...
Code: bhl-sync
Open source Dropbox model
 • uses and implements many open source projects
    o inotify - a subsystem within the Linux kernel that extends the filesystem to notice
       changes to the filesystem and report them to applications (in the kernel since
       2.6.13 (2005))
    o lsyncd - an open source project that provides a wrapper into inotify
    o OpenSSH - secure file transfer
    o rsync - long term, proven syncing subsystem
Code: bhl-sync
Open source Dropbox model
 • uses and implements many open source projects
    o inotify - a subsystem within the Linux kernel that extends the filesystem to notice
       changes to the filesystem and report them to applications (in the kernel since
       2.6.13 (2005))
    o lsyncd - an open source project that provides a wrapper into inotify
    o OpenSSH - secure file transfer
    o rsync - long term, proven syncing subsystem


What does bhl-sync do?
• runs lsyncd as a daemon that notices kernel events and kicks off rync over OpenSSH
   to mirror data to designated remote servers
• the only requirement on the remote system is a secure login for a normal user (using a
   key based OpenSSH) keeping the process neutral and not requiring any other specific
   technologies (OS, applications, filesystem) on the remote system (cross-platform)
• want to mirror BHL? it’s now possible (you just need a lot of storage)
           Code available (open sourced, BSD licensed):
           http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/bhl-sync.sh
Code: bhl-sync + status




              http://bit.ly/09-bhl-sync
BHL content distribution


                                          1
 Internet Archive, San Francisco                        BHL Global, Woods Hole



                                                    2                            2




                                   BHL, St. Louis                                    BHL Europe, London




      Code available (open sourced, BSD licensed):
      [1] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/grabby/grabbyd
      [2] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/bhl-sync.sh
BHL content distribution


                                   1                                    ?
 Internet Archive, San Francisco           BHL Global, Woods Hole             BHL China, Beijing




                                       2               2            ?




        BHL, St. Louis                      BHL Europe, London              BHL Australia, Melbourne




      Code available (open sourced, BSD licensed):
      [1] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/grabby/grabbyd
      [2] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/bhl-sync.sh
Other replication challenges

• Deleting content - "going dark"
  o this can be data that is removed from search indexes, but still
    retrievable via URI
  o or deleted data not available (requires a separate sync process)
• New content coming in from other sources
  o Localization of content - maybe it all can't be shared?
  o National nodes consideration
BHL content + local data



 Internet Archive, San Francisco        BHL Global, Woods Hole     BHL China, Beijing




                           Content sourced from China, scanned by
                         Internet Archive, replicated into BHL Global
BHL content + regional data



 Internet Archive, San Francisco       BHL Global, Woods Hole




                                   ?




      BHL Europe, Paris                 BHL Europe, London      BHL Europe, Berlin




             Content sourced from BHL Europe partners may, or may
             not, be passed back to Internet Archive and BHL Global
Fedora-commons integration
Integrated digital repository-centered platform
 • Enables storage, access and management of virtually any kind of digital content
 • can be a base for software developers to build tools and front ends on for sharing,
    reuse and displaying data online
 • Is free, community supported, open source software
Fedora-commons integration
Integrated digital repository-centered platform
 • Enables storage, access and management of virtually any kind of digital content
 • can be a base for software developers to build tools and front ends on for sharing,
    reuse and displaying data online
 • Is free, community supported, open source software

 • Creates and maintains a persistent, stable, digital archive
    o provides backup, redundancy and disaster recovery
    o complements (doesn’t replace or put any demands upon) existing architecture by
      incorporating open standards
    o stores data in a neutral manner, allowing for an independent disaster recovery
      option
    o shares data via OAI, REST based interface
BHL content distribution



 Internet Archive, San Francisco                    BHL Global, Woods Hole                        Fedora-commons




                                   BHL, St. Louis                            BHL Europe, London
BHL content distribution



 Internet Archive, San Francisco              BHL Global, Woods Hole                    Fedora-commons




                                                                       OAI




                                   BHL node                            Fedora-commons
BHL content distribution



 Internet Archive, San Francisco              BHL Global, Woods Hole                    Fedora-commons



                                                                                          OAI




                                   BHL node                            Fedora-commons
Thanks + questions

           Thanks to Adrian Smales, Chris
         Sleep (NMH), Chris Freeland, Tom
          Garnett (BHL) and Cathy Norton,
           Anthony Goddard, Woods Hole
         networking admins (MBL) for their
           work and support of this project.




                                          email phil.cryer@mobot.org
                                          skype phil.cryer
                                          twitter @fak3r

                                          slides available on slideshare

More Related Content

What's hot

Node.js Interactive
Node.js InteractiveNode.js Interactive
Node.js InteractiveDavid Dias
 
RDM#2- The Distributed Web
RDM#2- The Distributed WebRDM#2- The Distributed Web
RDM#2- The Distributed WebDavid Dias
 
basic linux command (questions)
basic linux command (questions)basic linux command (questions)
basic linux command (questions)Sukhraj Singh
 
Linux Memory Analysis with Volatility
Linux Memory Analysis with VolatilityLinux Memory Analysis with Volatility
Linux Memory Analysis with VolatilityAndrew Case
 
(120513) #fitalk an introduction to linux memory forensics
(120513) #fitalk   an introduction to linux memory forensics(120513) #fitalk   an introduction to linux memory forensics
(120513) #fitalk an introduction to linux memory forensicsINSIGHT FORENSIC
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemHungWei Chiu
 
Linux admin interview questions
Linux admin interview questionsLinux admin interview questions
Linux admin interview questionsKavya Sri
 
Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)Ahmed El-Arabawy
 
Workshop - Linux Memory Analysis with Volatility
Workshop - Linux Memory Analysis with VolatilityWorkshop - Linux Memory Analysis with Volatility
Workshop - Linux Memory Analysis with VolatilityAndrew Case
 
Compression
CompressionCompression
Compressionaswathyu
 
Compression Commands in Linux
Compression Commands in LinuxCompression Commands in Linux
Compression Commands in LinuxPegah Taheri
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
101 3.3 perform basic file management
101 3.3 perform basic file management101 3.3 perform basic file management
101 3.3 perform basic file managementAcácio Oliveira
 
101 2.4 use debian package management
101 2.4 use debian package management101 2.4 use debian package management
101 2.4 use debian package managementAcácio Oliveira
 
101 2.1 design hard disk layout
101 2.1 design hard disk layout101 2.1 design hard disk layout
101 2.1 design hard disk layoutAcácio Oliveira
 
12 linux archiving tools
12 linux archiving tools12 linux archiving tools
12 linux archiving toolsShay Cohen
 
Memory forensics
Memory forensicsMemory forensics
Memory forensicsSunil Kumar
 

What's hot (20)

Node.js Interactive
Node.js InteractiveNode.js Interactive
Node.js Interactive
 
RDM#2- The Distributed Web
RDM#2- The Distributed WebRDM#2- The Distributed Web
RDM#2- The Distributed Web
 
basic linux command (questions)
basic linux command (questions)basic linux command (questions)
basic linux command (questions)
 
Linux Memory Analysis with Volatility
Linux Memory Analysis with VolatilityLinux Memory Analysis with Volatility
Linux Memory Analysis with Volatility
 
4. linux file systems
4. linux file systems4. linux file systems
4. linux file systems
 
(120513) #fitalk an introduction to linux memory forensics
(120513) #fitalk   an introduction to linux memory forensics(120513) #fitalk   an introduction to linux memory forensics
(120513) #fitalk an introduction to linux memory forensics
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystem
 
Linux admin interview questions
Linux admin interview questionsLinux admin interview questions
Linux admin interview questions
 
Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)Course 102: Lecture 27: FileSystems in Linux (Part 2)
Course 102: Lecture 27: FileSystems in Linux (Part 2)
 
Workshop - Linux Memory Analysis with Volatility
Workshop - Linux Memory Analysis with VolatilityWorkshop - Linux Memory Analysis with Volatility
Workshop - Linux Memory Analysis with Volatility
 
Compression
CompressionCompression
Compression
 
Compression Commands in Linux
Compression Commands in LinuxCompression Commands in Linux
Compression Commands in Linux
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
101 3.3 perform basic file management
101 3.3 perform basic file management101 3.3 perform basic file management
101 3.3 perform basic file management
 
101 2.4 use debian package management
101 2.4 use debian package management101 2.4 use debian package management
101 2.4 use debian package management
 
Registry
RegistryRegistry
Registry
 
101 2.1 design hard disk layout
101 2.1 design hard disk layout101 2.1 design hard disk layout
101 2.1 design hard disk layout
 
12 linux archiving tools
12 linux archiving tools12 linux archiving tools
12 linux archiving tools
 
Memory forensics
Memory forensicsMemory forensics
Memory forensics
 

Viewers also liked

Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with MantlPhil Cryer
 
ICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and ProcessingICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and ProcessingTakuma Wakamori
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage systemZhichao Liang
 
7 distributed storage_open_stack
7 distributed storage_open_stack7 distributed storage_open_stack
7 distributed storage_open_stackopenstackindia
 
DumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage SolutionDumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage SolutionNuno Loureiro
 
Identity Based Secure Distributed Storage Scheme
Identity Based Secure Distributed Storage SchemeIdentity Based Secure Distributed Storage Scheme
Identity Based Secure Distributed Storage SchemeVenkatesh Devam ☁
 
Use Distributed Filesystem as a Storage Tier
Use Distributed Filesystem as a Storage TierUse Distributed Filesystem as a Storage Tier
Use Distributed Filesystem as a Storage TierManfred Furuholmen
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Gluster.org
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA
 
Strategies for Distributed Data Storage
Strategies for Distributed Data StorageStrategies for Distributed Data Storage
Strategies for Distributed Data Storagekakugawa
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon Nexus, Inc.
 

Viewers also liked (13)

Getting started with Mantl
Getting started with MantlGetting started with Mantl
Getting started with Mantl
 
ICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and ProcessingICDE2015 Research 3: Distributed Storage and Processing
ICDE2015 Research 3: Distributed Storage and Processing
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage system
 
7 distributed storage_open_stack
7 distributed storage_open_stack7 distributed storage_open_stack
7 distributed storage_open_stack
 
DumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage SolutionDumpFS - A Distributed Storage Solution
DumpFS - A Distributed Storage Solution
 
Distributed storage system
Distributed storage systemDistributed storage system
Distributed storage system
 
Integrated Distributed Solar and Storage
Integrated Distributed Solar and StorageIntegrated Distributed Solar and Storage
Integrated Distributed Solar and Storage
 
Identity Based Secure Distributed Storage Scheme
Identity Based Secure Distributed Storage SchemeIdentity Based Secure Distributed Storage Scheme
Identity Based Secure Distributed Storage Scheme
 
Use Distributed Filesystem as a Storage Tier
Use Distributed Filesystem as a Storage TierUse Distributed Filesystem as a Storage Tier
Use Distributed Filesystem as a Storage Tier
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
 
Strategies for Distributed Data Storage
Strategies for Distributed Data StorageStrategies for Distributed Data Storage
Strategies for Distributed Data Storage
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
 

Similar to Clustered and distributed
 storage with
 commodity hardware 
and open source software


Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing dataPhil Cryer
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersPhil Cryer
 
Root file system
Root file systemRoot file system
Root file systemBindu U
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilChristian Frech
 
How to Make a Honeypot Stickier (SSH*)
How to Make a Honeypot Stickier (SSH*)How to Make a Honeypot Stickier (SSH*)
How to Make a Honeypot Stickier (SSH*)Jose Hernandez
 
How to Make a Honeypot Stickier (SSH*)
How to Make a Honeypot Stickier (SSH*)How to Make a Honeypot Stickier (SSH*)
How to Make a Honeypot Stickier (SSH*)Jose Hernandez
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDavid Wallom
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...BertrandDrouvot
 
Introduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFIntroduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFGlobus
 
Ganesh naik linux_kernel_internals
Ganesh naik linux_kernel_internalsGanesh naik linux_kernel_internals
Ganesh naik linux_kernel_internalsnullowaspmumbai
 
Introduction to linux at Introductory Bioinformatics Workshop
Introduction to linux at Introductory Bioinformatics WorkshopIntroduction to linux at Introductory Bioinformatics Workshop
Introduction to linux at Introductory Bioinformatics WorkshopSetor Amuzu
 
Containerization Is More than the New Virtualization
Containerization Is More than the New VirtualizationContainerization Is More than the New Virtualization
Containerization Is More than the New VirtualizationC4Media
 
Tutorial: What's New with Globus
Tutorial: What's New with GlobusTutorial: What's New with Globus
Tutorial: What's New with GlobusGlobus
 
Swift extensions for Tape Storage or other High-Latency Media
Swift extensions for Tape Storage or other High-Latency MediaSwift extensions for Tape Storage or other High-Latency Media
Swift extensions for Tape Storage or other High-Latency MediaSlavisa Sarafijanovic
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusGlobus
 

Similar to Clustered and distributed
 storage with
 commodity hardware 
and open source software
 (20)

Storing and distributing data
Storing and distributing dataStoring and distributing data
Storing and distributing data
 
BHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clustersBHL hardware architecture - storage and clusters
BHL hardware architecture - storage and clusters
 
Root file system
Root file systemRoot file system
Root file system
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
 
Kfs presentation
Kfs presentationKfs presentation
Kfs presentation
 
How to Make a Honeypot Stickier (SSH*)
How to Make a Honeypot Stickier (SSH*)How to Make a Honeypot Stickier (SSH*)
How to Make a Honeypot Stickier (SSH*)
 
How to Make a Honeypot Stickier (SSH*)
How to Make a Honeypot Stickier (SSH*)How to Make a Honeypot Stickier (SSH*)
How to Make a Honeypot Stickier (SSH*)
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
 
Introduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFIntroduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCF
 
Linux internals v4
Linux internals v4Linux internals v4
Linux internals v4
 
Libra Library OS
Libra Library OSLibra Library OS
Libra Library OS
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
Ganesh naik linux_kernel_internals
Ganesh naik linux_kernel_internalsGanesh naik linux_kernel_internals
Ganesh naik linux_kernel_internals
 
Introduction to linux at Introductory Bioinformatics Workshop
Introduction to linux at Introductory Bioinformatics WorkshopIntroduction to linux at Introductory Bioinformatics Workshop
Introduction to linux at Introductory Bioinformatics Workshop
 
Containerization Is More than the New Virtualization
Containerization Is More than the New VirtualizationContainerization Is More than the New Virtualization
Containerization Is More than the New Virtualization
 
Tutorial: What's New with Globus
Tutorial: What's New with GlobusTutorial: What's New with Globus
Tutorial: What's New with Globus
 
Swift extensions for Tape Storage or other High-Latency Media
Swift extensions for Tape Storage or other High-Latency MediaSwift extensions for Tape Storage or other High-Latency Media
Swift extensions for Tape Storage or other High-Latency Media
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 

More from Phil Cryer

Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPhil Cryer
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified loggingPhil Cryer
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?Phil Cryer
 
What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usPhil Cryer
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Phil Cryer
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonPhil Cryer
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social webPhil Cryer
 
Adoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsAdoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsPhil Cryer
 
Data hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataData hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataPhil Cryer
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...Phil Cryer
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processPhil Cryer
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionPhil Cryer
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoPhil Cryer
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchivePhil Cryer
 

More from Phil Cryer (14)

Pets versus Cattle: servers evolved
Pets versus Cattle: servers evolvedPets versus Cattle: servers evolved
Pets versus Cattle: servers evolved
 
Moving towards unified logging
Moving towards unified loggingMoving towards unified logging
Moving towards unified logging
 
What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?What if Petraeus Was a Hacker?
What if Petraeus Was a Hacker?
 
What if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of usWhat if Petraeus was a hacker? Email privacy for the rest of us
What if Petraeus was a hacker? Email privacy for the rest of us
 
Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)Online privacy concerns (and what we can do about it)
Online privacy concerns (and what we can do about it)
 
Online Privacy in the Year of the Dragon
Online Privacy in the Year of the DragonOnline Privacy in the Year of the Dragon
Online Privacy in the Year of the Dragon
 
Is your data secure? privacy and trust in the social web
Is your data secure?  privacy and trust in the social webIs your data secure?  privacy and trust in the social web
Is your data secure? privacy and trust in the social web
 
Adoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity InformaticsAdoption of Persistent Identifiers for Biodiversity Informatics
Adoption of Persistent Identifiers for Biodiversity Informatics
 
Data hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity dataData hosting infrastructure for primary biodiversity data
Data hosting infrastructure for primary biodiversity data
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
 
Biodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and processBiodiversity Heritiage Library: progress and process
Biodiversity Heritiage Library: progress and process
 
Building A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage SolutionBuilding A Scalable Open Source Storage Solution
Building A Scalable Open Source Storage Solution
 
Biodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles DemoBiodiversity Heritage Library Articles Demo
Biodiversity Heritage Library Articles Demo
 
Using Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent ArchiveUsing Fedora Commons To Create A Persistent Archive
Using Fedora Commons To Create A Persistent Archive
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Clustered and distributed
 storage with
 commodity hardware 
and open source software


  • 1. Clustered and distributed storage with commodity hardware and open source software Phil Cryer BHL Developer, Systems Analyst BHL Europe Technical Board Meeting 25-27 August 2010, NHM London
  • 2. BHL data, on our cluster BHL’s first cluster in Woods Hole • Hardware - commodity servers o (6) six 4U sized cabinets o (24) twenty-four 1.5TB hard drives in each cabinet
  • 3. BHL data, on our cluster BHL’s first cluster in Woods Hole • Hardware - commodity servers o (6) six 4U sized cabinets o (24) twenty-four 1.5TB hard drives in each cabinet • Software - open source software o operating system is Debian GNU/Linux (squeeze) o filesystem - ext4  supports filesystems up to 1 EB (1000 PB) and max file size of 16 TB o clustered file system - GlusterFS (3.0.4)  all drives run in a networked/RAID1 setup  all files are replicated and redundantly copied across the cluster  New: Acquia is using GlusterFS for their Drupal SaaS implementation o monitoring - Monit, Ganglia for alerts and reporting
  • 4. BHL data, on our cluster http://whbhl01.ubio.org/ganglia
  • 5. BHL data, on our cluster BHL’s first cluster in Woods Hole • Hardware - commodity servers o (6) six 4U sized cabinets o (24) twenty-four 1.5TB hard drives in each cabinet • Software - open source software o operating system is Debian GNU/Linux (squeeze) o filesystem - ext4  supports filesystems up to 1 EB (1000 PB) and max file size of 16 TB o clustered file system - GlusterFS (3.0.4)  all drives run in a networked/RAID1 setup  all files are replicated and redundantly copied across the cluster  New: Acquia is using GlusterFS for their Drupal SaaS implementation o monitoring - Monit, Ganglia for alerts and reporting • Capacity - cluster has 97TB of replicated/distributed storage o currently using 66TB of data for 78492 books o a full record for a book can be 24MB - 3GB
  • 6. Files from a record # ls -lh /mnt/glusterfs/www/a/actasocietatissc26suom total 649M -rwxr-xr-x 1 www-data www-data 19M 2009-07-10 01:55 actasocietatissc26suom_abbyy.gz -rwxr-xr-x 1 www-data www-data 28M 2009-07-10 06:53 actasocietatissc26suom_bw.pdf -rwxr-xr-x 1 www-data www-data 1.3K 2009-06-12 10:21 actasocietatissc26suom_dc.xml -rwxr-xr-x 1 www-data www-data 18M 2009-07-10 03:05 actasocietatissc26suom.djvu -rwxr-xr-x 1 www-data www-data 1.3M 2009-07-10 06:54 actasocietatissc26suom_djvu.txt -rwxr-xr-x 1 www-data www-data 14M 2009-07-10 02:08 actasocietatissc26suom_djvu.xml -rwxr-xr-x 1 www-data www-data 4.4K 2009-12-14 04:42 actasocietatissc26suom_files.xml -rwxr-xr-x 1 www-data www-data 20M 2009-07-09 18:57 actasocietatissc26suom_flippy.zip -rwxr-xr-x 1 www-data www-data 285K 2009-07-09 18:52 actasocietatissc26suom.gif -rwxr-xr-x 1 www-data www-data 193M 2009-07-09 18:51 actasocietatissc26suom_jp2.zip -rwxr-xr-x 1 www-data www-data 5.7K 2009-06-12 10:21 actasocietatissc26suom_marc.xml -rwxr-xr-x 1 www-data www-data 2.0K 2009-06-12 10:21 actasocietatissc26suom_meta.mrc -rwxr-xr-x 1 www-data www-data 416 2009-06-12 10:21 actasocietatissc26suom_metasource.xml -rwxr-xr-x 1 www-data www-data 2.2K 2009-12-01 12:20 actasocietatissc26suom_meta.xml -rwxr-xr-x 1 www-data www-data 279K 2009-12-14 04:42 actasocietatissc26suom_names.xml -rwxr-xr-x 1 www-data www-data 324M 2009-07-09 13:28 actasocietatissc26suom_orig_jp2.tar -rwxr-xr-x 1 www-data www-data 34M 2009-07-10 04:35 actasocietatissc26suom.pdf -rwxr-xr-x 1 www-data www-data 365K 2009-07-09 13:28 actasocietatissc26suom_scandata.xml
  • 7. Initial file population Populating a cluster with our data at the Internet Archive • Looked at many options o ship a pre-populated server (Sun Thumper with 48TB capacity) o shipping individual external hard-drives o download the files on our own
  • 9. Initial file population Populating a cluster with our data at the Internet Archive • Looked at many options o ship a pre-populated server (Sun Thumper with 48TB capacity) o shipping individual external hard-drives o download the files on our own • Path of least resistance, we wrote a script and used the Internet2 connection at the Marine Biology Laboratory (Woods Hole) to download directly to the first cluster o knew it would take forever to download (but it took longer) o needed space to download files (cluster buildout) o networking issues in Woods Hole (overloaded local router) o file verification (checksums that don’t...) • Lessons learned - would we do it again? Probably not. • Current propagation method o initial distribution - mailing external drives (1, 5) o syncing of the changes for future content (smaller bites)
  • 10. Code: grabbyd 1 Internet Archive, San Francisco BHL Global, Woods Hole Automated process to continuously download the latest BHL data • Uses subversion to get an updated list of new BHL content as IA identifiers http://code.google.com/p/bhl-bits/source/browse/#svn/trunk/iaidentifiers • An enhanced version of the original download script to transfer the data o grabbyd - a script that parses the latest iaidentifiers list, determines the IDs of the new data and downloads the data to the cluster o Will provide detailed reporting with status pages and/or another method (webapp, email, RSS, XML, etc) Code available (open sourced, BSD licensed): [1] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/grabby/grabbyd
  • 11. Code: grabbyd + reporting http://cluster.biodiversitylibrary.org/
  • 12. Replication|Replication Why do we need replication? • First BHL stored everything at the Internet Archive in San Francisco o no backup or safety net o limited in what we could do with, and serve, our data • Now with our first BHL cluster, we gain o redundancy - will be able to serve from the cluster and fall back to IA if needed o analytics - the files are ‘local’ to parse through, discover new relationships o serving options - geo-location, eventually will be able to serve from closest server
  • 13. Replication|Replication Why do we need replication? • First BHL stored everything at the Internet Archive in San Francisco o no backup or safety net o limited in what we could do with, and serve, our data • Now with our first BHL cluster, we gain o redundancy - will be able to serve from the cluster and fall back to IA if needed o analytics - the files are ‘local’ to parse through, discover new relationships o serving options - geo-location, eventually will be able to serve from closest server • Next - share the data with everyone o Europe o Australia o China o etc... • Provide safe harbor o lots of copies...
  • 14. Code: bhl-sync Open source Dropbox model • uses and implements many open source projects o inotify - a subsystem within the Linux kernel that extends the filesystem to notice changes to the filesystem and report them to applications (in the kernel since 2.6.13 (2005)) o lsyncd - an open source project that provides a wrapper into inotify o OpenSSH - secure file transfer o rsync - long term, proven syncing subsystem
  • 15. Code: bhl-sync Open source Dropbox model • uses and implements many open source projects o inotify - a subsystem within the Linux kernel that extends the filesystem to notice changes to the filesystem and report them to applications (in the kernel since 2.6.13 (2005)) o lsyncd - an open source project that provides a wrapper into inotify o OpenSSH - secure file transfer o rsync - long term, proven syncing subsystem What does bhl-sync do? • runs lsyncd as a daemon that notices kernel events and kicks off rync over OpenSSH to mirror data to designated remote servers • the only requirement on the remote system is a secure login for a normal user (using a key based OpenSSH) keeping the process neutral and not requiring any other specific technologies (OS, applications, filesystem) on the remote system (cross-platform) • want to mirror BHL? it’s now possible (you just need a lot of storage) Code available (open sourced, BSD licensed): http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/bhl-sync.sh
  • 16. Code: bhl-sync + status http://bit.ly/09-bhl-sync
  • 17. BHL content distribution 1 Internet Archive, San Francisco BHL Global, Woods Hole 2 2 BHL, St. Louis BHL Europe, London Code available (open sourced, BSD licensed): [1] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/grabby/grabbyd [2] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/bhl-sync.sh
  • 18. BHL content distribution 1 ? Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing 2 2 ? BHL, St. Louis BHL Europe, London BHL Australia, Melbourne Code available (open sourced, BSD licensed): [1] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/grabby/grabbyd [2] http://code.google.com/p/bhl-bits/source/browse/trunk/utilities/bhl-sync.sh
  • 19. Other replication challenges • Deleting content - "going dark" o this can be data that is removed from search indexes, but still retrievable via URI o or deleted data not available (requires a separate sync process) • New content coming in from other sources o Localization of content - maybe it all can't be shared? o National nodes consideration
  • 20. BHL content + local data Internet Archive, San Francisco BHL Global, Woods Hole BHL China, Beijing Content sourced from China, scanned by Internet Archive, replicated into BHL Global
  • 21. BHL content + regional data Internet Archive, San Francisco BHL Global, Woods Hole ? BHL Europe, Paris BHL Europe, London BHL Europe, Berlin Content sourced from BHL Europe partners may, or may not, be passed back to Internet Archive and BHL Global
  • 22. Fedora-commons integration Integrated digital repository-centered platform • Enables storage, access and management of virtually any kind of digital content • can be a base for software developers to build tools and front ends on for sharing, reuse and displaying data online • Is free, community supported, open source software
  • 23. Fedora-commons integration Integrated digital repository-centered platform • Enables storage, access and management of virtually any kind of digital content • can be a base for software developers to build tools and front ends on for sharing, reuse and displaying data online • Is free, community supported, open source software • Creates and maintains a persistent, stable, digital archive o provides backup, redundancy and disaster recovery o complements (doesn’t replace or put any demands upon) existing architecture by incorporating open standards o stores data in a neutral manner, allowing for an independent disaster recovery option o shares data via OAI, REST based interface
  • 24. BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons BHL, St. Louis BHL Europe, London
  • 25. BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons OAI BHL node Fedora-commons
  • 26. BHL content distribution Internet Archive, San Francisco BHL Global, Woods Hole Fedora-commons OAI BHL node Fedora-commons
  • 27. Thanks + questions Thanks to Adrian Smales, Chris Sleep (NMH), Chris Freeland, Tom Garnett (BHL) and Cathy Norton, Anthony Goddard, Woods Hole networking admins (MBL) for their work and support of this project. email phil.cryer@mobot.org skype phil.cryer twitter @fak3r slides available on slideshare

Editor's Notes