2. Why Cluster?
➢ Service Resilience
○ Failures
○ Server admin and security patches
➢ Performance / scale
○ More hardware : CPU, RAM, system bus
3. Why Cluster?
➢ Alternatives
○ Full clustering
○ Master-slave (load scale only ; update issues)
○ Application visible partitioning
4. Who am I?
➢ Committer on Apache Jena
○ Deploy/operate Jena/TDB in ££job.
➢ W3C
○ Co-editor on SPARQ 1.0 and 1.1 query lang
spaces
○ RDF 1.1 (on syntax, inc. SPARQL alignment)
○ ASF’s W3C AC representative
5. Acknowledgements
➢ Apache
➢ Partial funding : InnovateUK*
➢ Users
○ For the discussion and encouragement
* Used to be the Technology Strategy Board.
UK Department for Business, Innovation & Skills
6. Outline
➢ TDB Design
➢ SPARQL Execution
➢ Lizard Design
➢ Back to SPARQL
9. TDB : Indexes
➢ Indexes are covering
○ Range scans
○ All key, no value
○ No "triple table"
10. SPARQL Execution
{ ?x :p 123 . }
Convert to NodeIds
Look in POS to get all PO?, assign S to ?x
123 is an inline constant in TDB.
{ ?x :p 123 .
?x :q ?v . }
A database join
Index join (Loop+substitution)
Index join (= loop) on
:x1 :q ?v
where :x1 is the value of ?x
12. Choices
Query and Update
Indexes / B+Trees Node table / Objects
Blocks Key → Value Store
Where to introduce distribution?
13. This Does Not Work (very well)
Query and Update
B+Trees Objects
Blocks Key→Value
➢ Impedance mismatch
Distribute the storage
K->V store
Index access on query processor
○ Too much data moving about
○ Little parallelism
○ Bad cold-start
14. Distribute
Query and Update
B+Trees Objects
Blocks Key→Value
➢ Distribute the indexes
○ With modified index access
➢ Distribute the nodes
➢ Comms : Apache Thrift
15. Clustered Node Table
➢ Node Table
○ N replicas; Read R / Write W
e.g. W=N and R =1 =>
Complete copies of node table on each data server
○ Replaceable
Requirement: NodeId for naming