Towards a Vocabulary for Data Quality Management in Semantic Web Architectures
1. Towards a Vocabulary for
DQM in Semantic Web
Architectures
(Research in Progress)
Christian Fürber and Martin Hepp
christian@fuerber.com, mhepp@computer.org
Presentation @ 1st International Workshop on Linked Web
Data Management,
March 25th, 2011, Uppsala, Sweden
2. Part 1:
What‘s the Problem?
C. Fürber, M. Hepp: 2
Towards a Vocabulary for DQM
In SemWeb Architectures
3. Various Data Quality Problems
Inconsistent duplicates
Invalid characters Missing classification
Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Incorrect reference Approximate duplicates
Reference: Linking Open Data cloud diagram, by
Character alignment violation
Word transpositions
Invalid substrings
Mistyping / Misspelling errors
Cardinality violation
Missing values Referential integrity violation
Misfielded values
Unique value violation False values Functional Dependency
Out of range values
Violation Imprecise values
Existence of Homonyms Meaningless values
Incorrect classification
Existence of Synonyms Contradictory relationships
Outdated conceptual elements Untyped literals Outdated values
C. Fürber, M. Hepp: 3
Towards a Vocabulary for DQM
in SemWeb Architectures
4. The Problem
Negative
Population
Weird Population
Values
Invalid
URL‘s
Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql
C. Fürber, M. Hepp: 4
Towards a Vocabulary for DQM
in SemWeb Architectures
5. Part 2:
What are high quality data?
C. Fürber, M. Hepp: 5
Towards a Vocabulary for DQM
In SemWeb Architectures
6. What is Data Quality?
• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)
• „Conformance to specification“ (Kahn et al. 2002)
• „Data are of high quality if they are fit for their intended
uses in operations, decision making, and planning. Data
are fit for use if they are free of defects and possess
desired features.“ (Redman 2001)
• Requirements as „Benchmark“
C. Fürber, M. Hepp: 6
Towards a Vocabulary for DQM
in SemWeb Architectures
7. Perspective-Neutral Data Quality
Data quality is the degree to which
data fulfills quality requirements
…no matter who makes the quality requirements.
C. Fürber, M. Hepp: 7
Towards a Vocabulary for DQM
In SemWeb Architectures
8. Quality-
Requirements
The Problem
Population
cannot be Negative
negative Population
Population is
indicated by
numeric values Weird Population
Values
URL‘s usually
start with http://,
https://, etc. Invalid
URL‘s
Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql
C. Fürber, M. Hepp: 8
Towards a Vocabulary for DQM
in SemWeb Architectures
9. Satisfying Quality Requirements
Problem 3: Satisfying
Requirements Desired
State
Individuals
Status
Quo
= Desired
State
Groups
Desired
State
Standards,
etc.
Problem 2: Harmonizing
Requirements Problem 1: Expressing
Quality Requirements
C. Fürber, M. Hepp: 9
Towards a Vocabulary for DQM
In SemWeb Architectures
10. Part 3:
Research Goal
C. Fürber, M. Hepp: 10
Towards a Vocabulary for DQM
In SemWeb Architectures
11. Major Research Goal
• Represent Quality-Relevant information for
automated…
– Data Quality Monitoring
– Data Quality Assessment
– Data Cleansing
– Filtering of High Quality Data
…in a standardized vocabulary.
C. Fürber, M. Hepp: 11
Towards a Vocabulary for DQM
in SemWeb Architectures
12. Motives for DQM-Vocabulary
• Support people to explicitly express data quality
requirements in „same language“ on Web-Scale
• Support the creation of consensual agreements
upon quality requirements
• Reduce effort for DQM-Activities
• Raise transparency about assumed quality
requirements
• Enable consistency checks among quality
requirements
C. Fürber, M. Hepp: 12
Towards a Vocabulary for DQM
In SemWeb Architectures
13. Part 4:
Our Approach
C. Fürber, M. Hepp: 13
Towards a Vocabulary for DQM
In SemWeb Architectures
14. Basic Architecture
Assessment HQ Data
Problem Scores Retrieval Cleansed
Classification Data
SPARQL-Query-Engine
DQM-Vocabulary
Knowledgebase
RDB A RDB B Data Acquisition
C. Fürber, M. Hepp: 14
Towards a Vocabulary for DQM
in SemWeb Architectures
15. Main Concepts of DQM-Vocabulary
Classify Quality Express
Problems Requirements
Annotate
Quality
Scores
Express
Cleansing
Account for Tasks
Task-Dependent
Requirements
C. Fürber, M. Hepp: 15
Towards a Vocabulary for DQM
In SemWeb Architectures
16. Data Quality Problem Types:
Source for Potential Requirements
Inconsistent duplicates
Invalid characters Missing classification
Incorrect reference Character alignment violation
Approximate duplicates
Word transpositions
Invalid substrings
Mistyping / Misspelling errors
Cardinality violation
Missing values Referential integrity violation
Misfielded values
Unique value violation False values Functional Dependency
Out of range values
Violation Imprecise values
Existence of Homonyms Meaningless values
Incorrect classification
Existence of Synonyms Contradictory relationships
Outdated conceptual elements Outdated values
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM 16
in SemWeb Architectures
17. Data Quality Requirements
Syntactical Rules
Semantic Rules
Redundancy Rules
Completeness Rules
Timeliness Rules
C. Fürber, M. Hepp: 17
Towards a Vocabulary for DQM
In SemWeb Architectures
18. Quality-Influencing Artifacts
Current Focus
of DQM-Vocabulary
Data
C. Fürber, M. Hepp: 18
Towards a Vocabulary for DQM
In SemWeb Architectures
19. Design Alternatives:
Statements about Classes & Properties
(1) Using classes and properties as subjects
(2) Using datatype properties with xsd:anyURI
(3) Mapping class and property URI‘s to new URI‘s
C. Fürber, M. Hepp: 19
Towards a Vocabulary for DQM
In SemWeb Architectures
20. Part 5:
Application Examples
C. Fürber, M. Hepp: 20
Towards a Vocabulary for DQM
In SemWeb Architectures
21. Example 1: Legal Value Rule (1/3)
What instances have illegal values
for property foo:country ?
C. Fürber, M. Hepp: 21
Towards a Vocabulary for DQM
In SemWeb Architectures
22. Example 1: Legal Value Rule (2/3)
dqm:LegalValueRule Class
Instance
Literal value
foo:LegalValueRule_1
“tref:Countries“
“foo:Countries“
“tref:countryName“ “foo:countryName“
C. Fürber, M. Hepp: 22
Towards a Vocabulary for DQM
In SemWeb Architectures
23. Example 1: Legal Value Rule (3/3)
C. Fürber, M. Hepp: 23
Towards a Vocabulary for DQM
In SemWeb Architectures
24. Example 2: DQ-Assessment (1/2)
How syntactically accurate are all
properties that are subject to
LegalValueRules?
C. Fürber, M. Hepp: 24
Towards a Vocabulary for DQM
In SemWeb Architectures
25. Example 2: DQ-Assessment (2/2)
C. Fürber, M. Hepp: 25
Towards a Vocabulary for DQM
In SemWeb Architectures
26. Part 6:
Conclusions &
Planned Work
C. Fürber, M. Hepp: 26
Towards a Vocabulary for DQM
In SemWeb Architectures
27. Advantages of DQM-Voabulary
• Minimizes human effort for DQM
• Web-Scale sharing/reuse of data quality
requirements
• Consistency checks among data quality
requirements
• Transparency about applied data quality
rules
C. Fürber, M. Hepp: 27
Towards a Vocabulary for DQM
In SemWeb Architectures
28. Limitations
• Representation of complex functional
dependency rules and derivation rules
• Limited experience on real world-data sets
• Currently no own concepts for classes and
properties
• Research still in progress
C. Fürber, M. Hepp: 28
Towards a Vocabulary for DQM
In SemWeb Architectures
29. Future Work
• Evaluation of design alternatives
• Development of processing framework
• Representation of more complex
functional dependency rules / derivation
rules
• Extension of DQM-Vobulary
• Evaluation on real-world data sets
• Publication at http://semwebquality.org
C. Fürber, M. Hepp: 29
Towards a Vocabulary for DQM
in SemWeb Architectures
30. Christian Fürber
Researcher
E-Business & Web Science Research Group
Werner-Heisenberg-Weg 39
85577 Neubiberg
Germany
skype c.fuerber
email christian@fuerber.com
web http://www.unibw.de/ebusiness
homepage http://www.fuerber.com
twitter http://www.twitter.com/cfuerber
Paper available at http://bit.ly/gYEDdQ
30