SlideShare a Scribd company logo
1 of 30
Download to read offline
Towards a Vocabulary for
  DQM in Semantic Web
      Architectures
                 (Research in Progress)

        Christian Fürber and Martin Hepp
       christian@fuerber.com, mhepp@computer.org

Presentation @ 1st International Workshop on Linked Web
                    Data Management,
           March 25th, 2011, Uppsala, Sweden
Part 1:
                      What‘s the Problem?



C. Fürber, M. Hepp:                         2
Towards a Vocabulary for DQM
In SemWeb Architectures
Various Data Quality Problems
                                                          Inconsistent duplicates
                       Invalid characters                                Missing classification




                                                                                                                           Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  Incorrect reference                                                                      Approximate duplicates




                                                                                                                               Reference: Linking Open Data cloud diagram, by
                                                          Character alignment violation

                    Word transpositions
                                     Invalid substrings
                                                               Mistyping / Misspelling errors
  Cardinality violation
                                                     Missing values                  Referential integrity violation
                  Misfielded values
         Unique value violation            False values             Functional Dependency
                               Out of range values
                                                                    Violation                Imprecise values
    Existence of Homonyms                 Meaningless values
                                                                            Incorrect classification
         Existence of Synonyms                                   Contradictory relationships
                               Outdated conceptual elements         Untyped literals        Outdated values


C. Fürber, M. Hepp:                                                                                                    3
Towards a Vocabulary for DQM
in SemWeb Architectures
The Problem
                                                                                        Negative
                                                                                        Population


                                                                           Weird Population
                                                                           Values


                                                                                              Invalid
                                                                                              URL‘s

                                Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql



C. Fürber, M. Hepp:                                                                                  4
Towards a Vocabulary for DQM
in SemWeb Architectures
Part 2:
        What are high quality data?



C. Fürber, M. Hepp:                   5
Towards a Vocabulary for DQM
In SemWeb Architectures
What is Data Quality?
• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)

• „Conformance to specification“ (Kahn et al. 2002)
• „Data are of high quality if they are fit for their intended
  uses in operations, decision making, and planning. Data
  are fit for use if they are free of defects and possess
  desired features.“ (Redman 2001)


                    • Requirements as „Benchmark“
C. Fürber, M. Hepp:                                              6
Towards a Vocabulary for DQM
in SemWeb Architectures
Perspective-Neutral Data Quality


              Data quality is the degree to which
               data fulfills quality requirements

        …no matter who makes the quality requirements.



C. Fürber, M. Hepp:                                 7
Towards a Vocabulary for DQM
In SemWeb Architectures
Quality-
   Requirements
                                    The Problem
                                    Population
                                    cannot be                                                    Negative
                                     negative                                                    Population
                            Population is
                            indicated by
                           numeric values                                           Weird Population
                                                                                    Values
                        URL‘s usually
                       start with http://,
                         https://, etc.                                                                Invalid
                                                                                                       URL‘s

                                         Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql



C. Fürber, M. Hepp:                                                                                           8
Towards a Vocabulary for DQM
in SemWeb Architectures
Satisfying Quality Requirements
         Problem 3: Satisfying
            Requirements            Desired
                                     State

                                                            Individuals

       Status
        Quo
                               =   Desired
                                    State
                                                             Groups


                                    Desired
                                     State
                                                           Standards,
                                                              etc.
  Problem 2: Harmonizing
       Requirements                           Problem 1: Expressing
                                              Quality Requirements
C. Fürber, M. Hepp:                                               9
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 3:
                               Research Goal



C. Fürber, M. Hepp:                            10
Towards a Vocabulary for DQM
In SemWeb Architectures
Major Research Goal
 • Represent Quality-Relevant information for
   automated…
                       – Data Quality Monitoring
                       – Data Quality Assessment
                       – Data Cleansing
                       – Filtering of High Quality Data

                                 …in a standardized vocabulary.


C. Fürber, M. Hepp:                                               11
Towards a Vocabulary for DQM
in SemWeb Architectures
Motives for DQM-Vocabulary
• Support people to explicitly express data quality
  requirements in „same language“ on Web-Scale
• Support the creation of consensual agreements
  upon quality requirements
• Reduce effort for DQM-Activities
• Raise transparency about assumed quality
  requirements
• Enable consistency checks among quality
  requirements
C. Fürber, M. Hepp:                              12
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 4:
                               Our Approach



C. Fürber, M. Hepp:                           13
Towards a Vocabulary for DQM
In SemWeb Architectures
Basic Architecture
                                 Assessment   HQ Data
      Problem                      Scores     Retrieval           Cleansed
    Classification                                                  Data


                                  SPARQL-Query-Engine
                                              DQM-Vocabulary



                          Knowledgebase
                        RDB A     RDB B        Data Acquisition

C. Fürber, M. Hepp:                                                          14
Towards a Vocabulary for DQM
in SemWeb Architectures
Main Concepts of DQM-Vocabulary
                               Classify Quality     Express
                                  Problems        Requirements

                                                                 Annotate
                                                                  Quality
                                                                  Scores




                                                                  Express
                                                                 Cleansing
     Account for                                                   Tasks
   Task-Dependent
    Requirements
C. Fürber, M. Hepp:                                                   15
Towards a Vocabulary for DQM
In SemWeb Architectures
Data Quality Problem Types:
          Source for Potential Requirements
                                                          Inconsistent duplicates
                       Invalid characters                                Missing classification
  Incorrect reference                                     Character alignment violation
                                                                                           Approximate duplicates

                    Word transpositions
                                     Invalid substrings
                                                               Mistyping / Misspelling errors
  Cardinality violation
                                                     Missing values                  Referential integrity violation
                  Misfielded values
         Unique value violation            False values             Functional Dependency
                               Out of range values
                                                                    Violation                Imprecise values
    Existence of Homonyms                 Meaningless values
                                                                            Incorrect classification
         Existence of Synonyms                                   Contradictory relationships
                               Outdated conceptual elements                                 Outdated values
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM                                                                                           16
in SemWeb Architectures
Data Quality Requirements
                                      Syntactical Rules
                                      Semantic Rules
                                     Redundancy Rules
                                    Completeness Rules
                                      Timeliness Rules




C. Fürber, M. Hepp:                                  17
Towards a Vocabulary for DQM
In SemWeb Architectures
Quality-Influencing Artifacts


        Current Focus
     of DQM-Vocabulary
                                    Data




C. Fürber, M. Hepp:                            18
Towards a Vocabulary for DQM
In SemWeb Architectures
Design Alternatives:
   Statements about Classes & Properties


(1) Using classes and properties as subjects

(2) Using datatype properties with xsd:anyURI

(3) Mapping class and property URI‘s to new URI‘s


C. Fürber, M. Hepp:                             19
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 5:
                    Application Examples



C. Fürber, M. Hepp:                        20
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (1/3)


               What instances have illegal values
                 for property foo:country ?




C. Fürber, M. Hepp:                                 21
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (2/3)
                               dqm:LegalValueRule          Class
                                                          Instance

                                                         Literal value
                                  foo:LegalValueRule_1




   “tref:Countries“
                                                          “foo:Countries“



        “tref:countryName“                               “foo:countryName“



C. Fürber, M. Hepp:                                                  22
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (3/3)




C. Fürber, M. Hepp:                        23
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 2: DQ-Assessment (1/2)


               How syntactically accurate are all
                 properties that are subject to
                      LegalValueRules?




C. Fürber, M. Hepp:                                 24
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 2: DQ-Assessment (2/2)




C. Fürber, M. Hepp:                      25
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 6:
                               Conclusions &
                               Planned Work


C. Fürber, M. Hepp:                            26
Towards a Vocabulary for DQM
In SemWeb Architectures
Advantages of DQM-Voabulary

• Minimizes human effort for DQM
• Web-Scale sharing/reuse of data quality
  requirements
• Consistency checks among data quality
  requirements
• Transparency about applied data quality
  rules
C. Fürber, M. Hepp:                         27
Towards a Vocabulary for DQM
In SemWeb Architectures
Limitations
• Representation of complex functional
  dependency rules and derivation rules
• Limited experience on real world-data sets
• Currently no own concepts for classes and
  properties
• Research still in progress


C. Fürber, M. Hepp:                          28
Towards a Vocabulary for DQM
In SemWeb Architectures
Future Work
• Evaluation of design alternatives
• Development of processing framework
• Representation of more complex
  functional dependency rules / derivation
  rules
• Extension of DQM-Vobulary
• Evaluation on real-world data sets
• Publication at http://semwebquality.org
C. Fürber, M. Hepp:                          29
Towards a Vocabulary for DQM
in SemWeb Architectures
Christian Fürber
   Researcher
   E-Business & Web Science Research Group

                 Werner-Heisenberg-Weg 39
                 85577 Neubiberg
                 Germany

                 skype            c.fuerber
                 email            christian@fuerber.com
                 web              http://www.unibw.de/ebusiness
                 homepage         http://www.fuerber.com
                 twitter          http://www.twitter.com/cfuerber




Paper available at http://bit.ly/gYEDdQ
                                                                    30

More Related Content

What's hot

Poincare embeddings for Learning Hierarchical Representations
Poincare embeddings for Learning Hierarchical RepresentationsPoincare embeddings for Learning Hierarchical Representations
Poincare embeddings for Learning Hierarchical RepresentationsTatsuya Shirakawa
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Databricks
 
3. Vertex AIを用いた時系列データの解析
3. Vertex AIを用いた時系列データの解析3. Vertex AIを用いた時系列データの解析
3. Vertex AIを用いた時系列データの解析幸太朗 岩澤
 
Recsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem RevisitedRecsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem RevisitedXavier Amatriain
 
コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)Nagi Teramo
 
トピックモデルの評価指標 Coherence 研究まとめ #トピ本
トピックモデルの評価指標 Coherence 研究まとめ #トピ本トピックモデルの評価指標 Coherence 研究まとめ #トピ本
トピックモデルの評価指標 Coherence 研究まとめ #トピ本hoxo_m
 
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...Krishnaram Kenthapadi
 
パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰sleipnir002
 
Interactive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCPInteractive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCPReza Rahimi
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisitedXavier Amatriain
 
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphNeo4j
 
階層ベイズによるワンToワンマーケティング入門
階層ベイズによるワンToワンマーケティング入門階層ベイズによるワンToワンマーケティング入門
階層ベイズによるワンToワンマーケティング入門shima o
 
[DL輪読会]物理学による帰納バイアスを組み込んだダイナミクスモデル作成に関する論文まとめ
[DL輪読会]物理学による帰納バイアスを組み込んだダイナミクスモデル作成に関する論文まとめ[DL輪読会]物理学による帰納バイアスを組み込んだダイナミクスモデル作成に関する論文まとめ
[DL輪読会]物理学による帰納バイアスを組み込んだダイナミクスモデル作成に関する論文まとめDeep Learning JP
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graphAlan Morrison
 
グラフを奇麗に描画するアルゴリズム
グラフを奇麗に描画するアルゴリズムグラフを奇麗に描画するアルゴリズム
グラフを奇麗に描画するアルゴリズムmfumi
 
非線形データの次元圧縮 150905 WACODE 2nd
非線形データの次元圧縮 150905 WACODE 2nd非線形データの次元圧縮 150905 WACODE 2nd
非線形データの次元圧縮 150905 WACODE 2ndMika Yoshimura
 

What's hot (20)

Poincare embeddings for Learning Hierarchical Representations
Poincare embeddings for Learning Hierarchical RepresentationsPoincare embeddings for Learning Hierarchical Representations
Poincare embeddings for Learning Hierarchical Representations
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
 
3. Vertex AIを用いた時系列データの解析
3. Vertex AIを用いた時系列データの解析3. Vertex AIを用いた時系列データの解析
3. Vertex AIを用いた時系列データの解析
 
Recsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem RevisitedRecsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem Revisited
 
Hyperbolic Neural Networks
Hyperbolic Neural NetworksHyperbolic Neural Networks
Hyperbolic Neural Networks
 
XGBoost (System Overview)
XGBoost (System Overview)XGBoost (System Overview)
XGBoost (System Overview)
 
コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)
 
トピックモデルの評価指標 Coherence 研究まとめ #トピ本
トピックモデルの評価指標 Coherence 研究まとめ #トピ本トピックモデルの評価指標 Coherence 研究まとめ #トピ本
トピックモデルの評価指標 Coherence 研究まとめ #トピ本
 
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
Privacy-preserving Data Mining in Industry: Practical Challenges and Lessons ...
 
パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰パターン認識 05 ロジスティック回帰
パターン認識 05 ロジスティック回帰
 
Interactive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCPInteractive Proof Systems and An Introduction to PCP
Interactive Proof Systems and An Introduction to PCP
 
Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisited
 
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
時系列ビッグデータの特徴自動抽出とリアルタイム将来予測(第9回ステアラボ人工知能セミナー)
 
Session-Based Recommender Systems
Session-Based Recommender SystemsSession-Based Recommender Systems
Session-Based Recommender Systems
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
階層ベイズによるワンToワンマーケティング入門
階層ベイズによるワンToワンマーケティング入門階層ベイズによるワンToワンマーケティング入門
階層ベイズによるワンToワンマーケティング入門
 
[DL輪読会]物理学による帰納バイアスを組み込んだダイナミクスモデル作成に関する論文まとめ
[DL輪読会]物理学による帰納バイアスを組み込んだダイナミクスモデル作成に関する論文まとめ[DL輪読会]物理学による帰納バイアスを組み込んだダイナミクスモデル作成に関する論文まとめ
[DL輪読会]物理学による帰納バイアスを組み込んだダイナミクスモデル作成に関する論文まとめ
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graph
 
グラフを奇麗に描画するアルゴリズム
グラフを奇麗に描画するアルゴリズムグラフを奇麗に描画するアルゴリズム
グラフを奇麗に描画するアルゴリズム
 
非線形データの次元圧縮 150905 WACODE 2nd
非線形データの次元圧縮 150905 WACODE 2nd非線形データの次元圧縮 150905 WACODE 2nd
非線形データの次元圧縮 150905 WACODE 2nd
 

Recently uploaded

Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 

Recently uploaded (20)

Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

  • 1. Towards a Vocabulary for DQM in Semantic Web Architectures (Research in Progress) Christian Fürber and Martin Hepp christian@fuerber.com, mhepp@computer.org Presentation @ 1st International Workshop on Linked Web Data Management, March 25th, 2011, Uppsala, Sweden
  • 2. Part 1: What‘s the Problem? C. Fürber, M. Hepp: 2 Towards a Vocabulary for DQM In SemWeb Architectures
  • 3. Various Data Quality Problems Inconsistent duplicates Invalid characters Missing classification Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Incorrect reference Approximate duplicates Reference: Linking Open Data cloud diagram, by Character alignment violation Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Untyped literals Outdated values C. Fürber, M. Hepp: 3 Towards a Vocabulary for DQM in SemWeb Architectures
  • 4. The Problem Negative Population Weird Population Values Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 4 Towards a Vocabulary for DQM in SemWeb Architectures
  • 5. Part 2: What are high quality data? C. Fürber, M. Hepp: 5 Towards a Vocabulary for DQM In SemWeb Architectures
  • 6. What is Data Quality? • Data‘s „fitness for use by data consumers“ (Wang, Strong 1996) • „Conformance to specification“ (Kahn et al. 2002) • „Data are of high quality if they are fit for their intended uses in operations, decision making, and planning. Data are fit for use if they are free of defects and possess desired features.“ (Redman 2001) • Requirements as „Benchmark“ C. Fürber, M. Hepp: 6 Towards a Vocabulary for DQM in SemWeb Architectures
  • 7. Perspective-Neutral Data Quality Data quality is the degree to which data fulfills quality requirements …no matter who makes the quality requirements. C. Fürber, M. Hepp: 7 Towards a Vocabulary for DQM In SemWeb Architectures
  • 8. Quality- Requirements The Problem Population cannot be Negative negative Population Population is indicated by numeric values Weird Population Values URL‘s usually start with http://, https://, etc. Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 8 Towards a Vocabulary for DQM in SemWeb Architectures
  • 9. Satisfying Quality Requirements Problem 3: Satisfying Requirements Desired State Individuals Status Quo = Desired State Groups Desired State Standards, etc. Problem 2: Harmonizing Requirements Problem 1: Expressing Quality Requirements C. Fürber, M. Hepp: 9 Towards a Vocabulary for DQM In SemWeb Architectures
  • 10. Part 3: Research Goal C. Fürber, M. Hepp: 10 Towards a Vocabulary for DQM In SemWeb Architectures
  • 11. Major Research Goal • Represent Quality-Relevant information for automated… – Data Quality Monitoring – Data Quality Assessment – Data Cleansing – Filtering of High Quality Data …in a standardized vocabulary. C. Fürber, M. Hepp: 11 Towards a Vocabulary for DQM in SemWeb Architectures
  • 12. Motives for DQM-Vocabulary • Support people to explicitly express data quality requirements in „same language“ on Web-Scale • Support the creation of consensual agreements upon quality requirements • Reduce effort for DQM-Activities • Raise transparency about assumed quality requirements • Enable consistency checks among quality requirements C. Fürber, M. Hepp: 12 Towards a Vocabulary for DQM In SemWeb Architectures
  • 13. Part 4: Our Approach C. Fürber, M. Hepp: 13 Towards a Vocabulary for DQM In SemWeb Architectures
  • 14. Basic Architecture Assessment HQ Data Problem Scores Retrieval Cleansed Classification Data SPARQL-Query-Engine DQM-Vocabulary Knowledgebase RDB A RDB B Data Acquisition C. Fürber, M. Hepp: 14 Towards a Vocabulary for DQM in SemWeb Architectures
  • 15. Main Concepts of DQM-Vocabulary Classify Quality Express Problems Requirements Annotate Quality Scores Express Cleansing Account for Tasks Task-Dependent Requirements C. Fürber, M. Hepp: 15 Towards a Vocabulary for DQM In SemWeb Architectures
  • 16. Data Quality Problem Types: Source for Potential Requirements Inconsistent duplicates Invalid characters Missing classification Incorrect reference Character alignment violation Approximate duplicates Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Outdated values C. Fürber, M. Hepp: Towards a Vocabulary for DQM 16 in SemWeb Architectures
  • 17. Data Quality Requirements Syntactical Rules Semantic Rules Redundancy Rules Completeness Rules Timeliness Rules C. Fürber, M. Hepp: 17 Towards a Vocabulary for DQM In SemWeb Architectures
  • 18. Quality-Influencing Artifacts Current Focus of DQM-Vocabulary Data C. Fürber, M. Hepp: 18 Towards a Vocabulary for DQM In SemWeb Architectures
  • 19. Design Alternatives: Statements about Classes & Properties (1) Using classes and properties as subjects (2) Using datatype properties with xsd:anyURI (3) Mapping class and property URI‘s to new URI‘s C. Fürber, M. Hepp: 19 Towards a Vocabulary for DQM In SemWeb Architectures
  • 20. Part 5: Application Examples C. Fürber, M. Hepp: 20 Towards a Vocabulary for DQM In SemWeb Architectures
  • 21. Example 1: Legal Value Rule (1/3) What instances have illegal values for property foo:country ? C. Fürber, M. Hepp: 21 Towards a Vocabulary for DQM In SemWeb Architectures
  • 22. Example 1: Legal Value Rule (2/3) dqm:LegalValueRule Class Instance Literal value foo:LegalValueRule_1 “tref:Countries“ “foo:Countries“ “tref:countryName“ “foo:countryName“ C. Fürber, M. Hepp: 22 Towards a Vocabulary for DQM In SemWeb Architectures
  • 23. Example 1: Legal Value Rule (3/3) C. Fürber, M. Hepp: 23 Towards a Vocabulary for DQM In SemWeb Architectures
  • 24. Example 2: DQ-Assessment (1/2) How syntactically accurate are all properties that are subject to LegalValueRules? C. Fürber, M. Hepp: 24 Towards a Vocabulary for DQM In SemWeb Architectures
  • 25. Example 2: DQ-Assessment (2/2) C. Fürber, M. Hepp: 25 Towards a Vocabulary for DQM In SemWeb Architectures
  • 26. Part 6: Conclusions & Planned Work C. Fürber, M. Hepp: 26 Towards a Vocabulary for DQM In SemWeb Architectures
  • 27. Advantages of DQM-Voabulary • Minimizes human effort for DQM • Web-Scale sharing/reuse of data quality requirements • Consistency checks among data quality requirements • Transparency about applied data quality rules C. Fürber, M. Hepp: 27 Towards a Vocabulary for DQM In SemWeb Architectures
  • 28. Limitations • Representation of complex functional dependency rules and derivation rules • Limited experience on real world-data sets • Currently no own concepts for classes and properties • Research still in progress C. Fürber, M. Hepp: 28 Towards a Vocabulary for DQM In SemWeb Architectures
  • 29. Future Work • Evaluation of design alternatives • Development of processing framework • Representation of more complex functional dependency rules / derivation rules • Extension of DQM-Vobulary • Evaluation on real-world data sets • Publication at http://semwebquality.org C. Fürber, M. Hepp: 29 Towards a Vocabulary for DQM in SemWeb Architectures
  • 30. Christian Fürber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com twitter http://www.twitter.com/cfuerber Paper available at http://bit.ly/gYEDdQ 30