SlideShare a Scribd company logo
1 of 60
Wrangling Data with
OpenRefine
Tony Hirst
Computing and Communications
The Open University
@psychemedia
“It’s … a great joy to learn a technique,
because as soon as you learn it, you
start thinking in it. When I learn a new
technique my imaginative possibilities
have expanded.”
Grayson Perry,
Playing to the Gallery
Real data is
often dirty
and messy
Loading
Checking
Exploring
Cleaning
Reshaping
Annotating
Saving
This is a hands-on
workshop, so fire
up your laptops…
Loading – import source
Loading - filetype
Loading - encoding
Exploring – text facets
Exploring – text facets
Exploring – text facets
Saving - customisation
Cleaning – text facets
Checking – no blanks
Cleaning – tidying columns
Cleaning – tidying numbers
Cleaning – tidying numbers
value.replace(‘£’, ‘’).replace(‘,’ , ‘’)
Cleaning – tidying numbers
Cleaning – tidying numbers
Exploring – number facets
Exploring – filtering number ranges
Exploring – sorting columns
Cleaning – making dates
value.toDate( ‘d/M/y’ )
Cleaning – making dates
http://bit.ly/javadateformat Cleaning – making dates
Exploring – filtering date ranges
Cleaning - whitespace
Cleaning - whitespace
Cleaning - whitespace
Exploring – filter and facet
Cleaning – “ish-match”
Cleaning – cluster / make alike
Cleaning – good practice
We need a small dataset
for the next example…
Annotating – reconciliation
https://opencorporates.com/reconcile
Annotating – reconciliation
Annotating –
reconciliation
value.replace( / LTD.?/, ‘ LIMITED’)
Cleaning – normalisation
Annotating – reconciled data
cell.recon.match.id
Annotating – reconciled data
cell.recon.match.name
Annotating – reconciled data
Annotating – reconciled data
https://api.opencorporates.com/companies/gb/00102498
Annotating – URL based data
'https://api.opencorporates.com'+value+'?sparse=true'
Annotating – URL based data
Annotating – URL based data
JSON['results’]
JSON['results']['company’]
JSON['results']['company']['registered_address_in_full']
value.parseJson['results']['company']['registered_address_in_full']
Annotating – parsed JSON data
split(value, ‘,’)
Annotating – parsed JSON data
split(value, ‘,’)[-1]
split(value, ‘,’)[-1].strip()
Annotating – parsed JSON data
Saving – annotated data
Loading
Checking
Exploring
Cleaning
Reshaping
Annotating
Saving
Reuse – exporting your action list
Tutorials and walkthroughs
http://schoolofdata.org/handbook/
recipes/cleaning-data-with-refine/
http://blog.ouseful.info/categor
y/syndication/openrefine
Any questions: @psychemedia

More Related Content

Similar to A Quick Tour of OpenRefine

Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101John Ternent
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonTariq Rashid
 
Python001 training course_mumbai
Python001 training course_mumbaiPython001 training course_mumbai
Python001 training course_mumbaivibrantuser
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data ScientistNarong Intiruk
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
Which library should you choose for data-science? That's the question!
Which library should you choose for data-science? That's the question!Which library should you choose for data-science? That's the question!
Which library should you choose for data-science? That's the question!Anastasia Bobyreva
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Mathieu d'Aquin
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMateusz Dymczyk
 
CognitiveComputing_CirrusShakeri_final
CognitiveComputing_CirrusShakeri_finalCognitiveComputing_CirrusShakeri_final
CognitiveComputing_CirrusShakeri_finalCirrus Shakeri
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuserec53e73
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuserec53e73
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.pptssuser71fb63
 

Similar to A Quick Tour of OpenRefine (20)

Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Lecture3.pptx
Lecture3.pptxLecture3.pptx
Lecture3.pptx
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
 
Python001 training course_mumbai
Python001 training course_mumbaiPython001 training course_mumbai
Python001 training course_mumbai
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data Scientist
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
Which library should you choose for data-science? That's the question!
Which library should you choose for data-science? That's the question!Which library should you choose for data-science? That's the question!
Which library should you choose for data-science? That's the question!
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
OpenML NeurIPS2018
OpenML NeurIPS2018OpenML NeurIPS2018
OpenML NeurIPS2018
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
CognitiveComputing_CirrusShakeri_final
CognitiveComputing_CirrusShakeri_finalCognitiveComputing_CirrusShakeri_final
CognitiveComputing_CirrusShakeri_final
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 
ML-DecisionTrees.ppt
ML-DecisionTrees.pptML-DecisionTrees.ppt
ML-DecisionTrees.ppt
 

More from Tony Hirst

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiestaTony Hirst
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptxTony Hirst
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptxTony Hirst
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacksTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriateTony Hirst
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyterTony Hirst
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2Tony Hirst
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopTony Hirst
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireTony Hirst
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interestTony Hirst
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXTony Hirst
 
Conversations with data
Conversations with dataConversations with data
Conversations with dataTony Hirst
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingoTony Hirst
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Tony Hirst
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalismTony Hirst
 
Lincoln Journalism Research Day - Data Journalism
Lincoln Journalism Research Day - Data JournalismLincoln Journalism Research Day - Data Journalism
Lincoln Journalism Research Day - Data JournalismTony Hirst
 

More from Tony Hirst (20)

15 in 20 research fiesta
15 in 20 research fiesta15 in 20 research fiesta
15 in 20 research fiesta
 
Dev8d jupyter
Dev8d jupyterDev8d jupyter
Dev8d jupyter
 
Ili 16 robot
Ili 16 robotIli 16 robot
Ili 16 robot
 
Jupyternotebooks ou.pptx
Jupyternotebooks ou.pptxJupyternotebooks ou.pptx
Jupyternotebooks ou.pptx
 
Virtual computing.pptx
Virtual computing.pptxVirtual computing.pptx
Virtual computing.pptx
 
ouseful-parlihacks
ouseful-parlihacksouseful-parlihacks
ouseful-parlihacks
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Gors appropriate
Gors appropriateGors appropriate
Gors appropriate
 
Robotlab jupyter
Robotlab   jupyterRobotlab   jupyter
Robotlab jupyter
 
Fco open data in half day th-v2
Fco open data in half day  th-v2Fco open data in half day  th-v2
Fco open data in half day th-v2
 
Notes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 WorkshopNotes on the Future - ILI2015 Workshop
Notes on the Future - ILI2015 Workshop
 
Community Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wireCommunity Journalism Conf - hyperlocal data wire
Community Journalism Conf - hyperlocal data wire
 
Residential school 2015_robotics_interest
Residential school 2015_robotics_interestResidential school 2015_robotics_interest
Residential school 2015_robotics_interest
 
Data Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKXData Mining - Separating Fact From Fiction - NetIKX
Data Mining - Separating Fact From Fiction - NetIKX
 
Week4
Week4Week4
Week4
 
Conversations with data
Conversations with dataConversations with data
Conversations with data
 
Data reuse OU workshop bingo
Data reuse OU workshop bingoData reuse OU workshop bingo
Data reuse OU workshop bingo
 
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
 
Lincoln jun14datajournalism
Lincoln jun14datajournalismLincoln jun14datajournalism
Lincoln jun14datajournalism
 
Lincoln Journalism Research Day - Data Journalism
Lincoln Journalism Research Day - Data JournalismLincoln Journalism Research Day - Data Journalism
Lincoln Journalism Research Day - Data Journalism
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

A Quick Tour of OpenRefine

Editor's Notes

  1. "Wrangling Data With OpenRefine" We've all found datasets that hold the promise or riches inside them, but the data itself is in a terrible state. Commas and percentage signs that mess up number formats, dates that we can see are dates but the computer can't, lists of items in single cells, data spanning multiple rows where we want pivot one column's values across multiple columns, and so on. OpenRefine is a powerful browser based tool that helps you work with tabular data and tidy it up as you go along. As well as editing functions (and yes, you will get to see some regular expressions string matching action!) we can also cluster similar words and phrases, spotting that "J. Smith & Sons Ltd" is probably the same as "J Smith and Sons Limited" and being able to group them as such. Once you've seen it in action you'll wonder why working with data ever needed to be so difficult before. 45 mins x 3
  2. Dirty in the sense that the data may contain errors; messy in that it may be in the wrong shape, form or layout. (Not messy in statistical sense – statisticians are well know for the way they appropriate words and make them mean something no-one else means. In stats, messy data is often regarded as data that is not normally distributed). OpenRefine as a tool helps us work with data that can be represented in tabular form, and generally just tidy it up. Open Refine can help us with inconsistently formatted data, or data that is incorrectly represented. Open Refine can help us get an overview of a dataset, as well as splitting it up in a variety of ways. And Open Refine can help us annotate our data with data from an external source.
  3. https://dl.dropboxusercontent.com/u/1156404/466161Payments_over_%C2%A3500_November_2014.csv - http://bit.ly/brumspend
  4. http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html