SlideShare a Scribd company logo
1 of 38
Download to read offline
SWIFT WEB SERVICES
   Verifying and Filtering the Crowd




             An Ushahidi Initiative

        by Neville Newey and Jon Gosier
SWIFT IS THE FILTER
SWIFTRIVER IS FOR...
Improving information findability
Surfacing content you didn't know you were looking for
Understanding media from other parts of the world (translation)
Making urgent data more discoverable (structured, published and accessible)
Verifying eyewitness accounts
Using location as context
Expanding the grassroots reporting network
Preserving information (archiving)
SwiftRiver Web Services
•   SiLCC - NLP for SMS and Twitter
•   SULSa - Location Services
•   SiCDS - Duplication Filtering
•   River ID - Distributed Reputation
•   Reverberations - Measures influence of online content
RIVER ID
Distributed Trust and Reputation




           Web Services
REVERBERATIONS
  Measuring Content Influence




           Web Services
SILCC
SwiftRiver Language Computation Core




              Web Services
WHAT IS SILCC?
•Swift Language Computation Component
•One of the SwiftRiver Web Services
•Open Web API
•Semantic Tagging of Short Text
•Multilingual
•Multiple sources (twitter, email, SMS, blogs etc)
•Active Learning capability
•Open Source
•Easy to Deploy, Modify and Run
Swiftriver     SiLCC  Dataflow  
       
                          SiSLS  
                                                         Content   Items   coming   from   the   SiSLS   have      where  
                 Swiftriver  Source                      SiSLS   integrations   is   enabled      global   trust   values  
                  Library  Service                       added  to  the  object  model.  


            




                         SiLCC  

               Swiftriver  Language                      An  API  key  is  sent  along  with  the  text  to  ensure  that  
                                                         the  SiLCC  is  not  open  to  any  malicious  usage.    
               Computational  Core  

                  The  text  of  the  
               content  is  sent  to  the  
                       SiLCC.  



                                                         There  is  still  a  bit  of  ambiguity  around  what  the  NLP  
                                                         should  extract  from  the  text  but  at  its  most  simple,  
               Using  NLP,  the  SiLCC                   all  the  nouns  would  be  a  good  start.  
                extracts  Nouns  and  
               other  keywords  from  
                      the  text.  




               The  SiLCC  send  back  
                                                           The   lists   of   tags   sent   back   from   the   SiLCC   can   be  
               a  list  of  tags  that  are  
                                                           added  to  the  Content  Item  along  with  any  that  were  
                     added  to  the                        extracted  from  the  source  data  by  the  parser.  
                     Content  Item  




                          SLISa  
                                                           Although   the   NLP   tags   have   now   been   applied,   the  
                                                           SLISa   is   now   responsible   for   applying   instance  
                Swiftriver  Language  
                                                           specific  tagging  corrections.  
               Improvement  Service  

            
OUR GOALS
•Simple Tagging of short snippets of text
•Rapid tagging for high volume environments
•Simple API, easy to use
•Learns from user feedback
•Routing of messages to upstream services
•Semantic Classification
•Sorts rapid streams into buckets
•Clusters like messages
•Visual effects
•Cross-referencing
WHAT IT’S NOT
•Does not do deep analysis of text
•Only identifies words within original text
HOW DOES IT WORK?
•Step 1: Lexical Analysis
•Step 2: Parsing into constituent parts
•Step 3: Part of Speech tagging
•Step 4: Feature extraction
•Step 5: Compute using feature weights
•Lets examine each one in turn...
STEP 1: LEXICAL ANALYSIS
•For news headlines, email subjects this is trivial, just
 split on spaces.

•For Twitter this is more complex...
TWEET ANALYSIS
•Tweets are surprisingly complex
•Only 140 characters but many features
•Emergent features from community (e.g. hashtags)
•Lets take a look at a typical tweet...
TWEET ANALYSIS
 The typical Tweet: “RT @directrelief: RT
 @PIH: PBS @NewsHour addresses mental health
 needs in the aftermath of the #Haiti earthquake
 #health #earthquake... http://bit.ly/bNhyK6”

•RT indicates a “re-tweet”
•@name indicates who the original tweeter was
•Multiple embedded retweets
•Hashtags (e.g. #Haiti) can play two roles, as a tag
 and as part of the sentence
TWEET ANALYSIS 2
•Two or more hashtags within a tweet (e.g.
 #health and #earthquake)
•Continuation dots “...” indicates that there
 was more text that didn’t fit into the 140 limit
 somewhere in it’s history
•Urls many tweets contain one or more urls
 As we can see this simple tweet contains no less
 than 7 different features and that’s not all!
TWEET ANALYSIS 3
We want to break up the tweet into the following
parts:
{

  'text': ['PBS addresses mental health needs in the aftermath of the Haiti
earthquake'],

    'hashtags': ['#Haiti', '#health', '#earthquake'],

    'names': ['@directrelief', '@PIH', '@NewsHour'],

    'urls': ['http://bit.ly/bNhyK6'],

}
TWEET ANALYSIS 4
 Why do we want to break up the tweet into parts
 (parsing)?

•Because we want to further process the
 grammatically correct english text
•Part of speech tagging would otherwise be
 corrupted by words it cannot recognize (e.g. urls,
 hashtags, @names etc.)
•We want to save the hashtags for later use
•Many of the features are irrelevant to the task of
 identifying tags (e.g. dots, punctuation, @name, RT)
TWEET ANALYSIS 5
•We now take the “text” portion of the tweet and
 perform part of speech tagging on it
•After part of speech tagging, we perform feature
 extraction
•Features are now passed through the keyword
 classifier which returns a list of keywords / tags
•Finally we combine these tags with the hashtags we
 saved earlier to give the complete tag set
HEADLINE AND EMAIL
    SUBJECT ANALYSIS
•This is much simpler to do
•Its a subset of the steps in Tweet Analysis
•There is no parsing since there are no hashtags,
 @names etc.
FEATURE EXTRACTION
• For the active learning algorithm we need to extract features to use in classification
• These features should be subject/domain independent
• We therefore never use the actual words as features
• This would for example give artificially high weights to words such as “earthquake”
• We don't want these artificial weights as we can’t foresee future disasters and we
    want to be as generic with classification as possible
•   The use of training sets does allow for domain customization if where necessary
FEATURE EXTRACTION
• Capitalization of individual words: Either first caps, or all caps, this is an
    important indicator of proper nouns or other important words that make good tag
    candidates
•   Position in text: Tags seem to have a greater preponderance near the
    beginning of text
•   Part of Speech: Nouns and proper nouns are particularly important but so are
    some adjectives and adverbs
•   Capitalization of entire text: sometimes the whole text is capitalized and
    this should reduce overall weighting of other features
•   Length of the text: In shorter texts the words are more likely to be tags
•   The parts of speech of previous and next words (effectively this means we
    are using trigrams; or a window of 3)
TRAINING
• Requires user reviewed examples
• Lexical analysis, parsing and feature extraction on the examples
• Multinomial naïve Bayes algorithm
• NB: The granularity we are classifying is at the word level
• For each word in the text, we classify it as either a keyword or not
• This has pleasant side effect of providing several training examples from each user
    reviewed text
•   Even with less than 50 reviewed texts the results are comparable to the simple
    approach of using nouns only
ACTIVE LEARNING
•The API also provides a method for users to send
 back corrected text
•The corrected text is saved and then used in the
 next iteration of training
•User may optionally specify a corpus for the
 example to go into
•Training can be performed using any combination of
 corpora
DEVELOPER FRIENDLY
•Two levels of API, the web API and the internal
 Python API
•Either one may be used but most users will use the
 web API
•Design is highly modular and maintainable
•For very rapid backend processing the native Python
 API can be used
PYTHON CLASSES
Most of the classes that make up the library are
divided into three types:

   1) Tokenizers
   2) Parsers
   3) Taggers

All three types have consistent API's and are
interchangeable.
PYTHON API
•A tagger calls a parser
•A parser calls a tokenizer
•Output of the tokenizer goes into the parser
•Output of the parser goes into the tagger
•Output of the tagger goes into the user!
CLASSES
• BasicTokenizer – This is used for splitting basic (non-tweet) text into individual
    words
•   TweetTokenizer – This is used to tokenize a tweet, it may also be used to
    tokenize plain text since plain text is a subset of tweets
•   TweetParser – Calls the TweetTokenizer and the parses the output (see
    previous example)
•   TweetTagger – Calls the TweetTokenizer and then tags the output of the text
    part and adds the hashtags
•   BasicTagger – Calls the BasicTokenizer and then tags the text, should only be
    used for non-tweet text, uses simple Part of Speech to identify tags
•   BayesTagger – Same as BasicTagger but uses weights from the naïve Bayes
    training algorithm
DEPENDANCIES
•Part of speech tagging is currently performed by the
 Python NLTK
•The Web API uses the Pylons web framework
CURRENT STATUS
•Tag method of API is ready for use, individual
 deployments can choose between using the
 BasicTagger or the BayesTagger
•Tell method (for user feedback) will be ready by
 the time you read this!
•Training is possible on corpora of tagged data in .csv
 format (see examples in distribution)
CURRENT LIMITATIONS
•Only English text is supported at the moment
•Tags are always one of the words in the supplied
 text ie they can never be a word not in the supplied
 text
•Very few training examples exist at the moment
FUTURE WORK
•Multilingual, use non-english part of speech taggers
•UTF8 compatible
•Experiment with different learning algorithms (e.g.
 neural networks)
•Perform external text analysis (e.g. if there is a url,
 analyze the text in the url as well as in the tweet)
•Allow users to specify required density of tags
SWIFT RIVER
       jon@ushahidi.com
   http://swift.ushahidi.com
http://github.com/appfrica/silcc




         An Ushahidi Initiative

    by Neville Newey and Jon Gosier

More Related Content

Similar to Swift Web Services Overiview

Iterator - a powerful but underappreciated design pattern
Iterator - a powerful but underappreciated design patternIterator - a powerful but underappreciated design pattern
Iterator - a powerful but underappreciated design patternNitin Bhide
 
Let's Write Better Node Modules
Let's Write Better Node ModulesLet's Write Better Node Modules
Let's Write Better Node ModulesKevin Whinnery
 
Publishing Data to REST APIs with Lightning Process Builder
Publishing Data to REST APIs with Lightning Process BuilderPublishing Data to REST APIs with Lightning Process Builder
Publishing Data to REST APIs with Lightning Process BuilderScott Coleman
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
 
Swift programming language
Swift programming languageSwift programming language
Swift programming languageNijo Job
 
TAUS Webinar - Introduction to the Gengo API Ecosystem
TAUS Webinar - Introduction to the Gengo API EcosystemTAUS Webinar - Introduction to the Gengo API Ecosystem
TAUS Webinar - Introduction to the Gengo API EcosystemGengo
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi... Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...Databricks
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
Api FUNdamentals #MHA2017
Api FUNdamentals #MHA2017Api FUNdamentals #MHA2017
Api FUNdamentals #MHA2017JoEllen Carter
 
Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016Domingo Suarez Torres
 
Api fundamentals
Api fundamentalsApi fundamentals
Api fundamentalsAgileDenver
 
Complete PPT about the Java lokesh kept it
Complete PPT about the Java lokesh kept itComplete PPT about the Java lokesh kept it
Complete PPT about the Java lokesh kept itlokeshpappaka10
 
APIs at Scale with TypeSpec by Mandy Whaley, Microsoft
APIs at Scale with TypeSpec by Mandy Whaley, MicrosoftAPIs at Scale with TypeSpec by Mandy Whaley, Microsoft
APIs at Scale with TypeSpec by Mandy Whaley, MicrosoftNordic APIs
 
presentation on online movie ticket booking
presentation on online movie ticket bookingpresentation on online movie ticket booking
presentation on online movie ticket bookingdharmawath
 
Instagram filters (8 24)
Instagram filters (8 24)Instagram filters (8 24)
Instagram filters (8 24)Ivy Rueb
 
ASP.NET Core Demos Part 2
ASP.NET Core Demos Part 2ASP.NET Core Demos Part 2
ASP.NET Core Demos Part 2Erik Noren
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache luceneShrikrishna Parab
 

Similar to Swift Web Services Overiview (20)

Iterator - a powerful but underappreciated design pattern
Iterator - a powerful but underappreciated design patternIterator - a powerful but underappreciated design pattern
Iterator - a powerful but underappreciated design pattern
 
Let's Write Better Node Modules
Let's Write Better Node ModulesLet's Write Better Node Modules
Let's Write Better Node Modules
 
Publishing Data to REST APIs with Lightning Process Builder
Publishing Data to REST APIs with Lightning Process BuilderPublishing Data to REST APIs with Lightning Process Builder
Publishing Data to REST APIs with Lightning Process Builder
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Swift programming language
Swift programming languageSwift programming language
Swift programming language
 
TAUS Webinar - Introduction to the Gengo API Ecosystem
TAUS Webinar - Introduction to the Gengo API EcosystemTAUS Webinar - Introduction to the Gengo API Ecosystem
TAUS Webinar - Introduction to the Gengo API Ecosystem
 
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi... Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Architecting for Scale
Architecting for ScaleArchitecting for Scale
Architecting for Scale
 
Api FUNdamentals #MHA2017
Api FUNdamentals #MHA2017Api FUNdamentals #MHA2017
Api FUNdamentals #MHA2017
 
Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016
 
Api fundamentals
Api fundamentalsApi fundamentals
Api fundamentals
 
Complete PPT about the Java lokesh kept it
Complete PPT about the Java lokesh kept itComplete PPT about the Java lokesh kept it
Complete PPT about the Java lokesh kept it
 
stigbot_beta
stigbot_betastigbot_beta
stigbot_beta
 
APIs at Scale with TypeSpec by Mandy Whaley, Microsoft
APIs at Scale with TypeSpec by Mandy Whaley, MicrosoftAPIs at Scale with TypeSpec by Mandy Whaley, Microsoft
APIs at Scale with TypeSpec by Mandy Whaley, Microsoft
 
presentation on online movie ticket booking
presentation on online movie ticket bookingpresentation on online movie ticket booking
presentation on online movie ticket booking
 
Instagram filters (8 24)
Instagram filters (8 24)Instagram filters (8 24)
Instagram filters (8 24)
 
ASP.NET Core Demos Part 2
ASP.NET Core Demos Part 2ASP.NET Core Demos Part 2
ASP.NET Core Demos Part 2
 
Apex code (Salesforce)
Apex code (Salesforce)Apex code (Salesforce)
Apex code (Salesforce)
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 

More from Ushahidi

Data Science for Social Good and Ushahidi - Final Presentation
Data Science for Social Good and Ushahidi - Final PresentationData Science for Social Good and Ushahidi - Final Presentation
Data Science for Social Good and Ushahidi - Final PresentationUshahidi
 
Corruption mapping (april 2013, part 2)
Corruption mapping (april 2013, part 2)Corruption mapping (april 2013, part 2)
Corruption mapping (april 2013, part 2)Ushahidi
 
Anti-Corruption Mapping (April 2013, part 1)
Anti-Corruption Mapping (April 2013, part 1)Anti-Corruption Mapping (April 2013, part 1)
Anti-Corruption Mapping (April 2013, part 1)Ushahidi
 
Ushahdi 3.0 Design Framework
Ushahdi 3.0 Design Framework Ushahdi 3.0 Design Framework
Ushahdi 3.0 Design Framework Ushahidi
 
Around the Globe Corruption Mapping (part 2)
Around the Globe Corruption Mapping (part 2)Around the Globe Corruption Mapping (part 2)
Around the Globe Corruption Mapping (part 2)Ushahidi
 
Around the Globe Corruption Mapping (part 1)
Around the Globe Corruption Mapping (part 1)Around the Globe Corruption Mapping (part 1)
Around the Globe Corruption Mapping (part 1)Ushahidi
 
Ushahidi Toolbox - Real-time Evaluation
Ushahidi Toolbox - Real-time EvaluationUshahidi Toolbox - Real-time Evaluation
Ushahidi Toolbox - Real-time EvaluationUshahidi
 
Ushahidi Toolbox - Implementation
Ushahidi Toolbox - ImplementationUshahidi Toolbox - Implementation
Ushahidi Toolbox - ImplementationUshahidi
 
Ushahidi Toolbox - Assessment
Ushahidi Toolbox - AssessmentUshahidi Toolbox - Assessment
Ushahidi Toolbox - AssessmentUshahidi
 
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building BridgesKenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building BridgesUshahidi
 
Kenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: UchaguziKenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: UchaguziUshahidi
 
Kenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog SeriesKenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog SeriesUshahidi
 
Pivoting An African Open Source Project
Pivoting An African Open Source ProjectPivoting An African Open Source Project
Pivoting An African Open Source ProjectUshahidi
 
Ushahidi personas scenarios
Ushahidi personas scenariosUshahidi personas scenarios
Ushahidi personas scenariosUshahidi
 
Citizen pollution mapping made easy
Citizen pollution mapping made easy Citizen pollution mapping made easy
Citizen pollution mapping made easy Ushahidi
 
Map it, Change it
Map it, Change itMap it, Change it
Map it, Change itUshahidi
 
Map it, Make it, Hack it
Map it, Make it, Hack itMap it, Make it, Hack it
Map it, Make it, Hack itUshahidi
 
What if Citizens Mapped Health?
What if Citizens Mapped Health?What if Citizens Mapped Health?
What if Citizens Mapped Health?Ushahidi
 
Re-imagining Citizen Engagement
Re-imagining Citizen EngagementRe-imagining Citizen Engagement
Re-imagining Citizen EngagementUshahidi
 

More from Ushahidi (20)

Data Science for Social Good and Ushahidi - Final Presentation
Data Science for Social Good and Ushahidi - Final PresentationData Science for Social Good and Ushahidi - Final Presentation
Data Science for Social Good and Ushahidi - Final Presentation
 
Corruption mapping (april 2013, part 2)
Corruption mapping (april 2013, part 2)Corruption mapping (april 2013, part 2)
Corruption mapping (april 2013, part 2)
 
Anti-Corruption Mapping (April 2013, part 1)
Anti-Corruption Mapping (April 2013, part 1)Anti-Corruption Mapping (April 2013, part 1)
Anti-Corruption Mapping (April 2013, part 1)
 
Ushahdi 3.0 Design Framework
Ushahdi 3.0 Design Framework Ushahdi 3.0 Design Framework
Ushahdi 3.0 Design Framework
 
Around the Globe Corruption Mapping (part 2)
Around the Globe Corruption Mapping (part 2)Around the Globe Corruption Mapping (part 2)
Around the Globe Corruption Mapping (part 2)
 
Around the Globe Corruption Mapping (part 1)
Around the Globe Corruption Mapping (part 1)Around the Globe Corruption Mapping (part 1)
Around the Globe Corruption Mapping (part 1)
 
Ushahidi Toolbox - Real-time Evaluation
Ushahidi Toolbox - Real-time EvaluationUshahidi Toolbox - Real-time Evaluation
Ushahidi Toolbox - Real-time Evaluation
 
Ushahidi Toolbox - Implementation
Ushahidi Toolbox - ImplementationUshahidi Toolbox - Implementation
Ushahidi Toolbox - Implementation
 
Ushahidi Toolbox - Assessment
Ushahidi Toolbox - AssessmentUshahidi Toolbox - Assessment
Ushahidi Toolbox - Assessment
 
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building BridgesKenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
Kenya Ushahidi Evaluation: Unsung Peace Heros/Building Bridges
 
Kenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: UchaguziKenya Ushahidi Evaluation: Uchaguzi
Kenya Ushahidi Evaluation: Uchaguzi
 
Kenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog SeriesKenya Ushahidi Evaluation: Blog Series
Kenya Ushahidi Evaluation: Blog Series
 
Pivoting An African Open Source Project
Pivoting An African Open Source ProjectPivoting An African Open Source Project
Pivoting An African Open Source Project
 
Ushahidi personas scenarios
Ushahidi personas scenariosUshahidi personas scenarios
Ushahidi personas scenarios
 
Citizen pollution mapping made easy
Citizen pollution mapping made easy Citizen pollution mapping made easy
Citizen pollution mapping made easy
 
Testimony
TestimonyTestimony
Testimony
 
Map it, Change it
Map it, Change itMap it, Change it
Map it, Change it
 
Map it, Make it, Hack it
Map it, Make it, Hack itMap it, Make it, Hack it
Map it, Make it, Hack it
 
What if Citizens Mapped Health?
What if Citizens Mapped Health?What if Citizens Mapped Health?
What if Citizens Mapped Health?
 
Re-imagining Citizen Engagement
Re-imagining Citizen EngagementRe-imagining Citizen Engagement
Re-imagining Citizen Engagement
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 

Swift Web Services Overiview

  • 1. SWIFT WEB SERVICES Verifying and Filtering the Crowd An Ushahidi Initiative by Neville Newey and Jon Gosier
  • 2. SWIFT IS THE FILTER
  • 3. SWIFTRIVER IS FOR... Improving information findability Surfacing content you didn't know you were looking for Understanding media from other parts of the world (translation) Making urgent data more discoverable (structured, published and accessible) Verifying eyewitness accounts Using location as context Expanding the grassroots reporting network Preserving information (archiving)
  • 4. SwiftRiver Web Services • SiLCC - NLP for SMS and Twitter • SULSa - Location Services • SiCDS - Duplication Filtering • River ID - Distributed Reputation • Reverberations - Measures influence of online content
  • 5. RIVER ID Distributed Trust and Reputation Web Services
  • 6.
  • 7.
  • 8. REVERBERATIONS Measuring Content Influence Web Services
  • 9.
  • 10.
  • 11.
  • 13. WHAT IS SILCC? •Swift Language Computation Component •One of the SwiftRiver Web Services •Open Web API •Semantic Tagging of Short Text •Multilingual •Multiple sources (twitter, email, SMS, blogs etc) •Active Learning capability •Open Source •Easy to Deploy, Modify and Run
  • 14. Swiftriver    SiLCC  Dataflow     SiSLS   Content   Items   coming   from   the   SiSLS   have     where   Swiftriver  Source     SiSLS   integrations   is   enabled     global   trust   values   Library  Service   added  to  the  object  model.     SiLCC   Swiftriver  Language   An  API  key  is  sent  along  with  the  text  to  ensure  that   the  SiLCC  is  not  open  to  any  malicious  usage.     Computational  Core     The  text  of  the   content  is  sent  to  the   SiLCC.   There  is  still  a  bit  of  ambiguity  around  what  the  NLP   should  extract  from  the  text  but  at  its  most  simple,   Using  NLP,  the  SiLCC   all  the  nouns  would  be  a  good  start.   extracts  Nouns  and   other  keywords  from   the  text.   The  SiLCC  send  back   The   lists   of   tags   sent   back   from   the   SiLCC   can   be   a  list  of  tags  that  are   added  to  the  Content  Item  along  with  any  that  were   added  to  the   extracted  from  the  source  data  by  the  parser.   Content  Item   SLISa   Although   the   NLP   tags   have   now   been   applied,   the   SLISa   is   now   responsible   for   applying   instance   Swiftriver  Language   specific  tagging  corrections.   Improvement  Service    
  • 15. OUR GOALS •Simple Tagging of short snippets of text •Rapid tagging for high volume environments •Simple API, easy to use •Learns from user feedback •Routing of messages to upstream services •Semantic Classification •Sorts rapid streams into buckets •Clusters like messages •Visual effects •Cross-referencing
  • 16. WHAT IT’S NOT •Does not do deep analysis of text •Only identifies words within original text
  • 17. HOW DOES IT WORK? •Step 1: Lexical Analysis •Step 2: Parsing into constituent parts •Step 3: Part of Speech tagging •Step 4: Feature extraction •Step 5: Compute using feature weights •Lets examine each one in turn...
  • 18. STEP 1: LEXICAL ANALYSIS •For news headlines, email subjects this is trivial, just split on spaces. •For Twitter this is more complex...
  • 19. TWEET ANALYSIS •Tweets are surprisingly complex •Only 140 characters but many features •Emergent features from community (e.g. hashtags) •Lets take a look at a typical tweet...
  • 20. TWEET ANALYSIS The typical Tweet: “RT @directrelief: RT @PIH: PBS @NewsHour addresses mental health needs in the aftermath of the #Haiti earthquake #health #earthquake... http://bit.ly/bNhyK6” •RT indicates a “re-tweet” •@name indicates who the original tweeter was •Multiple embedded retweets •Hashtags (e.g. #Haiti) can play two roles, as a tag and as part of the sentence
  • 21. TWEET ANALYSIS 2 •Two or more hashtags within a tweet (e.g. #health and #earthquake) •Continuation dots “...” indicates that there was more text that didn’t fit into the 140 limit somewhere in it’s history •Urls many tweets contain one or more urls As we can see this simple tweet contains no less than 7 different features and that’s not all!
  • 22. TWEET ANALYSIS 3 We want to break up the tweet into the following parts: { 'text': ['PBS addresses mental health needs in the aftermath of the Haiti earthquake'], 'hashtags': ['#Haiti', '#health', '#earthquake'], 'names': ['@directrelief', '@PIH', '@NewsHour'], 'urls': ['http://bit.ly/bNhyK6'], }
  • 23. TWEET ANALYSIS 4 Why do we want to break up the tweet into parts (parsing)? •Because we want to further process the grammatically correct english text •Part of speech tagging would otherwise be corrupted by words it cannot recognize (e.g. urls, hashtags, @names etc.) •We want to save the hashtags for later use •Many of the features are irrelevant to the task of identifying tags (e.g. dots, punctuation, @name, RT)
  • 24. TWEET ANALYSIS 5 •We now take the “text” portion of the tweet and perform part of speech tagging on it •After part of speech tagging, we perform feature extraction •Features are now passed through the keyword classifier which returns a list of keywords / tags •Finally we combine these tags with the hashtags we saved earlier to give the complete tag set
  • 25. HEADLINE AND EMAIL SUBJECT ANALYSIS •This is much simpler to do •Its a subset of the steps in Tweet Analysis •There is no parsing since there are no hashtags, @names etc.
  • 26. FEATURE EXTRACTION • For the active learning algorithm we need to extract features to use in classification • These features should be subject/domain independent • We therefore never use the actual words as features • This would for example give artificially high weights to words such as “earthquake” • We don't want these artificial weights as we can’t foresee future disasters and we want to be as generic with classification as possible • The use of training sets does allow for domain customization if where necessary
  • 27. FEATURE EXTRACTION • Capitalization of individual words: Either first caps, or all caps, this is an important indicator of proper nouns or other important words that make good tag candidates • Position in text: Tags seem to have a greater preponderance near the beginning of text • Part of Speech: Nouns and proper nouns are particularly important but so are some adjectives and adverbs • Capitalization of entire text: sometimes the whole text is capitalized and this should reduce overall weighting of other features • Length of the text: In shorter texts the words are more likely to be tags • The parts of speech of previous and next words (effectively this means we are using trigrams; or a window of 3)
  • 28. TRAINING • Requires user reviewed examples • Lexical analysis, parsing and feature extraction on the examples • Multinomial naïve Bayes algorithm • NB: The granularity we are classifying is at the word level • For each word in the text, we classify it as either a keyword or not • This has pleasant side effect of providing several training examples from each user reviewed text • Even with less than 50 reviewed texts the results are comparable to the simple approach of using nouns only
  • 29. ACTIVE LEARNING •The API also provides a method for users to send back corrected text •The corrected text is saved and then used in the next iteration of training •User may optionally specify a corpus for the example to go into •Training can be performed using any combination of corpora
  • 30. DEVELOPER FRIENDLY •Two levels of API, the web API and the internal Python API •Either one may be used but most users will use the web API •Design is highly modular and maintainable •For very rapid backend processing the native Python API can be used
  • 31. PYTHON CLASSES Most of the classes that make up the library are divided into three types: 1) Tokenizers 2) Parsers 3) Taggers All three types have consistent API's and are interchangeable.
  • 32. PYTHON API •A tagger calls a parser •A parser calls a tokenizer •Output of the tokenizer goes into the parser •Output of the parser goes into the tagger •Output of the tagger goes into the user!
  • 33. CLASSES • BasicTokenizer – This is used for splitting basic (non-tweet) text into individual words • TweetTokenizer – This is used to tokenize a tweet, it may also be used to tokenize plain text since plain text is a subset of tweets • TweetParser – Calls the TweetTokenizer and the parses the output (see previous example) • TweetTagger – Calls the TweetTokenizer and then tags the output of the text part and adds the hashtags • BasicTagger – Calls the BasicTokenizer and then tags the text, should only be used for non-tweet text, uses simple Part of Speech to identify tags • BayesTagger – Same as BasicTagger but uses weights from the naïve Bayes training algorithm
  • 34. DEPENDANCIES •Part of speech tagging is currently performed by the Python NLTK •The Web API uses the Pylons web framework
  • 35. CURRENT STATUS •Tag method of API is ready for use, individual deployments can choose between using the BasicTagger or the BayesTagger •Tell method (for user feedback) will be ready by the time you read this! •Training is possible on corpora of tagged data in .csv format (see examples in distribution)
  • 36. CURRENT LIMITATIONS •Only English text is supported at the moment •Tags are always one of the words in the supplied text ie they can never be a word not in the supplied text •Very few training examples exist at the moment
  • 37. FUTURE WORK •Multilingual, use non-english part of speech taggers •UTF8 compatible •Experiment with different learning algorithms (e.g. neural networks) •Perform external text analysis (e.g. if there is a url, analyze the text in the url as well as in the tweet) •Allow users to specify required density of tags
  • 38. SWIFT RIVER jon@ushahidi.com http://swift.ushahidi.com http://github.com/appfrica/silcc An Ushahidi Initiative by Neville Newey and Jon Gosier