SlideShare a Scribd company logo
1 of 72
Download to read offline
Cleaning plain text books with
Text::Perfide::BookCleaner
           Andr´ Santos
                e
         andrefs@cpan.org




        September 23, 2011
Introduction   Per-Fide




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Joint venture between the Computer Science
     Department and the School of Humanities of
     the University of Minho




         Andr´ Santos andrefs@cpan.org
             e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Joint venture between the Computer Science
     Department and the School of Humanities of
     the University of Minho
     Portuguese in parallel with six languages:
     Espa˜ol, Russian, Fran¸ais, Italiano, Deutsch,
          n                 c
     English




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Joint venture between the Computer Science
     Department and the School of Humanities of
     the University of Minho
     Portuguese in parallel with six languages:
     Espa˜ol, Russian, Fran¸ais, Italiano, Deutsch,
           n                c
     English
     Build parallel corpora that will establish a
     relation between Portuguese and the other 6
     languages


          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



[Parallel] Corpora

       Corpora Collection of natural language texts
  Parallel corpora Collection of nat. lang. bitexts
         Bitext Pair formed by a text in a given
                 language and its translation in
                 another language, frequently aligned.
    Alignment Mapping between the
                 sentences/paragraphs/words of one
                 text and the other.


            Andr´ Santos andrefs@cpan.org
                e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations
     Two main genres: contemporary fiction
     and non-fiction




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations
     Two main genres: contemporary fiction
     and non-fiction
     non-fiction: judicial, journalistic, religious,
                technical, ...




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations
     Two main genres: contemporary fiction
     and non-fiction
     non-fiction: judicial, journalistic, religious,
                technical, ...
        fiction: contemporary novels and short
                stories



          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations
     Two main genres: contemporary fiction
     and non-fiction
     non-fiction: judicial, journalistic, religious,
                technical, ...
        fiction: contemporary novels and short
                stories
     per-fide.di.uminho.pt

          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic




         Andr´ Santos andrefs@cpan.org
             e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level




         Andr´ Santos andrefs@cpan.org
             e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level
     Automatic alignment tools/algorithms
     generally fall into three categories:




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level
     Automatic alignment tools/algorithms
     generally fall into three categories:
     length based: “when two sentences correspond, the
                     words in them also correspond”




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level
     Automatic alignment tools/algorithms
     generally fall into three categories:
     length based: “when two sentences correspond, the
                 words in them also correspond”
     lexical/dictionary based: relies on lexical
                 information or dictionaries to perform the
                 alignment




           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level
     Automatic alignment tools/algorithms
     generally fall into three categories:
     length based: “when two sentences correspond, the
                 words in them also correspond”
     lexical/dictionary based: relies on lexical
                 information or dictionaries to perform the
                 alignment
     partial similarity (cognates) based: relies on
                 occurrences of tokens graphically or
                 otherwise identical (cognates)
           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment – Example




  Table: Extract of sentence-level alignment performed using
  Portuguese and Russian subtitles from the movie Tron.


             Andr´ Santos andrefs@cpan.org
                 e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Books




        Andr´ Santos andrefs@cpan.org
            e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Books

    Obtained directly from publishers or, if in
    public domain, from Project Gutenberg and
    similar projects
    Large variety of formats: PDF, MS Word,
    HTML, ebook formats, ...




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Books

    Obtained directly from publishers or, if in
    public domain, from Project Gutenberg and
    similar projects
    Large variety of formats: PDF, MS Word,
    HTML, ebook formats, ...
    If not already in plain text, they need to be
    converted before the alignment



         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Books

    Obtained directly from publishers or, if in
    public domain, from Project Gutenberg and
    similar projects
    Large variety of formats: PDF, MS Word,
    HTML, ebook formats, ...
    If not already in plain text, they need to be
    converted before the alignment
 This is where all the trouble starts!

         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Book alignment problems
     pagination – page numbers, headers,
     footers, . . .
     previous text formatting – sub/superscript,
     bold, italics, . . .
     sections
     paragraphs
     translineations and transpaginations
     footnotes
     text encoding
     ...
          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Book alignment problems – Example
  (. . . )
  gaiement. Sur le devant s<92>’ouvrait la porte
  d<92>’entr´e, donnant acc`s dans la salle commune.
            e              e
  Une l´g`re v´randa, qui en prot´-
       e e    e                  e

               <96>- 86 <96>-
   ^L geait la partie ant´rieure contre l<92>’action
                         e
  des rayons solaires, reposait sur de sveltes bambous.
  Le tout ´tait peint d<92>’une fra^che
           e                        ı
  (. . . )

                                                        La Jangada, Jules Verne

             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



First approach
          RegExp + Find & Replace




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



First approach
          RegExp + Find & Replace




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



First approach


  Well-intentioned but:
       Too na¨ıve
       Big mess
       A more sofisticated approach was needed!




           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Architecture
  Build a pipeline; each step handles a specific set of
  problems.
    1  pages
    2  sections
    3  paragraphs
    4  footnotes
    5  chars
    6  ...



            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Architecture
  Build a pipeline; each step handles a specific set of
  problems.
    1  pages
    2  sections
    3  paragraphs
    4  footnotes
    5  chars
    6  ...
    7   commit

            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Architecture




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Architecture


     whenever possible, use ontologies and DSLs
     they help organizing stuff
     they allow to abstract from the code and
     discuss details at a higher level (even with
     people from other areas)




          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages

  Goal
  Identify and remove from text elements related to
  book pagination:
       page numbers
       headers
       footers
       page breaks
  These elements often lead to a bad performance of
  the aligner.

           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages – Example

  est vrai qu’il fallait etre assez chanceux pour
                         ^
  rencontrer le nabab, et assez audacieux pour
  s’emparer de sa personne.

                     Page 3
  ^L La maison ` vapeur
               a                                        Jules Verne

    Le faquir, - evidemment le seul entre tous
                 ´
  que ne surexcit^t pas l’espoir de gagner la
                 a
  prime, - filait au milieu des groupes, s’arr^tant
                                              e

                                           La Maison ` Vapeur, Jules Verne
                                                     a


           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages – Algorithm
   1   identify page breaks (e.g., ^L )
   2   nearby: candidates to headers and footers
   3   count the occurrences of each normalized
       candidate
   4   headers and footers are extracted from
       candidates which occur more thant a threshold
       value
   5   replace everything with a custom mark
   6   move all the necessary information to a
       standoff file
            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages – Example

  est vrai qu’il fallait etre assez chanceux pour
                         ^
  rencontrer le nabab, et assez audacieux pour
  s’emparer de sa personne.

                     Page 3
  ^L La maison ` vapeur
               a                                        Jules Verne

    Le faquir, - evidemment le seul entre tous
                 ´
  que ne surexcit^t pas l’espoir de gagner la
                 a
  prime, - filait au milieu des groupes, s’arr^tant
                                              e

                                           La Maison ` Vapeur, Jules Verne
                                                     a


           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages – Example

  est vrai qu’il fallait etre assez chanceux pour
                         ^
  rencontrer le nabab, et assez audacieux pour
  s’emparer de sa personne. _pb2_

    Le faquir, - evidemment le seul entre tous
                 ´
  que ne surexcit^t pas l’espoir de gagner la
                 a
  prime, - filait au milieu des groupes, s’arr^tant
                                              e

                                           La Maison ` Vapeur, Jules Verne
                                                     a




           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections


  Goal
  Identify and normalize the divisions between the
  several sections of a book (parts, chapters, acts,
  scenes, epilogue, afterword, ...)
  An ontology was created, containing types of
  divisions and subdivisions, in several languages.



            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Ontology
  Example
  cap
  PT cap´tulo, cap, capitulo
        ı
  FR chapitre, chap
  EN chapter, chap
  NT sec

  PT   fim
  FR   fin
  EN   the_end
  BT   _alone

  This ontology is used to automatically generate a
  parte of the code.
             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Example
  PRIMEIRA PARTE

  FANTINE


  ^L LIVRO PRIMEIRO

  UM JUSTO

  O abade Myriel

  Em 1815, era bispo de Digne, o reverendo Carlos
  Francisco Bemvindo Myriel, o qual contava setenta e
                                                    Os Miser´veis, Vitor Hugo
                                                            a
             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Algorithm

   1   Search for potential sections divisions:
            lines with keywords – cap´ıtulo, chapter, Chap.,
            Appendix, Table des Mati´res, . . .
                                       e
            pages or lines containing only numbers
            roman numbering
            ...
   2   Insert a custom mark immediately before the
       section identified



            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Example
  PRIMEIRA PARTE

  FANTINE


  ^L LIVRO PRIMEIRO

  UM JUSTO

  O abade Myriel

  Em 1815, era bispo de Digne, o reverendo Carlos
  Francisco Bemvindo Myriel, o qual contava setenta e
                                                    Os Miser´veis, Vitor Hugo
                                                            a
             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Example
  _sec+O:PARTE=PRIMEIRA_
  FANTINE

  _sec+O:LIVRO=PRIMEIRO_

  UM JUSTO

  O abade Myriel

  Em 1815, era bispo de Digne, o reverendo Carlos
  Francisco Bemvindo Myriel, o qual contava setenta e

                                                    Os Miser´veis, Vitor Hugo
                                                            a

             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections


  Identifying the different parts within a bitext:
       allows to subsequently compare the two
       versions and remove parts which can only be
       found in one of them
       allows to perform a structural alignment1




    1
        Text::Perfide::BookSync
               Andr´ Santos andrefs@cpan.org
                   e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Paragraphs



  Goal
  Handles things related with identifying and
  normalizing paragraph notation, direct speech, etc.




            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Paragraphs – Example

  L’h^tesse prit la d´fense de son cur´:
     o               e                e

  - D’ailleurs, il en plierait quatre comme vous sur
  son genou. Il a, l’ann´e derni`re, aid´ nos gens a
                        e       e       e          `
  rentrer la paille; il en portait jusqu’` six bottes
                                         a
  a la fois, tant il est fort!
  `

  - Bravo! dit le pharmacien. Envoyez donc vos filles
  en confesse a des gaillards d’un temp´rament pareil!
              `                        e
  Moi, si j’´tais le gouvernement, je voudrais qu’on
            e
  saign^t les pr^tres une fois par mois.
       a        e


           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Paragraphs – Example

  L’h^tesse prit la d´fense de son cur´:
     o               e                e

    "D’ailleurs, il en plierait quatre comme vous sur
  son genou. Il a, l’ann´e derni`re, aid´ nos gens a
                        e       e       e          `
  rentrer la paille; il en portait jusqu’` six bottes
                                         a
  a la fois, tant il est fort! "
  `

    "Bravo!" dit le pharmacien. "Envoyez donc vos filles
  en confesse a des gaillards d’un temp´rament pareil!
              `                        e
  Moi, si j’´tais le gouvernement, je voudrais qu’on
            e
  saign^t les pr^tres une fois par mois."
       a        e


           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Paragraphs – Algorithm


     paragraph identification is performed by
     calculating metrics based on the number of
     blank lines and indentation
     identification and normalization of direct
     speech:
         punctuation, paragraph, dash
         text in quotes




          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes



  Goal
  Identify and remove footnote callmarks and
  footnote expansions




           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes – Example
  On fit un inventaire de son argent comptant, et on
  le mena dans le ch^teau que fit construire le roi
                    a
  Charles V, fils de Jean II, aupr`s de la rue
                                  e
  Saint-Antoine, a la porte des Tournelles[1].
                 `

  [1] La Bastille, qui fut prise par le peuple de
  Paris, le 14 juillet 1789, puis d´molie. B.
                                   e

   ^L Quel etait en chemin l’´tonnement de l’Ing´nu!
           ´                 e                  e
  je vous le laisse a penser. Il crut d’abord
                     `
  que c’´tait un r^ve.
         e         e

                                              Oeuvres de Voltaire, Voltaire

           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes – Algorithm

   1   Search for footnote expansions (lines beggining
       with <<1>>, [2], ^3, . . . )
   2   Replace with custom mark
   3   Only footnote call marks left
   4   Search again for the same patterns in the
       middle of the text
   5   Replace with custom mark



            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes – Algorithm
  On fit un inventaire de son argent comptant, et on
  le mena dans le ch^teau que fit construire le roi
                    a
  Charles V, fils de Jean II, aupr`s de la rue
                                  e
  Saint-Antoine, a la porte des Tournelles[1].
                 `

  [1] La Bastille, qui fut prise par le peuple de
  Paris, le 14 juillet 1789, puis d´molie. B.
                                   e

  (fbox^LQuel ´tait en chemin l’´tonnement de l’Ing´nu!
              e                 e                  e
  je vous le laisse a penser. Il crut d’abord
                    `
  que c’´tait un r^ve.
        e         e

                                              Oeuvres de Voltaire, Voltaire

           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes – Algorithm

  On fit un inventaire de son argent comptant, et on
  le mena dans le ch^teau que fit construire le roi
                    a
  Charles V, fils de Jean II, aupr`s de la rue
                                  e
  Saint-Antoine, a la porte des Tournelles_fnr29_.
                 `
  _fne8_


   ^L Quel etait en chemin l’´tonnement de l’Ing´nu!
           ´                 e                  e
  je vous le laisse a penser. Il crut d’abord
                     `
  que c’´tait un r^ve.
         e         e

                                              Oeuvres de Voltaire, Voltaire


           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Words and characters



     translineations
     text encoding
     ...




          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Report


     Previous steps produce a report
     Summarizes what was found, what was
     assumed and what was done
     Main goal is to allow to make a diagnostic of
     the program, allowing to manually emend what
     is wrong




          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Report

  livros/_FR_15.pdf.txt:
      footers=[’( Page) = 241’]
      headers=[
        "(La maison x{e0} vapeur Jules Verne) = 241"]
      ctrL=1;
      pagnum_ctrL=241;

     sectionsO=2;
     sectionsN=30;

     word_tr=58;
     words=118036;


           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Commit


    Final and irreversible step which removes all
    the custom marks added by the previous steps
    Outputs a cleaned copy of the document
    This is the last stage before the alignment (or
    any other further processing)




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Conclusions and wish list


     There is no de facto standard format for plain
     text books (documents?)
     Documents are way heterogeneous
     (provenience, type and quantity, notation
     formats, . . . )
     Hurrah to regular expressions!
     20/80 rule applies



            Andr´ Santos andrefs@cpan.org
                e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Conclusions and wish list


     Ontologies and DSLs lead to a better structure
     Common pattern:
           search text
           calculate metrics
           perform action accordingly
     Report generated at the end should present a
     smart summary of what was found and done



            Andr´ Santos andrefs@cpan.org
                e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Related ongoing work
  Text::Perfide::BookPairs Find repeated books and
            pairs of books (same book in different
            languages) within a collection




             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Related ongoing work
  Text::Perfide::BookPairs Find repeated books and
            pairs of books (same book in different
            languages) within a collection
  Text::Perfide::BookSync Uses the section
            delimitation made by T::P::BC to make a
            structural alignment:




             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Related ongoing work
  Text::Perfide::BookPairs Find repeated books and
            pairs of books (same book in different
            languages) within a collection
  Text::Perfide::BookSync Uses the section
            delimitation made by T::P::BC to make a
            structural alignment:
  Text::Perfide::CorporaFlow Uses a DSL to guide the
            corpora preparation workflow (to be
            done)



             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Related ongoing work
  Text::Perfide::BookPairs Find repeated books and
            pairs of books (same book in different
            languages) within a collection
  Text::Perfide::BookSync Uses the section
            delimitation made by T::P::BC to make a
            structural alignment:
  Text::Perfide::CorporaFlow Uses a DSL to guide the
            corpora preparation workflow (to be
            done)
  Text::Perfide::SciPaperCleaner Cleaner for scientific
            papers (to be done)
             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Future work



     Standoff annotation – no changes in the
     original file until commit
     Export to ebook formats – .fb2, .epub, . . .
     ...




            Andr´ Santos andrefs@cpan.org
                e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



CPAN


                      Is it on CPAN yet?




           Andr´ Santos andrefs@cpan.org
               e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



CPAN


               Is it on CPAN yet?
       No, but it will be really, really soon!

 Missing
      More and better documentation
      More and better tests



             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Questions



                                          o/

                                    Andr´ Santos
                                        e
                                  andrefs@cpan.org


           Andr´ Santos andrefs@cpan.org
               e                                Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with
Text::Perfide::BookCleaner
           Andr´ Santos
                e
         andrefs@cpan.org




        September 23, 2011

More Related Content

Viewers also liked

575 madame tussaud wien-iii
575 madame tussaud wien-iii575 madame tussaud wien-iii
575 madame tussaud wien-iiifilipj2000
 
Outlook 2010 imap settings
Outlook 2010 imap settingsOutlook 2010 imap settings
Outlook 2010 imap settingsJeremy Dawes
 
Tianmen mountains
Tianmen mountainsTianmen mountains
Tianmen mountainsfilipj2000
 
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014TEDx Adventure Catalyst
 
Outlook 2011 imap settings
Outlook 2011 imap settingsOutlook 2011 imap settings
Outlook 2011 imap settingsJeremy Dawes
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures Ensrrm7
 
Visual Composer: Old vs New
Visual Composer: Old vs NewVisual Composer: Old vs New
Visual Composer: Old vs NewJeremy Dawes
 

Viewers also liked (11)

575 madame tussaud wien-iii
575 madame tussaud wien-iii575 madame tussaud wien-iii
575 madame tussaud wien-iii
 
Outlook 2010 imap settings
Outlook 2010 imap settingsOutlook 2010 imap settings
Outlook 2010 imap settings
 
Tianmen mountains
Tianmen mountainsTianmen mountains
Tianmen mountains
 
York residence learning plan apr192012
York residence learning plan apr192012York residence learning plan apr192012
York residence learning plan apr192012
 
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
 
Outlook 2011 imap settings
Outlook 2011 imap settingsOutlook 2011 imap settings
Outlook 2011 imap settings
 
La Excepción
La ExcepciónLa Excepción
La Excepción
 
Cant Stand It - Never Shout Never
Cant  Stand It - Never Shout NeverCant  Stand It - Never Shout Never
Cant Stand It - Never Shout Never
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures En
 
Pdf 1
Pdf 1Pdf 1
Pdf 1
 
Visual Composer: Old vs New
Visual Composer: Old vs NewVisual Composer: Old vs New
Visual Composer: Old vs New
 

More from andrefsantos

Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pintoandrefsantos
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documentsandrefsantos
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesandrefsantos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...andrefsantos
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatosandrefsantos
 

More from andrefsantos (10)

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
 
Slides
SlidesSlides
Slides
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documents
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Bigorna
BigornaBigorna
Bigorna
 
Mojolicious lite
Mojolicious liteMojolicious lite
Mojolicious lite
 

Recently uploaded

Dhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian PoeticsDhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian PoeticsDhatriParmar
 
Metabolism of lipoproteins & its disorders(Chylomicron & VLDL & LDL).pptx
Metabolism of  lipoproteins & its disorders(Chylomicron & VLDL & LDL).pptxMetabolism of  lipoproteins & its disorders(Chylomicron & VLDL & LDL).pptx
Metabolism of lipoproteins & its disorders(Chylomicron & VLDL & LDL).pptxDr. Santhosh Kumar. N
 
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxAUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxiammrhaywood
 
Pharmacology chapter No 7 full notes.pdf
Pharmacology chapter No 7 full notes.pdfPharmacology chapter No 7 full notes.pdf
Pharmacology chapter No 7 full notes.pdfSumit Tiwari
 
3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptx3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptxmary850239
 
3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptx3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptxmary850239
 
BBA 205 BUSINESS ENVIRONMENT UNIT I.pptx
BBA 205 BUSINESS ENVIRONMENT UNIT I.pptxBBA 205 BUSINESS ENVIRONMENT UNIT I.pptx
BBA 205 BUSINESS ENVIRONMENT UNIT I.pptxProf. Kanchan Kumari
 
BBA 205 BE UNIT 2 economic systems prof dr kanchan.pptx
BBA 205 BE UNIT 2 economic systems prof dr kanchan.pptxBBA 205 BE UNIT 2 economic systems prof dr kanchan.pptx
BBA 205 BE UNIT 2 economic systems prof dr kanchan.pptxProf. Kanchan Kumari
 
The First National K12 TUG March 6 2024.pdf
The First National K12 TUG March 6 2024.pdfThe First National K12 TUG March 6 2024.pdf
The First National K12 TUG March 6 2024.pdfdogden2
 
AI Uses and Misuses: Academic and Workplace Applications
AI Uses and Misuses: Academic and Workplace ApplicationsAI Uses and Misuses: Academic and Workplace Applications
AI Uses and Misuses: Academic and Workplace ApplicationsStella Lee
 
Riti theory by Vamana Indian poetics.pptx
Riti theory by Vamana Indian poetics.pptxRiti theory by Vamana Indian poetics.pptx
Riti theory by Vamana Indian poetics.pptxDhatriParmar
 
LEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced StudLEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced StudDr. Bruce A. Johnson
 
Research Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchResearch Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchRushdi Shams
 
2024.03.16 How to write better quality materials for your learners ELTABB San...
2024.03.16 How to write better quality materials for your learners ELTABB San...2024.03.16 How to write better quality materials for your learners ELTABB San...
2024.03.16 How to write better quality materials for your learners ELTABB San...Sandy Millin
 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxheathfieldcps1
 
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxMetabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxDr. Santhosh Kumar. N
 
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.docdieu18
 
Auchitya Theory by Kshemendra Indian Poetics
Auchitya Theory by Kshemendra Indian PoeticsAuchitya Theory by Kshemendra Indian Poetics
Auchitya Theory by Kshemendra Indian PoeticsDhatriParmar
 

Recently uploaded (20)

Dhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian PoeticsDhavni Theory by Anandvardhana Indian Poetics
Dhavni Theory by Anandvardhana Indian Poetics
 
Metabolism of lipoproteins & its disorders(Chylomicron & VLDL & LDL).pptx
Metabolism of  lipoproteins & its disorders(Chylomicron & VLDL & LDL).pptxMetabolism of  lipoproteins & its disorders(Chylomicron & VLDL & LDL).pptx
Metabolism of lipoproteins & its disorders(Chylomicron & VLDL & LDL).pptx
 
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptxAUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
AUDIENCE THEORY - PARTICIPATORY - JENKINS.pptx
 
Pharmacology chapter No 7 full notes.pdf
Pharmacology chapter No 7 full notes.pdfPharmacology chapter No 7 full notes.pdf
Pharmacology chapter No 7 full notes.pdf
 
3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptx3.14.24 The Selma March and the Voting Rights Act.pptx
3.14.24 The Selma March and the Voting Rights Act.pptx
 
3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptx3.12.24 Freedom Summer in Mississippi.pptx
3.12.24 Freedom Summer in Mississippi.pptx
 
BBA 205 BUSINESS ENVIRONMENT UNIT I.pptx
BBA 205 BUSINESS ENVIRONMENT UNIT I.pptxBBA 205 BUSINESS ENVIRONMENT UNIT I.pptx
BBA 205 BUSINESS ENVIRONMENT UNIT I.pptx
 
BBA 205 BE UNIT 2 economic systems prof dr kanchan.pptx
BBA 205 BE UNIT 2 economic systems prof dr kanchan.pptxBBA 205 BE UNIT 2 economic systems prof dr kanchan.pptx
BBA 205 BE UNIT 2 economic systems prof dr kanchan.pptx
 
The First National K12 TUG March 6 2024.pdf
The First National K12 TUG March 6 2024.pdfThe First National K12 TUG March 6 2024.pdf
The First National K12 TUG March 6 2024.pdf
 
AI Uses and Misuses: Academic and Workplace Applications
AI Uses and Misuses: Academic and Workplace ApplicationsAI Uses and Misuses: Academic and Workplace Applications
AI Uses and Misuses: Academic and Workplace Applications
 
Riti theory by Vamana Indian poetics.pptx
Riti theory by Vamana Indian poetics.pptxRiti theory by Vamana Indian poetics.pptx
Riti theory by Vamana Indian poetics.pptx
 
LEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced StudLEAD6001 - Introduction to Advanced Stud
LEAD6001 - Introduction to Advanced Stud
 
ANOVA Parametric test: Biostatics and Research Methodology
ANOVA Parametric test: Biostatics and Research MethodologyANOVA Parametric test: Biostatics and Research Methodology
ANOVA Parametric test: Biostatics and Research Methodology
 
Least Significance Difference:Biostatics and Research Methodology
Least Significance Difference:Biostatics and Research MethodologyLeast Significance Difference:Biostatics and Research Methodology
Least Significance Difference:Biostatics and Research Methodology
 
Research Methodology and Tips on Better Research
Research Methodology and Tips on Better ResearchResearch Methodology and Tips on Better Research
Research Methodology and Tips on Better Research
 
2024.03.16 How to write better quality materials for your learners ELTABB San...
2024.03.16 How to write better quality materials for your learners ELTABB San...2024.03.16 How to write better quality materials for your learners ELTABB San...
2024.03.16 How to write better quality materials for your learners ELTABB San...
 
The basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptxThe basics of sentences session 8pptx.pptx
The basics of sentences session 8pptx.pptx
 
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptxMetabolism , Metabolic Fate& disorders of cholesterol.pptx
Metabolism , Metabolic Fate& disorders of cholesterol.pptx
 
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
30-de-thi-vao-lop-10-mon-tieng-anh-co-dap-an.doc
 
Auchitya Theory by Kshemendra Indian Poetics
Auchitya Theory by Kshemendra Indian PoeticsAuchitya Theory by Kshemendra Indian Poetics
Auchitya Theory by Kshemendra Indian Poetics
 

Cleaning plain text books with Text::Perfide::BookCleaner

  • 1. Cleaning plain text books with Text::Perfide::BookCleaner Andr´ Santos e andrefs@cpan.org September 23, 2011
  • 2. Introduction Per-Fide 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 3. Introduction Per-Fide 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 4. Introduction Per-Fide Project Per-Fide Joint venture between the Computer Science Department and the School of Humanities of the University of Minho Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 5. Introduction Per-Fide Project Per-Fide Joint venture between the Computer Science Department and the School of Humanities of the University of Minho Portuguese in parallel with six languages: Espa˜ol, Russian, Fran¸ais, Italiano, Deutsch, n c English Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 6. Introduction Per-Fide Project Per-Fide Joint venture between the Computer Science Department and the School of Humanities of the University of Minho Portuguese in parallel with six languages: Espa˜ol, Russian, Fran¸ais, Italiano, Deutsch, n c English Build parallel corpora that will establish a relation between Portuguese and the other 6 languages Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 7. Introduction Per-Fide [Parallel] Corpora Corpora Collection of natural language texts Parallel corpora Collection of nat. lang. bitexts Bitext Pair formed by a text in a given language and its translation in another language, frequently aligned. Alignment Mapping between the sentences/paragraphs/words of one text and the other. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 8. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 9. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary fiction and non-fiction Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 10. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary fiction and non-fiction non-fiction: judicial, journalistic, religious, technical, ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 11. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary fiction and non-fiction non-fiction: judicial, journalistic, religious, technical, ... fiction: contemporary novels and short stories Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 12. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary fiction and non-fiction non-fiction: judicial, journalistic, religious, technical, ... fiction: contemporary novels and short stories per-fide.di.uminho.pt Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 13. Introduction Text alignment Text alignment Manual or automatic Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 14. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 15. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 16. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: length based: “when two sentences correspond, the words in them also correspond” Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 17. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: length based: “when two sentences correspond, the words in them also correspond” lexical/dictionary based: relies on lexical information or dictionaries to perform the alignment Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 18. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: length based: “when two sentences correspond, the words in them also correspond” lexical/dictionary based: relies on lexical information or dictionaries to perform the alignment partial similarity (cognates) based: relies on occurrences of tokens graphically or otherwise identical (cognates) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 19. Introduction Text alignment Text alignment – Example Table: Extract of sentence-level alignment performed using Portuguese and Russian subtitles from the movie Tron. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 20. Introduction Books Books Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 21. Introduction Books Books Obtained directly from publishers or, if in public domain, from Project Gutenberg and similar projects Large variety of formats: PDF, MS Word, HTML, ebook formats, ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 22. Introduction Books Books Obtained directly from publishers or, if in public domain, from Project Gutenberg and similar projects Large variety of formats: PDF, MS Word, HTML, ebook formats, ... If not already in plain text, they need to be converted before the alignment Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 23. Introduction Books Books Obtained directly from publishers or, if in public domain, from Project Gutenberg and similar projects Large variety of formats: PDF, MS Word, HTML, ebook formats, ... If not already in plain text, they need to be converted before the alignment This is where all the trouble starts! Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 24. Introduction Books Book alignment problems pagination – page numbers, headers, footers, . . . previous text formatting – sub/superscript, bold, italics, . . . sections paragraphs translineations and transpaginations footnotes text encoding ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 25. Introduction Books Book alignment problems – Example (. . . ) gaiement. Sur le devant s<92>’ouvrait la porte d<92>’entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e <96>- 86 <96>- ^L geait la partie ant´rieure contre l<92>’action e des rayons solaires, reposait sur de sveltes bambous. Le tout ´tait peint d<92>’une fra^che e ı (. . . ) La Jangada, Jules Verne Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 26. Text::Perfide::BookCleaner 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 27. Text::Perfide::BookCleaner 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 28. Text::Perfide::BookCleaner First approach RegExp + Find & Replace Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 29. Text::Perfide::BookCleaner First approach RegExp + Find & Replace Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 30. Text::Perfide::BookCleaner First approach Well-intentioned but: Too na¨ıve Big mess A more sofisticated approach was needed! Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 31. Text::Perfide::BookCleaner Architecture Build a pipeline; each step handles a specific set of problems. 1 pages 2 sections 3 paragraphs 4 footnotes 5 chars 6 ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 32. Text::Perfide::BookCleaner Architecture Build a pipeline; each step handles a specific set of problems. 1 pages 2 sections 3 paragraphs 4 footnotes 5 chars 6 ... 7 commit Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 33. Text::Perfide::BookCleaner Architecture Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 34. Text::Perfide::BookCleaner Architecture whenever possible, use ontologies and DSLs they help organizing stuff they allow to abstract from the code and discuss details at a higher level (even with people from other areas) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 35. Text::Perfide::BookCleaner Pages Goal Identify and remove from text elements related to book pagination: page numbers headers footers page breaks These elements often lead to a bad performance of the aligner. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 36. Text::Perfide::BookCleaner Pages – Example est vrai qu’il fallait etre assez chanceux pour ^ rencontrer le nabab, et assez audacieux pour s’emparer de sa personne. Page 3 ^L La maison ` vapeur a Jules Verne Le faquir, - evidemment le seul entre tous ´ que ne surexcit^t pas l’espoir de gagner la a prime, - filait au milieu des groupes, s’arr^tant e La Maison ` Vapeur, Jules Verne a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 37. Text::Perfide::BookCleaner Pages – Algorithm 1 identify page breaks (e.g., ^L ) 2 nearby: candidates to headers and footers 3 count the occurrences of each normalized candidate 4 headers and footers are extracted from candidates which occur more thant a threshold value 5 replace everything with a custom mark 6 move all the necessary information to a standoff file Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 38. Text::Perfide::BookCleaner Pages – Example est vrai qu’il fallait etre assez chanceux pour ^ rencontrer le nabab, et assez audacieux pour s’emparer de sa personne. Page 3 ^L La maison ` vapeur a Jules Verne Le faquir, - evidemment le seul entre tous ´ que ne surexcit^t pas l’espoir de gagner la a prime, - filait au milieu des groupes, s’arr^tant e La Maison ` Vapeur, Jules Verne a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 39. Text::Perfide::BookCleaner Pages – Example est vrai qu’il fallait etre assez chanceux pour ^ rencontrer le nabab, et assez audacieux pour s’emparer de sa personne. _pb2_ Le faquir, - evidemment le seul entre tous ´ que ne surexcit^t pas l’espoir de gagner la a prime, - filait au milieu des groupes, s’arr^tant e La Maison ` Vapeur, Jules Verne a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 40. Text::Perfide::BookCleaner Sections Goal Identify and normalize the divisions between the several sections of a book (parts, chapters, acts, scenes, epilogue, afterword, ...) An ontology was created, containing types of divisions and subdivisions, in several languages. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 41. Text::Perfide::BookCleaner Sections – Ontology Example cap PT cap´tulo, cap, capitulo ı FR chapitre, chap EN chapter, chap NT sec PT fim FR fin EN the_end BT _alone This ontology is used to automatically generate a parte of the code. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 42. Text::Perfide::BookCleaner Sections – Example PRIMEIRA PARTE FANTINE ^L LIVRO PRIMEIRO UM JUSTO O abade Myriel Em 1815, era bispo de Digne, o reverendo Carlos Francisco Bemvindo Myriel, o qual contava setenta e Os Miser´veis, Vitor Hugo a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 43. Text::Perfide::BookCleaner Sections – Algorithm 1 Search for potential sections divisions: lines with keywords – cap´ıtulo, chapter, Chap., Appendix, Table des Mati´res, . . . e pages or lines containing only numbers roman numbering ... 2 Insert a custom mark immediately before the section identified Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 44. Text::Perfide::BookCleaner Sections – Example PRIMEIRA PARTE FANTINE ^L LIVRO PRIMEIRO UM JUSTO O abade Myriel Em 1815, era bispo de Digne, o reverendo Carlos Francisco Bemvindo Myriel, o qual contava setenta e Os Miser´veis, Vitor Hugo a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 45. Text::Perfide::BookCleaner Sections – Example _sec+O:PARTE=PRIMEIRA_ FANTINE _sec+O:LIVRO=PRIMEIRO_ UM JUSTO O abade Myriel Em 1815, era bispo de Digne, o reverendo Carlos Francisco Bemvindo Myriel, o qual contava setenta e Os Miser´veis, Vitor Hugo a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 46. Text::Perfide::BookCleaner Sections Identifying the different parts within a bitext: allows to subsequently compare the two versions and remove parts which can only be found in one of them allows to perform a structural alignment1 1 Text::Perfide::BookSync Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 47. Text::Perfide::BookCleaner Paragraphs Goal Handles things related with identifying and normalizing paragraph notation, direct speech, etc. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 48. Text::Perfide::BookCleaner Paragraphs – Example L’h^tesse prit la d´fense de son cur´: o e e - D’ailleurs, il en plierait quatre comme vous sur son genou. Il a, l’ann´e derni`re, aid´ nos gens a e e e ` rentrer la paille; il en portait jusqu’` six bottes a a la fois, tant il est fort! ` - Bravo! dit le pharmacien. Envoyez donc vos filles en confesse a des gaillards d’un temp´rament pareil! ` e Moi, si j’´tais le gouvernement, je voudrais qu’on e saign^t les pr^tres une fois par mois. a e Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 49. Text::Perfide::BookCleaner Paragraphs – Example L’h^tesse prit la d´fense de son cur´: o e e "D’ailleurs, il en plierait quatre comme vous sur son genou. Il a, l’ann´e derni`re, aid´ nos gens a e e e ` rentrer la paille; il en portait jusqu’` six bottes a a la fois, tant il est fort! " ` "Bravo!" dit le pharmacien. "Envoyez donc vos filles en confesse a des gaillards d’un temp´rament pareil! ` e Moi, si j’´tais le gouvernement, je voudrais qu’on e saign^t les pr^tres une fois par mois." a e Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 50. Text::Perfide::BookCleaner Paragraphs – Algorithm paragraph identification is performed by calculating metrics based on the number of blank lines and indentation identification and normalization of direct speech: punctuation, paragraph, dash text in quotes Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 51. Text::Perfide::BookCleaner Footnotes Goal Identify and remove footnote callmarks and footnote expansions Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 52. Text::Perfide::BookCleaner Footnotes – Example On fit un inventaire de son argent comptant, et on le mena dans le ch^teau que fit construire le roi a Charles V, fils de Jean II, aupr`s de la rue e Saint-Antoine, a la porte des Tournelles[1]. ` [1] La Bastille, qui fut prise par le peuple de Paris, le 14 juillet 1789, puis d´molie. B. e ^L Quel etait en chemin l’´tonnement de l’Ing´nu! ´ e e je vous le laisse a penser. Il crut d’abord ` que c’´tait un r^ve. e e Oeuvres de Voltaire, Voltaire Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 53. Text::Perfide::BookCleaner Footnotes – Algorithm 1 Search for footnote expansions (lines beggining with <<1>>, [2], ^3, . . . ) 2 Replace with custom mark 3 Only footnote call marks left 4 Search again for the same patterns in the middle of the text 5 Replace with custom mark Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 54. Text::Perfide::BookCleaner Footnotes – Algorithm On fit un inventaire de son argent comptant, et on le mena dans le ch^teau que fit construire le roi a Charles V, fils de Jean II, aupr`s de la rue e Saint-Antoine, a la porte des Tournelles[1]. ` [1] La Bastille, qui fut prise par le peuple de Paris, le 14 juillet 1789, puis d´molie. B. e (fbox^LQuel ´tait en chemin l’´tonnement de l’Ing´nu! e e e je vous le laisse a penser. Il crut d’abord ` que c’´tait un r^ve. e e Oeuvres de Voltaire, Voltaire Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 55. Text::Perfide::BookCleaner Footnotes – Algorithm On fit un inventaire de son argent comptant, et on le mena dans le ch^teau que fit construire le roi a Charles V, fils de Jean II, aupr`s de la rue e Saint-Antoine, a la porte des Tournelles_fnr29_. ` _fne8_ ^L Quel etait en chemin l’´tonnement de l’Ing´nu! ´ e e je vous le laisse a penser. Il crut d’abord ` que c’´tait un r^ve. e e Oeuvres de Voltaire, Voltaire Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 56. Text::Perfide::BookCleaner Words and characters translineations text encoding ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 57. Text::Perfide::BookCleaner Report Previous steps produce a report Summarizes what was found, what was assumed and what was done Main goal is to allow to make a diagnostic of the program, allowing to manually emend what is wrong Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 58. Text::Perfide::BookCleaner Report livros/_FR_15.pdf.txt: footers=[’( Page) = 241’] headers=[ "(La maison x{e0} vapeur Jules Verne) = 241"] ctrL=1; pagnum_ctrL=241; sectionsO=2; sectionsN=30; word_tr=58; words=118036; Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 59. Text::Perfide::BookCleaner Commit Final and irreversible step which removes all the custom marks added by the previous steps Outputs a cleaned copy of the document This is the last stage before the alignment (or any other further processing) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 60. Conclusions, wish list and future work 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 61. Conclusions, wish list and future work 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 62. Conclusions, wish list and future work Conclusions and wish list There is no de facto standard format for plain text books (documents?) Documents are way heterogeneous (provenience, type and quantity, notation formats, . . . ) Hurrah to regular expressions! 20/80 rule applies Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 63. Conclusions, wish list and future work Conclusions and wish list Ontologies and DSLs lead to a better structure Common pattern: search text calculate metrics perform action accordingly Report generated at the end should present a smart summary of what was found and done Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 64. Conclusions, wish list and future work Related ongoing work Text::Perfide::BookPairs Find repeated books and pairs of books (same book in different languages) within a collection Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 65. Conclusions, wish list and future work Related ongoing work Text::Perfide::BookPairs Find repeated books and pairs of books (same book in different languages) within a collection Text::Perfide::BookSync Uses the section delimitation made by T::P::BC to make a structural alignment: Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 66. Conclusions, wish list and future work Related ongoing work Text::Perfide::BookPairs Find repeated books and pairs of books (same book in different languages) within a collection Text::Perfide::BookSync Uses the section delimitation made by T::P::BC to make a structural alignment: Text::Perfide::CorporaFlow Uses a DSL to guide the corpora preparation workflow (to be done) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 67. Conclusions, wish list and future work Related ongoing work Text::Perfide::BookPairs Find repeated books and pairs of books (same book in different languages) within a collection Text::Perfide::BookSync Uses the section delimitation made by T::P::BC to make a structural alignment: Text::Perfide::CorporaFlow Uses a DSL to guide the corpora preparation workflow (to be done) Text::Perfide::SciPaperCleaner Cleaner for scientific papers (to be done) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 68. Conclusions, wish list and future work Future work Standoff annotation – no changes in the original file until commit Export to ebook formats – .fb2, .epub, . . . ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 69. Conclusions, wish list and future work CPAN Is it on CPAN yet? Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 70. Conclusions, wish list and future work CPAN Is it on CPAN yet? No, but it will be really, really soon! Missing More and better documentation More and better tests Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 71. Conclusions, wish list and future work Questions o/ Andr´ Santos e andrefs@cpan.org Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 72. Cleaning plain text books with Text::Perfide::BookCleaner Andr´ Santos e andrefs@cpan.org September 23, 2011