SlideShare a Scribd company logo
1 of 72
Download to read offline
Cleaning plain text books with
Text::Perfide::BookCleaner
           Andr´ Santos
                e
         andrefs@cpan.org




        September 23, 2011
Introduction   Per-Fide




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Joint venture between the Computer Science
     Department and the School of Humanities of
     the University of Minho




         Andr´ Santos andrefs@cpan.org
             e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Joint venture between the Computer Science
     Department and the School of Humanities of
     the University of Minho
     Portuguese in parallel with six languages:
     Espa˜ol, Russian, Fran¸ais, Italiano, Deutsch,
          n                 c
     English




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Joint venture between the Computer Science
     Department and the School of Humanities of
     the University of Minho
     Portuguese in parallel with six languages:
     Espa˜ol, Russian, Fran¸ais, Italiano, Deutsch,
           n                c
     English
     Build parallel corpora that will establish a
     relation between Portuguese and the other 6
     languages


          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



[Parallel] Corpora

       Corpora Collection of natural language texts
  Parallel corpora Collection of nat. lang. bitexts
         Bitext Pair formed by a text in a given
                 language and its translation in
                 another language, frequently aligned.
    Alignment Mapping between the
                 sentences/paragraphs/words of one
                 text and the other.


            Andr´ Santos andrefs@cpan.org
                e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations
     Two main genres: contemporary fiction
     and non-fiction




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations
     Two main genres: contemporary fiction
     and non-fiction
     non-fiction: judicial, journalistic, religious,
                technical, ...




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations
     Two main genres: contemporary fiction
     and non-fiction
     non-fiction: judicial, journalistic, religious,
                technical, ...
        fiction: contemporary novels and short
                stories



          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Per-Fide



Project Per-Fide

     Original texts in the seven languages and their
     translations
     Two main genres: contemporary fiction
     and non-fiction
     non-fiction: judicial, journalistic, religious,
                technical, ...
        fiction: contemporary novels and short
                stories
     per-fide.di.uminho.pt

          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic




         Andr´ Santos andrefs@cpan.org
             e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level




         Andr´ Santos andrefs@cpan.org
             e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level
     Automatic alignment tools/algorithms
     generally fall into three categories:




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level
     Automatic alignment tools/algorithms
     generally fall into three categories:
     length based: “when two sentences correspond, the
                     words in them also correspond”




          Andr´ Santos andrefs@cpan.org
              e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level
     Automatic alignment tools/algorithms
     generally fall into three categories:
     length based: “when two sentences correspond, the
                 words in them also correspond”
     lexical/dictionary based: relies on lexical
                 information or dictionaries to perform the
                 alignment




           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment
     Manual or automatic
     Paragraph/sentence/word level
     Automatic alignment tools/algorithms
     generally fall into three categories:
     length based: “when two sentences correspond, the
                 words in them also correspond”
     lexical/dictionary based: relies on lexical
                 information or dictionaries to perform the
                 alignment
     partial similarity (cognates) based: relies on
                 occurrences of tokens graphically or
                 otherwise identical (cognates)
           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Text alignment



Text alignment – Example




  Table: Extract of sentence-level alignment performed using
  Portuguese and Russian subtitles from the movie Tron.


             Andr´ Santos andrefs@cpan.org
                 e                              Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Books




        Andr´ Santos andrefs@cpan.org
            e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Books

    Obtained directly from publishers or, if in
    public domain, from Project Gutenberg and
    similar projects
    Large variety of formats: PDF, MS Word,
    HTML, ebook formats, ...




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Books

    Obtained directly from publishers or, if in
    public domain, from Project Gutenberg and
    similar projects
    Large variety of formats: PDF, MS Word,
    HTML, ebook formats, ...
    If not already in plain text, they need to be
    converted before the alignment



         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Books

    Obtained directly from publishers or, if in
    public domain, from Project Gutenberg and
    similar projects
    Large variety of formats: PDF, MS Word,
    HTML, ebook formats, ...
    If not already in plain text, they need to be
    converted before the alignment
 This is where all the trouble starts!

         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Book alignment problems
     pagination – page numbers, headers,
     footers, . . .
     previous text formatting – sub/superscript,
     bold, italics, . . .
     sections
     paragraphs
     translineations and transpaginations
     footnotes
     text encoding
     ...
          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Introduction   Books



Book alignment problems – Example
  (. . . )
  gaiement. Sur le devant s<92>’ouvrait la porte
  d<92>’entr´e, donnant acc`s dans la salle commune.
            e              e
  Une l´g`re v´randa, qui en prot´-
       e e    e                  e

               <96>- 86 <96>-
   ^L geait la partie ant´rieure contre l<92>’action
                         e
  des rayons solaires, reposait sur de sveltes bambous.
  Le tout ´tait peint d<92>’une fra^che
           e                        ı
  (. . . )

                                                        La Jangada, Jules Verne

             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



First approach
          RegExp + Find & Replace




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



First approach
          RegExp + Find & Replace




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



First approach


  Well-intentioned but:
       Too na¨ıve
       Big mess
       A more sofisticated approach was needed!




           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Architecture
  Build a pipeline; each step handles a specific set of
  problems.
    1  pages
    2  sections
    3  paragraphs
    4  footnotes
    5  chars
    6  ...



            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Architecture
  Build a pipeline; each step handles a specific set of
  problems.
    1  pages
    2  sections
    3  paragraphs
    4  footnotes
    5  chars
    6  ...
    7   commit

            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Architecture




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Architecture


     whenever possible, use ontologies and DSLs
     they help organizing stuff
     they allow to abstract from the code and
     discuss details at a higher level (even with
     people from other areas)




          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages

  Goal
  Identify and remove from text elements related to
  book pagination:
       page numbers
       headers
       footers
       page breaks
  These elements often lead to a bad performance of
  the aligner.

           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages – Example

  est vrai qu’il fallait etre assez chanceux pour
                         ^
  rencontrer le nabab, et assez audacieux pour
  s’emparer de sa personne.

                     Page 3
  ^L La maison ` vapeur
               a                                        Jules Verne

    Le faquir, - evidemment le seul entre tous
                 ´
  que ne surexcit^t pas l’espoir de gagner la
                 a
  prime, - filait au milieu des groupes, s’arr^tant
                                              e

                                           La Maison ` Vapeur, Jules Verne
                                                     a


           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages – Algorithm
   1   identify page breaks (e.g., ^L )
   2   nearby: candidates to headers and footers
   3   count the occurrences of each normalized
       candidate
   4   headers and footers are extracted from
       candidates which occur more thant a threshold
       value
   5   replace everything with a custom mark
   6   move all the necessary information to a
       standoff file
            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages – Example

  est vrai qu’il fallait etre assez chanceux pour
                         ^
  rencontrer le nabab, et assez audacieux pour
  s’emparer de sa personne.

                     Page 3
  ^L La maison ` vapeur
               a                                        Jules Verne

    Le faquir, - evidemment le seul entre tous
                 ´
  que ne surexcit^t pas l’espoir de gagner la
                 a
  prime, - filait au milieu des groupes, s’arr^tant
                                              e

                                           La Maison ` Vapeur, Jules Verne
                                                     a


           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Pages – Example

  est vrai qu’il fallait etre assez chanceux pour
                         ^
  rencontrer le nabab, et assez audacieux pour
  s’emparer de sa personne. _pb2_

    Le faquir, - evidemment le seul entre tous
                 ´
  que ne surexcit^t pas l’espoir de gagner la
                 a
  prime, - filait au milieu des groupes, s’arr^tant
                                              e

                                           La Maison ` Vapeur, Jules Verne
                                                     a




           Andr´ Santos andrefs@cpan.org
               e                              Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections


  Goal
  Identify and normalize the divisions between the
  several sections of a book (parts, chapters, acts,
  scenes, epilogue, afterword, ...)
  An ontology was created, containing types of
  divisions and subdivisions, in several languages.



            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Ontology
  Example
  cap
  PT cap´tulo, cap, capitulo
        ı
  FR chapitre, chap
  EN chapter, chap
  NT sec

  PT   fim
  FR   fin
  EN   the_end
  BT   _alone

  This ontology is used to automatically generate a
  parte of the code.
             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Example
  PRIMEIRA PARTE

  FANTINE


  ^L LIVRO PRIMEIRO

  UM JUSTO

  O abade Myriel

  Em 1815, era bispo de Digne, o reverendo Carlos
  Francisco Bemvindo Myriel, o qual contava setenta e
                                                    Os Miser´veis, Vitor Hugo
                                                            a
             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Algorithm

   1   Search for potential sections divisions:
            lines with keywords – cap´ıtulo, chapter, Chap.,
            Appendix, Table des Mati´res, . . .
                                       e
            pages or lines containing only numbers
            roman numbering
            ...
   2   Insert a custom mark immediately before the
       section identified



            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Example
  PRIMEIRA PARTE

  FANTINE


  ^L LIVRO PRIMEIRO

  UM JUSTO

  O abade Myriel

  Em 1815, era bispo de Digne, o reverendo Carlos
  Francisco Bemvindo Myriel, o qual contava setenta e
                                                    Os Miser´veis, Vitor Hugo
                                                            a
             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections – Example
  _sec+O:PARTE=PRIMEIRA_
  FANTINE

  _sec+O:LIVRO=PRIMEIRO_

  UM JUSTO

  O abade Myriel

  Em 1815, era bispo de Digne, o reverendo Carlos
  Francisco Bemvindo Myriel, o qual contava setenta e

                                                    Os Miser´veis, Vitor Hugo
                                                            a

             Andr´ Santos andrefs@cpan.org
                 e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Sections


  Identifying the different parts within a bitext:
       allows to subsequently compare the two
       versions and remove parts which can only be
       found in one of them
       allows to perform a structural alignment1




    1
        Text::Perfide::BookSync
               Andr´ Santos andrefs@cpan.org
                   e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Paragraphs



  Goal
  Handles things related with identifying and
  normalizing paragraph notation, direct speech, etc.




            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Paragraphs – Example

  L’h^tesse prit la d´fense de son cur´:
     o               e                e

  - D’ailleurs, il en plierait quatre comme vous sur
  son genou. Il a, l’ann´e derni`re, aid´ nos gens a
                        e       e       e          `
  rentrer la paille; il en portait jusqu’` six bottes
                                         a
  a la fois, tant il est fort!
  `

  - Bravo! dit le pharmacien. Envoyez donc vos filles
  en confesse a des gaillards d’un temp´rament pareil!
              `                        e
  Moi, si j’´tais le gouvernement, je voudrais qu’on
            e
  saign^t les pr^tres une fois par mois.
       a        e


           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Paragraphs – Example

  L’h^tesse prit la d´fense de son cur´:
     o               e                e

    "D’ailleurs, il en plierait quatre comme vous sur
  son genou. Il a, l’ann´e derni`re, aid´ nos gens a
                        e       e       e          `
  rentrer la paille; il en portait jusqu’` six bottes
                                         a
  a la fois, tant il est fort! "
  `

    "Bravo!" dit le pharmacien. "Envoyez donc vos filles
  en confesse a des gaillards d’un temp´rament pareil!
              `                        e
  Moi, si j’´tais le gouvernement, je voudrais qu’on
            e
  saign^t les pr^tres une fois par mois."
       a        e


           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Paragraphs – Algorithm


     paragraph identification is performed by
     calculating metrics based on the number of
     blank lines and indentation
     identification and normalization of direct
     speech:
         punctuation, paragraph, dash
         text in quotes




          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes



  Goal
  Identify and remove footnote callmarks and
  footnote expansions




           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes – Example
  On fit un inventaire de son argent comptant, et on
  le mena dans le ch^teau que fit construire le roi
                    a
  Charles V, fils de Jean II, aupr`s de la rue
                                  e
  Saint-Antoine, a la porte des Tournelles[1].
                 `

  [1] La Bastille, qui fut prise par le peuple de
  Paris, le 14 juillet 1789, puis d´molie. B.
                                   e

   ^L Quel etait en chemin l’´tonnement de l’Ing´nu!
           ´                 e                  e
  je vous le laisse a penser. Il crut d’abord
                     `
  que c’´tait un r^ve.
         e         e

                                              Oeuvres de Voltaire, Voltaire

           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes – Algorithm

   1   Search for footnote expansions (lines beggining
       with <<1>>, [2], ^3, . . . )
   2   Replace with custom mark
   3   Only footnote call marks left
   4   Search again for the same patterns in the
       middle of the text
   5   Replace with custom mark



            Andr´ Santos andrefs@cpan.org
                e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes – Algorithm
  On fit un inventaire de son argent comptant, et on
  le mena dans le ch^teau que fit construire le roi
                    a
  Charles V, fils de Jean II, aupr`s de la rue
                                  e
  Saint-Antoine, a la porte des Tournelles[1].
                 `

  [1] La Bastille, qui fut prise par le peuple de
  Paris, le 14 juillet 1789, puis d´molie. B.
                                   e

  (fbox^LQuel ´tait en chemin l’´tonnement de l’Ing´nu!
              e                 e                  e
  je vous le laisse a penser. Il crut d’abord
                    `
  que c’´tait un r^ve.
        e         e

                                              Oeuvres de Voltaire, Voltaire

           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Footnotes – Algorithm

  On fit un inventaire de son argent comptant, et on
  le mena dans le ch^teau que fit construire le roi
                    a
  Charles V, fils de Jean II, aupr`s de la rue
                                  e
  Saint-Antoine, a la porte des Tournelles_fnr29_.
                 `
  _fne8_


   ^L Quel etait en chemin l’´tonnement de l’Ing´nu!
           ´                 e                  e
  je vous le laisse a penser. Il crut d’abord
                     `
  que c’´tait un r^ve.
         e         e

                                              Oeuvres de Voltaire, Voltaire


           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Words and characters



     translineations
     text encoding
     ...




          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Report


     Previous steps produce a report
     Summarizes what was found, what was
     assumed and what was done
     Main goal is to allow to make a diagnostic of
     the program, allowing to manually emend what
     is wrong




          Andr´ Santos andrefs@cpan.org
              e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Report

  livros/_FR_15.pdf.txt:
      footers=[’( Page) = 241’]
      headers=[
        "(La maison x{e0} vapeur Jules Verne) = 241"]
      ctrL=1;
      pagnum_ctrL=241;

     sectionsO=2;
     sectionsN=30;

     word_tr=58;
     words=118036;


           Andr´ Santos andrefs@cpan.org
               e                             Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner



Commit


    Final and irreversible step which removes all
    the custom marks added by the previous steps
    Outputs a cleaned copy of the document
    This is the last stage before the alignment (or
    any other further processing)




         Andr´ Santos andrefs@cpan.org
             e                             Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work




1   Introduction
       Per-Fide
       Text alignment
       Books

2   Text::Perfide::BookCleaner

3   Conclusions, wish list and future work



             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Conclusions and wish list


     There is no de facto standard format for plain
     text books (documents?)
     Documents are way heterogeneous
     (provenience, type and quantity, notation
     formats, . . . )
     Hurrah to regular expressions!
     20/80 rule applies



            Andr´ Santos andrefs@cpan.org
                e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Conclusions and wish list


     Ontologies and DSLs lead to a better structure
     Common pattern:
           search text
           calculate metrics
           perform action accordingly
     Report generated at the end should present a
     smart summary of what was found and done



            Andr´ Santos andrefs@cpan.org
                e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Related ongoing work
  Text::Perfide::BookPairs Find repeated books and
            pairs of books (same book in different
            languages) within a collection




             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Related ongoing work
  Text::Perfide::BookPairs Find repeated books and
            pairs of books (same book in different
            languages) within a collection
  Text::Perfide::BookSync Uses the section
            delimitation made by T::P::BC to make a
            structural alignment:




             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Related ongoing work
  Text::Perfide::BookPairs Find repeated books and
            pairs of books (same book in different
            languages) within a collection
  Text::Perfide::BookSync Uses the section
            delimitation made by T::P::BC to make a
            structural alignment:
  Text::Perfide::CorporaFlow Uses a DSL to guide the
            corpora preparation workflow (to be
            done)



             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Related ongoing work
  Text::Perfide::BookPairs Find repeated books and
            pairs of books (same book in different
            languages) within a collection
  Text::Perfide::BookSync Uses the section
            delimitation made by T::P::BC to make a
            structural alignment:
  Text::Perfide::CorporaFlow Uses a DSL to guide the
            corpora preparation workflow (to be
            done)
  Text::Perfide::SciPaperCleaner Cleaner for scientific
            papers (to be done)
             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Future work



     Standoff annotation – no changes in the
     original file until commit
     Export to ebook formats – .fb2, .epub, . . .
     ...




            Andr´ Santos andrefs@cpan.org
                e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



CPAN


                      Is it on CPAN yet?




           Andr´ Santos andrefs@cpan.org
               e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



CPAN


               Is it on CPAN yet?
       No, but it will be really, really soon!

 Missing
      More and better documentation
      More and better tests



             Andr´ Santos andrefs@cpan.org
                 e                                Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work



Questions



                                          o/

                                    Andr´ Santos
                                        e
                                  andrefs@cpan.org


           Andr´ Santos andrefs@cpan.org
               e                                Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with
Text::Perfide::BookCleaner
           Andr´ Santos
                e
         andrefs@cpan.org




        September 23, 2011

More Related Content

Viewers also liked

575 madame tussaud wien-iii
575 madame tussaud wien-iii575 madame tussaud wien-iii
575 madame tussaud wien-iiifilipj2000
 
Outlook 2010 imap settings
Outlook 2010 imap settingsOutlook 2010 imap settings
Outlook 2010 imap settingsJeremy Dawes
 
Tianmen mountains
Tianmen mountainsTianmen mountains
Tianmen mountainsfilipj2000
 
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014TEDx Adventure Catalyst
 
Outlook 2011 imap settings
Outlook 2011 imap settingsOutlook 2011 imap settings
Outlook 2011 imap settingsJeremy Dawes
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures Ensrrm7
 
Visual Composer: Old vs New
Visual Composer: Old vs NewVisual Composer: Old vs New
Visual Composer: Old vs NewJeremy Dawes
 

Viewers also liked (11)

575 madame tussaud wien-iii
575 madame tussaud wien-iii575 madame tussaud wien-iii
575 madame tussaud wien-iii
 
Outlook 2010 imap settings
Outlook 2010 imap settingsOutlook 2010 imap settings
Outlook 2010 imap settings
 
Tianmen mountains
Tianmen mountainsTianmen mountains
Tianmen mountains
 
York residence learning plan apr192012
York residence learning plan apr192012York residence learning plan apr192012
York residence learning plan apr192012
 
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
Measuring Success: The Balanced Scorecard Approach - Clara Wong - SASSY 2014
 
Outlook 2011 imap settings
Outlook 2011 imap settingsOutlook 2011 imap settings
Outlook 2011 imap settings
 
La Excepción
La ExcepciónLa Excepción
La Excepción
 
Cant Stand It - Never Shout Never
Cant  Stand It - Never Shout NeverCant  Stand It - Never Shout Never
Cant Stand It - Never Shout Never
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures En
 
Pdf 1
Pdf 1Pdf 1
Pdf 1
 
Visual Composer: Old vs New
Visual Composer: Old vs NewVisual Composer: Old vs New
Visual Composer: Old vs New
 

More from andrefsantos

Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pintoandrefsantos
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documentsandrefsantos
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesandrefsantos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...andrefsantos
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatosandrefsantos
 

More from andrefsantos (10)

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
 
Slides
SlidesSlides
Slides
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documents
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Bigorna
BigornaBigorna
Bigorna
 
Mojolicious lite
Mojolicious liteMojolicious lite
Mojolicious lite
 

Recently uploaded

The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxheathfieldcps1
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17Celine George
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17Celine George
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfMohonDas
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...Nguyen Thanh Tu Collection
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17Celine George
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice documentXsasf Sfdfasd
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17Celine George
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxMYDA ANGELICA SUAN
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 

Recently uploaded (20)

The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17
 
How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17How to Make a Field read-only in Odoo 17
How to Make a Field read-only in Odoo 17
 
HED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdfHED Office Sohayok Exam Question Solution 2023.pdf
HED Office Sohayok Exam Question Solution 2023.pdf
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdfPersonal Resilience in Project Management 2 - TV Edit 1a.pdf
Personal Resilience in Project Management 2 - TV Edit 1a.pdf
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice document
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptx
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 

Cleaning plain text books with Text::Perfide::BookCleaner

  • 1. Cleaning plain text books with Text::Perfide::BookCleaner Andr´ Santos e andrefs@cpan.org September 23, 2011
  • 2. Introduction Per-Fide 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 3. Introduction Per-Fide 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 4. Introduction Per-Fide Project Per-Fide Joint venture between the Computer Science Department and the School of Humanities of the University of Minho Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 5. Introduction Per-Fide Project Per-Fide Joint venture between the Computer Science Department and the School of Humanities of the University of Minho Portuguese in parallel with six languages: Espa˜ol, Russian, Fran¸ais, Italiano, Deutsch, n c English Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 6. Introduction Per-Fide Project Per-Fide Joint venture between the Computer Science Department and the School of Humanities of the University of Minho Portuguese in parallel with six languages: Espa˜ol, Russian, Fran¸ais, Italiano, Deutsch, n c English Build parallel corpora that will establish a relation between Portuguese and the other 6 languages Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 7. Introduction Per-Fide [Parallel] Corpora Corpora Collection of natural language texts Parallel corpora Collection of nat. lang. bitexts Bitext Pair formed by a text in a given language and its translation in another language, frequently aligned. Alignment Mapping between the sentences/paragraphs/words of one text and the other. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 8. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 9. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary fiction and non-fiction Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 10. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary fiction and non-fiction non-fiction: judicial, journalistic, religious, technical, ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 11. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary fiction and non-fiction non-fiction: judicial, journalistic, religious, technical, ... fiction: contemporary novels and short stories Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 12. Introduction Per-Fide Project Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary fiction and non-fiction non-fiction: judicial, journalistic, religious, technical, ... fiction: contemporary novels and short stories per-fide.di.uminho.pt Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 13. Introduction Text alignment Text alignment Manual or automatic Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 14. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 15. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 16. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: length based: “when two sentences correspond, the words in them also correspond” Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 17. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: length based: “when two sentences correspond, the words in them also correspond” lexical/dictionary based: relies on lexical information or dictionaries to perform the alignment Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 18. Introduction Text alignment Text alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: length based: “when two sentences correspond, the words in them also correspond” lexical/dictionary based: relies on lexical information or dictionaries to perform the alignment partial similarity (cognates) based: relies on occurrences of tokens graphically or otherwise identical (cognates) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 19. Introduction Text alignment Text alignment – Example Table: Extract of sentence-level alignment performed using Portuguese and Russian subtitles from the movie Tron. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 20. Introduction Books Books Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 21. Introduction Books Books Obtained directly from publishers or, if in public domain, from Project Gutenberg and similar projects Large variety of formats: PDF, MS Word, HTML, ebook formats, ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 22. Introduction Books Books Obtained directly from publishers or, if in public domain, from Project Gutenberg and similar projects Large variety of formats: PDF, MS Word, HTML, ebook formats, ... If not already in plain text, they need to be converted before the alignment Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 23. Introduction Books Books Obtained directly from publishers or, if in public domain, from Project Gutenberg and similar projects Large variety of formats: PDF, MS Word, HTML, ebook formats, ... If not already in plain text, they need to be converted before the alignment This is where all the trouble starts! Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 24. Introduction Books Book alignment problems pagination – page numbers, headers, footers, . . . previous text formatting – sub/superscript, bold, italics, . . . sections paragraphs translineations and transpaginations footnotes text encoding ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 25. Introduction Books Book alignment problems – Example (. . . ) gaiement. Sur le devant s<92>’ouvrait la porte d<92>’entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e <96>- 86 <96>- ^L geait la partie ant´rieure contre l<92>’action e des rayons solaires, reposait sur de sveltes bambous. Le tout ´tait peint d<92>’une fra^che e ı (. . . ) La Jangada, Jules Verne Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 26. Text::Perfide::BookCleaner 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 27. Text::Perfide::BookCleaner 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 28. Text::Perfide::BookCleaner First approach RegExp + Find & Replace Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 29. Text::Perfide::BookCleaner First approach RegExp + Find & Replace Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 30. Text::Perfide::BookCleaner First approach Well-intentioned but: Too na¨ıve Big mess A more sofisticated approach was needed! Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 31. Text::Perfide::BookCleaner Architecture Build a pipeline; each step handles a specific set of problems. 1 pages 2 sections 3 paragraphs 4 footnotes 5 chars 6 ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 32. Text::Perfide::BookCleaner Architecture Build a pipeline; each step handles a specific set of problems. 1 pages 2 sections 3 paragraphs 4 footnotes 5 chars 6 ... 7 commit Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 33. Text::Perfide::BookCleaner Architecture Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 34. Text::Perfide::BookCleaner Architecture whenever possible, use ontologies and DSLs they help organizing stuff they allow to abstract from the code and discuss details at a higher level (even with people from other areas) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 35. Text::Perfide::BookCleaner Pages Goal Identify and remove from text elements related to book pagination: page numbers headers footers page breaks These elements often lead to a bad performance of the aligner. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 36. Text::Perfide::BookCleaner Pages – Example est vrai qu’il fallait etre assez chanceux pour ^ rencontrer le nabab, et assez audacieux pour s’emparer de sa personne. Page 3 ^L La maison ` vapeur a Jules Verne Le faquir, - evidemment le seul entre tous ´ que ne surexcit^t pas l’espoir de gagner la a prime, - filait au milieu des groupes, s’arr^tant e La Maison ` Vapeur, Jules Verne a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 37. Text::Perfide::BookCleaner Pages – Algorithm 1 identify page breaks (e.g., ^L ) 2 nearby: candidates to headers and footers 3 count the occurrences of each normalized candidate 4 headers and footers are extracted from candidates which occur more thant a threshold value 5 replace everything with a custom mark 6 move all the necessary information to a standoff file Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 38. Text::Perfide::BookCleaner Pages – Example est vrai qu’il fallait etre assez chanceux pour ^ rencontrer le nabab, et assez audacieux pour s’emparer de sa personne. Page 3 ^L La maison ` vapeur a Jules Verne Le faquir, - evidemment le seul entre tous ´ que ne surexcit^t pas l’espoir de gagner la a prime, - filait au milieu des groupes, s’arr^tant e La Maison ` Vapeur, Jules Verne a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 39. Text::Perfide::BookCleaner Pages – Example est vrai qu’il fallait etre assez chanceux pour ^ rencontrer le nabab, et assez audacieux pour s’emparer de sa personne. _pb2_ Le faquir, - evidemment le seul entre tous ´ que ne surexcit^t pas l’espoir de gagner la a prime, - filait au milieu des groupes, s’arr^tant e La Maison ` Vapeur, Jules Verne a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 40. Text::Perfide::BookCleaner Sections Goal Identify and normalize the divisions between the several sections of a book (parts, chapters, acts, scenes, epilogue, afterword, ...) An ontology was created, containing types of divisions and subdivisions, in several languages. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 41. Text::Perfide::BookCleaner Sections – Ontology Example cap PT cap´tulo, cap, capitulo ı FR chapitre, chap EN chapter, chap NT sec PT fim FR fin EN the_end BT _alone This ontology is used to automatically generate a parte of the code. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 42. Text::Perfide::BookCleaner Sections – Example PRIMEIRA PARTE FANTINE ^L LIVRO PRIMEIRO UM JUSTO O abade Myriel Em 1815, era bispo de Digne, o reverendo Carlos Francisco Bemvindo Myriel, o qual contava setenta e Os Miser´veis, Vitor Hugo a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 43. Text::Perfide::BookCleaner Sections – Algorithm 1 Search for potential sections divisions: lines with keywords – cap´ıtulo, chapter, Chap., Appendix, Table des Mati´res, . . . e pages or lines containing only numbers roman numbering ... 2 Insert a custom mark immediately before the section identified Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 44. Text::Perfide::BookCleaner Sections – Example PRIMEIRA PARTE FANTINE ^L LIVRO PRIMEIRO UM JUSTO O abade Myriel Em 1815, era bispo de Digne, o reverendo Carlos Francisco Bemvindo Myriel, o qual contava setenta e Os Miser´veis, Vitor Hugo a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 45. Text::Perfide::BookCleaner Sections – Example _sec+O:PARTE=PRIMEIRA_ FANTINE _sec+O:LIVRO=PRIMEIRO_ UM JUSTO O abade Myriel Em 1815, era bispo de Digne, o reverendo Carlos Francisco Bemvindo Myriel, o qual contava setenta e Os Miser´veis, Vitor Hugo a Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 46. Text::Perfide::BookCleaner Sections Identifying the different parts within a bitext: allows to subsequently compare the two versions and remove parts which can only be found in one of them allows to perform a structural alignment1 1 Text::Perfide::BookSync Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 47. Text::Perfide::BookCleaner Paragraphs Goal Handles things related with identifying and normalizing paragraph notation, direct speech, etc. Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 48. Text::Perfide::BookCleaner Paragraphs – Example L’h^tesse prit la d´fense de son cur´: o e e - D’ailleurs, il en plierait quatre comme vous sur son genou. Il a, l’ann´e derni`re, aid´ nos gens a e e e ` rentrer la paille; il en portait jusqu’` six bottes a a la fois, tant il est fort! ` - Bravo! dit le pharmacien. Envoyez donc vos filles en confesse a des gaillards d’un temp´rament pareil! ` e Moi, si j’´tais le gouvernement, je voudrais qu’on e saign^t les pr^tres une fois par mois. a e Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 49. Text::Perfide::BookCleaner Paragraphs – Example L’h^tesse prit la d´fense de son cur´: o e e "D’ailleurs, il en plierait quatre comme vous sur son genou. Il a, l’ann´e derni`re, aid´ nos gens a e e e ` rentrer la paille; il en portait jusqu’` six bottes a a la fois, tant il est fort! " ` "Bravo!" dit le pharmacien. "Envoyez donc vos filles en confesse a des gaillards d’un temp´rament pareil! ` e Moi, si j’´tais le gouvernement, je voudrais qu’on e saign^t les pr^tres une fois par mois." a e Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 50. Text::Perfide::BookCleaner Paragraphs – Algorithm paragraph identification is performed by calculating metrics based on the number of blank lines and indentation identification and normalization of direct speech: punctuation, paragraph, dash text in quotes Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 51. Text::Perfide::BookCleaner Footnotes Goal Identify and remove footnote callmarks and footnote expansions Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 52. Text::Perfide::BookCleaner Footnotes – Example On fit un inventaire de son argent comptant, et on le mena dans le ch^teau que fit construire le roi a Charles V, fils de Jean II, aupr`s de la rue e Saint-Antoine, a la porte des Tournelles[1]. ` [1] La Bastille, qui fut prise par le peuple de Paris, le 14 juillet 1789, puis d´molie. B. e ^L Quel etait en chemin l’´tonnement de l’Ing´nu! ´ e e je vous le laisse a penser. Il crut d’abord ` que c’´tait un r^ve. e e Oeuvres de Voltaire, Voltaire Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 53. Text::Perfide::BookCleaner Footnotes – Algorithm 1 Search for footnote expansions (lines beggining with <<1>>, [2], ^3, . . . ) 2 Replace with custom mark 3 Only footnote call marks left 4 Search again for the same patterns in the middle of the text 5 Replace with custom mark Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 54. Text::Perfide::BookCleaner Footnotes – Algorithm On fit un inventaire de son argent comptant, et on le mena dans le ch^teau que fit construire le roi a Charles V, fils de Jean II, aupr`s de la rue e Saint-Antoine, a la porte des Tournelles[1]. ` [1] La Bastille, qui fut prise par le peuple de Paris, le 14 juillet 1789, puis d´molie. B. e (fbox^LQuel ´tait en chemin l’´tonnement de l’Ing´nu! e e e je vous le laisse a penser. Il crut d’abord ` que c’´tait un r^ve. e e Oeuvres de Voltaire, Voltaire Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 55. Text::Perfide::BookCleaner Footnotes – Algorithm On fit un inventaire de son argent comptant, et on le mena dans le ch^teau que fit construire le roi a Charles V, fils de Jean II, aupr`s de la rue e Saint-Antoine, a la porte des Tournelles_fnr29_. ` _fne8_ ^L Quel etait en chemin l’´tonnement de l’Ing´nu! ´ e e je vous le laisse a penser. Il crut d’abord ` que c’´tait un r^ve. e e Oeuvres de Voltaire, Voltaire Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 56. Text::Perfide::BookCleaner Words and characters translineations text encoding ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 57. Text::Perfide::BookCleaner Report Previous steps produce a report Summarizes what was found, what was assumed and what was done Main goal is to allow to make a diagnostic of the program, allowing to manually emend what is wrong Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 58. Text::Perfide::BookCleaner Report livros/_FR_15.pdf.txt: footers=[’( Page) = 241’] headers=[ "(La maison x{e0} vapeur Jules Verne) = 241"] ctrL=1; pagnum_ctrL=241; sectionsO=2; sectionsN=30; word_tr=58; words=118036; Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 59. Text::Perfide::BookCleaner Commit Final and irreversible step which removes all the custom marks added by the previous steps Outputs a cleaned copy of the document This is the last stage before the alignment (or any other further processing) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 60. Conclusions, wish list and future work 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 61. Conclusions, wish list and future work 1 Introduction Per-Fide Text alignment Books 2 Text::Perfide::BookCleaner 3 Conclusions, wish list and future work Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 62. Conclusions, wish list and future work Conclusions and wish list There is no de facto standard format for plain text books (documents?) Documents are way heterogeneous (provenience, type and quantity, notation formats, . . . ) Hurrah to regular expressions! 20/80 rule applies Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 63. Conclusions, wish list and future work Conclusions and wish list Ontologies and DSLs lead to a better structure Common pattern: search text calculate metrics perform action accordingly Report generated at the end should present a smart summary of what was found and done Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 64. Conclusions, wish list and future work Related ongoing work Text::Perfide::BookPairs Find repeated books and pairs of books (same book in different languages) within a collection Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 65. Conclusions, wish list and future work Related ongoing work Text::Perfide::BookPairs Find repeated books and pairs of books (same book in different languages) within a collection Text::Perfide::BookSync Uses the section delimitation made by T::P::BC to make a structural alignment: Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 66. Conclusions, wish list and future work Related ongoing work Text::Perfide::BookPairs Find repeated books and pairs of books (same book in different languages) within a collection Text::Perfide::BookSync Uses the section delimitation made by T::P::BC to make a structural alignment: Text::Perfide::CorporaFlow Uses a DSL to guide the corpora preparation workflow (to be done) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 67. Conclusions, wish list and future work Related ongoing work Text::Perfide::BookPairs Find repeated books and pairs of books (same book in different languages) within a collection Text::Perfide::BookSync Uses the section delimitation made by T::P::BC to make a structural alignment: Text::Perfide::CorporaFlow Uses a DSL to guide the corpora preparation workflow (to be done) Text::Perfide::SciPaperCleaner Cleaner for scientific papers (to be done) Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 68. Conclusions, wish list and future work Future work Standoff annotation – no changes in the original file until commit Export to ebook formats – .fb2, .epub, . . . ... Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 69. Conclusions, wish list and future work CPAN Is it on CPAN yet? Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 70. Conclusions, wish list and future work CPAN Is it on CPAN yet? No, but it will be really, really soon! Missing More and better documentation More and better tests Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 71. Conclusions, wish list and future work Questions o/ Andr´ Santos e andrefs@cpan.org Andr´ Santos andrefs@cpan.org e Cleaning plain text books with Text::Perfide::BookCleaner
  • 72. Cleaning plain text books with Text::Perfide::BookCleaner Andr´ Santos e andrefs@cpan.org September 23, 2011