Text-based Information Retrieval (T-IR)

1 minute read

a Workshop organized by Benno Stein.

Content Extraction (from Web pages)

Filter lines based on the content to tags ratio.

 a <- ASCII characters per line
 t <- tags per line

-> compute a/t (or 1 if t=0)

points to consider:

  • problem: the algorithm might remove titles: the tags in the titles are most times also available in the paper (=> not that bad, if you don't get the title)
  • importance of precision vs. recall - depends on the use case
    • database of criminals (=> high recall)
    • Google search (=> high precision; because i can only read a small fraction of all available documents)

  • Content Code blurring (by Gottron, Thomas et. al)
    • sliding window: document slope curve (DSC)
    • link quota functions often block too much hyperlinks => remove too big blocks (that's a problem especially with Wikipedia, ...)
    • idea: find regions with a lot of content and little code (similar formatting)
    • blurring -> apply a Gaussian function => grey
    • multiple blurring => more fuzzy algorithm => increases recall
    • start point remove content; scripts and whitespaces (\n, " ", ...)
    • baselines for the comparisons
      • high recall corpus: extract all the plain text => recall = 1; precision: worst
      • gold standard: manually copy and paste the text from the web pages
  • Suggestions from Benno Stein: final approach: screenshots + OCR (because displayed page and sourcecode will divert even more strongly in the future; extension: render only parts of the page (e.g. without pictures)
  • Other Approaches
    • Crunch framework: www03 (gupta, kaiser, neistadt, grimm) -> dom based