Text-based Information Retrieval (T-IR)

1 minute read

a Workshop organized by Benno Stein.

Content Extraction (from Web pages)

Filter lines based on the content to tags ratio.

 a <- ASCII characters per line
 t <- tags per line
 -> compute a/t (or 1 if t=0)

points to consider:

problem: the algorithm might remove titles: the tags in the titles are most times also available in the paper (=> not that bad, if you don't get the title)
importance of precision vs. recall - depends on the use case
- database of criminals (=> high recall)
- Google search (=> high precision; because i can only read a small fraction of all available documents)
Content Code blurring (by Gottron, Thomas et. al)
- sliding window: document slope curve (DSC)
- link quota functions often block too much hyperlinks => remove too big blocks (that's a problem especially with Wikipedia, ...)
- idea: find regions with a lot of content and little code (similar formatting)
- blurring -> apply a Gaussian function => grey
- multiple blurring => more fuzzy algorithm => increases recall
- start point remove content; scripts and whitespaces (\n, " ", ...)
- baselines for the comparisons
  - high recall corpus: extract all the plain text => recall = 1; precision: worst
  - gold standard: manually copy and paste the text from the web pages

Suggestions from Benno Stein: final approach: screenshots + OCR (because displayed page and sourcecode will divert even more strongly in the future; extension: render only parts of the page (e.g. without pictures)
Other Approaches
- Crunch framework: www03 (gupta, kaiser, neistadt, grimm) -> dom based