Text-based Information Retrieval (T-IR)
a Workshop organized by Benno Stein.
Content Extraction (from Web pages)
Filter lines based on the content to tags ratio.a <- ASCII characters per line t <- tags per line-> compute a/t (or 1 if t=0)
points to consider:
- problem: the algorithm might remove titles: the tags in the titles are most times also available in the paper (=> not that bad, if you don't get the title)
- importance of precision vs. recall - depends on the use case
- database of criminals (=> high recall)
- Google search (=> high precision; because i can only read a small fraction of all available documents)
- Content Code blurring (by Gottron, Thomas et. al)
- sliding window: document slope curve (DSC)
- link quota functions often block too much hyperlinks => remove too big blocks (that's a problem especially with Wikipedia, ...)
- idea: find regions with a lot of content and little code (similar formatting)
- blurring -> apply a Gaussian function => grey
- multiple blurring => more fuzzy algorithm => increases recall
- start point remove content; scripts and whitespaces (\n, " ", ...)
- baselines for the comparisons
- high recall corpus: extract all the plain text => recall = 1; precision: worst
- gold standard: manually copy and paste the text from the web pages
- Suggestions from Benno Stein: final approach: screenshots + OCR (because displayed page and sourcecode will divert even more strongly in the future; extension: render only parts of the page (e.g. without pictures)
- Other Approaches
- Crunch framework: www03 (gupta, kaiser, neistadt, grimm) -> dom based