by Panagiots Ipeirotis et al.
The idea of the article is to provide strategies for optimal choosing between different crawl-/query strategies (like scan, filter, ...) for text-centric tasks addressing the following trade offs:
- query-based execution might miss relevant documents
- scan based strategies take a lot of time
The authors introduce several use cases, and define each task as deriving tokens from a large databsae.
The most interesting points in term of our current paper projects are:
- definition of the execution time (including training time into the model)
- description of multiple execution strategies which are applied to the model (scan, filter, iterative set expansion, automated query generation)
- query based strategies outperformed crawling based approaches for a related data classification task
- the methodology of updating statistics at key points for adjusting the plan for the rest of the execution refers to reoptimization methods as described by Kabra and DeWitt