We examine document spanners, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). A document spanner is a function that maps an input string to a relation over spans (intervals of positions of the string). We focus on document spanners that are defined by regex formulas, which are basically regular expressions that map matched subexpressions to corresponding spans, and on
core spanners, which extend the former by standard algebraic operators and string
equality selection. First, we compare the expressive power of core spanners to three models {namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator). These results are then used to analyze the complexity of query evaluation and various aspects of static
analysis of core spanners. Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators.
History
School
Science
Department
Computer Science
Published in
Theory of Computing Systems
Volume
62
Issue
4
Pages
854 - 898
Citation
FREYDENBERGER, D.D. and HOLLDACK, M., 2018. Document spanners: from expressive power to decision problems. Theory of Computing Systems, 62 (4), pp.854–898.
This work is made available according to the conditions of the Creative Commons Attribution (CC BY 4.0) licence. Full details of this licence are available at: https://creativecommons.org/licenses/by/4.0/
Acceptance date
2017-04-04
Publication date
2017-05-22
Notes
This is an open access article published by Springer and distributed under the terms of the Creative Commons Attribution Licence, https://creativecommons.org/licenses/by/4.0/