main.pdf (810.79 kB)
Download fileComplexity bounds for relational algebra over document spanners
conference contribution
posted on 2019-05-03, 08:55 authored by Liat Peterfreund, Benny Kimelfeld, Dominik FreydenbergerDominik Freydenberger, Markus KrollWe investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex
formulas (i.e., regular expressions with capture variables)
over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation
with polynomial delay for every positive RA expression (i.e.,
consisting of only natural joins, projections and unions);
here, the RA expression is fixed and the input consists of
both the regex formulas and the document. In this work, we
explore the implication of two fundamental generalizations.
The first is adopting the “schemaless” semantics for spanners, as proposed and studied by Maturana et al. The second
is going beyond the positive RA to allowing the difference
operator.
We show that each of the two generalizations introduces
computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under
both the ordinary and schemaless semantics. Nevertheless,
we propose and analyze syntactic constraints, on the RA
expression and the regex formulas at hand, such that the
expressive power is fully preserved and, yet, evaluation can
be done with polynomial delay. Unlike the previous work on
RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton
that incorporates both the query and the document. This
approach also allows us to include black-box extractors in
the RA expression.
Funding
The work of Liat Peterfreund was supported by the Technion Hiroshi Fujiwara Cyber Security Research Center and Israel Cyber Bureau. The work of Benny Kimelfeld and Liat Peterfreund was supported by the ISF Grant 1295/15. The work of Markus Kröll was supported by the FWF Grants P30930-N35 and W1255-N23.
History
School
- Science
Department
- Computer Science
Published in
ACM Symposium on Principles of Database Systems (PODS)Citation
PETERFREUND, L. ... et al., 2019. Complexity bounds for relational algebra over document spanners. Presented at the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2019), Amsterdam, June 30 - July 05, pp.320-334.Publisher
© ACMVersion
- AM (Accepted Manuscript)
Acceptance date
2019-03-10Publication date
2019Notes
© ACM 2019. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in PODS '19 Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, http://dx.doi.org/10.1145/3294052.3319699.ISBN
9781450362276Publisher version
Language
- en