Thesis-2012-Schmid.pdf (1.54 MB)

# On the membership problem for pattern languages and related topics

thesis
posted on 2012-09-10, 16:17 authored by Markus L. Schmid
In this thesis, we investigate the complexity of the membership problem for pattern languages. A pattern is a string over the union of the alphabets A and X, where X := {x_1, x_2, x_3, ...} is a countable set of variables and A is a finite alphabet containing terminals (e.g., A := {a, b, c, d}). Every pattern, e.g., p := x_1 x_2 a b x_2 b x_1 c x_2, describes a pattern language, i.e., the set of all words that can be obtained by uniformly substituting the variables in the pattern by arbitrary strings over A. Hence, u := cacaaabaabcaccaa is a word of the pattern language of p, since substituting cac for x_1 and aa for x_2 yields u. On the other hand, there is no way to obtain the word u' := bbbababbacaaba by substituting the occurrences of x_1 and x_2 in p by words over A. The problem to decide for a given pattern q and a given word w whether or not w is in the pattern language of q is called the membership problem for pattern languages. Consequently, (p, u) is a positive instance and (p, u') is a negative instance of the membership problem for pattern languages. For the unrestricted case, i.e., for arbitrary patterns and words, the membership problem is NP-complete. In this thesis, we identify classes of patterns for which the membership problem can be solved efficiently. Our first main result in this regard is that the variable distance, i.e., the maximum number of different variables that separate two consecutive occurrences of the same variable, substantially contributes to the complexity of the membership problem for pattern languages. More precisely, for every class of patterns with a bounded variable distance the membership problem can be solved efficiently. The second main result is that the same holds for every class of patterns with a bounded scope coincidence degree, where the scope coincidence degree is the maximum number of intervals that cover a common position in the pattern, where each interval is given by the leftmost and rightmost occurrence of a variable in the pattern. The proof of our first main result is based on automata theory. More precisely, we introduce a new automata model that is used as an algorithmic framework in order to show that the membership problem for pattern languages can be solved in time that is exponential only in the variable distance of the corresponding pattern. We then take a closer look at this automata model and subject it to a sound theoretical analysis. The second main result is obtained in a completely different way. We encode patterns and words as relational structures and we then reduce the membership problem for pattern languages to the homomorphism problem of relational structures, which allows us to exploit the concept of the treewidth. This approach turns out be successful, and we show that it has potential to identify further classes of patterns with a polynomial time membership problem. Furthermore, we take a closer look at two aspects of pattern languages that are indirectly related to the membership problem. Firstly, we investigate the phenomenon that patterns can describe regular or context-free languages in an unexpected way, which implies that their membership problem can be solved efficiently. In this regard, we present several sufficient conditions and necessary conditions for the regularity and context-freeness of pattern languages. Secondly, we compare pattern languages with languages given by so-called extended regular expressions with backreferences (REGEX). The membership problem for REGEX languages is very important in practice and since REGEX are similar to pattern languages, it might be possible to improve algorithms for the membership problem for REGEX languages by investigating their relationship to patterns. In this regard, we investigate how patterns can be extended in order to describe large classes of REGEX languages.

• Science

## Department

• Computer Science

2012

## Notes

A Doctoral Thesis. Submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy of Loughborough University.

## EThOS Persistent ID

uk.bl.ethos.561177

• en