Optimal enumeration of regular pattern matches
Florin Manea (Georg-August Universität Göttingen)
| Wed Apr 8, 13:00-14:00 (7 weeks from now) | |
Abstract: Enumerating all matches of a regular expression with capture variables (variable regex, for short) to a document is a fundamental task in textual information extraction, e.g., in connection to document spanners. State-of-the-art results [Amarilli et al., ACM TODS 2021] show how the matches of a variable regex, representable by a sequential variable-set automaton, to a document can be enumerated with delay constant in the size of the document and polynomial in the size of the query, after a preprocessing requiring time linear in the size of the document and polynomial in the size of the query.
We consider here the matching problem for regular expressions with capture variables which can be represented as regular patterns: $w_0x_0 w_1x_1 \cdots w_{k}x_k w_{k+1}$, where, for $i\in \{0,\ldots,k+1\}$, $w_i$ is a string of terminal letters and, for $i\in \{0,\ldots,k\}$, $x_i$ is a variable. This problem naturally lies in the intersection of information extraction and stringology, as it can also be seen as computing all the ways in which $k+1$ given strings $w_0,\ldots,w_{k+1}$ occur, in this order and without overlaps, in a given text $T$. We provide an optimal enumeration algorithm for this problem. After a preprocessing requiring linear time in the size of the document $T$ and the size of the query pattern $\alpha$, we can enumerate all matches with worst-case constant delay (in combined complexity, i.e., both in the document- and the query-size).
Based on: Paweł Gawrychowski, Florin Manea, Stefan Siemer, Paul Sarnighausen-Cahn. “Optimal Enumeration of Regular Pattern Matches,” to appear in PODS 2026.
formal languages and automata theory
Audience: researchers in the topic
| Organizers: | Luca Prigioniero*, Rogério Reis* |
| *contact for this listing |
