Loughborough University
Browse
1806.02718v1.pdf (250.16 kB)

Alignment-free sequence comparison using absent words

Download (250.16 kB)
journal contribution
posted on 2018-06-29, 12:41 authored by Panagiotis Charalampopoulos, Maxime Crochemore, Gabriele Fici, Robert MercasRobert Mercas, Solon P. Pissis
Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realised by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as $q$-gram distance, are usually computed in time linear with respect to the length of the sequences. In this paper, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an {\em absent word} of some sequence if it does not occur in the sequence. An absent word is {\em minimal} if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering {\em all} their minimal absent words. In the process, we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences. We also present an algorithm that, given a word $x$ of length $n$, computes the largest integer for which all factors of $x$ of that length occur in some minimal absent word of $x$ in time and space $\cO(n)$. Finally, we show that the known asymptotic upper bound on the number of minimal absent words of a word is tight.

History

School

  • Science

Department

  • Computer Science

Published in

Information and Computation

Volume

262

Issue

Part 1

Pages

57 - 68

Citation

CHARALAMPOPOULOS, P. ...et al., 2018. Alignment-free sequence comparison using absent words. Information and Computation, 262(Pt 1), pp. 57-68.

Publisher

© Elsevier

Version

  • AM (Accepted Manuscript)

Publisher statement

This work is made available according to the conditions of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) licence. Full details of this licence are available at: https://creativecommons.org/licenses/by-nc-nd/4.0/

Acceptance date

2018-06-12

Publication date

2018-06-22

Notes

Extended version of "Linear-Time Sequence Comparison Using Minimal Absent Words & Applications" Proc. LATIN 2016, arxiv:1506.04917. This paper was accepted for publication in the journal Information and Computation and the definitive published version is available at https://doi.org/10.1016/j.ic.2018.06.002

ISSN

0890-5401

Language

  • en