Sequence Segmentation Using Semi‑Markov Conditional Random Fields

Sunita Sarawagi

Abstract


Many applications in natural language, speech processing
and data integration require model-based segmentation of sequences.
Semi-Markov conditional random fields (semi-CRFs) are a generalization
of CRFs and provide a full conditional distribution over all possible segmentation of a sequence. Semi-CRFs are particularly suitable for tasks
that entail segment-level features such as match with existing dictionary
of segments. Empirical results on real-life NER tasks show that they yield
higher accuracy than CRFs, but the straightforward foreword–backward
inference algorithm requires 3–10 times the computation cost of CRFs.
This running time can be reduced significantly by exploiting overlapping
features across segments. We present a succinct representation of overlapping features and an efficient training algorithm that can sum over
all possible input segmentation in time that is sub-quadratic in the input
length, even while imposing no bound on the maximum segment length.
Consequently, the running time becomes comparable to CRFs even with
the addition of useful entity-level features on large input segments.


Full Text:

PDF

Refbacks

  • There are currently no refbacks.