Skip to main content
Log in

Minimal consistent DFA from sample strings

  • Original Article
  • Published:
Acta Informatica Aims and scope Submit manuscript

Abstract

We review the grammatical inference problem for regular languages which aims to generate a deterministic finite automaton from a representative set of training sample strings known to be in or not in the language. Although the general problem of producing a minimal DFA consistent with a given sample is known to be NP-hard, it is possible to generate minimal consistent DFA in polynomial time if certain constraints are satisfied by the given samples. In this work we propose a new algorithm which generates minimal DFA if the given training samples satisfy a certain sufficient condition. On the negative side, we also show that this problem is indeed hard, such that even for a more restricted class of training sets, the problem of generating minimal consistent DFA is already intractable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Although the general results seem negative, Gold has also proved that from a given regular language, it is possible to generate a characteristic sample set S, such that learning from that training set yields a minimal DFA in polynomial time.

  2. Recall that Myhill–Nerode theorem defines equivalence classes on members of a language \(L\subseteq {\varSigma }^*\). If \(w_1,w_2\in {\varSigma }^*\) are defined as equivalent, then for all \(w\in {\varSigma }^*\), \(w_1\cdot w\) and \(w_2\cdot w\) are either both accepted by the language or both rejected by the language. The number of such equivalence classes for members of a regular language is always finite, which is equivalent to the number of states in a minimal DFA.

  3. This is similar to [5], in which it is shown that there may be multiple minimal cover-automata that are consistent with a given finite input sample. Also in the original TB algorithm, the construction of \(\delta _M(w,a)\) is nondeterministic as if \(w\cdot a\in S^+\setminus {\texttt {red}}\), the choice of \(w'\) is any compatible state in \({\texttt {red}}\).

  4. Philosophically, one may argue that when studying an automaton it is more reasonable to assume that the acceptability of all prefixes of a word should be observed.

  5. Being maximal means when compared with others there needs to have semantic evidences, either by showing it is larger or smaller by containment, or neither larger nor smaller (i.e., incomparable by an evidence trace). The term semantic-completeness enforces that such a measure always exists at least for the maximal elements.

References

  1. Angluin, D.: On the complexity of minimum inference of regular sets. Inf. Control 39, 337–350 (1978)

    Article  MathSciNet  Google Scholar 

  2. Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. 75, 87–106 (1987)

    Article  MathSciNet  Google Scholar 

  3. Angluin, D., Smith, C.H.: Inductive inference: theory and methods. ACM Comput. Surv. 15(3), 237–269 (1983)

    Article  MathSciNet  Google Scholar 

  4. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Occam’s razor. Inf. Process. Lett. 24(6), 377–380 (1987)

    Article  MathSciNet  Google Scholar 

  5. Câmpeanu, C., Sântean, N., Yu, S.: Minimal cover-automata for finite languages. Theor. Comput. Sci. 267, 3–16 (2001)

    Article  MathSciNet  Google Scholar 

  6. de la Higuera, C.: Characteristic sets for polynomial grammatical inference. Mach. Learn. 27(2), 125–138 (1997)

    Article  Google Scholar 

  7. de la Higuera, C., Oncina, J., Vidal, E.: Identification of DFA: data-dependent versus data-independent algorithms. In: Proceedings of International Colloquium on Grammatical Inference (ICGI’96), pp. 313–325 (1996)

  8. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)

    MATH  Google Scholar 

  9. García, P., López, D., Vázquez de Parga, M.: Polynomial characteristic sets for DFA identification. Theor. Comput. Sci. 448, 41–46 (2012)

    Article  MathSciNet  Google Scholar 

  10. Gold, M.E.: Language identification in the limit. Inf. Control 10, 447–474 (1967)

    Article  MathSciNet  Google Scholar 

  11. Gold, M.E.: Complexity of automaton identificaton from given data. Inf. Control 37, 302–320 (1978)

    Article  Google Scholar 

  12. Gentilini, R., Piazza, C., Policriti, A.: From bisimulation to simulation: coarsest partition problems. J. Autom. Reason. 31(1), 73–103 (2003)

    Article  MathSciNet  Google Scholar 

  13. Juillé, H., Pollack, J.B.: A stochastic search approach to grammar induction. In: Proceedings of International Colloquium on Grammatical Inference (ICGI’98), pp. 126–137 (1998)

  14. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Proceedings of International Colloquium on Grammatical Inference (ICGI’98), pp. 1–12 (1998)

  15. Oncina, J., García, P.: Inferring regular languages in polynomial update time. In: Paredes, R., Cardoso, J.S., Pardo, X.M. (eds.) Pattern Recognition and Image Analysis, vol. 1, pp. 49–61. World Scientific, Singapore (1992)

    Chapter  Google Scholar 

  16. Parekh, R., Honavar, V.: Learning DFA from simple examples. Mach. Learn. 44(1–2), 9–35 (2001)

    Article  Google Scholar 

  17. Pitt, L., Warmuth, M.K.: The minimum consistent DFA problem cannot be approximated within any polynomial. J. ACM 40(1), 95–142 (1993)

    Article  MathSciNet  Google Scholar 

  18. Trakhtenbrot, B.A., Barzdin, Y.M.: Finite Automata; Behavior and Synthesis (Fundamental Studies in Computer Science). North-Holland Publishing Co., Amsterdam (1973)

    Google Scholar 

  19. Valiant, L.: A theory of the learnable. Commun. ACM 27, 1134–1142 (1984)

    Article  Google Scholar 

  20. van Glabbeek, R.: The linear time-branching time spectrum I. The semantics of concrete, sequential processes. In: Bergstra, J.A., Ponse, A., Smolka, S.A. (eds.) Handbook of Process Algebra, pp. 3–99. Elsevier, Amsterdam (2001)

    Chapter  Google Scholar 

  21. van Glabbeek, R., Ploeger, B.: Correcting a space-efficient simulation algorithm. In: Proceedings of International Conference on Computer Aided Verification (CAV’08), pp. 517–529 (2008)

  22. Vázquez de Parga, M., García, P., López, D.: Minimal consistent DFA revisited. Theor. Comput. Sci. 647, 43–49 (2016)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenyi Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C. Minimal consistent DFA from sample strings. Acta Informatica 57, 657–670 (2020). https://doi.org/10.1007/s00236-020-00365-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00236-020-00365-8

Navigation