TECHNIQUES FOR CONTEXT-FREE GRAMMAR INDUCTION AND APPLICATIONS

habitualparathyroidsΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

236 εμφανίσεις




TECHNIQUES FOR CONTEXT-FREE GRAMMAR INDUCTION
AND APPLICATIONS








by
FAIZAN JAVED

BARRETT R. BRYANT, COMMITTEE CHAIR
MARJAN MERNIK
JEFFREY G. GRAY
ALAN P. SPRAGUE
ELLIOT J. LEFKOWITZ










A DISSERTATION
Submitted to the graduate faculty of The University of Alabama at Birmingham,
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy

BIRMINGHAM, ALABAMA
2007









































Copyright by
Faizan Javed
2007

iii

TECHNIQUES FOR CONTEXT-FREE GRAMMAR INDUCTION
AND APPLICATIONS
FAIZAN JAVED
COMPUTER AND INFORMATION SCIENCES
ABSTRACT
Grammar Inference is the process of learning a grammar from examples, either
positive (i.e., the grammar generates the string) and/or negative (i.e., the grammar does
not generate the string). Although grammar inference has been successfully applied to
many diverse domains such as speech recognition and robotics, its application to software
engineering has been limited. This research investigates the applicability of grammar
inference to software engineering and programming language development challenge
problems, where grammar inference offers an innovative solution to the problem, while
remaining tractable and within the scope of that problem. Specifically, the following
challenges are addressed in this research:

1. Recovery of a metamodel from instance models: Within the area of domain-specific
modeling (DSM), instance models may evolve independently of the original
metamodel resulting in metamodel drift, an inconsistency between the instance model
and the associated metamodel such that the instance model may no longer be loaded
into the modeling tool. Although prior work has focused on the problem of schema
evolution, no previous work addresses the problem of recovering a lost metamodel
from instance models. A contribution of this research is the MetAmodel Recovery
System (MARS) that uses grammar inference in concert with a host of
complementary technologies and tools to address the metamodel drift problem.

iv


2. Recovery of domain-specific language (DSL) specifications from example DSL
programs: An open problem in DSL development is a need for reducing the time
needed to learn language development tools by incorporating support for the
description-by-example (DBE) paradigm of language specifications like syntax. This
part of the dissertation focuses on recovering specifications of imperative, explicitly
Turing-complete and context-free DSLs. A contribution of this research is GenInc, an
unsupervised incremental CFG learning algorithm that allows further progress
towards inferring DSLs and finds a second application in recovery of legacy DSLs.

The research described in this dissertation makes the following contributions: i) A
metamodel recovery tool for DSM environments, ii) Easier development of DSLs for
domain experts, and iii) Advances in grammar inference algorithms that may also have
new applications in other areas of computer sciences (e.g., bioinformatics).
.

v




DEDICATION


To my parents, Javed Masood and Attiya Javed






















vi




ACKNOWLEDGEMENTS

I would like to first and foremost thank my advisor, Dr. Barrett Bryant, for
making this graduate research experience one of the most rewarding and defining
moments of my life. His expert guidance and advice on my Ph.D. study and research
coupled with the perfect balance of individual research opportunity as well as guided
mentoring resulted in a research experience that most graduate students wish for. His
encouragement during the formative phases of my research as well as the more
challenging periods motivated me to do my best work which consequently resulted in this
dissertation.
I would also like to thank Dr. Marjan Mernik, Dr. Jeff Gray, Dr. Alan Sprague
and Dr. Elliot Lefkowitz for taking the time to serve on my committee, and who also
played a vital role in guiding the trajectory of my research. Dr. Marjan Mernik introduced
me to domain-specific languages, facilitated early discussion on grammar inference, and
provided valuable feedback and guidance on the research endeavor. I thank him for the
numerous in person discussions, as well as his prompt responses to my research queries
while in Slovenia. Two of the most challenging and interesting graduate courses that I
took at UAB were offered by Dr. Jeff Gray. These courses expanded my computer
science knowledge repertoire and provided me with the very cutting edge in software
engineering. In addition to this, I thank Dr. Jeff Gray for introducing me to the idea of
investigating the collective synergy of grammar inference and domain-specific modeling,
the result of which forms one-half of this dissertation. I am indebted to Dr. Alan Sprague

vii

for the graduate courses that I took from him, as well as the many fruitful and in depth
discussions on the algorithmic specifics of grammar inference theory. I thank Dr. Elliot
Lefkowitz for his discussion and insight into grammar inference applications in
bioinformatics.
My research life wouldn’t have been the same without the friendship and
collaboration of fellow present and former Softcom members. I am grateful to Shih-Hsi
“Alex” Liu, Xiaoqing “Carl” Wu, Fei Cao, Suman Roychoudury, Robert Tairas, Hui Wu,
Yuehua “Jane” Lin, and Jing Zhang for the congenial and inviting atmosphere in the lab.
I thank Suman Roychoudury and Jing Zhang for collaborative work on type inference in
Chapter 3. I am also grateful to Damijan Rebernak and Matej Črepinšek at the University
of Maribor for their help with overcoming technical difficulties with the LISA compiler
development tool and research discussions on grammar inference. I would also like to
thank Dr. Frédéric Jouault for introducing me to the topic of Model Transformation, a
portion of which forms a part of this dissertaton.
I greatly appreciate the CIS staff of Mrs. Kathy Baier, Ms. Janet Tatum, and Mr.
Fran Fabrizio for handling my administrative and IT needs and keeping the CIS
department functioning smoothly.
Last, but not least, I am grateful to have a supportive and loving family. Special
thanks to my parents, Javed Masood and Attiya Javed. Their encouragement, support and
love have been invaluable, and have made me the person who I am today. I thank my
brothers Zohaib, Shayan, Rameez and Saif for their brotherhood and support. I am
grateful to Mary for her comfort and understanding while I was spending countless hours
writing this dissertation.

viii


TABLE OF CONTENTS
Page
ABSTRACT.......................................................................................................................iii
DEDICATION.....................................................................................................................v
ACKNOWLEDGMENTS.................................................................................................vi
LIST OF TABLES.............................................................................................................xi
LIST OF FIGURES..........................................................................................................xii
LIST OF LISTINGS........................................................................................................xiv
LIST OF ABBREVIATIONS...........................................................................................xv
CHAPTER
1. INTRODUCTION.........................................................................................................1
1.1. Recovery of a Metamodel from an Instance Model..............................................2
1.2. Recovery of Domain-Specific Language Specifications from DSL Programs....7
1.3. Research Goals and Overview..............................................................................8
1.3.1. A Metamodel Recovery System Using Grammar Inference....................9
1.3.2. An Unsupervised Incremental Learning Algorithm for
Domain-Specific Language Development ...............................................9
1.3.3. Experimental Validation.........................................................................10
1.4. The Structure of the Dissertation........................................................................11

2. BACKGROUND.........................................................................................................13
2.1. Grammar Inference.............................................................................................13
2.1.1. GenParse: An Evolutionary Approach to Inferring DSLs .....................18
2.2. Domain-Specific Modeling.................................................................................20
2.3. The Generic Modeling Environment .................................................................23
2.4. The LISA Language Development Environment……………………………...23
2.5. The Design Maintenance System………………………………………………24
2.6. The ATLAS Transformation Language………………………………………..25


ix

TABLE OF CONTENTS (Continued)
Page
CHAPTER

3. MARS: A METAMODEL RECOVERY SYSTEM...................................................27
3.1. Challenges and Current Limitations...................................................................28
3.2. The Model Representation Language.................................................................33
3.3. The Metamodel Inference Engine.......................................................................40
3.3.1. Metamodels as Context-Free Grammars................................................41
3.3.2. Inferring the Metamodel from Domain Models......................................44
3.3.2.1 Incrementally Inferring a Metamodel...............................................45
3.3.3. The Induction Process.............................................................................54
3.4. Extensions to MARS...........................................................................................62
3.4.1. Type Inference from Model Compilers..................................................62
3.4.2. A Generic Front-End Using a Model Transformation Tool...................67
3.5. Experimental Studies of Metamodel Inference...................................................75
3.5.1. Inferring the Network Diagram Metamodel...........................................75
3.5.2. Inferring the Petri Net Metamodel..........................................................77
3.5.3. Inferring the AudioVideo System Metamodel........................................80
3.5.4. Evaluation and Results............................................................................85
3.5.4.1. Type Inference Evaluation.............................................................86
3.6. Related Work......................................................................................................88
3.6.1. Grammar Stealing and Model Evolution................................................88
3.6.2. Grammar Inference Applied to XML Schema Extraction......................90
3.6.3. Type Inference Systems..........................................................................91
3.7. Limitations..........................................................................................................93
3.8. Conclusion..........................................................................................................94

4. GENINC: AN UNSUPERVISED INCREMENTAL LEARNING ALGORITHM
FOR DOMAIN-SPECIFIC LANGUAGE DEVELOPMENT...................................97
4.1. Motivation...........................................................................................................97
4.2. GenInc: An Incremental Approach to Learning Grammars...............................98
4.2.1. Three Learning Cases...........................................................................100
4.2.2. The GenInc Algorithm..........................................................................103
4.2.3. Post-processing the Inferred Grammars................................................110
4.3. Experimental Results........................................................................................111
4.3.1. Grammar Metrics..................................................................................113
4.3.2. Discussion.............................................................................................117
4.4. Related Work....................................................................................................122
4.5. Conclusion........................................................................................................124



x

TABLE OF CONTENTS (Continued)
Page
CHAPTER

5. FUTURE WORK.......................................................................................................126
5.1. MARS: Further Extensions and Improvements................................................126
5.1.1. Inferring the Modularization of Large Models.....................................126
5.1.2. Adapting MARS to Other Modeling Tools .........................................127
5.2. GenInc: Extensions and Improvements............................................................127
5.2.1. A Memetic Programming Approach to Grammar Inference................128
5.2.2. Mitigating Order Effects ......................................................................129
5.3. Application: RNA Secondary Structure Prediction..........................................130

6. CONCLUSIONS........................................................................................................134

LIST OF REFERENCES.................................................................................................140

APPENDIX
A EXPERIMENTAL RUN OF THE GENERIC FRONT-END OF MARS.......153
B EXPERIMENTAL RUN OF GENINC ON THE WHILE LANGUAGE........159













xi

LIST OF TABLES
Table Page
3-1 Two MRL Programs as Representations of Models from Figure 3-4.....................40
3-2 Transformation from a Metamodel to a Context-Free Grammar............................42
3-3 Rules for Variability Calculation............................................................................52
3-4 Summary of the Inference Experiments.................................................................86
3-5 Original and Inferred Types for the Petri Net Domain..........................................87
3-6 Original and Inferred Types for the Network Domain..........................................87
4-1 The RHS_Subset Operator...................................................................................109
4-2 IncGeneralizer Operator on the DESK DSL........................................................109
4-3 Original and Inferred Grammars..........................................................................114
4-4 Grammar Size Metrics of the ROBOT, MRL and DESK DSLs.........................120
4-5 Grammar Size Metrics of the {a
n
, b
n
}, FDL, KM3 and WHILE DSLs...............121
4-6 Grammar Structure Metrics of the ROBOT, MRL and DESK DSLs..................121
4-7 Grammar Structure Metrics of the {a
n
, b
n
}, FDL, KM3 and WHILE DSLs.......121








xii


LIST OF FIGURES
Figure Page
1-1 An Overview of Schema Evolution Dependencies in Model-Driven Engineering...4
2-1 The GenParse Induction Engine .............................................................................19
2-2 The State Machine Metamodel................................................................................22
2-3 An Instance of a State Machine Representing an ATM Machine...........................22
2-4 The ATL Transformation Approach........................................................................26
3-1 A Metamodel for Creating Finite State Machines...................................................29
3-2 An Instance of a Finite State Machine.....................................................................29
3-3 Overview of MARS.................................................................................................32
3-4 Two Instances of the FSM Metamodel....................................................................39
3-5 The First Domain Model and Corresponding MRL Program.................................45
3-6 Inferred Metamodel Based on the First Domain Model..........................................46
3-7 The Second Domain Model and Corresponding MRL Program.............................47
3-8 Inferred Metamodel Based on Second Domain Model...........................................47
3-9 Inferred Metamodel Based on First and Second Domain Models...........................48
3-10 The Third Domain Model and Corresponding MRL Program................................48
3-11 The Fourth Domain Model and Corresponding MRL Program..............................49
3-12 Inferred Metamodel Based on Four Instances.........................................................50
3-13 A Metamodel for Network Diagrams......................................................................62
3-14 An Instance of a Network........................................................................................65

xiii
3-15 Symbol Table for the Process Router Method.........................................................66
3-16 Megamodel of the XML-to-MRL Transformation..................................................68
3-17 The MRL Metamodel..............................................................................................70
3-18 Inferred Metamodel for the Network Domain.........................................................77
3-19 Original Metamodel for the Petri Net Domain........................................................78
3-20 Dining Philosophers: An Instance of a Petri Net.....................................................79
3-21 Inferred Metamodel for the Petri Net Domain........................................................80
3-22 Original Metamodel for the AudioVideo System Domain......................................81
3-23 An Instance of the AudioVideo System Domain....................................................82
3-24 Inferred Metamodel for the AudioVideo System Domain......................................82















xiv



LIST OF LISTINGS
Listing Page
3-1 Context-Free Grammar for MRL............................................................................35
3-2 XML Code Fragment from a Network Domain Instance Model............................36
3-3 XSLT Code to Extract the Connection Elements....................................................38
3-4 Context-Free Grammar Representation of the Metamodel in Figure 3-1...............44
3-5 Inferred CFG for the FSM Metamodel....................................................................50
3-6 The Metamodel Inference Algorithm......................................................................61
3-7
An Excerpt from the Model Compiler for Processing Routers in the Network
Domain...................................................................................................................65

3-8 PARLANSE Code Fragment to Determine Attribute Types...................................67
3-9 XML Metamodel in KM3 Format...........................................................................69
3-10 MRL Metamodel in KM3 Format...........................................................................71
3-11 MRL Concrete Syntax in TCS Format....................................................................72
3-12 The XML-to-MRL Transformation Algorithm.......................................................74
4-1 The GenInc Algorithm...........................................................................................107
4-2 The RHS_Subset and IncGeneralizer Operators..................................................108





xv


LIST OF ABBREVIATIONS

ABL Alignment-Based Learning
AMMA ATLAS Model Management Architecture
API Application Program Interface
APTA Augmented Prefix Tree Acceptor
AST Abstract Syntax Tree
ATL ATLAS Transformation Language
ATM Automated Teller Machine
AVS Average Right Hand Side
CFG Context-Free Grammar
CLEV Normalized Counts of Level
DEP Size of Largest Level
DFA Deterministic Finite Automata
DMS Design Maintenance System
DSL Domain-Specific Language
DSM Domain-Specific Modeling
DTD Document Type Definition
EBNF Extended Backus-Naur Form
ECFG Extended Context-Free Grammar
EDSM Evidence Driven State Merging

xvi
ESML Embedded Systems Modeling Language
FCO First Class Object
FSM Finite State Machine
GME Generic Modeling Environment
GP Genetic Programming
GPL General-Purpose Language
GSEE Generic Software Exploration Environment
HAL Halstead Effort
HMMs Hidden Markov Models
ILP Inductive Logic Programming
KM3 Kernel Meta Meta Model Language
LHS Left-Hand Side
LOC Lines Of Code
MARS MetAmodel Recovery System
MCC McCabe Cyclomatic Complexity
MDA Model-Driven Architecture
MDE Model-Driven Engineering
MDL Minimum Description Length
MOF Meta Object Facility
MRL Model Representation Language
NOP Number of Productions
NSLEV Number of Non-singleton Levels
NT Non-terminal

xvii
OCL Object Constraint Language
PAC Probably Approximately Correct
PACS Probably Approximately Correct learning under Simple
distributions

RHS Right-Hand Side
RNA Ribonucleic Acid
SA Software Architecture
SAGE Self-Adaptive Greedy Estimate
TERM Number of terminals
TCS Textual Concrete Syntax
TIMP Tree Impurity
XML Extensible Markup Language
XPATH XML Path Language
XSLT Extensible Style-sheet Transformations
UML Unified Modeling Language
VAR Number of non-terminals
VHEI Varju Height Metric







1




CHAPTER 1
INTRODUCTION

Inductive learning is the process of learning from examples [Baker, 79].
Language learning refers to the problem of acquiring the syntax and semantics of a target
programming or natural language. Most of the current computational approaches to
learning [Klein and Manning, 05] [Sakakibara, 05] focus on learning the syntax of a
language. The related area of grammar inference (or induction) can be defined as a
particular instance of inductive learning where the examples are sets of strings defined on
a specific alphabet. These sets of strings can be further classified into positive samples
(i.e., the set of strings belonging to the target language) and negative samples (i.e., the set
of strings not belonging to the target language). Using these strings generated by a
grammar G
0
, the learning algorithm (or learner) infers a grammar G
1
that approximates
G
0
in some way.
Grammar inference algorithms or techniques based on grammar inference have
found application in many diverse domains. Grammar inference has been applied to map
learning in robotics [Dean et al., 92], structural pattern recognition tasks such as finger
print classification and image recognition [Fu and Booth, 75], computational linguistics
[Adriaans et al., 00] [van Zaanen, 01], and constructing language models for speech
recognition [Wang and Acero, 02]. Grammar inference has also been applied to
2

variegated domain applications such as schema extraction for document management
systems [Chidlovskii, 01], mining web usage patterns [Borges and Levene, 00], modeling
and classification of music styles and music pieces [Cruz and Vidal, 98] and predicting
secondary structures of RNA sequences for biological sequence analyses [Sakakibara,
05]. While grammar inference has been applied to these diverse domains, its application
to the software engineering and programming language domains has been limited. In the
realm of software engineering, context-free grammars (CFGs) are of paramount
importance for defining the syntactic component of programming languages [Aho et al.,
07]. Grammars are increasingly being used in various software development scenarios,
and recent efforts seek to carve out an engineering discipline for grammars and grammar
dependent software [Klint et al., 05]. Inference of regular grammars has been generally
successful; however inference of CFGs is thought to be NP-hard relative to input size
[Gold, 78]. Considering the widespread applicability of CFGs in diverse domains and
their substantial use in programming languages and software systems, there is a need to
maintain and infer CFGs. More specifically, two grammar inference application areas
hold great potential: (1) within the purview of Domain-Specific Modeling (DSM) [Gray
et al, 07] recovering a metamodel from a repository of orphaned models, and (2)
facilitating Domain-Specific Language (DSL) [Mernik et al., 05] development for experts
not well versed in language design.
1.1 Recovery of a Metamodel from an Instance Model
During the various phases of the software development process, numerous
artifacts may be created (e.g., documentation, models, source code, testing scripts) and
stored in a repository. Some of the artifacts involved in development, such as source code
3

and models, depend on a language schema definition that provides the context for
syntactic structure. For example, a programming language is dependent on a grammar,
and a model is defined by a metamodel; as such, both grammars and metamodels
represent a schema that defines the syntax of a language. Over time, evolution of the
schema definition is often required to address new feature requests (e.g., evolution of a
language to provide new language features, or adaptation of a metamodel to
accommodate new stakeholder concerns). If the repository artifacts are not transformed to
conform to the new schema definition, it is possible that the repository may become stale
with obsolete artifacts.
In the realm of programming languages, Lämmel and Verhoef [Lämmel and
Verhoef, 01] have motivated the need to fabricate tools for software renovation problems.
With more than 500 general-purpose and proprietary programming languages being used
in the commercial and public domains, the need for a rapid and reliable renovation tool is
very real. Such a tool can be used to solve re-engineering problems, or can be useful in
commercial installations where source implementations either need to be recovered or
translated to a different language dialect. Lämmel and Verhoef [Lämmel and Verhoef,
01] make a strong case for using a grammar-centric solution to solve these problems by
stating that the dominant factor in producing any renovation tool is building a parser.
When a grammar for a language can be obtained, a parser generator can be used to create
a parser for the recovered language.
With the proliferation of modeling tools in the commercial and research arenas
[Schmidt, 06], the number of renovation problems in the modeling community is rising.
Figure 1-1 categorizes several evolution tasks that require automated assistance to
4

manage the various dependencies among metamodels, domain models, and
corresponding source code in Model-Driven Engineering (MDE) [Schmidt, 06]. Along
the top of Figure 1-1 is a sequence of intermediate metamodel versions that represent the
evolving definition of a specific modeling language in a particular domain. Each new
version of


Figure 1-1 - An Overview of Schema Evolution Dependencies in Model-Driven
Engineering

the metamodel captures some change in the modeling language (represented by ￿MM).
The domain models that are in the middle of Figure 1-1 are dependent on the metamodel
definition. The problem of metamodel evolution as it relates to updating dependent
domain models has already been investigated. As an initial solution to the metamodel
5

schema evolution problem, Sprinkle and Karsai define a visual language for mapping
between old and new metamodels [Sprinkle and Karsai, 04]. Their approach uses graph-
rewriting techniques to update the domain models accordingly. However, this schema
evolution approach fails when both the metamodels and the intermediate transformation
steps do not exist, or are not accessible. A computationally efficient framework for
managing and deploying incremental data transformations between models is proposed in
[Johann and Egyed, 04]. In [Diskin and Dingel, 06], a formal algebraic framework for
efficient model transformations in a generic metamodel-independent way is proposed.
The approach avoids the laborious element-wise specification of source and target
models encountered in conventional metamodel transformation methodology by
specifying a mapping between source and target metamodels using generic operators.
Graaf, Weber and van Deursen [Graaf at al., 07] formulate the migration of supervisory
machine control systems as a model transformation problem based on the Symphony [van
Deursen et al., 04] view-driven software architecture (SA) reconstruction process.
CacOphoNy [Favre, 04] is a generic metamodel-driven software architecture process
similar to Symphony. While Symphony is confined to SA reconstruction, CacOphoNy
integrates SA with MDE to create an integrated SA reconstruction approach which is
more expansive and broader in scope than Symphony.
A correspondence also exists between the domain models and a legacy source
code base, as shown in the bottom of Figure 1-1. In order to maintain a causal connection
between the domain models and the corresponding source code, a technique to
synchronize model changes to existing source code is needed. An initial inquiry into
model-driven program transformation is presented in [Gray et al., 04-a], which generates
6

program transformation rules from model changes. The generated transformation rules
are applied to a transformation engine that parses and modifies the corresponding
representation of the legacy source. To keep the illustration simplified, Figure 1-1 does
not consider reverse engineering of source code into a model representation. This
represents another issue outside the scope of this thesis, but acknowledged as an area that
needs further investigation. The work described in [Favre et al., 01] on using the Generic
Software Exploration Environment (GSEE) platform to reverse engineer large scale
object-oriented and component-based systems is an example of reverse engineering
source code into models. In GSEE, a metamodel for component models was designed.
This metamodel is used to view reverse engineered source code as a model. Furthermore,
there are other artifacts that may need to be transformed during metamodel evolution
(e.g., Object Constraint Language (OCL) [Warmer and Kleppe, 03] constraints, model
interpreters, model transformation rules), but are not shown in Figure 1-1.
From our experience in model-driven engineering, it is often the case that a
metamodel undergoes frequent evolution, which may result in previous model instances
being orphaned from the new definition. This has also been observed by others in
practice [GME, 06-a], [GME, 06-b]. We call this phenomenon metamodel drift. When the
metamodel is no longer available, a domain model may not be loaded into the modeling
tool. This is similar in concept to a change in a language definition that invalidates prior
programs and the associated compiler. However, if a metamodel can be inferred from a
set of domain models (as indicated by the “Inference” arrow in Figure 1-1), the design
knowledge contained in the domain models can be recovered. Some examples of
problems that require the need to recover or reverse engineer a metamodel include losing
7

a metamodel definition due to a hard-drive crash, and encountering versioning conflicts
when trying to load domain models based on obsolete metamodels. A component of the
research in this thesis describes a technique to recover the metamodel schema definition
from instances that have been separated from their original defining metamodel. The
technique is semi-automated and grammar-driven. It uses concepts from the grammar
inference domain, and is applicable even when the only accessible resources are the
domain models.
1.2 Recovery of Domain-Specific Language Specifications from DSL Programs

In the realm of software engineering, context-free grammars are widely used for
defining the syntactic component of programming languages. Domain-Specific
Languages are languages specifically designed for use in a particular domain. The
following apt definition for DSLs is given in [van Deursen et al., 00]:

“A DSL is a programming language or executable specification language that offers,
through appropriate notations and abstractions, expressive power focused on, and
usually restricted to, a particular problem domain.”

DSLs are usually declarative and smaller in size than general-purpose languages
(GPLs). An open problem in DSL development is a need for reducing the time needed to
learn language development tools by incorporating support for the description-by-
example paradigm of language specifications like syntax [Mernik et al., 05]. Using
grammar induction, language specifications can be generated for DSLs, facilitating
8

programming language development for domain experts not well-versed in programming
language design.
Grammar induction would enable a domain expert to create a DSL by supplying
sentences from the desired DSL to the grammar induction system, which would then
create a parser for the DSL represented by those samples, thus expediting programming
language development. Such a technique would find application in cases where legacy
DSLs have been running for many years and their specifications no longer exist to assist
with evolution of their implementations (e.g., as was needed to solve the Y2K problem).
As previously mentioned, a survey of programming language usage in commercial and
research environments has shown that more than 500 general-purpose and proprietary
programming languages are in use today [Lämmel and Verhoef, 01]. The Y2K-like
problems not withstanding, many commercial installations use in-house DSLs and a
variety of situations can arise (e.g., software company went bankrupt) where source
implementations either need to be recovered or translated to a different language dialect
.
It is entirely possible that the application domain may require a language far more
complex than DSLs.

1.3 Research Goals and Overview
The goal of the research in this dissertation is to investigate the applicability of
grammar inference to software engineering and programming language development
challenge problems, where grammar inference offers an innovative solution to the
problem, while remaining tractable and within the scope of that problem. More
specifically, the dissertation addresses the problem of metamodel drift in the DSM
9

domain, and investigates a grammar inference approach to the description-by-example
paradigm of DSL development which intends to facilitate language development for
domain experts not well versed in DSL design. The following sections give an overview
of the research.
1.3.1 A Metamodel Recovery System Using Grammar Inference
To address the metamodel drift problem a contribution of this research is MARS
[Javed et al., 07-a], a metamodel recovery system using grammar inference. MARS is a
semi-automatic inference-based system for recovering a metamodel that correctly defines
the mined instance models through application of grammar inference techniques. The
core of the approach involves the use of an intermediate representation of the models in a
textual language. A host of technologies ranging from Extensible Markup Language
(XML) transformation tools, to language development and transformation environments,
program transformation engines and model transformation tools are used to facilitate the
task. MARS is, to the best of our knowledge, a first solution to the metamodel recovery
problem and is also the first effort to apply grammar inference to the model recovery
problem.
1.3.2 An Unsupervised Incremental Learning Algorithm for Domain-Specific
Language Development
The research in this dissertation also presents GenInc [Javed et al., 06] [Javed et
al., 07-b], an unsupervised incremental learning algorithm for domain-specific language
development. The learning problem underlying GenInc is divided into three cases, each
more involved than the previous, and GenInc’s focus is on learning CFGs. The GenInc
10

algorithm presented in this research encompasses all three cases. GenInc uses ordered
positive examples under the PACS learning (Probably Approximately Correct learning
under Simple distributions) [Li and Vitanyi, 97] model to infer DSLs to facilitate the
description-by-example DSL development paradigm.
1.3.3 Experimental Validation
This research contributes two algorithms, MARS and GenInc. Within the
periphery of grammar inference applied to software engineering, MARS addresses the
metamodel recovery problem in the DSM domain while GenInc facilitates DSL
development using an incremental learning approach. MARS was tested on various
example domain models and its performance was analyzed under a variety of metamodel
recovery situations that might occur in practice. Some examples of such situations are
inferring metamodel from only one model, and inferring a metamodel from a set of
models that do not exhibit all the properties of the original metamodels. The modeling
artifacts used for experimentation were a mix of pedagogical case studies used to teach
domain-specific modeling as well as artifacts from the ESCHER repository [Escher, 07].
GenInc is evaluated on a set of DSLs that vary in their complexity and size; some
are simple demonstrational DSLs while others are used in research and commercial
applications. The evaluation focuses on whether grammars inferred by GenInc generate
the same language as grammars of the original DSLs. A grammar metrics suite is used to
compare the inferred grammars to the original DSL grammars. The grammar metrics
suite evaluates grammars using size and structural grammar metrics that allow a
similarity measure of the grammars being compared.
11

1.4 The Structure of the Dissertation
The remainder of this dissertation is structured as follows: Chapter 2 provides
further background information on grammar inference and DSM. The chapter discusses
the learning models of grammar inference, as well as the tools used in this research.
Chapter 3 presents the details of MARS, the research contribution which
addresses the metamodel drift problem. The chapter begins by detailing the need for a
metamodel recovery system and listing the challenges that need to be overcome. This is
followed by a discussion of the design decisions for the Model Representation Language
(MRL), and a detailed presentation of the metamodel inference engine which forms the
heart of MARS and utilizes grammar inference algorithms to perform the inference. Two
extensions to the MARS framework are then presented: the use of a program
transformation engine to infer element types from model compiler sources, and a model
transformation based generic input capability that allows MARS to use as input XML
files that do not have to conform to a standard schema. Several case studies on domain
languages under a variety of practical metamodel recovery situations are detailed, and the
chapter concludes by discussing related work and limitations of the current approach.
Chapter 4 describes the GenInc algorithm, the research contribution on DSL
grammar recovery. The chapter begins by describing the motivation for GenInc, and then
elaborates on the three learning cases that form the theoretical basis for GenInc. This is
followed by a detailed discussion of the GenInc algorithm and the post-processing steps.
An experimental section showcases GenInc’s performance on DSLs of varying sizes, and
a grammar metrics suite is used to analyze the difference between the original and the
inferred DSL grammars. A discussion of related grammar inference algorithms concludes
12

the chapter. Chapter 5 outlines future work for this dissertation research and Chapter 6
presents concluding comments. Appendix A lists the output of a test run of the generic
input capability extension to MARS, and Appendix B shows a detailed run of GenInc on
an example DSL.




















13




CHAPTER 2
BACKGROUND

This chapter provides further background on grammar inference, the central topic
of this dissertation, and domain-specific modeling (DSM), the underlying software
methodology of the metamodel recovery component of the research in this dissertation.
The basic paradigm and learning models of grammar inference are discussed, as well as
an overview of recent efforts at learning regular and context-free grammars. An overview
of the tools used in this research such as the Generic Modeling Environment (GME)
metamodeling tool, the Design Maintenance System (DMS) program transformation
engine, and the LISA language development environment is also presented.
2.1 Grammar Inference
As previously mentioned, grammar inference is a special case of inductive learning
where the learning process is focused on acquiring the syntax and semantics of a target
language, although most current work in grammar inference is on learning syntax. A general
machine learning model (of which grammar inference is a component) specifies the
following:
1. Learner: Is a computer program doing the learning or a human?
2. Domain: What is the learner trying to learn? Is the learner being trained how to drive
a car or control a robot? Is the learner trying to infer a function or a concept?
14


3. Information source: What is the learner using to learn the domain? A learner may be
given positive and/or negative samples, or be given access to a teacher to ask queries. It may
even be allowed to experiment with the domain to learn more about it.
4. Prior knowledge: A learner may be given prior knowledge about the domain to limit
its uncertainty during the learning process. This knowledge could be about the format of the
domain concept to be learned, or that simple theories are preferable to complex ones.
5. Performance criteria: The performance criteria measure depends on what and why a
learner is learning. The learner may be evaluated on the accuracy of its learned theory (error
rate or number of mistakes) and efficiency (amount of samples needed, or polynomial
computational efficiency).

In [Gold, 67], Gold introduced one of the first models of learning: identification in
the limit. In this model, after each new sample, the inductive machine (learner) returns a
hypothesis; if after reading a sample the learner returns the correct hypothesis and it does not
change its hypothesis output afterwards, the learner is then said to have identified the target
language. Gold further states that learning of a target language is impossible if only positive
samples are used, and that learning is possible only when both positive and negative samples
are provided, and each individual string is presented to the learner at some point during the
learning process [Gold, 78]. Active learning is a learning model proposed by Angluin
[Angluin, 81], where the learner can ask string membership and grammar equivalence
queries to an oracle. Together, the membership and equivalence queries form a Minimal
Adequate Teacher. In this model of learning, the algorithm L
*
[Angluin, 87] identifies
15

regular languages with only a polynomial number of queries. Under a set of conditions, it is
shown in [Angluin, 90] that languages are learnable using equivalence queries. To alleviate
the stringent requirement of exact learning in practical situations, the Probably
Approximately Correct (PAC) learning model was introduced by Valiant [Valiant, 84].
Learning is done by sampling from an unknown distribution over all possible samples, and
the learned theory is tested under the same distribution. Since the requirement of being able
to learn under any distribution is too stringent, a user defined error rate (ε) is permitted. The
Simple-PAC (PACS) model of learning relaxes the PAC model’s requirement of learning
under all distributions. The setting for PACS is learning under simple distributions, that is,
learning from simple strings that have a high probability of being sampled. PACS is based
on the Kolmogorov complexity measure [Li and Vitanyi, 97].
Primarily, grammar learning research has focused on regular and context-free
grammars. A regular grammar is a 4-tuple G=(V, T, P, S) where V is a finite set of non-
terminals (NTs), T is a finite set of terminals, P is a finite set of productions (or rules) and S
is the starting NT. Rules are of the form NT
￿
T, NT
￿
T NT, and NT
￿
ε . It has been
proved that exact identification in the limit of regular, context-free and context-sensitive
languages in the Chomsky hierarchy is NP-complete [Gold, 78]. Subsequent results showed
that the average case was polynomial for regular languages [Freund et al., 93]. Most of the
recent research in the regular language inference domain has focused on inducing
Abbadingo-style Deterministic Finite Automata (DFA) [Abbadingo, 07]. The Abbadingo-
style problems have superseded the Tomita language [Tomita, 82] as the current benchmark
set of grammars used for DFA induction. The advent of this benchmark problem set resulted
in two successful DFA inference algorithms: the Evidence Driven State Merging (EDSM)
16

[Lang et al., 98] algorithm, and the Self-Adaptive Greedy Estimate (SAGE) [Juille and
Pollack, 98] algorithm. Both algorithms substantially build on the accomplishments of the
Traxbar [Trakhtenbrot and Barzdin, 73] algorithm, which was one of the first attempts at
inducing canonical DFAs in polynomial time using a state-merging process on an
Augmented Prefix Tree Acceptor (APTA). EDSM employs a variant of the state-merging
process in which only the pair of nodes whose subtrees share the most similar labels are
merged. EDSM has a disadvantage in that considering every potential merge pair at each
stage of the inference process is computationally expensive. Although EDSM variants have
managed to decrease the running time, they are still an order of magnitude greater than those
of the Traxbar algorithm. SAGE uses random sampling techniques on search trees to control
the search. However, this random selection process ignores the clues in the training data,
resulting in unnecessary search iterations. As a result, SAGE is more computationally
expensive than EDSM and its derivatives. Ed-Beam [Lang, 98], a SAGE derivative, makes
use of the matching labels heuristic and has a lower computation time.
Although considerable advances have been made in inferring regular grammars,
learning CFGs has proved to be more formidable. CFGs are more expressive and powerful
than regular grammars, and consequently are more practically important in many areas. A
CFG is a 4-tuple G=(V, T, P, S) where V is a finite set of non-terminals (NTs), T is a finite
set of terminals, P is a finite set of productions (or rules) and S is the starting NT. Rules are
of the form NT
￿
(V
￿
T)
*
. Given that many CFG decision problems are undecidable (e.g.,
given two CFGs L
1
and L
2
, is L
1
∩ L
2
= Ø?, is L
1
= L
2
?) and that inference of CFGs is
impeded by the same theoretical issues that limit the inference of regular grammars means
that heuristically inferring CFGs is the preferred method of choice of learning. Various
17

approaches to CFG induction have been investigated, ranging from use of structural data to
Bayesian methods. Sakakibara [Sakakibara, 92] used skeletal parse trees for inducing CFGs
in polynomial time, and genetic algorithms have been used as a heuristic in various
approaches [Sakakibara and Muramatsu, 00]. In [Nakamura and Matsumoto, 05], an
inductive Cocke-Younger-Kasami (CYK) algorithm [Younger, 67] [Kasami, 65] which
learns simple CFGs is discussed. SEQUITUR [Nevill-Manning and Witten, 97] is an
algorithm that provides good compression rates for large text and runs in linear time, but
does not generalize the inferred CFG. SubDueGL [Jonyer et al., 03] uses the Minimum
Description Length (MDL) [Rissanen, 99] principle for inducing context-free graph
grammars. Langley and Stromsen also use the MDL principle and a representation change to
induce CFGs [Langley and Stromsen, 96].
So far, all the models and algorithms described are incapable of dealing with noisy
data, and inefficient at learning from positive samples only. Stochastic (probabilistic)
grammars and automata introduce probabilities in the model to handle these problems.
Inferring stochastic context-free grammars is a hard problem which has been found to be
useful in speech recognition [Wang and Acero, 02] and computational biology [Sakakibara,
05]. The VITERBI algorithm [Forney, 73] finds the most probable parse for a given string.
ALERGIA [Carrasco and Oncina, 94] uses a state merging operation to induce a stochastic
regular grammar from positive samples. The inside-outside algorithm [Baker, 79] estimates
the probabilities of the rules when the grammar is already known. In [Kammeyer and Belew,
96], a Genetic Algorithm is used to infer stochastic grammars, and the inside-outside
algorithm is used as local search. It is important to note here that the inside-outside algorithm
18

in this work is not used to explore the local search space of grammars, but rather is used to
fine tune the probabilities of the grammar when a new grammar is bred.

2.1.1 GenParse: An Evolutionary Approach to Inferring DSLs
This section gives an overview of a grammar inference system that uses
evolutionary techniques to infer grammars for DSLs. The GenInc algorithm presented in
Chapter 4 was created to address some of the limitations of GenParse.
Genetic Programming [Banzhaf et al., 98] (GP) is a biologically inspired machine
learning technique that has been successfully used to find solutions to a wide range of
hard problems [Koza et al., 99]. GenParse focuses on investigating an evolutionary
approach for inferring CFGs. Figure 2-1 provides an overview of the GenParse
architecture (more details are available in [Črepinšek et al., 06]). For effective use of an
evolutionary algorithm, there are several requirements: a suitable representation of the
problem, suitable genetic operators and parameters, and an evaluation function to
determine the fitness of chosen chromosomes. For the encoding of a grammar into a
chromosome, a direct encoding as a list of BNF production rules was used because this
encoding has been shown to outperform the bit-string representations [Wyard, 94].
Furthermore, specific one-point crossover, mutation and heuristic operators were
proposed as genetic operators. The heuristic operators used were the option operator, the
iteration
+
operator and the iteration
*
operator. Chromosomes were evaluated using the
LISA compiler generator (see Section 2.4) at the end of each generation by testing each
grammar on a set of positive and negative samples.


19



Figure 2-1 - The GenParse Induction Engine

Frequent sequences are a concept used in data mining, closely related to the much
more prominent data mining concept called frequent sets [Han and Kamber, 01]. A string
of symbols is called a frequent sequence if it appears at least Θ times, where Θ is some
preset threshold. The GenParse induction engine was augmented with this heuristic to
construct sublanguages of the goal language. Because a frequent sequence does not
contain a myriad of symbols, its derivation tree can be created using a brute-force
approach [Črepinšek et al., 05-c] in which all possible derivation trees are generated and
checked for suitability. The current implementation of the GP-based inference engine was
able to infer grammars for small imperative DSLs, more examples of which are available
in [Črepinšek et al., 05-a] [ Črepinšek et al., 06]. However GenParse was not able to infer
grammars of DSLs with more than 10 productions because of its inability to efficiently
search through the expansive combinatorial search space.
20


2.2 Domain Specific Modeling
Throughout the history of programming languages, abstraction was improved
through evolution towards higher levels of specification. Domain-specific modeling
(DSM) has adopted a different approach by raising the level of abstraction, while at the
same time narrowing the design space to a single domain of discourse with visual models
[Gray et al., 04-b]. When applying DSM, models are constructed that follow the domain
abstractions and semantics, allowing developers to perceive themselves as working
directly with domain concepts. The DSM language captures the semantics of the domain
and the production rules of the instantiation environment.
An essential characteristic of DSM is the ability to generate, or synthesize, new
software artifacts from domain models. This is typically accomplished by associating a
model interpreter with a particular modeling language. A model interpreter transforms the
concept structures into physical implementations in code. With the generative [Czarnecki
and Eisenecker, 00] approach available in DSM, there is no longer a need to make error-
prone mappings from domain concepts to design concepts, and on to programming
language concepts.

Example domains where DSM has been applied successfully are the
Saturn automotive factory [Long et al., 98], DuPont chemical factory [Garrett et al., 00],
numerous government projects supported by DARPA and NSF [OMG, 04], electrical
utilities [Moore et al., 00], and even courseware authoring support for educators
[Howard, 02].
An important activity in DSM is the construction of a metamodel that defines the key
elements of the domain. Instances of the metamodel can be created to define specific
21

configurations of the domain. An example is shown in Figure 2-2, which represents the
metamodel for a simple language for specifying properties of state machines. The
metamodel contains state machine concepts (e.g., start state, end state, state) as well as
the valid connections (transitions) among all entities. There are additional constraints that
further limit domain model configurations. The first constraint states that there can be no
out-going transitions from an end state:
parts("EndState")->forAll(y| y.connectedFCOs("dst")->size()=0)
The second constraint specifies that there can be only one transition leaving the start
state:
parts("StartState")->forAll(x| x.connectedFCOs("dst")->size()=1)
The third constraint states that there can be no incoming transitions into the start state:
parts("State")->forAll(x| x.connectedFCOs("dst")->forAll(y | y.kindName <>
"StartState"))
An instance of this metamodel is shown in Figure 2-3, which is a simplified Automated
Teller Machine (ATM) model. The ATM model contains a StartState, an EndState, seven
State elements and the Transition elements between them. The domain-specific nature of
this model is evident from the icons and visualization of domain representations.

22


Figure 2-2 - The State Machine Metamodel






Figure 2-3 - An Instance of a State Machine Representing an ATM Machine

23

2.3 The Generic Modeling Environment
The Generic Modeling Environment (GME) [Lédeczi et al., 01] is the modeling
tool used in the metamodel recovery research described in Chapter 3. The GME is a
metaconfigurable modeling environment that can be configured and adapted from
metamodels that describe the domain [Balasubramaniam et al., 06]. When using the
GME, an end-user loads a metamodel into the tool to define an environment containing
all the modeling elements and valid relationships that can be constructed in a specific
domain [Karsai et al., 04]. Domain models are stored in the GME as objects in a database
repository. An API is provided by GME for traversing a model. From the API, it is
possible to create model interpreters that traverse the internal representation of the model
and generate new artifacts (e.g., XML configuration files, source code, or even hardware
logic) based on the model properties. In the GME, a metamodel is described with Unified
Modeling Language (UML) class diagrams and constraints that are specified in the
Object Constraint Language (OCL) [Warmer and Kleppe, 03]. The metamodel shown in
Figure 2-2 was created using GME.

2.4 The LISA Language Development Environment
LISA [Mernik et al., 02] is an interactive environment for programming language
development where users can specify, generate, compile-on-the-fly and execute programs
in a newly specified language. From the formal language specifications and attribute
grammars, LISA produces a language-specific environment that includes a
compiler/interpreter and various language-based tools (e.g., language knowledgeable
editor, visualizers, and animators) for the specified language. LISA was created to assist
language designers and implementers in incremental language development [Mernik and
24

Žumer, 05] of DSLs. This was achieved by introducing multiple inheritance into attribute
grammars where the attribute grammar as a whole is subject to inheritance. Using the
LISA approach, the language designer is able to add new features (e.g., syntax constructs
and/or semantics) to the language in a simple manner by extending lexical, syntax and
semantic specifications. LISA was chosen for the metamodel recovery research described
in Chapter 3 because of its benefits in designing DSLs. The Model Representation
Language described in Section 3.2 and the metamodel inference engine described in
Section 3.3 were implemented using LISA.

2.5 The Design Maintenance System
The Design Maintenance System (DMS) [Baxter et al., 04] is a program
transformation system and re-engineering toolkit developed by Semantic Designs
(http://www.semdesigns.com). The core component of DMS is a term rewriting engine
that provides powerful pattern matching and source translation capabilities. In DMS
terminology, a language domain represents all of the tools (e.g., lexer, parser, pretty
printer, rule applier) for performing transformation within a specific programming
language. In addition, DMS defines a specific language called PARLANSE, as well as a
set of APIs (e.g., Abstract Syntax Tree API, Symbol Table API) for writing DMS
applications to perform sophisticated program analysis and transformation tasks. DMS
was chosen for the metamodel recovery research described in Chapter 3 because of its
scalability for parsing and transforming large source files in several dozen languages
(e.g., C++, Java, COBOL, Pascal). Because different modeling tools may use numerous
programming languages to produce model interpreters, different parsers are needed for
25

handling each language. In Section 3.4.1, DMS is used in a technique to parse model
interpreters to mine the type information of metamodel elements.

2.6 The ATLAS Transformation Language
The ATLAS Transformation Language (ATL) [Jouault and Kurtev, 06] is a
hybrid transformation language with declarative and imperative constructs which
facilitates the MDE software development process where models are the primary artifacts
and model transformations are the primary operations. ATL is a component of the
ATLAS Model Management Architecture (AMMA) [Kurtev et al., 06], a model
management platform which defines an experimental framework based on the principle
of models as first class entities. AMMA provides facilities for managing, weaving and
transforming models, as well the Kernel Meta Meta Model language (KM3) [Jouault and
Bézivin, 06] for specifying metamodels, or abstract syntaxes of DSLs, and the Textual
Concrete Syntax (TCS) language [Jouault et al., 06] for specifying textual concrete
syntaxes of DSLs.

26



Figure 2-4 – The ATL Transformation Approach (adapted from [Kurtev et al., 06])

Figure 2-4 gives an overview of the ATL transformation approach. The relation
between a model and a metamodel is called Conforms To while the metamodel is
expressed using a metametamodel. An example of a standard metametamodel is the Meta
Object Facility (MOF) [MOF, 07]. Ma and Mb are models which conform to metamodels
MMa and MMb respectively. The three metamodels MMa, ATL and MMb conform to
the metametamodel MOF. MMa2MMb.atl is a transformation program which conforms
to ATL and transforms the source model Ma to the target model Mb. ATL
transformations are unidirectional; bidirectional transformation are implemented as
transformations for each direction. In section 3.4.2, ATL is used to provide a generic
front-end for the metamodel recovery system.
27




CHAPTER 3
MARS: A METAMODEL RECOVERY SYSTEM

Domain-specific modeling (DSM) assists subject matter experts in describing the
essential characteristics of a problem in their domain. When a metamodel is lost,
repositories of instance models can become orphaned from their defining metamodel.
Within the purview of model-driven engineering, the ability to recover the design
knowledge in a repository of legacy models is needed.
This chapter describes the MetAmodel Recovery System (MARS), a semi-
automatic grammar-centric system that leverages grammar inference techniques to solve
the metamodel recovery problem. In addition to the specific details of the metamodel
inference technique used in MARS, a process for recovering name and type information
from the source code of model interpreters is introduced, as well as discussion of an
initial attempt at enabling MARS to accept as input XML files with non-GME specific
schema using a model transformation driven approach. The chapter also contains an
applicative case study, as well as experimental results from the recovery of several
metamodels in diverse domains. Related work and a brief summary serving as a
conclusion are presented at the end of the chapter.






28

3.1 Challenges and Current Limitations
As stated earlier, the Generic Modeling Environment (GME) [Lédeczi et al., 01]
is the modeling tool used in our research. An important activity in DSM is the
construction of a metamodel that defines the key elements of the domain. Instances of the
metamodel can be created to define specific configurations of the domain. An example is
shown in Figure 3-1, which represents the metamodel for a simple language for
specifying properties of a Finite State Machine (FSM). The metamodel contains FSM
concepts (e.g., start state, end state, and state) as well as the valid connections among all
entities. This metamodel is similar to the State Machine Metamodel in Figure 2-2 in that
it models State Machines but different in that the constraints are translated into the class
diagram. These constraints (“an end state can not have any outgoing transitions”, “there
can be only one transition leaving the start state”, “there can be no incoming transitions
into the start state”) are translated into the class diagram by splitting the Transition
connection in Figure 2-2 into separate source (StateInheritanceSrc) and
destination (StateInheritanceDst) connections and changing the cardinalities of
the StartState and EndState atoms. An instance of this metamodel is shown in
Figure 3-2, which illustrates a FSM composed of a start state, a state and two end states.



29


Figure 3-1 - A Metamodel for Creating Finite State Machines



Figure 3-2 - An Instance of a Finite State Machine

The metamodel shown in Figure 3-1 represents the example to be used throughout
this chapter to demonstrate metamodel inference from individual instances. There are
several challenges involved in mining a set of domain models in order to recover the
metamodel. The key challenges and a summary of the solutions presented in this chapter
are as follows:


30


1. Inference Techniques for Domain Models: The research literature in the modeling
community has not addressed the issue of recovering a metamodel from a set of
domain models. However, a rich body of work has been reported in the grammar
inference community to infer a defining grammar for a programming language from a
repository of example programs. Inductive learning is the process of learning from
examples [Pazzani and Kibler, 92]. The related area of grammar inference can be
defined as a particular instance of inductive learning where the examples are sets of
strings defined on a specific alphabet. These sets of strings can be further classified
into positive samples (i.e., the set of strings belonging to the target language) and
negative samples (i.e., the set of strings not belonging to the target language).
Primarily, grammar learning research has focused on regular and context-free
grammars. It has been proven that exact identification in the limit of any of the four
classes of languages in the Chomsky hierarchy is NP-complete [Gold, 67], and in
[Freund et. al, 93] it is shown that the average case is polynomial. A key contribution
of this chapter is the use of grammar inference techniques applied to the metamodel
inference problem.

2. Model Representation Problem: Most modeling tools provide a capability to export a
model as an XML file. However, there is a mismatch between the XML
representation of a model and the syntax expected by the grammar inference tools. To
mitigate the effect of this mismatch, the technique presented in this paper translates
the XML representation of an instance of a domain model into a textual domain-
specific language that can be analyzed by traditional grammar inference tools.


31


3. Mining Additional Model Repository Artifacts: In addition to the instance models,
there are other artifacts that can be mined in the modeling repository. For example,
the model interpreters contain type information that cannot be inferred from the
instance models. The key challenge with mining information from a model interpreter
is the difficulty of parsing the model interpreter source (e.g., a complex C++
program), and performing the appropriate analysis to determine the type information.
To address this problem, the chapter describes a technique that uses a program
transformation system to parse the model interpreter code and recover the type
information of metamodel entities.

To address these three challenges, we have created MARS, a metamodel recovery
tool based on the foundational research in grammar inference. An overview of MARS is
illustrated in Figure 3-3. MARS has three primary steps (see steps 1, 2 and 3 in Figure 3-
3) with an extension step labeled TI representing the type inference step will be described
in Section 3.4.1. The metamodel inference process begins with the translation of various
domain models into a DSL that filters the accidental complexities of the XML
representation in order to capture the essence of the domain models (represented as step 1
in Figure 3-3). The inference is performed within the LISA [ Mernik et al., 02] language
description environment (step 2 in Figure 3-3). LISA, as previously described, is an
interactive environment for programming language development where users can specify,
generate, compile-on-the-fly and execute programs in a newly specified language. LISA
was chosen for this project because of its benefits in designing DSLs [Mernik and Žumer,
05]. The LISA system has been used successfully in our evolutionary-based context-free


32

grammar (CFG) inference engine [Črepinšek, 05-b]. The result of the inference process is
a context-free grammar that is generated concurrently with the XML file containing the
metamodel, which can be used to load the domain models into the modeling tool (step 3
in Figure 3-3). The next section will introduce the generated DSL from the model
instances, as indicated by step 1 in Figure 3-3.

Figure 3-3 - Overview of MARS



33

3.2 The Model Representation Language
A grammar-based system is defined as, “any system that uses a grammar and/or
sentences produced by this grammar to solve various problems outside the domain of
programming language definition and its implementation. The vital component of such a
system is well structured and expressed with a grammar or with sentences produced by
this grammar in an explicit or implicit manner” [ Mernik et al., 04]. One of the identified
benefits of a grammar-based solution is that some problems can be solved simply by
converting the representation at hand to a CFG, because appropriate tools and methods
for working with CFGs already exist.
A difficulty in inferring metamodels is the mismatch in notations. Each GME
domain model is persistently stored as an XML file, but the grammar inference process is
better suited for a more constrained language. The inference process could be applied to
the XML representation, but at a cost of greater complexity. To bridge the gap between
the representations, and to make use of already existing inference tools and techniques, a
DSL was created called the Model Representation Language (MRL).
The defining
properties of DSLs are that they are small, more focused than General-Purpose
Programming Languages (GPLs), and usually declarative [Mernik et al., 05]. Although
DSL development is challenging and requires domain knowledge and language
development expertise, language development toolkits exist to facilitate the DSL
implementation process. LISA was used to develop the MRL and supporting tools.
The primary use of the MRL is to describe the components of the domain models
in a form that can be used by a grammar inference engine. In particular, an MRL program
contains the various metamodel elements (e.g., models, atoms and connections) that are


34

stated in a specific order. The most useful information contained in the domain models is
the kind (type) of the elements, which can be a model, atom, field or connection. The
inference process is not concerned with the name of the modeling instance, but rather its
type. Thus, an MRL program is a collection of <kind, identifier> implicit bindings. In
terms of declaration order, models and atoms can be declared in any order. However,
there is a particular order to be followed when declaring the composition of models and
atoms.
A model must first declare any constituent atoms and models, followed by field
(attribute) and connection declarations. The complete grammar for the MRL is given in
Listing 3-1. Due to the fact that connections in GME are binary relationships [Lédeczi et
al., 01], the MRL has only binary relationships. Metamodels are more expressive than a
CFG in that they can model references and explicit aggregation and composition
relationships [Alaanen and Porres, 04] [Wimmer and Kramler, 05], while CFGs need
additional annotations to be able to express them. Analyzing CFGs along with their
annotations to discover such relationships can result in increased processing time. In the
case of the MRL, all aggregations and compositions are modeled as compositions. The
MRL is able to handle references by utilizing transformation rules that exactly capture
such references and which can be seen as annotations to the base grammar. For example,
rule 2 in Table 3-2 allows exact identification of the source and destination of a
connection. The binary relationship inherent in the MRL results in a concise grammar
that expedites parsing of MRL programs. Thus, MRL is sufficiently succinct in its design
to be feasible for the inference process, but also expressive enough to represent the GME
metamodels accurately. Adopting MARS within a modeling tool other than the GME


35

would require an MRL grammar definition for that tool, but such a grammar is very
simple to produce if the meta-metamodel of the modeling tool is well-understood. The
MRL grammar would also need to be adapted to handle n-ary relationships if the tool
supports them.




START ::= GME
GME ::= MODEL_OR_ATOM {MODEL_OR_ATOM}
MODEL_OR_ATOM ::= MODEL | ATOM
MODEL ::= model #Id \{ M_BODY \}
M_BODY ::= [MODELS] FIELDS [connection CONNECT]
MODELS ::= #Id \; {#Id \;}
FIELDS ::= fields {#Id \,} \;
CONNECT::= {#Id \: #Id \-> #Id \;}
ATOM ::= atom #Id \{ FIELDS \}
Listing 3-1 - Context-Free Grammar for MRL

A first stage of this work was to design a DSL that could accurately represent the
(visual) GME domain models in a textual form. This was accomplished through a visual-
to-textual-representation transformation process. Because the GME models can be
persistently stored as XML files, the transformation to MRL is done by the Extensible
Stylesheet Language Transformation Language (XSLT) [Clark, 99]. XSLT is a
transformation language that can transform XML documents to any other text-based
format. XSLT uses the XML Path Language (XPath) [Clark and Rose, 99] to locate
specific nodes in an XML document. XPath is a string-based language of expressions.


36

We use XSLT and XPath to parse the XML-based model files and convert the model
information in the XML file into an intermediate but equivalent representation in the
form of MRL. We detail the transformation process using our running example of the
FSM domain. One of the relatively intricate problems was to extract all the connection
sets of a particular model. In a GME model, a connection is described by a connection
name, the source element, and the destination element. Listing 3-2 shows an XML
fragment from a Network domain instance model XML file.

1 <connection id="id-0068-00000070" kind="Transition" role="Transition">
2 <name>Transition</name>
3 <connpoint role="dst" target="id-0066-00000080"/>
4 <connpoint role="src" target="id-0066-00000086"/>
5 </connection>
6 <connection id="id-0068-00000071" kind="Transition" role="Transition">
7 <name>Transition</name>
8 <connpoint role="dst" target="id-0066-00000084"/>
9 <connpoint role="src" target="id-0066-00000081"/>
10 </connection>
11 <connection id="id-0068-0000006c" kind="Transition"
role="Transition">
12 <name>Transition</name>
13 <connpoint role="src" target="id-0066-00000083"/>
14 <connpoint role="dst" target="id-0066-00000084"/>
15 </connection>

Listing 3-2 - XML Code Fragment from a Network Domain Instance Model
The fragment describes three XML container elements of type transition.
Container elements are composite structures that can contain other elements. The
connection element is composed of name and connpoint elements (see lines 2-4).
The connpoint elements describe the source and destination of the connection. In a
GME model XML file, each element is assigned a unique ID. The connpoint elements
do not contain the name of the source or destination elements. Rather, only their IDs are


37

mentioned in the target attribute tag. Thus, in order to retrieve the name and type of
the source and destination elements, the instance model XML file had to be searched to
find the element with the requisite ID, and then extract the type and name from that
particular element declaration. This was accomplished by the XSLT transformation in
Listing 3.3. The XPath expression “connpoint[@role='src']/@target” in line
6 translates to “the target attribute of the connpoint element which has a role attribute
value of ‘src’ ”. In lines 6 and 7, these XPath expressions are used in XSLT statements to
assign the source and destination IDs to variables that are then passed as parameters to
the XSLT template scopeX (lines 19-23). The main purpose of the scopeX template is
to educe the value of the kind attribute of the element ID that is passed to it. The
attribute kind indicates the metamodel element of an instance. Note that we are more
concerned with the kind of the element than its name. The XPath expression
"//parent::node()[@id = $targ]/@kind" selects the kind attribute of the
element whose ID attribute matches the $targ variable. Part of the MRL program
generated after the transformation is shown in Model 2 in Table 3-1, which shows an
instance of the StateDiagram model. Using our XSLT translation engine, the
connections information in Listing 3.2 is transformed into an equivalent MRL
representation.





38

1 connections <br/>
2 <xsl:for-each select = "connection">
3 <xsl:variable name = "conn2" select = "@id" />
4
5 <xsl:variable name = "connpt2"
6 select = "connpoint[@role='src']/@target" /> <xsl:variable name =
7 "connpt4" select = "connpoint[@role='dst']/@target" />
8
9 <xsl:value-of select="name"/> :
10 <xsl:call-template name = "scopeX">
11 <xsl:with-param name = "targ" select = "$connpt2"> </xsl:with-param>
12 </xsl:call-template>
13 ->
14 <xsl:call-template name = "scopeX">
15 <xsl:with-param name = "targ" select = "$connpt4"> </xsl:with-param>
16 </xsl:call-template> ;
17 <br/>
18
19 <xsl:template name = "scopeX" >
20 <xsl:param name = "targ"> </xsl:param>
21 <xsl:variable name = "parenter" select = "//parent::node()
23 [@id = $targ]/name" />
24 <xsl:variable name = "IDer" select = "//parent::node()
25 [@id = $targ]/@kind" />
26 <xsl:value-of select = "$IDer"/>
27
28 </xsl:template>

Listing 3-3 - XSLT Code to Extract the Connection Elements.
However, it is beneficial to prune the generated MRL program before it is used as
an input for the grammar inference phase detailed in Section 3.5. The domain instance
models usually have similar instances of atoms and models in their definitions (note the
duplications in Model 2 in Table 3-1). Multiple declarations of a particular model
occasionally vary in their compositions and can be useful in inferring the correct
cardinality of the model’s composing elements. However, the atom definitions are static
and multiple declarations of the same atom can be removed. Although the generated


39

program can be used for the next phase, it is desirable to pare down the program for
improved readability and succinctness.
The domain models shown in Figure 3-4, which are based on the metamodel of
Figure 3-1, are represented as the MRL programs in Table 3-1. For example, model 1 is
composed of a StateDiagram model and two State atoms. A StateDiagram
consists of one StartState, EndState and a connection named Transition that
connects both states. The Start states in Fig. 3-4 are atoms with no fields and are
transformed to atom StartState { fields; }. Section 3.3.3 describes the
transformation between the two representations in more detail. The primary artifact from
which a metamodel will be inferred is the intermediate textual representation as translated
in the MRL. The MRL serves as input into the next stage of the MARS metamodel
recovery process, as described in Section 3.3.


Figure 3-4 - Two Instances of the FSM Metamodel






40

Table 3-1
Two MRL Programs as Representations of Models from Figure 3-4
Model 1: Model 2:

model StateDiagram {

StartState;
EndState;
fields;
connection
Transition : StartState ￿ EndState;

}

atom StartState {
fields ;
}

atom EndState {
fields ;
}


model StateDiagram {

StartState;
EndState;
State;
State;
fields;
connection
Transition : StartState ￿ State;
Transition : State ￿ State;
Transition : State ￿ EndState;
}

atom StartState {
fields ;
}

atom EndState {
fields ;
}

atom State {
fields ;
}

atom State {
fields ;
}


3.3 The Metamodel Inference Engine
As previously discussed in Section 3.2, a correspondence exists between a
metamodel and its instances, and a programming language and valid programs defined by
the language. By considering a metamodel as a representation of a CFG, named G, the
corresponding domain models can be delineated as sentences generated by the language
L(G). This section describes the subcomponent of the MARS system that applies CFG
inference methods and techniques to the metamodel inference problem.




41

3.3.1 Metamodels as Context-Free Grammars
In the next step toward defining a process to infer a metamodel, all of the
relationships and transformation procedures between metamodels and CFGs are
identified. A key part of the process involves the mapping from the metamodel
representation to the non-terminals and terminals of a corresponding grammar.
The role
of non-terminal symbols in a CFG is two-fold. At a higher level of abstraction, non-
terminal symbols are used to describe different concepts in a programming language
(e.g., an expression or a declaration). At a more concrete lower level, non-terminal and
terminal symbols are used to describe the structure of a concept (e.g., a variable
declaration consists of a variable type and a variable name). Language concepts, and the
relationships between them, can be represented by CFGs. This is also true for the GME
metamodels [Karsai et al., 04], which describe
concepts (e.g., model, atom) and the
relationships that hold between them (e.g., connection). Therefore, both formalisms can
be used for the same purpose at differing levels of abstraction, and a two-way
transformation from a metamodel to a CFG can be defined. The transformations relating
a metamodel to a CFG are depicted in Table 3-2. Note that the type information of fields
is not converted into the CFG representation because this information is not available in
the domain models. MARS infers all the fields as generic ‘field’ types. The user can
modify the field information manually after loading the inferred metamodel.






42

Table 3-2
Transformation from a Metamodel to a Context-Free Grammar

1.




NAME
→ ’atom’ name {FIELDS}

FIELDS
→ ’fields’ field1 ... fieldn
2.



NAME
→ ’connection’ name ’:’ SRC -> DST;
SRC → SRC_NAME
DST → DST_NAME
3.

NAME
→ ’model’ name {PARTS}
PARTS → MODELATOM FIELDS CONNECTIONS
FIELDS → ’fields’ field1 ... fieldn
MODELATOM → ...
CONNECTIONS → ...
(see transformations 8 and 9)

4.






FCO
→ 'fco' NAME
NAME → NAME1 | ... | NAMEn

5.


NAME
→ NAME1S
NAME1S → NAME1 NAME1S | ε

6.


NAME
→ NAME1S
NAME1S → NAME1 NAME1S | NAME1

7.


NAME
→ NAME1S
NAME1S → NAME1 | ε

8.



CONNECTIONS
→ NAME1 ... NAMEn

(see transformation 3)



43

9.




MODELATOM
→ NAME1 ... NAMEn

(see transformation 3)


( Table 3-2 continued )
As an example application of the transformations in Table 3-2, the Finite State
Machine (FSM) metamodel shown in Fig. 3-1 is semantically equivalent to the
corresponding CFG represented in Listing 3-4.
The obtained CFG is a rather intuitive
representation of the metamodel in Figure 3-1. Productions 1 and 2 state that a
StateDiagram (or FSM) is a model consisting of models, atoms, fields and
connections. Models and atoms (production 3) that can be used in a FSM are
StartState, EndState, and State. A FSM has no fields (production 6) and at
least one connection called Transition (production 8). The source of the connections
can be StartState or State (productions 10 and 12), and the destination can be
EndState, or State (production 11 and 13), which all are atoms without fields
(productions 14 – 19). From such a description, a metamodel can be drawn manually
using the GME tool. However, this manual process can be automated. In the final step of
the MARS inference system, the CFG is transformed to the GME metamodel XML
representation that can be loaded by GME. This is accomplished by using transformation
rules in Table 3-2.



44

1 STATEDIAGRAM → ’model’ StateDiagram { PARTS0 }
2 PARTS0 → MODELATOM0 FIELDS0 CONNECTIONS0
3 MODELATOM0 → STARTSTATE STATES ENDSTATES
4 STATES → STATE STATES | ε
5 ENDSTATES → ENDSTATE ENDSTATES | ENDSTATE
6 FIELDS0 → ε
7 CONNECTIONS0 → TRANSITIONS
8 TRANSITIONS → TRANSITION TRANSITIONS | TRANSITION
9 TRANSITION → ’connection’ transition : SRC0 ￿ DST0
10 SRC0 → ’fco’ STATEINHERITANCESRC
11 DST0 → ’fco’ STATEINHERITANCEDST
12 STATEINHERITANCESRC → STARTSTATE | STATE
13 STATEINHERITANCEDST → STATE | ENDSTATE
14 STARTSTATE → ’atom’ StartState { FIELDS1 }
15 FIELDS1 → ε
16 STATE → ’atom’ State { FIELDS2 }
17 FIELDS2 → ε
18 ENDSTATE → ’atom’ EndState { FIELDS3 }
19 FIELDS3 → ε

Listing 3-4 - Context-Free Grammar Representation of the Metamodel in Figure 3-1
3.3.2 Inferring the Metamodel from Domain Models
Before formally explaining the metamodel inference engine in the next section,
we first describe the inference process from a less formal viewpoint. The working
example is the FSM metamodel introduced earlier in Fig. 3-1. The metamodel contains
FSM concepts (e.g., start state, states, end states) as well as valid connections among all