NLCI: a natural language command interpreter

Natural language interfaces are becoming more and more common, because they are powerful and easy to use. Examples of such interfaces are voice controlled navigation devices, Apple’s personal assistant Siri, Google Voice Search, and translation services. However, such interfaces are extremely difficult to build, to maintain, and to port to new domains. We present an approach for building and porting such interfaces quickly. NLCI is a natural language command interpreter that accepts action commands in English and translates them into executable code. The core component is an ontology that models an API. Once the API is “ontologized”, NLCI translates input sentences into sequences of API calls that implement the intended actions. Two radically different APIs were ontologized: openHAB for home automation and Alice for building 3D animations. Construction of the ontology can be automated if the API uses descriptive names for its components. In that case, the language interface can be generated completely automatically. Recall and precision of NLCI on a benchmark of 50 input scripts are 67 and 78 %, resp. Though not yet acceptable for practical use, the results indicate that the approach is feasible. NLCI accepts typed input only. Future work will use a speech front-end to test spoken input.


Introduction
User interfaces have been growing in complexity for decades.While the command line was sufficient in the seventies, graphical user interfaces appeared in the eighties, web interfaces in the nineties, and touch screen interfaces in the last decade.The next generation of interfaces will handle unrestricted text and speech as input.Examples where talking to computers is a reality today are navigation devices, Apple's Siri (Bellegarda 2014), Google Voice Search (Ortiz 2014), and several translation services.While mobile applications are the trail blazers for such interfaces, users will soon expect them to be available everywhere.As a consequence, software developers will have another demanding task on their hands: speech and text interfaces are extremely hard to build.They involve competence in speech and natural language processing (NLP), natural language (NL) grammars, and inference engines that map the input to whatever the application requires.This paper describes the natural language command interpreter (NLCI), an architecture that simplifies the construction of such interfaces dramatically.
All the developer needs to do is build an ontology for an API.This API can then be controlled with textual commands in unrestricted English.The interface can be used, for instance, for instructing robots, programming home automation systems, manipulating spreadsheets, controlling games, or for working with any API that is suitable for end-user programming.Though NLCI is limited to written input at the moment, future work will use a speech front-end to generate text for processing by NLCI.
The important advance reported in this paper is that interfaces that work with unrestricted English text require only a domain ontology to be built and no other expertise.The ontology can even be generated automatically, if the API has certain properties.Our approach is a first step on the road to simplify the construction of next-generation user interfaces.
The paper's core is an explanation of how the ontology acts as a bridge between a sentence in natural language and the code that implements the sentence.To demonstrate that NLCI is easily adapted to a new domain, we present natural language interfaces for two radically different application areas: openHAB, a small home automation system, and Alice (Conway 1997), a sophisticated 3D animation software.We report precision and recall for both interfaces.
Section 2 discusses related work and Sect. 3 explains our approach in detail.Section 4 evaluates our approach in two domains: 3D animations in Alice and scripts for openHAB.Section 5 discusses future work and concludes the paper.

Related work
NLCI is essentially a tool for programming in natural language.It is meant for issuing individual commands as well as composing straight-line scripts consisting of such commands, not for sophisticated software such as operating systems or even complex algorithms.Programming in natural language is a sub-area of end-user development.Lieberman et al. define end-user development "as a set of methods […] that allow users of software systems […] to create, modify, or extend a software artifact" (Lieberman et al. 2006).
NLCI's intention and procedures are similar to the ones from different research areas: of course there is programming in natural language; the first subsection reviews some work from the early 1960s to recent approaches.Because NLCI uses an ontology to model the domain and the natural language input must be mapped to elements therein, it uses techniques that are similar to querying ontologies and databases in natural language.The second subsection reviews work similar to this aspect of NLCI.

Programming and scripting in natural language
The idea of programming in natural language was first discussed by Sammet (1966) in the 1960s.She envisioned computers that can be programmed not only with complex mathematical formula but also in general English.The computer should-according to Sammet-understand natural language and deal with its weaknesses.Computer users (i.e.non-professionals) should be enabled to program computers efficiently, using programming languages or natural language as desired.
An early contribution to programming in natural language is the natural language computer (NLC) introduced by Ballard and Biermann (1979).NLC performs matrix calculations given commands such as "Choose a row in the matrix.Put the average of the first four entries in that row into its last entry".The underlying approach relies on a hand-crafted syntax analysis and a domain specific dictionary.NLC is also able to perform rudimentary reference resolution.The approach works well within the limited domain (Biermann et al. 1983).The authors note that programming in natural language should allow unconstrained language, but will be domain-specific.
Complementary approaches allow a nearly unlimited domain, but constrain the language.Two popular approaches are AppleScript and NaturalJava.AppleScript can be used to create scripts for Apple's Mac OS.It provides an easy-to-understand syntax because it resembles natural language (Goodman 1998).However, users must learn the syntax and the restrictions of AppleScript.NaturalJava is a tool for dictating Java source code using natural language speech (Price et al. 2000).It only supports a small subset of English.Consequently, users must not only adhere to the limitations of NaturalJava, but also know the details of Java itself.NaturalJava is meant for entering code rather than synthesizing code from user intention.Pane et al. (2001) investigated how to make future programming languages more user-centered.Therefore they studied how non-programmers describe programming solutions.They report that users tend not to dictate control structures such as loops directly: users do not know the concept of loops but describe actions set-wise (e.g."A, B, and C do X") and rely on the machine to correctly interpret the input.Liu and Lieberman (2004) analyzed the expressiveness of natural language for programming.They found that unrestricted natural language can cover all programming concepts and that its inherent ambiguity and expressiveness should be seen as an advantage, not as a drawback (Liu and Lieberman 2004).Their system Metafor transforms stories into program skeletons.It automatically creates classes, variables and method stubs, but leaves the implementation of methods to human developers (Liu and Lieberman 2005).The input for Metafor is restricted to simple sentences that match the "subject, predicate, object" pattern.Even though Metafor produces only stubs in this early stage, the paper indicated that unrestricted natural language could indeed be used as a programming language.Later on, Metafor was extended with a module for loops, conditions, and comments (Mihalcea et al. 2006).The authors also explain how to generate runnable python code.Still, Metafor is constrained to generate python whereas NLCI is platform independent by design.Vadas and Curran (2005) showed that it is possible to create runnable program code from unrestricted natural language.Their approach is inspired by Metafor and uses deep semantics derived by a parser for combinatory categorical grammars (CCG).A set of patterns translate parts of CCG derivations into runnable Python code.As with NaturalJava, the user has to know and dictate the code using concepts of Python.
Pegasus is intended to convert from unrestricted natural language to any programming language in any domain (Knöll and Mezini 2006).Knöll and Mezini propose to use a graph-based internal representation for natural language that resembles an ontology.It can represent both structural (i.e.static) and dynamic elements.The translation from the internal representation to code should be based on patterns.In more recent work they describe methods to increase the "naturality" of programming languages instead of programming in NL (Knöll et al. 2011).
Guzzoni describes an ontology-based approach for creating intelligent assistants called active ontologies (Guzzoni et al. 2007).An intelligent assistant interprets natural language input and triggers actions like reserving a table in a restaurant.Guzzoni uses an ontology to model the target application, e.g.what is needed to perform a restaurant reservation.Processing rules and actions (e.g.scripts) are stored within the ontology's concepts and make the ontology "active", i.e. the natural language processing is domain dependent and strongly coupled with the domain model.If an action command is successfully recognized, the ontology triggers the command through predefined web service connectors.Guzzoni provides an integrated development environment that focuses on efficient modeling the domain, deploying, and running the active ontologies.Yet Active is intended to recognize single commands from one or more pieces of input.
SmartSynth creates smart phone scripts from natural language (Le et al. 2013).A pattern matcher identifies language features and maps them to smart phone methods or scripts.If appropriate method or script parts cannot be identified in the language features, it derives the missing parts from context.Whenever this resolution fails, SmartSynth asks the user for clarification.SmartSynth is intended for automating smart phone scripts and reference Le et al. (2013) does not provide an idea on how to switch to a different domain.Manshadi et al. show how programming in natural language can be combined with programming by example, another end-user programming technique (Manshadi et al. 2013).Their tool is limited to string modification in spreadsheets.The user is expected to give examples (i.e.input/output pairs) and a high level description of the intended program behavior in natural language.The tool extracts the desired string modification from the examples using a maximum entropy model.The description of the intended behavior is used to add features to the model and thereby improves its accuracy.
Thummalapenta et al. do not target usual end-users but testers in their 2012 and 2013 publications about automating test automation (Thummalapenta et al. 2013(Thummalapenta et al. , 2012)).They generate test scripts from natural language and target keyword-based testing of web-based systems such as bug trackers.Their technique uses the system under test as a guideline during the translation phase; thus the automated tests are guaranteed to work if the generator succeeds, i.e., the generator executes the system under test to verify that the system can be addressed with the generated test steps.They do not try to understand unrestricted natural language but "rely on the observation that the style in which testers write manual steps have a very predictable structure […] Moreover, the tests have a restricted vocabulary […]" Thummalapenta et al. (2012).Generating automatic tests is an interesting application for NLCI but we have not yet explored testing.The system under test and the test drivers functionality could be modeled in the NLCI ontology.

Natural language interfaces to databases (NLIDB)
Querying databases with natural language has been researched since the 1970s; Waltz (1975) published one of the first papers in this area .Androutsopoulos et al. (1995) review the field and Martinez- Barco et al. (2013) published an overview of the applications of NLIDB.Both the domain (database tables or ontologies) and the output (e.g., SQL or SPARQL queries) are limited.NLCI, by contrast, is targeting any end-user API in any domain.
NLDB, a conference of application of natural language to information systems, is in its 20th iteration in 2015 and covers a wealth of topics, including, for instance, information retrieval.A recent tool for querying ontologies with natural language is PANTO (Wang et al. 2007).It uses Stanford CoreNLP (Manning et al. 2014) to build parse trees and then extracts constituents to build an internal representation called query triples.Query triples can be mapped to ontological structures.If the triple is mapped successfully, PANTO derives a SPARQL query.
FREyA is another NL tool for querying ontologies (Damljanovic et al. 2010).The authors also use Stanford parse trees to extract words from the user input to create SPARQL queries.If mapping the words to ontology concepts fails, the user is presented with ranked alternatives.Initially, the ranking uses string similarity and synonym usage only but the users' choices are fed back to the system for semi-supervised reinforcement learning.
The semantic web provides a wealth of information but querying is hard.Unger and Cimiano demand that question answering systems need "to bridge the gap between the user and the data […]" (Unger and Cimiano 2011).Their tool Pythia uses a domaindependent language model that is used to align the natural language terms in a user query to the terminology of the domain.Their approach works well in small domains but needs non-negligible effort for larger domains.
PowerAqua, the successor of AquaLog, also targets users that query the semantic web (Lopez et al. 2012).It provides a single natural language interface for querying multiple heterogeneous ontologies.Its natural language analyzer constructs query triples from a query in a way similar to PANTO (Lopez et al. 2006).Then different modules determine the ontologies that are likely to answer the query and gather the results.In the last step, the results are ranked and merged (if necessary) to produce the final answer.

Summary
Related work about programming in natural language is either domain-specific or restricts the language usable for input.We place no restrictions on the input syntax.NLCI must also deal with grammatical flaws common in spontaneous speech.We accept for the time being that natural language command interpretation will be domainspecific.For this reason we developed an approach that makes it easy to acquire new domains.It should be interesting to see how large the domains can become, and whether several domains can be handled correctly by a single interpreter.The fact that NLCI uses an ontology as central database for domain information renders it similar to ontology or database querying approaches.NLCI does not target general ontology querying but the identification of certain concepts (e.g.classes and methods).Because of that its query engine is simpler yet specialized for the task at hand.

Architecture of NLCI
Processes that translate NL into source code usually either target a specific domain, restrict the input language, or both.Our language analysis is completely domain agnostic.We interpret the term "domain" very strict: of course, two different APIs form two separate domains; two APIs for the same application but written for different platforms (e.g.Java and C#) also form two different domains.The domain knowledge is stored in an ontology and loaded before processing the input.All information derived is annotated in the input text and the (necessarily platform specific) code generation engine can make use of it without knowing anything about natural language.
Roughly speaking, the interpretation process works as follows.Every noun phrase in the input text will be mapped to a class in the API; every action, usually expressed by a verb phrase, is mapped to a method, potentially with parameters.Thus, the API must provide a method of every action that users may request.To facilitate the mapping, we use an ontology.The ontology contains entries for every class, object, or method in the domain, including synonyms, plus the usual information regarding inheritance as well as composition hierarchies with meronyms and hypernyms.Given an action verb, for instance, the mapper will look for entries in the ontology that match that verb.In practice, there may be multiple matches.Disambiguation is done by a ranking method described later.The highest ranked match will be used by the code generator to emit a method call on a class, including parameters.
Figure 1 illustrates the overall process: first we populate the ontology that contains the domain specifics (i.e. the API with all classes and their methods); to allow for fuzzy language matching, we enrich the API with synonyms from WordNet.The domain ontology must be built only once per API.Given an input script, we parse it, enrich  it with structural information (such as control structures), and identify actors, actions, and (grammatical) objects.All relevant information is annotated in the text.For every sentence we identify the classes and the methods to invoke (including parameters).The final step produces source code for the target platform.The following subsections explain the ontology, the language analysis steps, and the program synthesis in greater detail.

The NLCI ontology
An ontology is a concrete representation of knowledge about a specific domain (Gruber 1993).It is populated by concepts that form a hierarchy.The relations between the concepts can be defined as needed.Along the concepts, there are individuals.Concepts and individuals are the ontology's equivalent to classes and instances in the objectoriented world.In ontologies there are no members (as in the method is a member of the class) but relationships that link concepts with each other: a Java class would become a concept as well as its methods; then relations link the class and its methods.

Structure
The structure of our ontology is similar to the one that Yang et al. proposed in their 1999 paper (Yang et al. 1999) and to the one Zhang et al. (2006) used for traceability recovery: We designed the ontology to capture the major concepts of object-oriented programming languages (c.f.Table 1).API elements are individuals of these concepts.For example, Java's StringBuilder is an individual of the Class concept and its method append(s: String) is an individual of the Method concept.The ontology also contains for each class the path to the file that implements it plus the synonyms of the class' name derived from WordNet.Since inference engines support inheritance, we can use method inheritance when populating the ontology.For example, when modeling a Java API, one needs to specify toString() only once.The structure of the ontology is provided as a template that has to be populated with a specific API.
Listing 1 Pseudocode for adding a method to the ontology.
// owl provides access to the OWL ontology // methodConcept is the concept for API methods , see

Population
The template of the domain ontology can be populated by hand (for example with Protégé, Knublauch et al. 2004).For large APIs, it is advisable to write a program for this task.The natural language analysis phase assumes that the ontology uses descriptive identifiers, in particular nouns or compound nouns for classes, and verbs or verb phrases for methods.These can be extracted directly from the API, if present.Splitting of identifiers is automatic (the splitter considers camel case, underscores, and other string splitting heuristics).Otherwise, suitable identifiers need to be inserted by hand.
When developing an ontology generator, one can use all the information there is, e.g., when ontologizing Java libraries one should use the JavaDocs as well as in-line comments to gather as much useful information as possible.To generate an ontology from a Java API one can use its source code, the compiled classes, or JAR files.The first step is to access the classes of the API, e.g. to parse the API's source code with a Java parsing library such as Javaparser. 1 Then every class is added to the ontology, followed by its members and all data types needed by the methods (as parameters and return types).Listing 1 shows how to add methods to the ontology.A simple generator based on Javaparser can be implemented in 330 lines of Java code; access to the ontology is provided by NLCI.We built ontology generators for two APIs: openHAB and Alice (c.f.Sects.4.1.1 and 4.2.1).

Synonyms and WordNet
Users will routinely use synonyms of the API's concepts in their input scripts.NLCI must therefore detect synonyms when matching the script with the ontology.NLCI provides a preparatory tool that links a given ontology to WordNet to facilitate the natural language analysis.We record the additional linguistic information in the NLCI ontology so that the matching algorithm is decoupled from the linguistic processing of the API.
Synonyms are defined for words and not for phrases but API elements usually comprise of several words.Therefore we identify the head of the element's name and determine its synonyms.The head of a phrase is the word of the phrase that controls it, i.e. determines the sense of the phrase (Chomsky 1995;Miller 2011); the other words of the phrase modify the head.In the example "the old man", "old" is a modifier of the head "man"; one writes "old ˆman".Linguists grammatically determine the head of a phrase: e.g., they use the subject of a sentence as head and all its attributes as modifiers.Since method and class names are no sentences, we rely on a simple heuristic to determine the head.Naming conventions for object oriented-systems state that objects' names comprise of adjectives and a noun and method names start with a verb.Therefore we analyze the element's name with a POS tagger.We use nouns as heads for class names and verbs as heads for method names.Then we determine the synonyms of the head in WordNet and record the information.Not all synonyms are  We align the hierarchies to rule out unsuitable synonyms.

Natural language processing
Figure 2 shows the steps of our natural language processing phase in more detail.The most important step is the identification of the classes used, the methods to invoke, and the parameters to use.We assume that the subjects of active sentences are the actors, the predicates are the actions, and direct and indirect (grammatical) objects can be parameters (e.g.Somebody.open(door:Class)); we identify passive voice and other linguistic constructs to recognize the respective elements.

Identifying atomic sentences
To abstract from the natural language expression, we introduce atomic sentences.An atomic sentence works like a logical predicate and captures an action with all elements involved.The so called atomic constituents are actor, action, and object.
We represent this information with the notation action(actor), action(actor, object), or action(object).Atomic sentences abstract from active and passive voice.For example, both "The janitor opens the door."and "The door is opened by the janitor" result in open (janitor, door).NLCI also handles conjunctions and enumerations: it translates "The janitor opens the door and the window" into two atomic sentences open (janitor, door) and open(janitor, window) and "John opens the window and the janitor the door" to open (John, window) and open(janitor, door).
To translate a sentence to one or more atomic sentences, we first process it with Stanford's CoreNLP (Manning et al. 2014).CoreNLP provides an extensible NLP pipeline and comes prepared with a lemmatizer, a part-of-speech tagger, and a parser.Currently we use CoreNLP for English only, but there are configurations for other languages as well.
CoreNLP's most important feature is the construction of so called typed dependencies.Typed dependencies are the connections between words in a sentence.The dependencies form a graph that consists of word nodes and dependency edges.According to Standford's NLP group, dependency graphs are the preferable vehicle when processing word connections in a sentence, i.e. dependencies between these words (de Marneffe and Manning 2008).A typed dependency is a predicate reln(gov, John turns the doorknob and opens the door means that janitor is the nominal subject of opens.Dependency graphs abstract from the syntactical structure of a sentence.They do not change much when the syntactical structure of a sentence is altered, e.g., rewriting a sentence from active voice to passive voice changes the way one has to interpret a syntax tree greatly: the noun phrase at the beginning of the active sentence is the actor; in a passive sentence, the first noun phrase is the object.Dependency graphs provide a more direct access to the information in a sentence than a syntax tree.There are 56 different typed dependencies; for a thorough explanation see reference (de Marneffe and Manning 2015).
We explain the identification of atomic sentences with the example in Fig. 3.The analysis starts with the root of the graph.Turns is the first predicate of the sentence and its direct neighbors John and doorknob are the subject and the object respectively; they are connected to the predicate with nsubj and dobj edges.These three nodes form the first Atomic Sentence: turns(John, doorknob).The third neighbor opens of the predicate is connected with a conj_and edge; this edge is the path to the next predicate with its subject and object.NLCI translates the second predicate to the atomic sentence opens (John, door).Note that John is the subject in the second part of the sentence as well (indicated by its two incoming nsubj edges).The atomic sentence information is annotated in the text for further processing.
The analysis is performed once for every predicate and includes special treatment for sentences -in passive voice, -with direct speech, -with imperatives, -with modifiers for subjects and objects (such as appositions, adjectives, and relative clauses), -containing gerunds, and -with full-infinitives (also known as to-infinitives).
Therefore NLCI is capable of dealing with complex sentence structures, such as "John repairs the door, which is broken, while the table is cleaned by the old janitor".Note that the sentence contains active and passive voice, a relative clause, and modifiers.NLCI identifies the following atomic sentence in this example: repair(John, broken ˆdoor) clean(old ˆjanitor, table)

Mapping atomic sentence to API elements
The next step is mapping atomic sentence to API elements: NLCI maps actions to ontology individuals representing API methods; actors and (syntactic) objects are mapped to ontology individuals representing API classes.Therefore NLCI adapts a common information retrieval technique.First, it gathers as much information as possible, before ranking the results according to a scoring function defined below.This approach is also implemented in IBM's successful Deep QA architecture (Ferrucci et al. 2010); Watson explores answer candidates in this manner.
Note that a natural language sentence with only one action can have multiple atomic sentence annotations that point to different actors, actions, and objects in the ontology.That is because the ranking algorithm cannot always identify the correct solution, e.g. if context information is needed for disambiguation.Then other analyses must be used to improve the performance.To incorporate these context sensitive analyses into the scoring algorithm would make them hard to understand and to maintain.Also, separate scoring and resolution algorithms should exist independent from each other and add to the information base available to NLCI.A future component should combine the information gathered during such analyses and determine an optimal combination.Such an approach that combines different analyses and NLP techniques is also the fundamental idea behind Deep QA.
The first step is querying the ontology for each word (the head word and modifiers) contained in the atomic constituents.The ontology returns all individuals that contain at least one of the words.We search for API elements this way, because they are not always well named; as a consequence modifiers are sometimes the only possibility to retrieve the correct element.Furthermore NLCI's ontology search regards synonyms and word lemmas.To raise precision NLCI only looks at the relevant parts of the ontology for the current query.That means for actions only methods are returned, while queries for actors and objects return classes.Nevertheless this kind of ontology search generates a large set of candidates.NLCI uses the following scoring function to rank the candidates: B(r, q) is the total score of the retrieved individual r for the query q.We decided to implement a custom scoring function to better adapt to ontology and API characteristics rather than using an off-the-shelf token matching algorithm.The first factor of B(r, q) is the representativeness R(r, q); it describes how well the individual name matches the query.It is the fraction of matching characters in the individual's name and all characters in the query: The second factor of B(r, q) is the coverage C(r, q); it describes how well the query matches the individual's name.It is the fraction of used characters in the query and all characters in the individual's name: Together representativeness and coverage determine how well the query (atomic constituent) matches the result (ontology individual).We used substring-wise matching to overcome flaws in API naming, e.g., for a query [old, janitor] and a best matching element in the API oldJan, a token-wise similarity metrics would only match "old", while our scoring considers the overlapping parts of "Jan" and "janitor" as well.
The constant s weakens the score for matches with synonyms.This is because the synonym search introduces noise.If API elements share common synonyms (i.e. both appear in the same Synset in WordNet), we have to distinguish between the actual name and the synonyms.Therefore matches based on the individual's name are not discounted, which means s is set to 1.If the match is based on a synonym, s is set to 0.4 The factor u n considers the container/component relationship in the ontology: u is a discount factor, set to 0.4, to decrease the score of components that have been used in the text but without their full containment hierarchy; n is the number of missing components in the query.For example the sentence "The janitor raises his left arm" is interpreted by NLCI as raise(ˆjanitor arm left).So for [janitor, arm, left] the best matching ontology individual might be OldJanitor→UpperBody→LeftArm.On the one hand, since the query only matches the first and last component of the hierarchy only those are used to compute R(r, q) and C(r, q) resulting in a better score.On the other hand u n degrades the score because of the missing inner component UpperBody; in the example by u n = 0.4 1 = 0.4.
As discussed above, our ontology structure distinguishes between (self-contained) objects and components.The ranking favors objects over components: o is set to 0 for objects and to 0.2 for components.In our experience, users tend to use objects not components.
At the end, the matches with scores B(r, q) ≥ 0.1 are annotated in the text.So far, every actor, action, and object in the text has several matched ontology individuals and corresponding scores.Then the matched individuals are combined; for every atomic sentence, NLCI determines viable alternatives and their combination score S. For every matched individual NLCI computes the intersection between the class' methods, which can be found in the ontology, and all matched methods annotated in the text to produce (class, method) tuples.Then NLCI chooses all matched classes that can be arguments for the respective method.The combined score S for an atomic sentence A and the set of retrieved individuals R = {r actor , r action , r object } is calculated as follows: If no class can be matched that could be used as parameter for the current method B object is zero.If a method does not require an argument, |A| is two; otherwise |A| is three.Again, all viable combinations of individuals (i.e.API calls) and the corresponding scores are annotated in the text.

A detailed example
We outline NLCI's matching algorithm with a detailed example with the ontology shown in Table 2.We consider the sentence "The old janitor cleans the table."First, NLCI derives the atomic sentence A =clean(old ˆjanitor, table ).Then it identifies the matches for all atomic constituents.NLCI separately queries the ontology for janitor and old.The ontology returns OldJanitor and OldTimer because they match with old.Caretaker is also retrieved, because janitor is found as a synonym of Caretaker.Finally, OldJan→ArmL is retrieved, as the first part partially matches the query.
NLCI ranks the candidates according to the scoring function in Eq. ( 1).Table 3 details the components of the final scores.The first column shows the retrieved ontology individual and the second the synonym that has been used for retrieval, if any.The following columns show the components of Eq. ( 1) and the last column states the resulting score.
The length of the query old ˆjanitor is ten, therefore R is the number of matching characters divided by ten.Caretaker's R is determined by its synonym janitor.Because of the synonym, s is 0.4 for Caretaker and 1 for the other individuals.C depends on the length of the individuals' names.Because only the first part of OldJan→ArmL was matched, the length of OldJan is used to calculate C and u n = 0.4 1 = 0.4.As the left arm is a component only, o is set to 0.2.Finally u n is 1 and o is 0 for the first three candidates, because they are all container classes.OldJanitor is the expected match for the example query.Caretaker has a low score as synonyms are regarded a valid but noisy contribution.The individual OldJan→ArmL scores below 0.1 and is eliminated from the result set.NLCI retrieves cleanUp(class) for clean and Table for table with a score of 0.714 and 1.0, respectively.Then the combined score for the Atomic Sentence A and the three matchings is computed as follows: S for Caretakter is 0.665 and for OldTimer S is 0.608; all three combinations are annotated in the text with their scores.

Sequential ordering and control structures
Before handing over the annotated text to the code generator, NLCI employs two further NLP analyses, one to check and correct the sequential order of the script and one to detect linguistic patterns that imply control structures (such as parallelism and loops).
During our work on NLCI, we noticed that non-programmers tend to describe sequences of actions out of order.For example, they write "Do a. Do b.But before that, do c".Generating the API calls in the textual order does not produce the desired script.NLCI uses signal words and linguistic patterns to identify such re-orderings.
When non-programmers describe a sequence of actions, they do not use use control structures explicitly as we would expect it in a program (Pane and Myers (1996), Section 5-6) Pane et al. (2002): for example, they refer to groups of objects (often without properly defining the group) and simply state that all members of the group do something.NLCI identifies signal phrases in the dependency graphs of the sentences and uses a tailored graph traversal to collect all affected objects and actions.It recognizes different kinds of loops and parallelism.
Both analyses are out of the scope of this article; further details and a thorough evaluation can be found in references (Landhäußer et al. 2014;Landhäußer and Hug 2015).The results are annotated in the text.

Program synthesis
The last phase generates actual code for the target programming language.As this step is programming language dependent, one must provide a code generator for each programming language one wants to support.Note that the generator depends only on the programming language, not on the API.All information that has been gathered during the analysis phase is annotated in the text.NLCI combines all this information and builds a representation that resembles an abstract syntax tree for a program; reorderings and control structures are explicitly modeled in this tree.NLCI maximizes the combined scores of all transformed combinations.Therefore the code generator can simply translate the text into code without having any knowledge about the API and about NLP.The ontology provides the generator with the needed information such as the location of the API source files (e.g. in which JAR one can find a specific Java class).

Evaluation
NLCI has been tested on two radically different domains: openHAB, an API for home automation, and Alice, a programming environment for building 3D animations.Natural language interaction in a home automation environment is very desirable.Alice is often used as an pedagogical environment; there programming in natural language would be an illustrative stepping stone.
Both of the APIs were ontologized automatically and both were tested with benchmarks of scripts.For each script, a gold standard solution was constructed by hand.This gold standard lists the expected API calls.Comparing the manual solution with NLCI's matches, we determined recall, precision, and the F 1 measure.These provide an adequate metric for the quality of NLCI's output.
The following subsections describe the APIs, the work necessary to build the ontologies, and presents evaluation results.For both, openHAB and Alice, we provide the ontology generators, NLCI, and the used ontologies, input texts and gold standard texts on our website.2

openHAB
openHAB is an open-source home automation software; a full list of features can be found on the openHAB website,3 from which we quote: "openHAB is a software for integrating different home automation systems and technologies into one single solution that allows over-arching automation rules and that offers uniform user interfaces.This means openHAB […] has a powerful rule engine to fulfill all your automation needs […] and is easily extensible to integrate with new systems and devices […]".openHAB integrates different home automation components such as heaters, lights, and switches into a central control.User-defined rules trigger actions (e.g turning on a light) or action scripts.Trigger events are the push of a button, a calendar event, the outside temperature falling below a threshold, and others.

Building the openHAB Ontology
openHAB provides nine different component types that are either active (e.g., switches) or passive (e.g., temperature sensors).The actual components in a household configuration inherit their functionality from the component types.An openHAB installation requires a setup that includes all components of the household.Compo-nents can be arranged in groups, for example to allow all lights on the ground floor to be turned on and off with a single command.
The ontology must contain all item types and their functionality plus the actual instances of the household that are to be controlled.The item types and their methods can be found in the openHAB documentation.The actual components can be extracted from an openHAB configuration.We have written an automatic ontology generator that reads that configuration and prepares the ontology.This generator consists of 370 lines of Java code and has been written by the first two authors in less than one day.As the household's setup is modeled in the ontology, the input scripts can omit a setup description.
Our evaluation uses the demonstration configuration from openHAB's project web site.The demonstration household has two floors, nine rooms and a total of 92 components including heaters, a home stereo, and groups of light switches.The floors and rooms act as groups in the demo house.As the functionality of the items in open-HAB is limited, the ontology exhaustively models openHAB's functionality with nine methods.

Case study
We have written five natural language scripts comprising of 15 commands in total.Table 4 summarizes the openHAB case study.The commands resemble the rules available in the openHAB demonstration package.There are commands such as "Turn the light on over the table in the kitchen" and "Turn on the heaters in the living room and increase the volume of the radio".Then we let NLCI identify the actions and objects of our commands and checked whether the results were correct.
We compared the intended API calls with the match of NLCI as follows.For every expected individual annotation (API element) we first check whether the corresponding text elements were annotated at all; then we determine precision, recall, and the F 1 score for the actors, the objects, and the actions.At last, we determine recall, precision, and the F 1 score for the combinations (API calls).Because NLCI produces scores for the annotations, we were able to calculate these metrics for the best n results, with N ∈ {1, 2, 3, 5, 10}.
Recall is the number of correct annotations divided by the number of the expected annotations; i.e. true positives divided by true positives and false negatives: R = p t /( p t + n f ).Precision is the number of correct annotations divided by the number of all annotations; i.e. true positives divided by true positives and false positives: P = p t /( p t + p f ).One can trade off precision for recall and vice versa.The F 1 measure is the harmonic mean between precision and recall and is used as a measure of accuracy: F 1 = 2 × (P × R)/(P + R).For all three metrics, the perfect score is 1, the worst is 0.
Table 5 shows the details of the results.Table 5(a) shows the classification results for the individual actors, actions, and objects, Table 5(b) shows the results for the combinations.Unfortunately, NLCI could not identify six of 39 individuals (actions and objects) in the API.Three of the missing individuals stem from problems in the Stanford parser: for example, if an action verb is incorrectly tagged as noun, the language analysis of NLCI does not search for methods implementing this action verb.We are experimenting with analyses that use less syntactical information and are thus more robust to parser errors and grammatically incorrect input.The remaining three missing individuals are groups that could not be found in the ontology due to improper naming; this could be fixed either directly in the openHAB configuration file, in an (automatic) preprocessing step before or during ontology population, or in the ontology.The 15 commands include 20 API calls, 14 of which were correctly identified and ranked in first position.

3D animations in Alice
Alice is a learning environment for programming novices and teaches object-oriented programming (Conway 1997).In Alice, one can put 3D models (for example, characters or scene objects such as houses, trees, etc.) into a 3D world and let them do things by calling methods on them (for example, move, turn, or manipulate other objects).We chose Alice as target platform because it provides a rich set of objects and the API is self-explanatory.Alice supports the usual control structures (such as loops and conditionals) and parallelism.Also, the models and their methods are documented in plain English.

Building the Alice ontology
Alice comes with a rich collection of 3D models and provides generic methods for all models such as turn() and move().The collection may be extended by the user in two ways: (a) one can define new 3D models and add them to the collection and (b) one can implement new methods that extend the functionality of a model.To make the models and all their functions available to the ontology, we created a tool that reads Alice models and registers them in the ontology.Parts of models (e.g., the arms of a woman) are recorded as components; the documentation is added to the ontology as well.This ontology generator consists of about 2100 lines of code and has been written by an undergraduate student in 4 weeks.The generator is more complicated than the generators for Java or openHAB because the 3D models are stored in an undocumented proprietary format.
For the evaluation, we generated an NLCI ontology with all the models that were available to us: 909 objects, 8539 components, and 373 methods; additionally every model inherits 22 predefined methods.The ontology generation is fully automatic and required no corrections.

Evaluation
For the evaluation, we used a corpus of scripts that was developed over a period of several years.The scripts were constructed as follows: we programmed ten different animations and two static sceneries and then asked subjects to describe those in their own words.The subjects were both programmers and non-programmers.Subjects described several animations, but each animation only once.The resulting descriptions were then fed into NLCI.Then we created the API calls that describe the actual plot to produce a gold standard for every script.Note that we did not select the API calls according to taste.The gold standards use the API calls that were in the Alice programs used to generate the animations in the first place.We need separate gold standards because the authors did not describe exactly what was in the animation but changed or left out details.
We collected a corpus of 86 Alice scripts.The 30 scripts for the static sceneries do not contain actions and were therefore excluded from this evaluation.Six further texts are unusable due to lack of focus or bad English.Table 6 states the statistics about the 50 texts used in the evaluation.
We found that Stanford CoreNLP does not resolve references in the scripts properly.CoreNLP can produce co-reference chains, but they are hardly ever correct in our scripts.Furthermore, references to actions are not resolved at all.Since this is a problem of the parser and not the matcher, we resolved pronouns ourselves before the evaluation.
Table 7 summarizes the results for the texts with resolved references.Table 7(a) shows the classification results for the individual actors, actions, and objects, Table 7(b) shows the results for the combinations.We determined precision, recall, and the F 1 measure in the way described above.As one can see, recall in the TOP 10 is lower for the atomic sentences than it is for the individuals.This is expected, as individual matches are more numerous than the combined matches for atomic sentences.Precision, however, is higher for the combinations than it is for the individuals.Also note that the drastically declining scores for precision are expected as well; if the correct annotation is among the TOP n results, TOP n + 1 precision will be lower.In summary, NLCI produces the correct API calls 67 % of the time, with a precision of 78 %.Of course, this is not good enough for practical use.However, note that we provided a generic text-to-API translator that is set up automatically.Even the ontology was generated.In other words, if an API provides descriptive names, a natural language command interpreter for this API can be constructed without any manual work.Naturally, future work will have to improve both precision and recall, which means that both the parser and the matching component need to be improved significantly.

Conclusion and future work
This paper presents a new architecture for building natural language (NL) command interfaces for APIs.NLCI models the targeted API in an ontology and uses it as bridge between the NL input and the API.WordNet enriches the ontology with synonyms to facilitate the processing of the input.NLP analyses employ Stanford's dependency graphs to transform the input into atomic sentences; then these Atomic Sentences are mapped to corresponding API calls.NLCI derives the desired sequential order and infers control structures.
Generating an ontology for an API is easy; a simple ontology generator for Java can be implemented in approximately 330 lines of Java code.If the API offers wellchosen, descriptive identifiers, no manual work is needed.If not, one merely has to add descriptive names to the automatically constructed ontology.
We evaluated our approach on two different domains, home automation and 3D animation.The results are promising and show that interpreting commands stated in NL in these two domains is feasible, yet we need to improve the accuracy of our matching algorithm.Also, in other work we found that users use nominalization.For instance, they might say "do a turn" rather than simply "turn".In Körner and Brumm (2010), the problem of nominalization has been addressed, but the solution also awaits incorporation into NLCI.
The APIs that we used so far do not contain preconditions and postconditions.If the user specifies an incomplete sequence of actions, NLCI can not generate a proper program.Even if every action is translated in an API call, the resulting program does not behave as intended.Given preconditions and postconditions, NLCI could check whether all preconditions hold before generating an API call.If a precondition is not met, NLCI could suggest actions to fill the gap.
Last but not least, we want to address spoken language.At the moment, NLCI uses syntactical analyses that do not perform well on ungrammatical phrases.Currently, we are evaluating whether state-of-the-art POS taggers and chunkers are good enough to be used in a system like NLCI.
In summary, this is a promising approach to an interesting problem.Users of computing devices will soon expect language interfaces for all kinds of applications.Generating NL interfaces would be simplify programmers' lives significantly, so they can concentrate on developing useful APIs.

Fig. 2
Fig. 2 A detailed view of the the natural language analyzer: arrows indicate processing order, dotted arrows input and dashed arrows output

Fig. 3
Fig. 3 An example for a typed dependency graph

Table 1
The structure of the

Table 3
The retrieved ontology individuals and their scoring for the query old ˆjanitor and the ontology given inTable 2

Table 5
Evaluation results for the openHAB case study

Table 6
Summary

Table 7
Evaluation results for Alice: co-references are manually resolved in the input texts Four individuals are not under the TOP 10, there are no further combinations