TWIG (“Transportable Word Intension Generator”) is a system that allows a robot to learn compositional meanings for new words that are grounded in its sensory capabilities. The system is novel in its use of logical semantics to infer which entities in the environment are the referents (extensions) of unfamiliar words; its ability to learn the meanings of deictic (“I,” “this”) pronouns in a real sensory environment; its use of decision trees to implicitly contrast new word definitions with existing ones, thereby creating more complex definitions than if each word were treated as a separate learning problem; and its ability to use words learned in an unsupervised manner in complete grammatical sentences for production, comprehension, or referent inference. In an experiment with a physically embodied robot, TWIG learns grounded meanings for the words “I” and “you,” learns that “this” and “that” refer to objects of varying proximity, that “he” is someone talked about in the third person, and that “above” and “below” refer to height differences between objects.
Follow-up experiments demonstrate the system’s ability to learn different conjugations of “to be”; show that removing either the extension inference or implicit contrast components of the system results in worse definitions; and demonstrate how decision trees can be used to model shifts in meaning based on context in the case of color words.
Introduction
Robots that could understand arbitrary sentences in natural language would be incredibly useful—but real vocabularies are huge. American high school graduates have learned, on average, about 60,000 words [29]. Programming a robot to be able to understand each of these words in terms of its own sensors and experience would be a monumental undertaking. To recognize the words in the audio stream is one thing; to recognize their referents in the real world, quite another. The messy, noisy nature of sensory input can render this difficult problem intractable for the programmer. It would be far more useful to have a robot that is able to learn the meanings of new words, grounded in its own perceptual and conceptual capabilities, and ideally with a minimum of user intervention for training and feedback. The present paper describes a system that is built to do exactly that.
According to the developmental psychology literature, children can learn new words rapidly [7] and in a largely unsupervised manner [9]. From the time that word learning begins around the age of one, children’s word learning accelerates from a rate of about word every 3 days for the first 4 months [14], to 3.6 words a day between the ages of 2 12 and 6 [1,4,14], to 12 words a day between the ages of 8 and 10 [1]. This learning would impose quite a demand on the time of parents and educators if children had to be taught all these words explicitly, but instruction does not appear to be necessary for learning language [4,27], and in fact there are communities in which children learn to speak with hardly any adult instruction at all [21]. Thus, in designing a robotic system built to learn the meanings of words quickly and without supervision, it makes sense to study the heuristics that children appear to use in learning the meanings of words.
One strategy that children use is to infer what object or person an new word is referring to from its grammatical context—realizing that “sibbing” must refer to a verb, but “a sib” must refer to a noun [6]. Another strategy children use is to contrast the new word with other words they know in order to narrow down its meaning without explicit feedback [7,9,26]. For instance, a child may believe at first that “dog” refers to all four-legged animals, but on learning the word “cat,” the child realizes that implicitly, the meaning of “dog” does not include cats, and so the definition of “dog” is narrowed.
This paper describes a word learning system called TWIG that employs both of these strategies, using sentence context and implicit contrast to learn the meanings of words from sensory information in the absence of feedback. Until now, research into learning the semantics of words in an unsupervised fashion has fallen primarily into two camps. On the one hand, there have been attempts to learn word semantics in simulation, with the world represented as collections of atomic symbols [12] or statements in predicate logic [41]. These simulations generally assume that the robot already possesses concepts for each of the words to be learned, and thus word learning becomes a simple problem of matching words to concepts. Such systems can deal with arbitrarily abstract concepts, because they never have to identify them “in
the wild.”
On the other hand, systems that have dealt with real sensor data have generally assumed that concepts and even word boundaries in speech must be learned at the same time as words. Since learning abstract concepts directly from sensor
data is difficult, these systems have focused on learning words for physical objects [37] and sometimes physical actions [48].
The TWIG word-learning system is an attempt to bridge this divide. “TWIG” stands for “Transportable Word Intension Generator.” It is “transportable” in the sense that it does not rely too heavily on the particular sensory setup of our robot (Fig. 1), but could be moved with ease to any robot that can express its sensory input in Prolog-like predicates that can include numeric values. (Note that it is the learning system, and not the definitions learned, that is transportable; each robot using TWIG will learn definitions specific to its own sensory capabilities.) A “word intension” is a function that, when applied to an object or relation, returns true if the word applies. In other words, an intension is a word meaning [8,13]. Though TWIG was originally designed to solve the conundrums posed by pronouns, which themselves straddle the boundary between the abstract and the concrete, its techniques are more generally applicable to other word categories, including verbs, prepositions, and nouns.
The first technique that TWIG introduces is extension inference, in which TWIG infers from sentence context what actual object the new word refers to. For example, if the system hears “Foo got the ball,” and it sees Bob holding the ball, it will assume that “Foo” refers to Bob; the entity in the world that is Bob is the extension of “Foo” in this example. Doing this allows the system to be picky about what facts it associates with a word, greatly reducing the impact of irrelevant information on learning. Grammatical context also informs the system as to whether it should be looking for facts about a single object or a relation between objects, as with prepositions and transitive verbs.
Second, to build word definitions, the system creates definition trees (see Section 4.3). Definition trees can be thought of as reconstructing the speaker’s decision process in choosing a word. The interior nodes are simple binary predicates, and the words are stored at the leaves. The meaning of a word can be reconstructed by following a path from the word back to the root; its definition is the conjunction of the true or negated predicates on this path. Alternately, the system can choose a word for a particular object or relation by starting at the root and following the branches that the object satisfies until arriving at a word.
Definition trees are a type of decision tree, and are created in a manner closely following the ID3 algorithm [32]—but treating word learning as a problem of word decision, with words defined in terms of the choices that lead to them, is an approach novel to TWIG. Previous robotic systems have typically assumed that word concepts can overlap with each other arbitrarily [37,48]. In assuming that words do not overlap, TWIG actually allows greater generalization from few examples, because the boundaries of each concept can extend all the way up to the “borders” of the other words. Though we described an earlier version of the TWIG system in a previous conference paper [17], that version of TWIG did not make use of the definition tree method, which we developed later [15]. (Definition trees were originally simply called “word trees,” but this led to some confusion with parse trees.)
The latest version of TWIG includes several enhancements that were not present in any previously reported results, including part-of-speech inference, online vocabulary updates that allow new words to be used immediately, and better handling of negated predicates. This paper also presents a new quantitative evaluation of the two halves of the system, and some new experiments that demonstrate the system’s flexibility in building on its previous vocabulary and learning multiple definitions for the same word.
Introduction
Robots that could understand arbitrary sentences in natural language would be incredibly useful—but real vocabularies are huge. American high school graduates have learned, on average, about 60,000 words [29]. Programming a robot to be able to understand each of these words in terms of its own sensors and experience would be a monumental undertaking. To recognize the words in the audio stream is one thing; to recognize their referents in the real world, quite another. The messy, noisy nature of sensory input can render this difficult problem intractable for the programmer. It would be far more useful to have a robot that is able to learn the meanings of new words, grounded in its own perceptual and conceptual capabilities, and ideally with a minimum of user intervention for training and feedback. The present paper describes a system that is built to do exactly that.
According to the developmental psychology literature, children can learn new words rapidly [7] and in a largely unsupervised manner [9]. From the time that word learning begins around the age of one, children’s word learning accelerates from a rate of about word every 3 days for the first 4 months [14], to 3.6 words a day between the ages of 2 12 and 6 [1,4,14], to 12 words a day between the ages of 8 and 10 [1]. This learning would impose quite a demand on the time of parents and educators if children had to be taught all these words explicitly, but instruction does not appear to be necessary for learning language [4,27], and in fact there are communities in which children learn to speak with hardly any adult instruction at all [21]. Thus, in designing a robotic system built to learn the meanings of words quickly and without supervision, it makes sense to study the heuristics that children appear to use in learning the meanings of words.
One strategy that children use is to infer what object or person an new word is referring to from its grammatical context—realizing that “sibbing” must refer to a verb, but “a sib” must refer to a noun [6]. Another strategy children use is to contrast the new word with other words they know in order to narrow down its meaning without explicit feedback [7,9,26]. For instance, a child may believe at first that “dog” refers to all four-legged animals, but on learning the word “cat,” the child realizes that implicitly, the meaning of “dog” does not include cats, and so the definition of “dog” is narrowed.
This paper describes a word learning system called TWIG that employs both of these strategies, using sentence context and implicit contrast to learn the meanings of words from sensory information in the absence of feedback. Until now, research into learning the semantics of words in an unsupervised fashion has fallen primarily into two camps. On the one hand, there have been attempts to learn word semantics in simulation, with the world represented as collections of atomic symbols [12] or statements in predicate logic [41]. These simulations generally assume that the robot already possesses concepts for each of the words to be learned, and thus word learning becomes a simple problem of matching words to concepts. Such systems can deal with arbitrarily abstract concepts, because they never have to identify them “in
the wild.”
On the other hand, systems that have dealt with real sensor data have generally assumed that concepts and even word boundaries in speech must be learned at the same time as words. Since learning abstract concepts directly from sensor
data is difficult, these systems have focused on learning words for physical objects [37] and sometimes physical actions [48].
The TWIG word-learning system is an attempt to bridge this divide. “TWIG” stands for “Transportable Word Intension Generator.” It is “transportable” in the sense that it does not rely too heavily on the particular sensory setup of our robot (Fig. 1), but could be moved with ease to any robot that can express its sensory input in Prolog-like predicates that can include numeric values. (Note that it is the learning system, and not the definitions learned, that is transportable; each robot using TWIG will learn definitions specific to its own sensory capabilities.) A “word intension” is a function that, when applied to an object or relation, returns true if the word applies. In other words, an intension is a word meaning [8,13]. Though TWIG was originally designed to solve the conundrums posed by pronouns, which themselves straddle the boundary between the abstract and the concrete, its techniques are more generally applicable to other word categories, including verbs, prepositions, and nouns.
The first technique that TWIG introduces is extension inference, in which TWIG infers from sentence context what actual object the new word refers to. For example, if the system hears “Foo got the ball,” and it sees Bob holding the ball, it will assume that “Foo” refers to Bob; the entity in the world that is Bob is the extension of “Foo” in this example. Doing this allows the system to be picky about what facts it associates with a word, greatly reducing the impact of irrelevant information on learning. Grammatical context also informs the system as to whether it should be looking for facts about a single object or a relation between objects, as with prepositions and transitive verbs.
Second, to build word definitions, the system creates definition trees (see Section 4.3). Definition trees can be thought of as reconstructing the speaker’s decision process in choosing a word. The interior nodes are simple binary predicates, and the words are stored at the leaves. The meaning of a word can be reconstructed by following a path from the word back to the root; its definition is the conjunction of the true or negated predicates on this path. Alternately, the system can choose a word for a particular object or relation by starting at the root and following the branches that the object satisfies until arriving at a word.
Definition trees are a type of decision tree, and are created in a manner closely following the ID3 algorithm [32]—but treating word learning as a problem of word decision, with words defined in terms of the choices that lead to them, is an approach novel to TWIG. Previous robotic systems have typically assumed that word concepts can overlap with each other arbitrarily [37,48]. In assuming that words do not overlap, TWIG actually allows greater generalization from few examples, because the boundaries of each concept can extend all the way up to the “borders” of the other words. Though we described an earlier version of the TWIG system in a previous conference paper [17], that version of TWIG did not make use of the definition tree method, which we developed later [15]. (Definition trees were originally simply called “word trees,” but this led to some confusion with parse trees.)
The latest version of TWIG includes several enhancements that were not present in any previously reported results, including part-of-speech inference, online vocabulary updates that allow new words to be used immediately, and better handling of negated predicates. This paper also presents a new quantitative evaluation of the two halves of the system, and some new experiments that demonstrate the system’s flexibility in building on its previous vocabulary and learning multiple definitions for the same word.
By : Kevin Gold, Marek Doniec, Christopher Crick, Brian Scassellati
Source : DOWNLOAD
No comments:
Post a Comment