specifying structured probability distributions, and it is easy to extend the topic model to
incorporate richer latent structure by adding further steps to the generative process. We will discuss
five extensions to the model: determining the number of topics, learning topics from other kinds of
data, incorporating collocations, inferring topic hierarchies, and including rudimentary syntax.
Learning the number of topics
In the preceding discussion, we assumed that the number of topics, T , in the model was
fixed. This assumption seems inconsistent with the demands of human language processing, where
more topics are introduced with every conversation. Fortunately, this assumption is not necessary.
Using methods from non-parametric Bayesian statistics (Muller & Quintana, 2004; Neal, 2000),
we can assume that our data are generated by a model with an unbounded number of dimensions,
of which only a finite subset have been observed. The basic idea behind these non-parametric
Topics in semantic representation 56
approaches is to define a prior probability distribution on the assignments of words to topics, z,
that does not assume an upper bound on the number of topics. Inferring the topic assignments for
the words that appears in a corpus simultaneously determines the number of topics, as well as their
content. Blei, Griffiths, Jordan, and Tenenbaum (2004) and Teh, Jordan, Beal, and Blei (2004)
have applied this strategy to learn the dimensionality of topic models. These methods are closely
related to the rational model of categorization proposed by Anderson (1990), which represents
categories in terms of a set of clusters, with new clusters being added automatically as more data
becomes available (see Neal, 2000).
Learning topics from other data
Our formulation of the basic topic model also assumes that words are divided into
documents, or otherwise broken up into units that share the same gist. A similar assumption is
made by LSA, but this is not true of all methods for automatically extracting semantic
representations from text (e.g., Dennis, 2004; Lund & Burgess, 1996; Jones & Mewhort, 2006).
This assumption is not appropriate for all settings in which we make linguistic inferences: while
we might differentiate the documents we read, many forms of linguistic interaction, such as
meetings or conversations, lack clear markers that break them up into sets of words with a common
gist. One approach to this problem is to define a generative model in which the document
boundaries are also latent variables, a strategy pursued by Purver, Koerding, Griffiths, and
Tenenbaum (2006). Alternatively, meetings or conversations might be better modeled by
associating the gist of a set of words with the person who utters those words, rather than words in
temporal proximity. Rosen-Zvi, Griffiths, Steyvers, and Smyth, 2004) and Steyvers, Smyth,
Rosen-Zvi, and Griffiths (2004) have extensively investigated models of this form.
Inferring topic hierarchies
We can also use the generative model framework as the basis for defining models that use
richer semantic representations. The topic model assumes that topics are chosen independently
Topics in semantic representation 57
when generating a document. However, people know that topics bear certain relations to one
another, and that words have relationships that go beyond topic membership. For example, some
topics are more general than others, subsuming some of the content of those other topics. The topic
of sport is more general than the topic of tennis, and the word SPORT has a wider set of associates
than TENNIS. These issues can be addressed by developing models in which the latent structure
concerns not just the set of topics that participate in a document, but the relationships among those
topics. Generative models that use topic hierarchies provide one example of this, making it
possible to capture the fact that certain topics are more general than others. Blei, Griffiths, Jordan
and Tenenbaum (2004) provided an algorithm that simultaneously learns the structure of a topic
hierarchy, and the topics that are contained within that hierarchy. This algorithm can be used to
extract topic hierarchies from large document collections. Figure 14 shows the results of applying
this algorithm to the abstracts of all papers published in Psychological Review since 1967. The
algorithm recognizes that the journal publishes work in cognitive psychology,5 social psychology,
vision research, and biopsychology, splitting these subjects into separate topics at the second level
of the hierarchy, and finds meaningful subdivisions of those subjects at the third level. Similar
algorithms can be used to explore other representations that assume dependencies among topics
(Blei & Lafferty, 2006).
Insert Figure 14 about here
Collocations and associations based on word order
In the basic topic model, the probability of a sequence of words is not affected by the order
in which they appear. As a consequence, the representation extracted by the model can only
capture coarse-grained contextual information, such as the fact that words tend to appear in the
same sort of conversations or documents. This is reflected in the fact that the input to the topic
Topics in semantic representation 58
model, as with LSA, is a word-document co-occurrence matrix: the order in which the words
appear in the documents does not matter. However, it is clear that word order is important to many
aspects of linguistic processing, including the simple word association task that we discussed
extensively earlier in the paper (Ervin, 1961; Hutchinson, 2003; McNeill, 1966).
A first step towards relaxing the insensitivity to word order displayed by the topic model is
to extend the model to incorporate collocations words that tend to follow one another with high
frequency. For example, the basic topic model would treat the phrase UNITED KINGDOM occurring
in a document as one instance of UNITED and one instance of KINGDOM. However, these two [ Pobierz całość w formacie PDF ]