|
|
This paper was presented at
ACH-ALLC '97
Paper
Towards a text benchmark suite
Richard S. Forsyth
University of the West of England
rs-forsyth@csm.uwe.ac.uk
Keywords: benchmarking, stylometry, text categorization
1. Introduction
In many areas of computing, benchmarking is a routine practice.
There is insufficient room here to go into the pros and cons
of benchmarking in any depth, except to acknowledge that
sets of benchmarks do have drawbacks as well as advantages.
Nevertheless benchmarking does have a role to play
in setting objective standards. For example,
in the field of forecasting, the work of Makridakis and
colleagues (e.g. Makridakis & Wheelwright, 1989), who tested a
number of forecasting methods on a wide range of time series,
transformed the field -- leading to both
methodological and practical advances. Likewise, in machine
learning, the general acceptance of the Machine-Learning Database
Repository (Murphy & Aha, 1991) as an agreed standard, and its
employment in extensive comparative tests (e.g.
Michie et al., 1994) has thrown new light on the strengths and
weaknesses of competing algorithms.
Although billion-byte public-domain archives of text exist, e.g.
Project Gutenberg and the Oxford Text Archive, stylometry
currently lacks an equivalent set of accepted test problems.
Therefore we at Bristol have compiled a textual benchmark suite.
The current version of this suite is known as Tbench96.
Despite its deficiencies, it does present a broader
variety of test problems than other workers in stylometry and
allied fields have previously used.
1.1 Selection Criteria
The text-categorization problems in this suite were selected to
fulfil a number of requirements.
- Provenance: the true category of each text should
be well attested.
- Variety: problems other than authorship should be
included.
- Language: not all the texts should be in English.
- Difficulty: both hard and easy problems should be
included.
- Size: the training texts should be of `modest'
size, such as might be expected in practical
applications.
The last point may need amplification. Although some huge text
samples are available, most text-classification tasks in real
life require decisions to be made on the basis of samples in the
order of thousands or tens of thousands of words.
An enormous training sample of undisputed text is, therefore,
something of a luxury.
Subject to these constraints, 13 test problems were chosen:
four authorship problems, three chronology problems,
three content-based problems, and three miscellaneous
problems. As is usual in machine learning, each category of text
was divided into non-overlapping training and test sets.
See section 2.
1.2 Pre-processing
In order to impose uniformity of layout and thus reduce the
effect of factors such as line-length (not usually an authorial
decision) all text samples have been passed through a program called PRETEXT.
This program makes some minor formatting changes, e.g. case-folding and
conversion of tabs into blanks. However, the most important change made
is to break running text into segments that
are then treated as cases to be classified.
Just what consititutes a natural unit of text is by no means
obvious. Different researchers have made different decisions
about the best way of segmenting long texts. Some have used fixed-length
blocks (e.g. Elliott & Valenza, 1991);
others have respected natural subdivisions in the text (e.g. Ule, 1982).
Both approaches have merits as well as disadvantages.
Because linguistic materials have a hierarchical structure there
is no universally correct segmentation scheme. In Tbench96 each
block boundary is taken as the first new-line in the text on or
after the 999th byte in the block being formed. Such units
will be referred to as kilobyte lines.
The number of words per kilobyte line varies according to the
type of writing. A representative figure for Tbench96 as a whole
is 185 words per line. Thus this is an attempt to work with text
units near the lower limit of what has previously been considered
feasible. Evidence of this is provided by the two quotations
below, made 20 years apart.
"It is clear in the present study that there is
considerable loss in discriminatory power when samples
fall below 500 words". (Baillie, 1974)
"We do not think it likely that authorship
characteristics would be strongly apparent at levels
below say 500 words, or approximately 2500 letters.
Even using 500 word samples we should anticipate a
great deal of unevenness, and that expectation is
confirmed by these results." (Ledger & Merriam, 1994)
Although Felton (1996) has studied 100-word text blocks (in New
Testament Greek) and Simonton (1990) even analyzed word usage in
the final couplets of Shakespeare's 154 sonnets (averaging 17.6
words each), the block size in Tbench96 is small
relative to most previous stylometric studies;
therefore it poses a relatively challenging series of tests.
2. Details of Data Sets
The 13 text-classification problems that constitute TBench96
(Text Benchmark Suite, 1996 edition) form an enhanced version of
the test suite used by Forsyth (1995). They constitute a
potentially valuable resource for future studies in text
analysis.
Summary information is given below
about the texts used in the benchmark suite. Note: A policy adhered
to throughout was never to split a single work (article, essay,
poem or song) between training and test sets.
Authorship / Prose
FEDS (2 classes): A selection of papers by two Federalist
authors, Hamilton and Madison. This difficult authorship problem
-- subject of a ground- breaking analysis by Mosteller & Wallace
(1984 [1964]) -- is possibly the best candidate for an accepted
benchmark in stylometry.
An electronic text of the entire Federalist papers was
obtained by anonymous ftp from Project Gutenberg at
GUTNBERG@vmd.cso.uiuc.edu
For checking purposes the Dent Everyman edition was
used (Hamilton et al., 1992 [1788]). Division
into test and training sets was as follows.
Author Training Test
Hamilton 6, 7, 9, 11, 12, 17, 1, 13, 16, 21, 29, 30,
22, 27, 32, 36, 61, 31, 34, 35, 60, 65, 75,
67, 68, 69, 73, 76, 81 85
Madison 10, 14, 37-48 49-58, 62, 63
This division implies accepting the view expounded by Martindale &
McKenzie (1995), who state that: "Mosteller and Wallace's
conclusion that Madison wrote the disputed Federalist papers is
so firmly established that we may take it as given."
JOJO (2 classes): Writings by Joseph Smith, the founder of the
Mormon religion, and Joanna Southcott, a religious prophet
contemporary with Smith -- from files kindly donated by Dr
David Holmes of UWE Bristol. Southcott's work was supplied
in four files: one from her diaries, two files of prophetic
meditations, and one file of prophetic verse. Smith's three
files were all extracts from his diaries. These
texts (and others) have been analyzed by Holmes (1992).
Authorship / Poetry
EZRA (3 classes): Poems by Ezra Pound, T.S. Eliot and William
B. Yeats -- three contemporaries who influenced each
other's writings. For example, Pound is known to have given
editorial assistance to Yeats and, famously, Eliot (Kamm, 1993).
A random selection of poems by Ezra Pound written up to
1926 was taken from Selected Poems 1908-1969 (Pound, 1977),
and entered by hand. It was supplemented by random
selection of 18 pre-1948 Cantos, obtained from the Oxford
Text Archive. Poems by T.S. Eliot were from Collected Poems
1909-1962 (Eliot, 1963). A
random selection of 148 poems by W.B. Yeats was taken from
the Oxford Text Archive. For checking purposes Collected
Poems (Yeats, 1961) was used.
NAMESAKE (2 classes): Poems by Bob Dylan and Dylan Thomas.
Songs by Bob Dylan (born Robert A. Zimmerman) were obtained
from Lyrics 1962-1985 (Dylan, 1994). In addition, two
tracks from the album Knocked Out Loaded (Dylan, 1988) and
the whole A-side of Oh Mercy (Dylan, 1989) were transcribed
by hand and included, to give fuller coverage.
Poems of Dylan Thomas were obtained from Collected Poems
1934-1952 (Thomas, 1952) with four more early works added
from The Notebook Poems 1930-1934 (Maud, 1989).
Chronology
ED (2 classes): Poems by Emily Dickinson, early work being
written up to 1863 and later work being written after 1863.
Emily Dickinson had a great surge of poetic composition in
1862 and a lesser peak in 1864, after which her output
tailed off gradually. The work included is all of A Choice
of Emily Dickinson's Verse selected by Ted Hughes (Hughes,
1993) as well as a random selection of 32 other poems from
the Complete Poems (edited by T.H. Johnson, 1970).
JP (3 classes): Poems by John Pudney, divided into three
classes. The first category came from Selected Poems
(Pudney, 1946) and For Johnny: Poems of World War II
(Pudney, 1976); the second from Spill Out (Pudney, 1967)
and the third from Spandrels (Pudney, 1969). Every distinct
poem in these four books was used.
John Pudney (1909-1977) described his career as follows:
"My poetic life has been a football match. The war poems
were the first half. Then an interval of ten years. Then
another go of poetry from 1967 to the present time"
(Pudney, 1976). Here the task is to distinguish his war
poems (published before 1948) from poems in two other
volumes, published in 1967 and 1969.
WY (2 classes): Early and late poems of W.B. Yeats. Early work
taken as written up to 1914, the start of the First World
War, and later work being written in or after 1916, the
date of the Irish Easter Rising, which had a profound
effect on Yeats's beliefs about poetry.
For these problems the classification objective was to
discriminate between early and late works by the same poet.
Subject-Matter
MAGS (2 classes): This used articles from two academic journals
Literary and Linguistic Computing (75 articles) and Machine
Learning (69 articles). The task was to classify texts
according to which journal they came from. In fact, each
`article' consisted of the Abstract and first paragraph of
a single paper.
NEWS (4 classes): This data-set consists of News stories
extracted from the Associated Press wire service during
December 1979. A total of about 250,000 words was obtained
from the Oxford Text Archive, where it was deposited by Dr
G. Akers in 1980. Stories in this archive are classified
into at least six mutually exclusive categories. For
Tbench96, four of these story types were extracted: F --
Financial stories; I -- International stories; S -- Sports
stories; and W -- Washington stories. The Washington
category covers US domestic politics. For training data
stories up to 15th December were used.
For test data stories after that date were used.
TROY (2 classes): Electronic versions of the complete texts of
Homer's Iliad and Odyssey, both transliterated into the
Roman alphabet in the same manner, were kindly supplied by
Professor Colin Martindale of the University of Maine at
Orono. Traditionally each book is divided into 24 sections
or `books'. For both works the training sample comprises
the odd-numbered books and the test sample consists of the
even-numbered books. The classification task is to tell
which work each kilobyte line comes from. (It is possible
that this task is an authorship discrimination as well
(Griffin, 1980).)
Miscellaneous:
GENDERS (2 classes): short stories written by first-year
undergraduate students at the University of Maine on the
subject: boy meets girl (or vice versa). These texts were
kindly supplied by Professor Colin Martindale of the
Psychology Department of the University of Maine at Orono.
These stories arrived in an arbitrary order. Even-numbered
stories were used as training data, odd numbered stories as
test data. The objective was to distinguish tales written
by males from those written by females.
AUGUSTAN (2 classes): The Augustan Prose Sample donated by
Louis T. Milic to the Oxford Text Archive. For details of
the rationale behind this corpus and its later development,
see Milic (1990). This data consists of extracts by many
English authors during the period 1678 to 1725. It is held
as a sequence of records each of which contains a single
sentence. Sentence boundaries identified by Milic were respected.
RASSELAS (2 classes): The complete text of Rasselas by Samuel
Johnson, written in 1759. This was obtained in electronic
form from the Oxford Text Archive. For checking purposes,
the Clarendon Press edition was used (Johnson, 1927
[1759]). This novel consists of 49 chapters. These were
allocated alternately to four different files.
The inclusion of random or quasi-random data may need justification.
The chief objective of doing so here was to
provide an opportunity for what statisticians call overfitting
to manifest itself. The author's view is that some `null' cases
should form part of any benchmark suite: as well as finding what
patterns do exist, a good classifier should avoid finding
patterns that don't exist.
Acknowledgements
Thanks are due to Dr David Holmes and Professor Colin Martindale
for providing some of the text files used in this benchmarking
suite, as well as for helpful comments. In addition, the
following institutions -- the Oxford Text Archive, Project
Gutenberg, and UWE's Bolland Library -- have also provided
resources without which this collection could not have been compiled.
References
Baillie, W.M. (1974). Authorship Attribution in Jacobean Dramatic
Texts. In: J.L. Mitchell, ed., Computers in the Humanities,
Edinburgh Univ. Press.
Dylan, B. (1988). Knocked Out Loaded. Sony Music Entertainment
Inc.
Dylan, B. (1989). Oh Mercy. CBS Records Inc.
Dylan, B. (1994). Lyrics 1962-1985. Harper Collins Publishers,
London. [Original U.S. edition published 1985.]
Eliot, T.S. (1963). Collected Poems 1909-1962. Faber & Faber
Limited, London.
Elliott, W.E.Y. & Valenza, R.J. (1991). A Touchstone for the
Bard. Computers & the Humanities, 25, 199-209.
Felton, R. (1996). Personal Communication. [From: Manukau
Institute of Technology, Auckland, N.Z.]
Forsyth, R.S. (1995). Stylistic Structures: a Computational
Approach to Text Classification. Unpublished Doctoral Thesis,
Faculty of Science, University of Nottingham.
Griffin, J. (1980). Homer. Oxford University Press, Oxford.
Hamilton, A., Madison, J. & Jay, J. (1992). The Federalist
Papers. Everyman edition, edited by W.R. Brock: Dent, London.
[First edition, 1788.]
Holmes, D.I. (1992). A Stylometric Analysis of Mormon Scripture
and Related Texts. J. Royal Statistical Society (A), 155(1),
91-120.
Hughes, E.J. (1993). A Choice of Emily Dickinson's Verse. Faber
& Faber Limited, London.
Johnson, S. (1927). The History of Rasselas, Prince of Abyssinia.
Clarendon Press, Oxford. [First edition 1759.]
Johnson, T.H. (1970) ed. Emily Dickinson: Collected Poems. Faber
& Faber Limited, London.
Kamm, A. (1993). Biographical Dictionary of English Literature.
HarperCollins, Glasgow.
Ledger, G.R. & Merriam, T.V.N. (1994). Shakespeare, Fletcher, and
the Two Noble Kinsmen. Literary & Linguistic Computing, 9(3),
235-248.
Makridakis, S. & Wheelwright, S.C. (1989). Forecasting Methods
for Managers, fifth edition. John Wiley & Sons, New York.
Martindale, C. & McKenzie, D.P. (1995). On the Utility of Content
Analysis in Authorship Attribution: the Federalist. Computers &
the Humanities, 29, in press.
Maud, R. (1989) ed. Dylan Thomas: the Notebook Poems 1930-1934.
J.M. Dent & Sons Limited, London.
Michie, D., Spiegelhalter, D.J. & Taylor, C.C. (1994) eds.
Machine Learning, Neural and Statistical Classification. Ellis
Horwood, Chichester.
Milic, L.T. (1990). The Century of Prose Corpus. Literary &
Linguistic Computing, 5(3), 203-208.
Mosteller, F. & Wallace, D.L. (1984). Applied Bayesian and
Classical Inference: the Case of the Federalist Papers.
Springer-Verlag, New York. [Extended edition of: Mosteller &
Wallace (1964). Inference and Disputed Authorship: the
Federalist. Addison-Wesley, Reading, Massachusetts.]
Murphy, P.M. & Aha, D.W. (1991). UCI Repository of Machine
Learning Databases. Dept. Information & Computer Sceince,
University of California at Irvine, CA. [Machine-readable
depository: http://www.ics.uci.edu/~mlearn/MLRepository/html.]
Pound, E.L. (1977). Selected Poems. Faber & Faber Limited,
London.
Pudney, J.S. (1946). Selected Poems. John Lane The Bodley Head
Ltd., London.
Pudney, J.S. (1967). Spill Out. J.M. Dent & Sons Ltd., London.
Pudney, J.S. (1969). Spandrels. J.M. Dent & Sons Ltd., London.
Pudney, J.S. (1976). For Johnny: Poems of World War II.
Shepheard-Walwyn, London.
Simonton, D.K. (1990). Lexical Choices and Aesthetic Success: a
Computer Content Analysis of 154 Shakespeare Sonnets. Computers
& the Humanities, 24, 251-264.
Thomas, D.M. (1952). Collected Poems 1934-1952. J.M. Dent & Sons
Ltd., London.
Ule, L. (1982). Recent Progress in Computer Methods of Authorship
Determination. ALLC Bulletin, 10(3), 73-89.
Yeats, W.B. (1961). The Collected Poems of W.B. Yeats. Macmillan
& Co. Limited., London.
|