|
Association for Computers and the
Humanities
|
|
Poster Sessions from ALLC/ACH '96
Corpus Methodology for Interlingual Machine Translation
(MT)
~ OVERVIEW ~
"Efforts in the development of natural language
processing (NLP) and information technology are
converging on the recognition of the importance
of some sort of corpus-based research as part of
the infrastructure for the development of advanced
language processing applications."
- Atkins, Clear and Ostler 1992
Why?
- Availability of on-line corpora
- Advances in storage capacity and computing power
- General renewal of interest in data-centered approaches
- Renewed Prestige - From Thirty Years Ago
- Data was sparse then
- Linguistic generalities could apply only to observed data
- Intuitions were overlooked
- Characteristics - Study of Language Use
- Historical lx'c events can predict potential lx'c events
- Level of Language regularly employed
- Understood by members of a linguistic community
- Absent from most grammars
- Goals- Now
Broad Coverage through
- Observation-enhanced linguistic intuitions
- Testing, confirming reformulated hypotheses
- More effective computational grammars
~ Machine Translation ~
- Renewed Prestige - From Thirty Years Ago
- ALPAC (1966) discredited MT efforts
- Emphasis on formal linguistic theories
- Major advances in computational power
- Characteristics - Study of Language Use
- Limitations of formal theories: different focus
- Direct and transfer approaches: syntactic level
- Quality is leveling off: call for paradigmatic shift
- Knowledge-based interlingua among new strategies
- Goals- Now
Broad Coverage through
- Robust, text-meaning, domain representations
- Processing of non-literal language
- Construction grammars
- Usefulness and Limitations of Theoretical Models
- Formal Grammars enhance processing
- Rules are broken for communicative reasons
- Much of linguistic expression based on different rules!
- Goals and Challenges of Broad Coverage
- Identification of what is not covered
- Use of morphologies, taggers and stop lists
- Elimination of non-characteristic phenomena
- Language Use
- Characterizing the Level of Language Use
- MT research: non-literal language, constructions
- Corpus linguistics: quantitative studies, characterizations
- May vary according to text type
- Normalcy and frequency of phenomena as criteria
~ Language Use: Traditions in Processing ~
- Interlingual MT Research
- Top-down processing: Identify types then find tokens
- Non-literal language: Types of deviation from literalness
- Constructions: Types of non-compositional meaning
- Corpus Linguistic Linguistics Research
- Bottom-up processing: Patterns in the text
- Frequency studies: Account for what happens most often
- Revise and refine hypotheses based on data
- Corpus Data to Inform MT Lexical-Semantic Design
- Patterns, frequently-occurring phenomena and idiosyncratic expressions
as representations of semantic and structural types
- Efficiency demands that MT modules be designed to process phenomena
which actually occur in text and not invented examples of it
~ Language Use: Description ~
- Linguistic Dichomoties
- Grammatical v. ungrammatical structures
- Competence v. performance models
- Lexical v. syntactic representations
- Grammar Studies using the LOB Corpus - an example problem
- NP structure types with one token apiece
- Uncommon NP types in common constituent types
- Drawing the line in a grammar of language use - currency
- Normative, no meta-grammatical or highly stylized language
- Based on language recognition, not production skills
- Solutions
- Lexical meaning can vary with form, e.g., "see" v. "seeing"
- Language Use and Constructions
~ Language Use: Translation ~
- Translation Process
- Distinguishes MT from other NLP applications
- Translation theory emphasis:
- Meaning of source language (SL) text
- Interlingual MT research focus:
- Representation of SL text meaning
- Organization of Present Project
- Exploration of linguistic issue as MT research goal
- Italian predicates of cognition, sensation and emotion
- Selection of SL forms based on aspectual features
- Observation of sense differences at lexical level in TL
- Capturing of generalities
- Consequences
- Some sense differences reflected in form alone
- Morphological info included in semantic processing
- Developing a Special-Purpose Corpus
- Texts to Meet Linguistic Research Goal
- Text types distinguished by internal structure
- Informational texts containing Experiencer verbs
- Texts to Meet MT Application Goal
- Genres distinguished by external function
- Material typically submitted for MT:
- newspapers, journals, technical material
- Issues in Selecting Texts for Inclusion in the Corpus
- Textual-linguistic features: What is being investigated?
- Size: Are there enough tokens of the phenomenon?
- Availability: Where does it reside? Protected?
- Practical Issues: Copyright, format, tagging, etc.
- Balance: Text types (synchronic), linguistic eras (diachronic)
~ Results of Survey of Available Italian Corpora ~
| Source |
Topic/Type |
| European Corpus Initiative CD-ROM |
fiction, news, technical, legal |
| Dante Project of Dartmouth University |
literature and literary criticism |
| Oxford Text Archive |
fiction, news |
| Center for Electronic Text in the Humanities of Georgetown
University |
fiction, correspondence |
| Italian Reference Corpus |
fiction, news, magazines, technical, legal |
| Manuzio Project |
fiction, technical |
| Freebook Project |
fiction, translations, magazines, technical, political |
| Corriere della Sera CD-ROM |
news |
| Il Sole 24 Ore CD-ROM |
news |
| La Stampa CD-ROM |
news |
~ Corpus Development ~
- Solution for Present Study: The Italian Reference Corpus (IRC)
- Linguistic Phenomenon
- Ample tokens of Experiencer verb forms
- Availability
- KWIC runs available via ftp
- Database Testuale (DBT) as interface at ILC
- Restrictions: For research only
- Reference Code tagged to every context
- Permits token extraction from specific sources
- Present study: non-informational texts excluded
- Synchronic linguistic study permitted
- Relevant texts in concise timeframe
- Present study: contemporary language only
~ Italian Reference Corpus: A Subcorpus ~
Newspaper Subcorpus: Source Information
| Publication |
Topic Type |
Number of Words |
% of Subcorpus |
| La Repubblica |
daily newspaper |
1,890,481 |
23.53 |
| Epoca |
weekly magazine |
953,192 |
11.87 |
| Panorama |
weekly magazine |
921,625 |
11.47 |
| Zero-Uno |
computing |
592,251 |
7.37 |
| Espansione |
computational economy |
532,334 |
6.63 |
| Star Bene |
medical scientific |
839,325 |
10.45 |
| Storia Illustrata |
history |
788,871 |
9.82 |
| Grazia |
women's weekly |
607,329 |
7.56 |
| Cento Cose |
teenager's weekly |
469,505 |
5.84 |
| Casa Viva |
furnishing |
437,754 |
5.45 |
(Adapted from Bindi, et al. 1991:6-7)
~ Design of Analysis ~
- Analysis
- Phenomenon to be investigated
- Form/Sense matrix
- Italian SL Aspectual Changes
- Morphologically encoded - other levels not analyzed
- Signalling differences at lexical level in English TL
- Verb Forms Reflect Grammatical Aspect or Viewpoint Type
- Imperfect: continuous action - imperfective aspect
- Preterit: completed action - perfective aspect
- Senses Reflect Inherent or Situational Aspect: e.g., pensare
- cause to have a mental representation: "think up"
- have a mental representation: "think"
- Ha pensato v. pensava
- Complications: complementation, phrasals, constructions
~ Italian Experiencer Verbs: Frequency Study Data ~
Most Frequent Experiencer Verbs in Three Studies: Sciarone `77,
Bortolini, et al `71, and Traversa `73
| SCIARONE |
BORTOLINI |
JUILLAND |
SYNTHESIS |
volere
sapere
vedere
sentire
guardare
pensare
credere
parere
sembrare
capire
conoscere
cercare
ricordare
piacere
apparire
amare
intendere
riconoscere
comprendere |
sapere
vedere
sentire
pensare
credere
capire
guardare
ricordare
sembrare
sercare
parere
piacere
amare
provare
toccare
accorgersi
interessare
scoprire
dimenticare |
volere
vedere
sapere
parere
guardare
credere
pensare
sembrare
sentire
conoscere
cercare
capire
ricordare
apparire
rispondere
riconoscere
intendere
considerare
comprendere |
vedere
volere
sapere
parere
pensare
sembrare
guardare
credere
ricordare
cercare
conoscere
sentire
apparire
riconoscere
comprendere
mostrare
intendere
rispondere
capire |
Hypothesis 1:
A verb sense which entails inherent continuous aspect will, in
forms expressing the grammatical aspect of completed action,
indicate the inception of the verb sense.
Hypothesis2:
A verb sense which entails inherent completed aspect will not
appear in forms expressing the grammatical aspect of
continuous action.
~ Verb Senses ~
Pensare:
- Examine with thought or imagination; reason; think about; judge
- Hold a fixed thought; believe; suppose
- Invent; think up
- Take care of; attend to
Sentirsi:
- Be aware of a feeling or an emotive situation
- Experience a psychic sensation
- Be conscious of a physical sensation
Vedere:
- Perceive with the eyes
- Encounter; meet; consult; visit
- Perceive with the mind; grasp; realize; understand;
imagine
| Pensare |
| FORMS: |
pensa - 189 |
pensava - 80 |
pensò - 48 |
| Sense 1 - 159 |
96 |
36 |
27 |
| Sense 2 - 131 |
75 |
44 |
11 |
| Sense 3 - 7 |
0 |
0 |
7 |
| Sense 4 - 22 |
19 |
0 |
3 |
| Sentirsi |
| FORMS: |
mi sento - 123 |
mi sentivo - 28 |
mi sentii - 5 |
| Sense 1 - 13 |
10 |
1 |
2 |
| Sense 2 - 134 |
105 |
27 |
2 |
| Sense 3 - 9 |
8 |
0 |
1 |
| Vedere |
| FORMS: |
vedevo - 95 |
vedevano - 12 |
videro - 19 |
| Sense 1 - 35 |
25 |
4 |
6 |
| Sense 2 - 36 |
29 |
3 |
4 |
| Sense 3 - 55 |
41 |
5 |
9 |
~ Pensare: Examples ~
H1.
Sense 1 of pensava (durative):
Giacomo Bove pensava a una possibile emigrazione
Sense 1 of pensò (inceptive):
facevano il giro delle corti europee. Quindi, pensò [...] di
dimostrare...
H2.
Sense 3 of pensò:
Un giorno, pensò, finiremo per conoscerci [...]
Similar structure with pensava (Sense 2):
<<Io andrò piì lontano>> pensava Giuanin,
ambizioso [...]
~ Sentirsi: Observations ~
H1.
Grammatical aspect of completed action is interpreted as
temporary duration, often indicated by accompanying adverbial, rather than
inception:
[...] allora sì che mi sentii ricco e felice
[...]
H2.
Hypothesis not testable as no senses entail inherent completed
aspect.
General Comments
Verb senses: All are states so not aspectually
significant
States entail duration, permanent or temporary
~ Vedere: Observations ~
H1.
Grammatical aspect of completed action signals the mental act of envisioning.
Sense 3: videro
[...] la fantasia dei governanti che videro nell'iniziativa
Sense 3: vedevano
[...] i conservatori lo vedevano come socialista.
H2.
Hypothesis not testable: All senses present in data entail inherent continuous aspect.
General Comments
Expressions such as non vedere ora and vedere rosso:
Candidates for inclusion in a Construction Lexicon.
- Language Use/Construction Criteria
- Flexible enough for lexical variation?
- Possibility of good structural description?
- High frequency of occurrence? Not too deviant?
- Careful Selection of Material
- Genres relevant to the application
- Text types with similar features
- Material which contains the phenomenon!
- Observe patterns in Form/Sense Matrix
- Grammatical v. inherent aspect
- Anomalies as candidates for unique treatment
Go back to the ACH Home Page
Any questions or bug reports regarding the ACH Web Pages should go to
ACHWeb@brown.edu