[ACH Logo]

Association for Computers and the Humanities

Poster Sessions from ALLC/ACH '96


[COLT logo]

Colt Homepage: http://www.hd.uib.no/colt/

COLT on TACT

A demonstration of the TACTweb software as applied to the Bergen Corpus of London Teenage Language

Gisle Andersen and Kristine Hasund
English Dept.
University of Bergen

1 COLT and TACTweb

[COLT drawing]

The Bergen Corpus of London Teenage Language (COLT) is the first large English Corpus focusing on the speech of teenagers. It was collected in 1993 and consists of audiotaped recordings of the spoken language of 13 to 17-year-old boys and girls from di fferent boroughs of London.

The aim of the COLT project is to compile a 500.000 word corpus of spoken teenage language, and make it available for students of English at the University of Bergen, as well as for language researchers world-wide.

This poster presents the use of TACTweb on the COLT corpus. TACTweb is experimental software developed as a part of a project by John Bradley and Geoffrey Rockwell. It connects the text-retrieval program TACT to the World Wide Web, enabling the user to search in a database of spoken conversations for the location of words, word combinations and word formation patterns. In the COLT database, TACTweb is applied to give the distribution of an item in relation to certain non-linguistic variables.

Searches in the corpus are made possible through the indexing of the texts in the database. The COLT database has the following indices:

  1. Reference number for each text file (eg. <REF> B132401)
  2. who= index for speaker identity (eg who=1)
  3. id= index for speaker turn number (eg id=1). This index is the same as is used in the BNC (the British National Corpus)
  4. speaker's age (eg <AGE1> 14)
  5. speaker's gender (eg <GEN1> f)
  6. speaker's socioeconomic group (eg <SOC1> 2)
  7. speaker's occupation (eg <OCC1> student)
  8. location of conversation (eg <LOC> Hackney)
  9. setting of conversation (eg <SET> classroom)
  10. number of participants (eg <AUD> 5)

2 Distribution

In TACTweb, the Distribution display allows the user to search for the occurrence of a word as it is distributed across a number of non-linguistic variables. Information as to which speakers use certain linguistic items the most, and information regard ing the speakers' gender, age, occupation, socioeconomic group affiliation, and so forth can be extracted from the corpus. Moreover, conversation-specific features such as location, setting and number of participants are searchable.

Here is the distribution of the word wicked with respect to the various boroughs where the conversations take place:

TACTweb Results

Title:

Query:

wicked

Hackney 54 ******************************************************
Stoke_Newi 1 *
Tower_Haml 10 **********
Camden 19 *******************
Westminste 1 *
Brent 3 ***
Islington 5 *****
Barnet 28 ****************************
Hertfordsh 12 ************

Total: 133.


The following table gives the distribution of words containing the sequence shit, according to the gender of the speakers:

Query:

.*shit.*

m 298 **************************************************
f 151 **************************
??? 12 **

* = 6, Total: 461.


And finally, the sequence shit, according to the various settings of the conversations:

Query:

.*shit.*

school_stu 87 ********************************************
home 72 ************************************
classroom 47 ************************
boarding_s 21 ***********
respondent 23 ************
bus 7 ****
school 38 *******************
??? 2 *
Church_Str 1 *
flat 2 *
outside_ho 10 *****
home_/clas 11 ******
park 47 ************************
games_hall 1 *
outside 6 ***
house 12 ******
home/frien 15 ********
school_din 3 **
pub 1 *
Peter's_ho 3 **
street 2 *
restaurant 11 ******
school_for 3 **
school_pla 2 *
gym_room_- 3 **
playground 7 ****
school,_ou 5 ***
school_six 18 *********
entrance_t 1 *

* = 2, Total: 461.

3 Word list / morphology

The word list mode is a simple means of extracting information regarding morphology, the formation of new words, the use of certain affixes and so on.

For instance, the following is a list of all words containing the sequence .*shit.*, and their respective frequencies:

Query:

.*shit.*

apeshit(1)
bullshit (5)
bullshitter (1)
shit (405)
shite (17)
shithole (1)
shits (1)
shit's (1)
shitted (2)
shittiest (1)
shittily (1)
shitting (13)
shitty (11)
shity (1)

And here is a list that will indicate the productivity and frequency of the suffix -able:

Query:

.*able

able (77) miserable (3) table (74)
available (6) noticeable (1) timetable (1)
believable (1) portable (2) unable (2)
cable (1) predictable (1) unavailable (1)
capable (2) reasonable (7) unbelievable (4)
changeable (2) rechargeable (1) uncomfortable (2)
comfortable (9) reliable (1) unfuckingtouchable (2)
fashionable (1) renewable (8) unreliable (1)
impressionable (1) reputable (1) unscrewable (1)
inequitable (1) respectable (3) unsociable (1)
inimitable (1) sizable (1) untouchable (1)
irritable (1) sociable (3) up-gradable (2)
malleable (5) suitable (1) vulnerable (4)

4 Selection refinements

There are various ways of restricting the search to a certain part of the corpus. For instance, the following search lists the occurrences of the word like in one particular text:

Query:

like;when ref=b140810

B140810 i=22 know she was slagging me off like anything, right and then,
B140810 i=24 | | |w17 And then, and then like, you know and then she
B140810 i=24 let's me sit opposite you | like [hinting] | | |w1 [Mm.]
B140810 i=26 | | |w1 [Mm.] | | |w17 like oh I don't wanna sit
B140810 i=32 | | |w17 I don't like her. It's like you know
B140810 i=32 | |w17 I don't like her. It's like you know when we had that
B140810 i=36 with you I don't want one like | one big fight. | |
B140810 i=42 |w1 Mm. [I mean] | | |w17 [If like,] I hated you I wouldn't
B140810 i=46 | | |w1 [Mm.] | | |w17 I like saying that I didn't like
B140810 i=46 I like saying that I didn't like you know that kind of
B140810 i=46 that kind of bloke | but not like she's such a fucking
B140810 i=52 nice one, I reckon who were | like maybe her expression
B140810 i=60 not that bad but I wouldn't like | to, [I mean] | |
B140810 i=61 I'd forgive</> you kind of like and you know like in a
B140810 i=61 you kind of like and you know like in a while I'd | forgive
B140810 i=67 what she's <unclear>. And you like | the rest of it don't
B140810 i=69 then to Elli | she goes Elli I like you but Sabrina's a
B140810 i=87 | |w17 Well thick in what way like? | | |w1 Well like,
B140810 i=88 what way like? | | |w1 Well like, thick. | | |w17 Not
B140810 i=91 thick. I mean you're not thick like, thick in | what way tell
B140810 i=92 of being thick. | | |w1 Well like, you know eh, don't
B140810 i=95 | |w1 Mm. | | |w17 You don't like <unclear> don't really
B140810 i=111 the helmet. Who else is | like one point | | |w1 I've
B140810 i=117 | | |w1 Mhm. | | |w17 I like that. | | |w1 Are they
B140810 i=122 to watch films. I mean like, it's different | if I
B140810 i=122 it's different | if I was like, you know like if it was
B140810 i=122 | if I was like, you know like if it was the summer
B140810 i=122 summer holidays, and we was | like this erm, thing like erm,
B140810 i=122 we was | like this erm, thing like erm, a magic show that
B140810 i=149 right. | | |w17 And he was like he was like he was like
B140810 i=149 | |w17 And he was like he was like he was like saying well
B140810 i=149 was like he was like he was like saying well you're | not
B140810 i=163 [to I] | | |w17 [Cos he] was like <unclear> | | |w1 Mm.
B140810 i=176 I saw] that. <nv>laugh</nv> I like the clothes she | wears.
B140810 i=178 | |w17 Eh? | | |w1 You know like the last time | | |w17
B140810 i=192 | | |w1 <nv>laugh</nv> Cos like, I can't think of
B140810 i=199 When I said that he sort of | like turned away you know, and
B140810 i=200 out with something | stupid like [that.] | | |w1 [I
B140810 i=202 ... (drinking) They're gonna like the <American accent> |
B140810 i=205 | | |w1 Where? | | |w17 Like, they got, in this
B140810 i=205 at the back it's | like</>], | | |w1
B140810 i=230 Yeah. Actually I quite | like science with Mr <name>
B140810 i=266 Mm. | | |w17 Cos my mum gets like that a bit. | | |w1 Mm.
B140810 i=282 | | |w17 No you just go like this, here you are Charla
B140810 i=297 | | |w1 Alright. Erm, well like, I usually take the train
B140810 i=365 | you lucky lucky thing</>. Like that. | | |w17 Oh yeah.
B140810 i=374 | |w1 Yeah. | | |w17 Hand on like this. [<nv>laugh</nv>] |
B140810 i=377 oh god | | |w1 It's like, Susie good
B140810 i=380 you know, we were singing like, you | know, up to date
B140810 i=384 |w1 laugh | | |w17 Like Barry Manilow. | | |w1
B140810 i=390 my god. | | |w17 Just as she like no all the words every
B140810 i=394 [cos I'm] | | |w17 [It's like,] she has the whole

And the following query modifies the selection even further by restricting the search to a single utterance:

Query:

like;when ref=b140810 & id=122

B140810 i=122 to watch films. I mean like, it's different | if I
B140810 i=122 it's different | if I was like, you know like if it was
B140810 i=122 | if I was like, you know like if it was the summer
B140810 i=122 summer holidays, and we was | like this erm, thing like erm,
B140810 i=122 we was | like this erm, thing like erm, a magic show that

5 Selecting certain speakers

The search may be restricted to certain speakers; for instance, the following is a list of all ocurrences of the word raga when the speaker has speaker id number 1, ie when the person carrying the tape recorder is speaking:

Query:

raga;when who=1

B132702 i=63 <mimicking Jamaican accent> Raga style, boy! </> Now |
B133704 i=258 anyway. Some stupid ugly raga | obviously. Dunno what
B133905 i=64 | people thinking I'm a raga Dan, and hang out with
B134202 i=35 Anthony <name> the biggest raga | in our year. | |
B134202 i=37 |w1 Anthony <name> the biggest raga in our year and there's |
B134202 i=39 erm prettiest female biggest raga in our year. ...(9) | |
B135906 i=1 they got, they got wicked, raga music over there. I |

Restricting the search to a certain sex, social group, age group, etc. is also possible; here is the word wicked when uttered by a person who belongs to socioeconomic group 2:

Query:

wicked;when class=2

B137103 i=316 ow </>] | | |w1 That is very wicked. | | |w? [That is
B137103 i=318 mean.] | | |w1 That's very wicked. [It's bad] | | |w?
B137104 i=320 Oh, yes <unclear> | | |w1 Wicked | | |w? Who else went
B137104 i=385 | | |w1 Oh, that's wicked Sarah. | | |w4 I know
B137104 i=418 refuse] | | |w1 [That's wicked] at your house ... |
B137201 i=293 <unclear> | | |w1 It had a wicked fairground, all those
B138501 i=218 first one I | ever drew was wicked. | | |w3 Is this
B138502 i=16 what's being | recorded. [It's wicked.] | | |w2 [It's loud]
B139303 i=20 know. Sweet. | | |w1 It's got wicked words. | | |w9 I
B139304 i=38 was the best because he got a wicked hall done | then. |
B139308 i=32 hasn't | got that in it. It's wicked. It comes with three
B139705 i=3 might have. | | |w1 That's a wicked idea I just got.
B139705 i=5 | | |w1 good isn't it? That's wicked! <nv>laugh</nv> | |
B139705 i=64 [yeah.] | | |w1 [That's] wicked. | | |w10 Yeah. |
B139705 i=92 Ah, how did that go? | | |w1 Wicked! | B139706 | |rB139706
B139706 i=70 ... | How is your shop going? Wicked! And you've made a
B141601 i=30 cool, this is [this is a bit wicked.] | | |w4 [Well,
B142704 i=119 for his hat. He had this wicked in | Geneva had this

ACH Logo HeadGo back to the ACH Home Page

Any questions or bug reports regarding the ACH Web Pages should go to ACHWeb@brown.edu