|
Association for Computers and the Humanities |
![[COLT logo]](colt-1.gif)
Gisle Andersen and Kristine Hasund
English Dept.
University of Bergen
![[COLT drawing]](colt-2.gif)
The Bergen Corpus of London Teenage Language (COLT) is the first large English Corpus focusing on the speech of teenagers. It was collected in 1993 and consists of audiotaped recordings of the spoken language of 13 to 17-year-old boys and girls from di fferent boroughs of London.
The aim of the COLT project is to compile a 500.000 word corpus of spoken teenage language, and make it available for students of English at the University of Bergen, as well as for language researchers world-wide.
This poster presents the use of TACTweb on the COLT corpus. TACTweb is experimental software developed as a part of a project by John Bradley and Geoffrey Rockwell. It connects the text-retrieval program TACT to the World Wide Web, enabling the user to search in a database of spoken conversations for the location of words, word combinations and word formation patterns. In the COLT database, TACTweb is applied to give the distribution of an item in relation to certain non-linguistic variables.
Searches in the corpus are made possible through the indexing of the texts in the database. The COLT database has the following indices:
In TACTweb, the Distribution display allows the user to search for the occurrence of a word as it is distributed across a number of non-linguistic variables. Information as to which speakers use certain linguistic items the most, and information regard ing the speakers' gender, age, occupation, socioeconomic group affiliation, and so forth can be extracted from the corpus. Moreover, conversation-specific features such as location, setting and number of participants are searchable.
Here is the distribution of the word wicked with respect to the various boroughs where the conversations take place:
Title:
Query:
wicked
| Hackney | 54 | ****************************************************** |
| Stoke_Newi | 1 | * |
| Tower_Haml | 10 | ********** |
| Camden | 19 | ******************* |
| Westminste | 1 | * |
| Brent | 3 | *** |
| Islington | 5 | ***** |
| Barnet | 28 | **************************** |
| Hertfordsh | 12 | ************ |
Total: 133.
The following table gives the distribution of words containing the sequence shit, according to the gender of the speakers:
Query:
.*shit.*
| m | 298 | ************************************************** |
| f | 151 | ************************** |
| ??? | 12 | ** |
* = 6, Total: 461.
And finally, the sequence shit, according to the various settings of the conversations:
Query:
.*shit.*
| school_stu | 87 | ******************************************** |
| home | 72 | ************************************ |
| classroom | 47 | ************************ |
| boarding_s | 21 | *********** |
| respondent | 23 | ************ |
| bus | 7 | **** |
| school | 38 | ******************* |
| ??? | 2 | * |
| Church_Str | 1 | * |
| flat | 2 | * |
| outside_ho | 10 | ***** |
| home_/clas | 11 | ****** |
| park | 47 | ************************ |
| games_hall | 1 | * |
| outside | 6 | *** |
| house | 12 | ****** |
| home/frien | 15 | ******** |
| school_din | 3 | ** |
| pub | 1 | * |
| Peter's_ho | 3 | ** |
| street | 2 | * |
| restaurant | 11 | ****** |
| school_for | 3 | ** |
| school_pla | 2 | * |
| gym_room_- | 3 | ** |
| playground | 7 | **** |
| school,_ou | 5 | *** |
| school_six | 18 | ********* |
| entrance_t | 1 | * |
* = 2, Total: 461.
The word list mode is a simple means of extracting information regarding morphology, the formation of new words, the use of certain affixes and so on.
For instance, the following is a list of all words containing the sequence .*shit.*, and their respective frequencies:
Query:
.*shit.*
| apeshit(1) |
| bullshit (5) |
| bullshitter (1) |
| shit (405) |
| shite (17) |
| shithole (1) |
| shits (1) |
| shit's (1) |
| shitted (2) |
| shittiest (1) |
| shittily (1) |
| shitting (13) |
| shitty (11) |
| shity (1) |
And here is a list that will indicate the productivity and frequency of the suffix -able:
Query:
.*able
| able (77) | miserable (3) | table (74) |
| available (6) | noticeable (1) | timetable (1) |
| believable (1) | portable (2) | unable (2) |
| cable (1) | predictable (1) | unavailable (1) |
| capable (2) | reasonable (7) | unbelievable (4) |
| changeable (2) | rechargeable (1) | uncomfortable (2) |
| comfortable (9) | reliable (1) | unfuckingtouchable (2) |
| fashionable (1) | renewable (8) | unreliable (1) |
| impressionable (1) | reputable (1) | unscrewable (1) |
| inequitable (1) | respectable (3) | unsociable (1) |
| inimitable (1) | sizable (1) | untouchable (1) |
| irritable (1) | sociable (3) | up-gradable (2) |
| malleable (5) | suitable (1) | vulnerable (4) |
There are various ways of restricting the search to a certain part of the corpus. For instance, the following search lists the occurrences of the word like in one particular text:
Query:
like;when ref=b140810
| B140810 i=22 | know she was slagging me off | like anything, right and then, |
| B140810 i=24 | | | |w17 And then, and then | like, you know and then she |
| B140810 i=24 | let's me sit opposite you | | like [hinting] | | |w1 [Mm.] |
| B140810 i=26 | | | |w1 [Mm.] | | |w17 | like oh I don't wanna sit |
| B140810 i=32 | | | |w17 I don't | like her. It's like you know |
| B140810 i=32 | | |w17 I don't like her. It's | like you know when we had that |
| B140810 i=36 | with you I don't want one | like | one big fight. | | |
| B140810 i=42 | |w1 Mm. [I mean] | | |w17 [If | like,] I hated you I wouldn't |
| B140810 i=46 | | | |w1 [Mm.] | | |w17 I | like saying that I didn't like |
| B140810 i=46 | I like saying that I didn't | like you know that kind of |
| B140810 i=46 | that kind of bloke | but not | like she's such a fucking |
| B140810 i=52 | nice one, I reckon who were | | like maybe her expression |
| B140810 i=60 | not that bad but I wouldn't | like | to, [I mean] | | |
| B140810 i=61 | I'd forgive</> you kind of | like and you know like in a |
| B140810 i=61 | you kind of like and you know | like in a while I'd | forgive |
| B140810 i=67 | what she's <unclear>. And you | like | the rest of it don't |
| B140810 i=69 | then to Elli | she goes Elli I | like you but Sabrina's a |
| B140810 i=87 | | |w17 Well thick in what way | like? | | |w1 Well like, |
| B140810 i=88 | what way like? | | |w1 Well | like, thick. | | |w17 Not |
| B140810 i=91 | thick. I mean you're not thick | like, thick in | what way tell |
| B140810 i=92 | of being thick. | | |w1 Well | like, you know eh, don't |
| B140810 i=95 | | |w1 Mm. | | |w17 You don't | like <unclear> don't really |
| B140810 i=111 | the helmet. Who else is | | like one point | | |w1 I've |
| B140810 i=117 | | | |w1 Mhm. | | |w17 I | like that. | | |w1 Are they |
| B140810 i=122 | to watch films. I mean | like, it's different | if I |
| B140810 i=122 | it's different | if I was | like, you know like if it was |
| B140810 i=122 | | if I was like, you know | like if it was the summer |
| B140810 i=122 | summer holidays, and we was | | like this erm, thing like erm, |
| B140810 i=122 | we was | like this erm, thing | like erm, a magic show that |
| B140810 i=149 | right. | | |w17 And he was | like he was like he was like |
| B140810 i=149 | | |w17 And he was like he was | like he was like saying well |
| B140810 i=149 | was like he was like he was | like saying well you're | not |
| B140810 i=163 | [to I] | | |w17 [Cos he] was | like <unclear> | | |w1 Mm. |
| B140810 i=176 | I saw] that. <nv>laugh</nv> I | like the clothes she | wears. |
| B140810 i=178 | | |w17 Eh? | | |w1 You know | like the last time | | |w17 |
| B140810 i=192 | | | |w1 <nv>laugh</nv> Cos | like, I can't think of |
| B140810 i=199 | When I said that he sort of | | like turned away you know, and |
| B140810 i=200 | out with something | stupid | like [that.] | | |w1 [I |
| B140810 i=202 | ... (drinking) They're gonna | like the <American accent> | |
| B140810 i=205 | | | |w1 Where? | | |w17 | Like, they got, in this |
| B140810 i=205 | at the back it's | | like</>], | | |w1 |
| B140810 i=230 | Yeah. Actually I quite | | like science with Mr <name> |
| B140810 i=266 | Mm. | | |w17 Cos my mum gets | like that a bit. | | |w1 Mm. |
| B140810 i=282 | | | |w17 No you just go | like this, here you are Charla |
| B140810 i=297 | | | |w1 Alright. Erm, well | like, I usually take the train |
| B140810 i=365 | | you lucky lucky thing</>. | Like that. | | |w17 Oh yeah. |
| B140810 i=374 | | |w1 Yeah. | | |w17 Hand on | like this. [<nv>laugh</nv>] | |
| B140810 i=377 | oh god | | |w1 It's | like, |
| B140810 i=380 | you know, we were singing | like, you | know, up to date |
| B140810 i=384 | |w1 |
Like Barry Manilow. | | |w1 |
| B140810 i=390 | my god. | | |w17 Just as she | like no all the words every |
| B140810 i=394 | [cos I'm] | | |w17 [It's | like,] she has the whole |
And the following query modifies the selection even further by restricting the search to a single utterance:
Query:
like;when ref=b140810 & id=122
| B140810 i=122 | to watch films. I mean | like, it's different | if I |
| B140810 i=122 | it's different | if I was | like, you know like if it was |
| B140810 i=122 | | if I was like, you know | like if it was the summer |
| B140810 i=122 | summer holidays, and we was | | like this erm, thing like erm, |
| B140810 i=122 | we was | like this erm, thing | like erm, a magic show that |
The search may be restricted to certain speakers; for instance, the following is a list of all ocurrences of the word raga when the speaker has speaker id number 1, ie when the person carrying the tape recorder is speaking:
Query:
raga;when who=1
| B132702 i=63 | <mimicking Jamaican accent> | Raga style, boy! </> Now | |
| B133704 i=258 | anyway. Some stupid ugly | raga | obviously. Dunno what |
| B133905 i=64 | | people thinking I'm a | raga Dan, and hang out with |
| B134202 i=35 | Anthony <name> the biggest | raga | in our year. | | |
| B134202 i=37 | |w1 Anthony <name> the biggest | raga in our year and there's | |
| B134202 i=39 | erm prettiest female biggest | raga in our year. ...(9) | | |
| B135906 i=1 | they got, they got wicked, raga | music over there. I | |
Restricting the search to a certain sex, social group, age group, etc. is also possible; here is the word wicked when uttered by a person who belongs to socioeconomic group 2:
Query:
wicked;when class=2
| B137103 i=316 | ow </>] | | |w1 That is very | wicked. | | |w? [That is |
| B137103 i=318 | mean.] | | |w1 That's very | wicked. [It's bad] | | |w? |
| B137104 i=320 | Oh, yes <unclear> | | |w1 | Wicked | | |w? Who else went |
| B137104 i=385 | | | |w1 Oh, that's | wicked Sarah. | | |w4 I know |
| B137104 i=418 | refuse] | | |w1 [That's | wicked] at your house ... | |
| B137201 i=293 | <unclear> | | |w1 It had a | wicked fairground, all those |
| B138501 i=218 | first one I | ever drew was | wicked. | | |w3 Is this |
| B138502 i=16 | what's being | recorded. [It's | wicked.] | | |w2 [It's loud] |
| B139303 i=20 | know. Sweet. | | |w1 It's got | wicked words. | | |w9 I |
| B139304 i=38 | was the best because he got a | wicked hall done | then. | |
| B139308 i=32 | hasn't | got that in it. It's | wicked. It comes with three |
| B139705 i=3 | might have. | | |w1 That's a | wicked idea I just got. |
| B139705 i=5 | | | |w1 good isn't it? That's | wicked! <nv>laugh</nv> | | |
| B139705 i=64 | [yeah.] | | |w1 [That's] | wicked. | | |w10 Yeah. | |
| B139705 i=92 | Ah, how did that go? | | |w1 | Wicked! | B139706 | |rB139706 |
| B139706 i=70 | ... | How is your shop going? | Wicked! And you've made a |
| B141601 i=30 | cool, this is [this is a bit | wicked.] | | |w4 [Well, |
| B142704 i=119 | for his hat. He had this | wicked in | Geneva had this |
Any questions or bug reports regarding the ACH Web Pages should go to
ACHWeb@brown.edu