25.04.2019
1073
1

Using corpora to keep up with the language changing

It’s essential that lesson materials are current and relevant. Where can a teacher find information about the features of current spoken English, what language is used currently to carry out certain language functions (e.g. giving directions). So where can a teacher find at a wide range of ongoing changes in the language? The answer is simple — use the corpus.

What is a corpus?

A corpus — is a very large systematic collection of naturally occurring both written and spoken language, usually stored as an electronic database. It consists of texts that have been produced in ‘natural contexts’ (published books, ordinary conversation, newspapers, lectures, etc), which means it mirrors natural language.

A well-composed corpus can be used to answer questions about language use, such as:

  • Is the idiom “raining cats and dogs” still used by native speakers?
  • Which words most often go together with the word “sense”?
  • Is “secede out of Russia” is the correct concordance for ‘secede’?
  • What’s the difference between “tactic” and “strategy”?
  • Does ‘wicked’ generally mean ‘good’ or ‘bad’? Has this meaning changed over time? Does the use differ between different kinds of text? Do different (kinds of) speakers use the word in the same way?

A corpus a very useful tool when we want to know about frequent patterns in English as it can tell us about which words and phrases are most frequent (and hence the most useful for students to learn).

corpus Skyteach
Source: www.cambridge.org

For example, you want to compare how often the phrasal verb “come up with” is used in written and spoken American English. So you go to any online Corpus you prefer, for example, https://www.english-corpora.org/

corpus 4 SkyteachAs I want to know about American English, I choose COCA corpus, category ‘chart’, type in the verb and get this:

corpus3 Skyteach
So we can see that “come up with” is more used in Spoken English and the usage of it has dropped since 2014.

You can check a corpus if you want to know how the language is used. Just type a word and see the most typical combinations and examples of the phrase in context.

Here are the most frequent collocations with the word ‘changes’ according to COCA:

There are several online corpora:

  • The Cambridge English Corpus (CEC) is the largest, reliable, multi-billion word collection of updated written and spoken language, taken from a huge range of sources, including newspapers, the internet, books, magazines, radio, schools, universities, the workplace, and even everyday conversation — and is constantly being updated. The Cambridge English Corpus contains a number of specialized corpora: Cambridge Academic English Corpus, Cambridge Business English Corpus, Cambridge Legal English Corpus, Cambridge Financial English Corpus, etc. It also includes the Cambridge Learner Corpus (CLC) which currently contains over 50 million words taken from Cambridge exam scripts submitted by over 220,000 students from 173 countries, and these numbers keep growing each year. The CLC allows us to conduct internationally relevant and country-specific research into how learners use English differently to expert speakers, as well as allowing us to analyse the different types of mistakes that learners make and what they get right. This research informs our ELT courses.

Find more about the CEC in the video below.

About Cambridge Sketch Engine here.

  • The Corpus of Contemporary American English (COCA) — a more than 560-million-word corpus of American English. The corpus is divided into five genres: spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources: transcripts of unscripted conversation from nearly 150 different TV and radio programs, short stories and plays, first chapters of books 1990–present, and movie scripts, magazines, newspapers, and academic Journals. The corpus is free to search through its web interface with a limit on the number of queries per day.

Find more about COCA in the video below.

  • The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data. It is annotated for part of speech and lemma, shallow parse, and named entities.
  • The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time.

The corpora mentioned above are the biggest, there are even more different online corpora.

There is also a number of sub-corpora as well, for example, sub-corpora of MICASE with lectures and seminars.

Use corpora to ensure that the language taught in your lessons is natural, accurate and up-to-date; to select the most useful, common words and phrases for a topic or level and to analyse spoken language so that we can teach effective speaking and listening strategies.

Our next articles will be on how to use corpora in lessons and how the language is actually changing. Stay tuned!

Комментарии (1)

Добавить комментарий

Ваш адрес email не будет опубликован.

×