Sunday, April 22, 2012

Corpora Critique

Evaluation criteria: 
  1. Design must be principled, "the texts in a corpus need to represent the type of language that the corpus is intending to capture" (Reppen 3)
  2. Corpus must consist of "a large collection of texts" (Reppen 3)
     This corpus is made up of over 20 million words, with ~15 million words being available for free download, of which 3,217,772 words are spoken and 11,406,555 are written. It appears that the types of sources are fairly limited, with spoken words coming from some face to face, but a majority from phone switchboard conversations. The written texts are more varied; however, they too are much more limited than the sources offered by other corpora. The texts appear to come from more with a majority of the texts coming from more formal sources, being from domains such as "technical," "journal," and "government" (ANC 2009). While this corpus would be useful for tasks requiring the use of more technical data, it would not be appropriate for lessons aimed at introducing students to conversational spoken data or other more colloquial data. 
     The corpus home page does not seem very user-friendly, and I did not like the interface. Another element of it that was that users would have to pay or download the free part onto their computers, which is not ideal. 

     COCA is an amazing source of data. It is "composed of over 425 million words in more than 175,000 texts" (COCA 2011). Because COCA's data is broken down into various registers, the data can be used to teach multiple types of lessons. Such registers include spoken, fiction, magazine, newspaper, and academic (COCA 2011). Compared the the ANC, this corpus offers a much wider variety of data. Additionally, the easy-to-navigate interface opens up the opportunity for student interaction using the corpus in classroom lessons. 

     Dang! The CEC consists of over 1.5 billion words throughout all of the CEC corpora resources. The reason that it is so much bigger than the aforementioned corpora is because it pulls texts from multiple resources, ultimately gathering information that covers British English, American English and Learner English (CEC 2012). I am thoroughly impressed by the extremely large range of resources that the CEC offers, and think that the Learner English corpora resources would be an incredibly useful tool for ESL classes. Because the Learner English resources include texts from the Learners' written English and Error coded learner written English corpora, one could choose to incorporate error coded or non-coded texts in the classroom. These examples could help students identify types of grammatical points and could supplement lessons on various topics. 

No comments:

Post a Comment