Exploring variability within and between corpora: some methodological considerations

被引:34
作者
Gries, Stefan Th. [1 ]
机构
[1] Univ Calif Santa Barbara, Dept Linguist, Santa Barbara, CA 93106 USA
关键词
D O I
10.3366/cor.2006.1.2.109
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
The results usually reported in corpus-linguistic studies are quantitative: frequencies, percentages, model parameters, etc. However, given that no corpora are alike, and that sometimes different results are reported for very similar corpora (or even the same corpus), three central issues are: (i) how to identify and quantify the degree of variation coming with one's results; (ii) how to investigate the source of the observed variation in corpora; and, (iii) how homogeneous one's corpus is with respect to a particular phenomenon. In this paper, I shall present a methodology that addresses these issues, providing data from ICE-GB on the frequency of the English present perfect, the alternation of transitive phrasal verbs and the semantics of the English ditransitive. Specifically, I will show how applying resampling methods and exploratory data analysis to corpus data allows for, (i) providing interval estimates for one's findings that show how superficially different results may reflect similar underlying tendencies; (ii) determining communicative dimensions underlying variation in a bottomup fashion (similar to work by Biber, but based on just the phenomenon one is interested in); and, (iii) quantifying the homogeneity of the corpus with respect to the phenomena one is actually interested in (rather than by the standard approach of using word frequencies). For every parameter we estimate from data, we need to establish an unreliability estimate. We use this to judge the uncertainty associated with any inferences we may want to make about our point estimate, and to establish a confidence interval for the true value of the parameter. Up to now, we have used parametric measures like standard errors that are based on the assumption of normality of errors [...]. If the assumption of normality is wrong, then our unreliability estimates will also be wrong, but it is hard to know how wrong they will be, using standard analytical methods. An alternative way of establishing unreliability estimates is to resample our data [...]
引用
收藏
页码:109 / 151
页数:43
相关论文
共 50 条
[1]  
[Anonymous], 2003, THESIS
[2]   Language trees and zipping [J].
Benedetto, D ;
Caglioti, E ;
Loreto, V .
PHYSICAL REVIEW LETTERS, 2002, 88 (04) :4
[3]  
Berglund Ylva., 1997, ICAME J, V21, P7
[4]  
Biber D., 1999, LONGMAN GRAMMAR SPOK
[5]  
Biber Douglas, 1990, LIT LINGUISTIC COMPU, V5, P257, DOI DOI 10.1093/LLC/5.4.257
[6]  
Biber Douglas., 1993, LIT LINGUIST COMPUT, V8, P243, DOI [DOI 10.1093/LLC/8.4.243, 10.1093/llc/8.4.243]
[7]   SYNTACTIC PERSISTENCE IN LANGUAGE PRODUCTION [J].
BOCK, JK .
COGNITIVE PSYCHOLOGY, 1986, 18 (03) :355-387
[8]  
Church Kenneth W, 2000, P 18 C COMP LING, V1, P180
[9]  
CRAWLEY MJ, 2004, STAT COMPUTING INTRO
[10]  
Dubois B. L., 1972, THESIS