Tuesday, March 10, 2026

Comparison of COCA and Google for Syntactic Research

Inversion Seminar

March 10, 2026

Comparison of COCA and Google for Syntactic Research

In this blog post, I give a brief comparison of searching for syntactic data on COCA (Corpus of American English) and searching for syntactic data on the internet using Google (for a detailed discussion of the latter, see the appendix of Collins 2024).

Internet COCA

Size vast 1 billion words

Note: The difference in size means that that there are many more kinds of interesting examples that are accessible on the internet than on COCA. 

Punctuation not sensitive very sensitive

Note: Google ignores all punctuation in doing searches. COCA does searches of strings including punctuation (including the period and quotation marks). For certain syntactic topics, e.g., quotative inversion, this is a very useful feature.

Statistics very rough precise

Note: If I search for two variants of a construction (e.g., inversion versus no inversion in a quotative construction), I might want to compare the frequency of the two variants. COCA allows very precise comparison of numbers over the corpus. But for Google, the best one can do is zero versus few versus many. The exact numbers seem to be less meaningful for Google.

Tagged Data no yes

Note: The COCA corpus has a rich system of tagging for part of speech, so these categories can be used in syntactic searches. The internet is not tagged for part of search, so no such categories can be used.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.