Saturday, January 28, 2023

The Use of the Internet as a Natural Language Corpus in Syntactic Research

Motivation

Collins and Postal (2012, 2014) contain a large amount of data obtained from Internet searches. Similarly, such data is cited extensively in chapters 2 and 3 from my forthcoming monograph “Principles of Argument Structure: A Merge Based Approach”.

Internet searches have turned out to be a revolutionary tool in syntactic research. There are a few reasons why one might want to use data from such searches. First, if the domain one is looking at is controversial, Internet searches afford a way of finding naturally occurring English examples that might help resolve the issue. For example, chapter 3 claims that secondary predicates can modify the implicit argument in the passive, which is controversial. However, it is quite easy to find relevant naturally occurring examples on the Internet. Second, if you are looking at a relatively unexplored data domain, Internet searches can help to fill out the range of combinatorial possibilities. For example, in chapter two, such searches helped me to figure out the range of possible phi-feature values of the implicit argument in the short passive.

Searching

The basic technique is to search for phrases that are being investigated. For example, in chapter 2, I investigate the use of anaphoric expressions like on my own in short passives. So I would search the following, where the quotes ensure that a string, not just a set of words, is being searched:

(1) Google: “was done on my own”

In this example, I have included “was done” to make sure that the search includes a passive participle, not an active participle. However, I have left out the subject, because the example would be relevant no matter what the subject is. Some of the results may be completely irrelevant so you may have to scroll through several screens to find good examples.

One of the results of the search in (1) is:

(2) None of this was done on my own. To me success always takes collaboration.

This example is useful to my research. In my monograph, my hypothesis is that the implicit argument in the passive is syntactically active. (2) supports my hypothesis, since the antecedent of my is the implicit argument. (2) is also helpful because second sentence clarifies the meaning of the first.

Sentence (2) came from the following URL:

(3) https://voyageminnesota.com/interview/check-out-jess-pratts-story/

Follow-up searches based on (1) can be done, yielding further information. For example, to investigate the phi-features of the implicit argument, my in (1) can be replaced by a whole range of possessive pronouns: yours, his, hers, ours, theirs. Alternatively, to find more passive data the main verb could be changed from done to: written, created. Lastly, to find further passive data, the copula could be changed from was to were, is, are, be, been. Just these choices would yield 5x2x5 = 50 additional searches, any of which could yield relevant example sentences.

Quality Control

Of course, there is no guarantee that the data on the Internet will be of high quality for syntactic research. So, it is necessary to follow some guidelines.

Control 1:

Immediately after finding an example on the Internet, one should check to see that the sentence actually appears on the website. That is, open the link in (3), and search for the sentence in (2). 

Control 2:

One should take a close look at the sentence to make sure that it does not represent an error or humorous writing or poetic license. A related concern is whether the sentence is being used as an example in linguistics paper online. If so, it might be elicited data, and not naturally occurring Internet data.

Control 3:

While at the website, one can try to see if there is anything that might indicate that it was created by non-native speakers of English. If there are red flags, one should discard the example.

Control 4:

One can look at the context in the preceding and following text. Such context might shed light on the interpretation of the example, and might be useful to include in one’s work.

Control 5:

In doing Internet searches, I am usually looking for data that I myself find acceptable. For example, after doing the search in (1), I found (2). Crucially, (2) is acceptable for me. If I find a relevant example, and I judge the sentence as acceptable, then I save it to a list for future use. If I judge the sentence as unacceptable, I discard it.

If you are looking for a construction that is not part of your idiolect, you might want to find a native speaker of the relevant dialect to check it with.

For further information on Internet searches in syntactic research, see:

Collins, Chris. 2018. Using * in Google Searches for Syntactic Research. Ordinary Working Grammarian [Blog]. (https://ordinaryworkinggrammarian.blogspot.com/2018/10/using-in-google-searches_5.html)


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.