Tuesday, January 31, 2023

Internet Searches as a Tool in Syntactic Research (second version)

Motivation

Collins and Postal (2012, 2014) contain lots of data obtained from Internet searches. Such data is also cited extensively in chapters 2 and 3 of my forthcoming monograph “Principles of Argument Structure: A Merge Based Approach”. In this blog post, I offer a few guidelines on using such data in syntactic research.

Internet searches have turned out to be a revolutionary tool in syntactic research. Here are a few reasons why. First, if the domain you are looking at is controversial, Internet searches afford a way of finding non-elicited English examples that might help resolve the issue. For example, chapter 3 of my monograph claims that secondary predicates can modify the implicit argument in the passive. That is controversial since various authors have claimed that secondary predicates cannot do so. However, it is quite easy to find relevant examples on the Internet which I find completely acceptable.

More generally, Internet searches are a tool that can give the syntactician confidence in their claimed empirical results. Suppose that I propose a particular generalization which has not been investigated at before. If I am able to easily find examples on the Internet conforming to the generalization (and crucially, I find those examples to be acceptable) then I will have increased confidence in my generalization.

Second, if you are looking at a relatively unexplored data domain, Internet searches can help to fill out the range of combinatorial possibilities. For example, in chapter two of my monograph, such searches helped me to figure out the range of possible phi-feature values of the implicit argument in the short passive. Once you have dissected the combinatorial nature of any particular problem, you can run searches on all the various possibilities, greatly expanding your knowledge of an empirical domain in a short period of time. Furthermore, tracking down and documenting the combinatorial possibilities in this way will often lead to surprising discoveries. 

In the blog post below, I outline a methodology that takes Internet searches to be a tool used by the syntactician in syntactic research.

Searching

The basic technique is to search for phrase types that are being investigated. For example, in chapter 2, I investigate the use of anaphoric expressions like on my own in short passives. So I would search the following, where the quotes ensure that a string, not just a set of words, is the target:

(1) Google: “was done on my own”

In this example, I included “was done” to make sure that the search includes a passive participle, not an active participle. However, I have left out the subject, because examples would be relevant no matter what their subject is. Some of the results may be completely irrelevant so you may have to scroll through several screens to find good examples.

One of the results of the search in (1) is:

(2) None of this was done on my own. To me success always takes collaboration.

This example is useful to my research. In my monograph, my hypothesis is that the implicit argument in the passive is syntactically active. (2) supports that hypothesis, since the antecedent of my is the implicit argument. (2) is also helpful because the second sentence clarifies the meaning of the first.

Follow-up searches based on (1) can be done, yielding further information. For example, to investigate the phi-features of the implicit argument, my in (1) can be replaced by a whole range of possessive pronouns: yours, his, hers, ours, theirs. Alternatively, the main verb could be changed from done to written, created. Lastly, to find further passive data, the copula could be changed from was to were, is, are, be, been. Just these choices would yield 5x2x5 = 50 additional searches, any of which could yield interesting example sentences.

See Collins 2018 for the use of the * operator in Google searches.

When I use sentences like (2) in a paper, I cite them with the URL:

(3) None of this was done on my own. To me success always takes collaboration.(https://voyageminnesota.com/interview/check-out-jess-pratts-story/)

It is also possible to cite the URL in a footnote, but one way or the other, it should be cited. That way a reader can look it up on their own, and verify the data.

Quality Control

Of course, there is no guarantee that the data on the Internet will be of high quality for syntactic research. It is completely uncontrolled, so caution needs to be exercised. I propose the following guidelines.

Control 1:

Immediately after finding an example in the search result list, you should check to see that the sentence actually appears on the accompanying URL. That is, open the link in (3), and search for the sentence in (2). 

Control 2:

You should take examine the sentence to make sure that it does not represent any of the following: (a) a grammatical error, (b) humorous writing (which plays with language), (c) poetic license (again playing with language), (d) an AI generated text (not produced by a human), (e) a Google translation (again not produced by a human). A related concern is whether the sentence is being used as an example in linguistics paper online. If so, it might be elicited data, and not naturally occurring Internet data.

Control 3:

While at the website, you can try to see if there is anything that might indicate that it was created by non-native speakers of English. If there are red flags, you should discard the example.

Control 4:

You can look at the context in the preceding and following text. Such context might shed light on the interpretation of the resulting examples, and might be useful to include in your work. For example, in (2), the second sentence explains the first sentence.

Control 5:

I use the Internet as a tool in looking into the properties of a particular dialect, usually my own I-language. In doing Internet searches, I am looking for data that I myself find acceptable. For example, after doing the search in (1), I found (2). Crucially, (2) is acceptable for me as a native speaker of English. If I find a relevant example, and I judge the sentence as acceptable, then I save it to a list for future use in a paper or book. If I judge the sentence as unacceptable, I put it aside for further consideration (see below).

Control 5’:

If you are looking for a construction that is not from your own dialect, you should find a native speaker of the relevant dialect to verify it with.

There are a few possible misconceptions about this method, which I will discuss here. First, it might be claimed that when using a corpus, the main purpose should be to give some description of the data found in that corpus. On that construal, Control 5 seems anomalous. If I am putting aside data that does not conform to my judgments, then I am in effect filtering the data in a way that does not directly reflect the generalizations that characterize the corpus data.

But my goal is not to give a description of the data found in the corpus (the English data found on the Internet). That task is hopelessly obscure. First, there are different dialects of English represented there (e.g., varieties of American, Canadian, British South African, Indian, Australian and Ghanaian English, to name just a few). Second, the people using English are from very different ages and backgrounds (socioeconomic and cultural). Third, the English found on the internet is of all kinds of registers and styles. Fourth, there may be citations from earlier stages of English. Fifth, as noted above, a lot of the English on the internet has been produced by non-native speakers. There are just way too many dimensions to think that a theory could ever be give of that data.

Rather my goal is to use Internet searches as a tool to doing syntactic research.

The most one can do is to zoom in and study a particular dialect. This is the reason for Control 5. I am using the Internet as a tool to help me study that dialect. The Internet searches help me to confirm controversial data, to fill out combinatorial possibilities, and even occasionally to help me discover completely new and surprising features of my own dialect. It is in this sense that I am using the Internet as a natural language corpus.

Questions and Answers:

In order to further clarify the method, I pose some possible questions about the method and answer them here

Q: If you find a construction attested on the internet, can you immediately conclude that it is part of the dialect of some speaker? 

A: No, you cannot. There are all kinds of reasons that a construction may appear on the Internet. In fact, it is not unlikely that it was written by a non-native speaker. This is the reason for Control 5 above. If the construction that you find is relevant to your research, you may consider trying to find native speakers that accept it.

My response here supports Jason Merchant’s dictate: "Beware the fetishization of attestation!" (thanks to Gary Thoms for pointing this out to me). Jason explains the phrase as follows (personal communication): “I’ve used it in various talks and handouts to warn people against taking attested sentences as direct input to theorizing. Particularly when a speaker might actually classify that sentence as unacceptable…The broader point is really about corpus linguistics, and why data from corpora still need to be checked with speakers, and some of it should be rejected.”

Q: If during searches, you find some construction (that you find acceptable) that apparently contradicts your analysis, can you discard it because you found it on the Internet? 

A: Definitely not. You must either show that it does not in fact conflict with your assumptions, or modify or completely reject your analysis. It is not licit scientifically to cherry pick the data that you find on the Internet. Any data that you find that bears on your analysis must be treated seriously, whether or not it supports your analysis.

Q: If you search for a particular construction, and do not find it, can you use this as evidence that the construction is unacceptable? 

A: No, you cannot. There are endless reasons why a particular sentence may not yield any hits in a search. To determine if a sentence is acceptable or not, you need to judge it for acceptability.

Conclusion

Even though in this blog post I have addressed syntactic research only, the same exact points carry over to semantics research and syntax/semantics interface research. 

References

For further information on Internet searches in syntactic research, see:

Collins, Chris. 2018. Using * in Google Searches for Syntactic Research. Ordinary Working Grammarian [Blog].

(https://ordinaryworkinggrammarian.blogspot.com/2018/10/using-in-google-searches_5.html)


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.