Introduction
One of the primary goals of syntactic theory is to understand the scope and limits of syntactic variation cross-linguistically. Doing this kind of research is crucial for syntactic theory. For example, it could be used to argue for properties of UG: the innate mechanisms and principles of human language.
But the task is vast.
The first dimension of complexity is the number of languages, both living and dead, and their syntactically distinguishable dialects. Conservative estimates put the number of languages at around 6,000, but with dialects the number is probably much higher.
The second dimension of complexity is that only a few languages have a corpus of materials. The vast majority of languages just have a few descriptive documents written about them (e.g., a grammar or a dictionary or some texts). So there is an unevenness of cross-linguistic coverage.
The third dimension of complexity is the set of properties that would have to be articulated to describe a single language. How many such properties are there? In fact, given a sufficiently large corpus, there may be properties that could be inferred from the corpus that are not part of any linguistic description. These implicit properties should also be investigated.
Lastly, there is the issue of just what kind of variation we would expect to look for. In standard typological sources, people are interested in implicational correlations: If P, then Q: If a language has property P, then it will also have some other property Q.
But such implicational correlations are just the simplest kind of relation that could be envisioned. One could extend these correlations using logical combinations: If P ∨ Q, then R ∧ W. And even this dramatic extension of properties is not enough. One could also introduce subsets of languages as properties. If all the languages of S have property P, then they will have Q. These are just the simplest kinds of extensions to the implicational correlation format. It may be there are many other kinds of connections we have not even conceived of yet. And it may be that the AI model could identify those connections.
Hypothetically, let’s say we could characterize a language fairly accurately with one million properties. Given the amount of information that has been uncovered about English and other languages in the history of generative grammar, this number seems conservative. Then just to compare the properties pairwise in simple correlations would require one trillion combinations (#(P) x #(Q)) (one million times one million), which would then have to be verified over at least 10,000 languages. Calculating all possible correlations in the broader sense (see paragraph above) would require an astronomical number of calculations.
The issues can be stated in terms of neural networks.
Imagine all properties as a vast network of nodes. Then in principle all the nodes could be connected to one another, such that P is connected to Q iff whenever P is true, Q is true. Possibly such connections could be given different strengths depending on the certainty of the implication (zero is no information). If there are a million properties, there a trillion connections (million squared). Of course, if we allow subsets of languages as properties, and logical connectives, there will be many more connections than that. How could we create and search such a vast network of information for possible correlations?
With the advent of AI (“Artificial Intelligence”) and LLMs (“Large Language Models”), we are possibly close to being able to think about such a task.
The general proposals follow Collins and Kayne 2007, but is updated to use Artificial Intelligence, Deep Learning and LLMs.
Collins, Chris and Richard Kayne. 2007. A Proposal for a Database of the Syntactic Structures of the World's Languages. (https://ling.auf.net/lingbuzz/003404)
Path Forward
The idea is to develop an AI program (using neural networks and deep learning) that could search the internet, including linguistic resources, to answer comparative syntax questions.
First, it needs to be understood what kind of syntactic information ChatGPT can give, and how it gets that information. Does the information come directly from texts or does it come from grammars and other descriptions? How accurate is the information?
Some preliminary studies are here concerning grammatical judgments:
https://ordinaryworkinggrammarian.blogspot.com/2024/10/acceptability-judgments-in-chatgpt.html
https://ordinaryworkinggrammarian.blogspot.com/2024/10/acceptability-judgments-in-chatgpt_20.html
Second, it needs to be understood how much access ChatGPT has to available resources on natural language syntax, such as WALS.
Third, it needs to be investigated whether a new AI interface needs to be developed (over and above ChatGPT) to interact with the online resources.
Forth, the system should have the ability to recognize the values of linguistic properties for so called low-density languages, so that those languages could play an integral role in the search for correlations. This means that any such system will have to have the ability to work on the basis of grammars, dictionaries and texts.
Fifth, the project does not envision the collection or uploading of any actual linguistic data, but rather the combined use of already existing resources to answer questions about comparative syntax. There is no need to get buy-in from language specialists to populate the database, although there may be some need to get access to databases that can be searched.
Preliminary Results with ChatGPT
First, as of December 8, 2024, ChatGPT4 seems to have some knowledge of other languages. Here are my queries and the responses:
Chris: Do you speak French?
ChatGPT said:
Oui, je parle français ! 😊 Je peux répondre à tes questions, t'aider avec des traductions, ou simplement discuter en français. Dis-moi ce dont tu as besoin!
Chris: Translate that into English.
ChatGPT said:
Yes, I speak French! 😊 I can answer your questions, help you with translations, or simply chat in French. Let me know what you need!
Chris: How many languages do you speak?
ChatGPT said:
I can communicate effectively in many languages, covering a broad range of linguistic families and regions. While I’m most proficient in widely spoken languages, I have varying levels of expertise in others. Here’s an overview…While I can assist with a variety of languages, my fluency level varies. For highly niche or less commonly spoken languages, my ability may be limited to basic phrases, cultural context, or translations. Let me know which language you have in mind, and I’ll do my best to assist! 😊
Now, consider the following interchange on pronoun verb order cross-linguistically.
Chris: What is the order of the verb and the object pronoun in English? Just a one line response, please.
ChatGPT said: In English, the object pronoun comes after the verb (e.g., I see her).
Chris: Now, what is the order of the verb and the object pronoun in French? Just a one line response, please.
ChatGPT said: In French, the object pronoun comes before the verb (e.g., Je la vois).
Chris: What is the difference between French and English with respect to the order of the object pronoun?
ChatGPT said: In French, the object pronoun comes before the verb, while in English, it comes after the verb.
Chris: In which languages that you know does the object pronoun come before the verb?
ChatGPT said: The object pronoun comes before the verb in languages such as French, Spanish, Italian, Portuguese, Arabic, Hindi, Russian, Polish, and Hungarian.
Even this small dialogue highlights the potential power of ChatGPT to do comparative syntax work.
The following dialogue shows that ChatGPT has access to information on WALS.
Chris: According to WALS, what is the basic word order of Ewe?
ChatGPT said: According to the World Atlas of Language Structures (WALS), the basic word order of Ewe, a Niger-Congo language spoken in Ghana, Togo, and neighboring countries, is Subject-Verb-Object (SVO).