Friday, July 23, 2021

Archiving with ELAR

In this blog post, I will discuss my trials and tribulations with ELDP and ELAR, both the positives and the negatives. My goal is to make it easier for other fieldworkers to archive there, if they decide to do so.

The main take away message is that planning ahead and familiarizing yourself with the nature of the required tasks can save you time in the long run. In my case, planning ahead could have saved me hundreds of hours.

In this post, I will mostly discuss metadata and archiving. See my blog for other posts concerning language documentation. Just use the keyword "fieldwork".

Ordinary Working Grammarian

As background, here are links to my application and my contract. You may find these useful in preparing your own application:

Small Grant Application

ELDP Contract (modified to take out NYU banking information)

Here is the link to my ELAR deposit:

Documenting Sasi

My ELAR deposit has 122 sessions, 38 of which have oral texts that are completely transcribed, glossed and translated (for a total of 3.5 hours). This work will be crucial for me in writing a Sasi dictionary, and in creating other materials suitable for the community to use (e.g., storybooks written in Sasi). It will also be very useful for any future work I do on the syntax of Sasi, and also in eventually writing the grammar.

My ELAR deposit does not come close to exhausting my materials on Sasi. Rather it mostly focuses on video documentation work done in 2019-2020 (with 13 sessions from 2016).  It was challenging enough putting the ELAR deposit together without bringing in other materials. I hope that in the future I will be able to add them to my deposit.

Is ELAR for you?

An ELAR grant provides a reasonable sum of money to help you to document endangered languages. As I understand it, the main objectives are to create video and then to transcribe, gloss and translate it. These are great objectives, which are useful in all kinds of ways for language documentation, pedagogy and linguistic theory. But an ELDP grant is not for everybody. You need to carefully read the information on their website before making your decision. 

The way that ELDP works is not the same as other major granting organizations, such as NSF or Fulbright, ACLS or Guggenheim all of which give the grantee more freedom in carrying out their project, and allow for a much broader range of intellectual objectives (e.g., writing scientific papers).

Perhaps the best way to find out if ELAR is for you is to browse it. I recommend that you spend time browsing this site before applying for a grant. 

Endangered Languages Archive

After you apply for an ELDP grant and are awarded it, carefully read through the contract before you sign it. Some of the provisions in the contract may not be a good fit for you and your institution. You may want to arrange a call with your institutional OSP program officer to talk over the contract and see if there are any red flags.

The outcome of an ELDP grant is a set of materials for deposit in ELAR. So, the whole grant is focused on the production and archiving of those materials. ELAR will give you a precise list of what you need to record and transcribe (which specifies the exact number of hours that you need). This will be in writing in the contract. If you do not feel you can meet those targets (based on your previous experience), let them know and try to negotiate. 

I personally was not able to meet my contract targets (see my contract above). Part of this had to do with the Coronavirus pandemic which struck the world right in the middle of my grant, but it also has to do with my workflow (translating into two languages) and with my rate of transcription.

Another relevant observation is that it took me quite a bit of correspondence with ELDP/ELAR to figure out the requirements of the grant (around 300 messages back and forth). Part of this has to do with my inexperience with archiving. But part is also due to the very specific technical requirements of the archiving process. For other granting agencies (e.g, NSF), I have had nowhere near this amount of correspondence.

Tip: When I applied for the grant, I had several productive discussions with the ELDP program director who clarified the scope of an ELDP grant and who gave me advice on what to include and what not to include in my proposal. I believe that these discussions helped me get the grant award.

Suggestion for ELAR: When I applied, ELDP required paper copies of all signed documents, which is an antiquated working model. Thus, even if you are in a country such as Botswana, you will have to download the contract, sign it, scan it, and send it in by mail. Since mail from some countries is irregular, this will have to be done by DHL, which is expensive (and time consuming). Linguistic publishers and grant agencies all allow (as far as I know) one to submit signed documents electronically. So, my suggestion is that ELDP drop the paper copy requirement, and switch to fully electronic submissions.

Suggestion for ELAR: In many cases, you need to be registered to see the materials on ELAR. If you forget your password, there is no way to create a new one online, so you need to write to ELAR to request a new one. If you are working on the weekend, then you will have to wait until Monday to get a new password. Most if not all large websites (for banks, for archives), allow online replacement of passwords, so ELAR is lagging a bit behind here. My suggestion is that ELAR implement the capability of replacing a password online.

Archiving Steps

In addition to documentation, there are a number of steps associated with archiving. These are given below (copied from a letter sent to me by ELAR). There are various tests and questionnaires in this e-mail message. I highly recommend that you look them over, even before the grant starts.

1) Organise all your files and ensure that they are in the accepted formats (see below). 

2) prepare the metadata for all of the sessions you would like to archive using lameta. See lameta.org for a video tutorial and helpsheet.

3) take the metadata self-test and make amends to your metadata if necessary

4) fill in the pre-loading questionnaire

5) send ELAR all of your IMDI metadata files for checking

6) request a loading bay and upload your materials

7) fill in the post-loading questionnaire

Tip: An important piece of advice for linguists archiving with ELAR is to familiarize yourselves with Lameta and the ELAR archiving requirements before going into the field. This way, you will be able to manage your materials (as you create them) in a way that will make it easier to upload to ELAR later on. 

What are Sessions?

Using Lameta, all of your files need to be organized into sessions. For each session, you specify the session ID, the title, the access level, the people involved, etc. So the question is: What is a session? It actually took me quite a few e-mails of back and forth with ELAR before I understood exactly what they are looking for. 

My original impulse was to make all the videos recorded on one day be a session. For example, one day I recorded ten oral texts, told by three different people. I considered this work to be a ‘session’. What I realized eventually, through e-mail correspondence with ELAR, is that they want sessions to be much more fine grained. Each session should ideally cover one topic (e.g., “Making Bojalwa”), and should consist of the following:

a. one video file (.MP4)

b. one audio file (.wav)

c. one ELAN file (.eaf) (plus accompanying .psfx file)

d. one .pdf of the FLEx file (.pdf)

A session can contain two video files, if they are somehow closely related (e.g., two parts of the same long video). So, in the case of the example immediately above, I needed ten different sessions, even though all the recording was done in a single day.

Each session has an ID and a title. The title is something descriptive such as "Making Bojalwa". But the ID is not. Crucially, the file names in your session are determined by the ID of the session (not the descriptive title). Figuring out how to assign IDs to sessions also took some back and forth with ELAR. In part, you can learn about the conventions by looking at ELAR deposits. I ultimately decided on the following system. Typically, several oral texts were recorded on a single day. So, each bundle/session is of the form: 20200112_01. This means the following: The oral text was recorded on January 12th, 2020. It is the first oral text recorded on that day. 

In addition to these materials, it was clear that ELAR also wanted me to include my FLEx project as a session. This project would have included the Sasi dictionary that I am developing (based on my own elicitation work and also on the contents of the oral texts in FLEx). Ultimately, I declined, because I thought that my dictionary was not in good enough shape to put on a public archive. I hope to be able to add the dictionary to my deposit in the future.

ELAN-->FLEx-->ELAN

In addition to the video file (.MP4) and the audio file (.wav), a session can contain an ELAN file (.eaf). But ELAR requires the ELAN file made from the FLEx file so that it has glosses. Concretely you will create an ELAN file (with a transcribed video). But this ELAN file typically does not contain glossed text, since it is much easier to do glossing using FLEx. So, using that ELAN file, you will create a FLEx file which will contain glosses. Next, you will feed that FLEx file back into ELAN to create an ELAN file with glossed text. I call this process: ELAN-->FLEx-->ELAN.

I am not a big fan of this requirement, since glossed ELAN files are not really important in my work (glossed FLEx files are sufficient for me). Furthermore, my consultants (and more generally non-linguist users) would not benefit from glossed ELAN files. The Youtube videos that I created for the community only have translations, not glosses. However, I did comply with the requirement.

But I had some difficulties. I was able to figure out how to make ELAN files from FLEx files. See the link below. But I had used the character * in my original ELAN file, and that messed up the whole process. The ELAN files I made from FLEx files contained empty lines whenever I used a * in the original ELAN file. Finally, I was able to figure out the issue and reproduce the ELAN files, but it took me a lot of time. I easily wasted around two weeks on redoing all the ELAN files to fix this error.

Tip: When creating ELAN files, do not use any non-linguistic characters (e.g., *) or punctuation (e.g., ?) in the transcription line. 

Also, I made the mistake of creating a large batch of FLEx files (from ELAN files) before transferring them back to ELAN. That is a huge mistake, because you do not see what your errors are as you make them. The best way to work is that when you create a FLEx file, transfer it back to ELAN right away, before creating other FLEx files. That way, you catch your errors as you make them.

ELAR gave me two documents to help me with the ELAN-->FLEx-->ELAN process. The first document was not helpful for me, but the second worked perfectly. Here is a link:

ELAN-->FLEx-->ELAN

An alternative method which I worked out is given below in the blog post section.

Suggestion for ELAR: There are lots of documents relevant to archiving on ELAR. The very helpful staff will provide those documents to you, if you ask for them. But it would be more convenient if all the documents were in one folder and if the grantees were given access to that folder. Having a semi-public place to find documents in would have saved me a lot of time at some points.

Create Metadata in the Field

I recommend creating your metadata with Lameta while in the field. This way you will not have to do it later on. In my case, I have 122 sessions (including hundreds of video files and audio files), and I created the metadata for these after I got back from the field. It is without a doubt easier to create metadata (e.g., the ‘description’ and the ‘keyword’) immediately after shooting the video when everything is fresh in your mind. You may also need to contact your consultants to fill in metadata (e.g., full name, date of birth), and this is much easier to do when you are in the field.

As another example, ELAR requests that you provide a picture of each consultant. Since I waited until I got back from the field to add those pictures to Lameta, it took me a lot of time (in some cases I had to use screenshots from the videos). Had I uploaded the pictures in the field, as they were created, it would have made the task easier.

If your file naming system that you use in the field is different from the one you use on Lameta, you will have to relink all your ELAN files after you upload to Lameta. The process of relinking the ELAN files is not difficult, but it is time consuming. It would be much easier just to name all the files in the field according to the Lameta file naming system.

For your final report with ELAR, you are going to have to report such information as: How many videos did you make? How many audio recordings did you make? How many minutes of video do you have? How many minutes of audio do you have? How many people did you record with?

Keeping an EXCEL spreadsheet in the field with all this information will make things much easier at the time you file your final report. You can just read the totals off from the spreadsheet. I did not make one, and I needed to go through all of my files and count up the totals, which took me lots of time. 

The spreadsheet will also make it easier to work with your materials in the field. For example, you will not always remember exactly where (in terms of bundle/session) a particular oral text is. You can easily look it up on the spreadsheet.

Genres, Topics and Keywords

Once you have all your files uploaded to Lameta on your computer, and you are ready to upload your materials to ELAR, you will need to send your IMDI files to ELAR to be checked. One of the big issues they are looking at is the selection of genres, topics and keywords. Obviously, creating good sets of genres, topics and keywords will make your deposit more accessible to ELAR users. Fixing these issues gave rise to a back and forth process that took me about a week to complete. Here is one e-mail that I got from ELAR:

“Please do not repeat the same information across the three categories.

Genres describe the type or structure of a speech event, e.g. a conversation, or a procedural narrative. Please select from the attached list and add if it's not enough. For the genre "Narrative", please specify    the kind of narrative, using the attached list. 

Topics answer the question "what do the speakers talk about?"

Keywords give more details on the topics, e.g. Topic "Trees" vs. Keyword "Acigaro tree", or "Animals" vs. "Crocodiles" or add additional information, e.g. "Frog story"

If you have multilingual topics or keywords, please put both languages into one topic or                       keywords,…” 

Also, through correspondence with ELAR, and the questions on the Post-Loading Questionnaire, I learned that they do not want the keywords and topics to overlap. For me, this was a very time-consuming requirement to comply with. I ended up going through my topics and keywords at least 10 times to make sure that (a) they were informative and (b) they did not overlap.

Suggestion for ELAR: It would be helpful if there were a helpsheet on genres, topics and keywords. Such a helpsheet could contain model examples and guidelines. I do not remember receiving such a document.

Uploading

Once your metadata has been checked by ELAR, you will be given permission to upload your materials and you will be assigned a loading bay. I was assigned Loading Bay 8. For this, you will be need a depositor username and password. These are different from your regular username and password for ELAR (which allows you to browse the collection). Because of this, you will need a second e-mail account, so that the two usernames are different. For example, I used my NYU e-mail account and a gmail account (which I normally never use). Furthermore, in order to upload your materials onto ELAR you will need a PC. I regularly use a Macbook pro, but I happened to have a PC. So I was OK. 

The program you will use to upload your files is called “SIP Creator” and it requires a lot of PC hard drive space. For example, suppose you are going to upload 10 sessions, for a total of 15GB. Then you need to have three times that space or 45GB free space on your PC hard drive (not an external hard drive, the PC hard drive itself). If you do not have enough space, you will get an error message. And if SIP Creator has run for two hours and you get an error message, you have basically wasted two hours. This happened to me several times.

So if your PC hard drive is filled to the brim, you need to take the time to delete some of your files, or transfer them to an external hard drive. Furthermore, you need to make sure that you do not try to upload too many sessions at one time, otherwise you will surpass your PC hard drive space availability. It took me a few days to work these things out (and some correspondence with ELAR) before I could start to upload seriously.

Once I figured the uploading process out, it took me about a week to upload all of my materials. The process is rather time consuming. For example, uploading 10 sessions with 16GB of video took me about an hour and a half. The time is directly dependent on the number of GB of video you upload (all the other file types, e.g., .wav and .pdf, are relatively small and easy to upload).

Blog Posts (Ordinary Working Grammarian)

I have written a number of blog posts with information directly relevant to my ELDP grant. I list them below for the convenience of the reader. There are many other posts on my blog about fieldwork that you might find useful as well. Just search the keyword “fieldwork”.

Backup Workflow for Linguistic Fieldwork

Burning Subtitles into Video Using Adobe Premiere Pro

Converting .MOV Video Files to .MP4 Video Files

ELAN to FLEx to ELAN

ELAN Training Videos

Grabbing a Frame from a Video with Adobe Premiere Pro

Rate of Transcription in Syntactic Fieldwork

Run-and-Gun for Linguistic Fieldwork

Sound Recording Set-Ups for Video in Linguistics Fieldwork

Tips for Recording Sound when Shooting Video in Linguistic Fieldwork

Transcription Mode in ELAN




No comments:

Post a Comment

Note: Only a member of this blog may post a comment.