In SignAll, we do not use open source databases for ASL translation, and here is why.
David Retek, Chief Technology Officer SignAll Technologies.
My name is David Retek. I supervise all technological processes of creating the first-in-the-world automated sign language translation system. It translates from American Sign Language (ASL) to English. In this article, I will tell you why we refused to use open-source sign language (SL) databases and created our own. Also, I will offer a sneak peek into how we do it and why it is unique (spoiler alert – it’s because of the way we collect non-manuals and annotate data).
When developing the plan to create the translation system (which would later be called “SignAll Chat”), we already knew that open source sign language databases existed. We examined a few decent samples to understand whether we could skip the data collection stage in favor of using already existing options.
The usability of a database depends on many parameters. The first thing that comes to mind - its size - is only one of them. The number of signs is important, but let’s see what other factors determine its usefulness. Sign language databases differ depending on their intended purpose, too.
Several universities have excellent databases developed for linguistic research. Here, linguistic considerations were paramount in choosing the method of recording, and even more so in the specification of recorded content (rather than machine learning). Among the parameters important for us, we identified the following as most important:
- The technical level of the devices used for recording. Namely, whether a depth sensor is used, how many are cameras used, their positions, and their quality.
- The number and variety of participants. The more diverse the participants are in age, gender, and regional distribution, the better the language is represented. Even physical characteristics of the signers matter (height, body shape, skin color, etc.)
- The number of times the given signs were recorded, by how many people, and what strategy was used to record each sign. (For purposes of teaching AI, the number of repetitions matters, and they should follow a number of rules for proper recording. I will come back to this later.)
When making the decision about the database, we were aware of a few big (and seemingly decent) databases. Their technical approach corresponded well with the majority of the parameters named above. However, concerning the last aspect, the data they recorded didn’t fit our purpose.
To give you a better idea, I’d like to illustrate one of the good examples. It is from the University of Hamburg, which focuses on German Sign Language (GDS). The institution had a complex data-recording plan. It was one of the most all-encompassing approaches to collecting signs. The situations, topics, tasks - everything was clearly defined and well-thought-out. But the database was created for linguistic purposes.
The analysis made clear that we couldn’t avoid creating our own database. To confirm it would work for teaching the AI, we had to ensure the following:
IF DATA IS KING
Conditions for data collection.
Signs should be produced multiple times by many people. We learned that asking a person to repeat the same sign a few times in a row results in a deviation of the movement. Therefore, if we want the same person to repeat a sign, we include it in a sign sequence when it is mixed with other signs.
Signs should be produced in a range of different situations. Our tasks included signing such things as:
- Signs sentences
- Sign sequences - signing from a list of random signs (“random” here means they have no logical connection)
- Describing pictures (which allowed for recording natural non-manuals and transitions between signs)
- Telling a story (which minimized the influence of spoken language rules, like word order, auxiliary verbs, etc.)
Because of this wide variety of content, we were able to collect data on signs’ connections (flow) and non-manual data (body posture, facial expression).
There are three points here.
- Our set of cameras were as close as possible for the highest relative resolution
- We used several perspectives (from cameras) simultaneously: Left, right and top. There was also a Kinect in the center for recording depth
- For further accuracy, we used marked gloves. They allow the system to identify an exact position of each part of the hand (finger tips, middle phalanx, internal and external middle parts of palms)
All samples should be produced by native signers.
This is important and self-evident. Most databases we know fulfill this requirement.
After gathering the data, we had to validate them and identify signs’ modifications. We should have clarified what is a personal style and what is the core of the sign itself. We also separated so-called “noise” by comparing each sign with a “basic form.” *
Own recognition program
As mentioned above, using proper data recording facilities was important for us. Keeping in mind the expected result, we developed our own recording computer program and integrated it into our framework. The program allowed not only to record data in the necessary way, but also facilitated the recording process, data processing, and AI training.
Overall, our approach to data gathering was multifaceted. We kept in mind the big picture while preparing smaller tasks simultaneously, in parallel to each other. For example, we recorded special data for training sub-algorithms and sub-models, too. At the same time, it was important to focus not only on recording miscellaneous signs and matching them, but to associate grammar and linguistic structures with them.
We are continuously collecting signs and growing our database. Before starting each recording project, we clearly identify what kind of data we expect to collect. Some of the tasks are tech-oriented. For example, we can make a recording project to complement our data with more body movements and facial expressions for more precise recognition. In some of the recording projects, we focus on signs for specific use cases, or on a set of signs for a given topic.
The process of data gathering
We organized two big recording projects in partnership with Washington D.C.’s Gallaudet University, the first university in the world specifically for Deaf/HoH people.
The fundamental first project secured massive data covering a range of topics After the success of the first project,we organized a second session. For this session, we recruited 30 native ASL users to collect variations of signs on academic/university-related topics. The participants described situations, stories, pictures, and sequences of signs related to the field of education. In this round, the goals were much more specific. We collected a database for educational setup: a library that would include an extensive and nuanced vocabulary related to education
As mentioned before, we continue to collect data and are constantly expanding our database. Currently, the database consists of over 300,000 annotated video segments (and counting). Hundreds of hours of validated signs, recorded with top accuracy, including a multi-camera setup and specialized gloves. But let me get back to the point why quantity is not the be-all and end-all solution. Read on to see what is even more important than simply the number of signs.
ANNOTATION IS QUEEN
Creating a database is just the first step. It’s also the easiest one. Annotation is more complex, challenging, and requires much more effort. A few words about annotation. It is a collective term. The basic definition is “a note by way of explanation or comment added to something” (e.g. a video). For us, it is mostly a set of meta-data in which we collect and record what a person who knows sign language sees on a given video. To use the data, we have to annotate each recording manually. Let me repeat this. We describe each and every piece of recording. Manually. Here is how we do that.
As for data recording, we developed our own annotation software. Using multiple devices and perspectives for recording, we had to make sure that annotation software would properly synchronize with all of the cameras’ recordings. An annotator can see a sign from multiple perspectives simultaneously, annotating the recordings from all angles at once. As a result, we have extremely accurate annotations.
What makes a good annotation?
The annotation includes two major points: timeline (start and end time) and content. While start and end time may sound easy, it is challenging in ASL to identify when exactly each sign starts. Start time is of critical importance for its future recognition. One of the prominent features of our database is that it is annotated with a so-called “frame-level accuracy,” meaning there is no 5-10 frame, or in other words 150-300 milliseconds, uncertainty. It may seem like a negligible error at first, but 150-300 milliseconds is the difference between the core part of the sign and the transitions to another sign. A deviation of this magnitude in many cases compares to the total length of the sign!
At SignAll, each sign undergoes a two-step annotation procedure. At first, one of SignAll’s native ASL team members provides a transcription. This is considered a guideline, where ASL experts specify the content and the approximate the time interval. On the second one, another annotator puts time marks and selects predefined labels on the records.
SignAll’s database consists of a great number of commonly used signs and their versions. In ASL, as in any other language, one word may have several meanings. Therefore, we offer multiple options, which we label with #1, #2, and so on, identifying sign options for each words’ different meanings. At the same time, signs may have multiple meanings, and we also address this nuance.
To give an example, the system includes 19 sign versions of the word “view,” 10 for ‘awesome,” 13 for “all,” 11 for “love.” These are either region-specific, or their meanings are contextual.
I love my job. It keeps me focused. There are no 50 lines of mystical code that make everything work. Big achievements consist of many smaller wins. We move forward, making mistakes and advancing, keeping the big picture in mind. At any time, we could realize that an earlier decision was not the right one. Thus, we go back and test alternatives to find a better solution. This makes SignAll unique.
Thanks to our perseverance, our two great solutions are already available on the market and have received very impressive feedback. This continues to fuel the desire of the whole team to make the products and the solution even better.
* A basic form is a different topic. It is no longer a matter of the database, but the algorithm that uses it. I may touch upon this in my next article