Sitemap

Gemini: A bilingual e-book reader concept application for Android

Aligning and storing text + audio translation pairs in multiple languages.

Propolis
8 min readMay 19, 2020

For one of my university projects, I was given the opportunity to take on a project of personal significance, and so I decided to create a language study application which I have been imagining for some time. As an expat and traveler I have frequently used free language courses and software, such as Duolingo, to improve my knowledge. However, the apps which I encountered did not have content extensive enough to give significant exposure to the second language. I thought about looking up translated e-books to read in parallel; however, When I tried to read and listen to literature in a second language, however, I found the burden of continually looking up translations (especially for cultural expressions) too great to allow for an enjoyable experience.

From this problem, the idea eventually occurred to me that I might be able to create an application to synchronize public domain text translation pairs along with audio recordings to create a bilingual, immersive learning experience, wherein users could both listen to and read a given book in both their target and native languages. In the end, I was able to create a prototype of this concept. In this report, the design, logic and remaining challenges of my application “Gemini” will be summarized.

Application Design

The Gemini application has three primary views; a library view, wherein users can select and view metadata about included literary works; a reader view, wherein users can experience the audiobook(s) and text tracks in a synchronized view; and finally an editor, in which users can provide manual alignments to their own book data (recordings and translation corpora) :

Library, Reader and Editor views.

The design of the Gemini application has been constructed to be intuitive, although many core elements remain missing due to the time constraints on implementation. The viewer and editor layouts, which are likely where users would spend the most time, are split into three views: Text in first language (TL1), text in second language (TL2) and audio / playback controls (PB). These three views are chained together such that a change to the position of one view (such as scrolling TL2 or seeking to a new position in audio) will update the position of the other views in real time.

The user should have the option to toggle between audio languages if files are available for both text sources; similarly, they should be able to maximize either text language as desired for an easier reading experience (with the translation being available at a click’s notice), or use the default dual view. Chaining the various views together may result in unexpected behavior for some users (e.g. if the user wants to scroll one of the views manually without autoscroll, but all views update), therefore controls must be offered to allow precision over what synchronizations are currently enabled.

The editor view offers some additional controls versus the reading view, most notably including floating buttons to add timing data markers at the current text position offset, for both text tracks, relative to the audio offset. This interface allows users to disable or enable track autoscroll via a toggle/switch to preview the results of their synchronizations or continue adding points; it by default does not auto-scroll if the current audio position is ahead of the latest existing timing point. The editor view also includes a save option to save the inputted timing data, which currently dumps it to the system log but should eventually save to application storage.

Synchronization Logic

The Gemini reader app’s primary feature is the synchronization of multiple text and audio tracks, which also represented the most significant logic challenge of the application’s development. This synchronization requires data, for which many formats exist requiring various pre-processing styles [Table 1] :

Table 1. Common multimedia synchronization techniques and tools.

Synchronization data in the formats above can be represented by the following relational diagram (audio-to-audio synchronization is not possible as audio has a requisite fixed rate):

Figure 2. Possible synchronization vectors between formats of a literary work.

The ‘legs’ in this diagram can be implemented by one of the two formats types above; for example, AL1 → TL1 could use an SRT file, while TL2 could be mapped to TL1 using a TMX file.

Issues with existing synchronization formats

A problem with both of the two discussed synchronization formats is that they require significant effort (or a word-to-word precision NLP analysis tool) to create, as the data must be mapped at an extremely high resolution; to demonstrate this, subtitle files break down text into single or even fractional sentences so that each synchronized piece can fit within a normal screen; TMX and other parallel corpus formats are similarly constrained and require exact translations (down to each sentence or ‘segment’ level). Book synchronization data in these forms does exist, carefully curated by a number of research groups such as OPUS (Tiedemann, 2012), community audiobook efforts, and culture preservation organizations, and Gemini has rudimentary support for importing the most common sync data formats (TMX, SRT); but sync data availability for specific language pairs and literary works is extremely erratic (it is very rare to find all four data types (TL1, TL2, AL1, AL2 for a given work). It is easier to find complete sets of audio and text translation pairs that are presently unsynchronized, but as discussed, adding synchronization with existing formats takes a lot of work.

A solution using sparse, relative timing data

In designing Gemini, I decided to create a new technique for aligning language sets (‘langsets’) of data that significantly lowers the rigorousness associated with synchronization. In essence, my method uses interpolation to create a curve for each audio → text relationship; when these curves are evaluated at a given time offset (based on percent of total time) the resulting values will represent an approximate alignment of the data. To understand and test this concept, I created a demo application which visualizes two text curves relative to one audio ‘curve’:

Figure 3. Alignment of normalized offset by interpolated curve evaluation.

The reason this concept works well for lengthy texts and audio recordings is that it requires much lower precision — in the reader application, a relatively large segment (generally at least four sentences) of text will always be displayed on-screen, so line or character level synchronization data is not needed. The use of interpolation (Gemini uses a custom bicubic algorithm) ensures that between mapped ‘exact’ points, the text will scroll smoothly, and because narrators generally speak at the same rate over a recording, the points required for accurate scrolling are few. The real strength of this relative alignment format is that can describe any relationship from Figure 2 — by normalizing the X axis and Y axis of two curves (Gemini uses a range and domain of 0–1 (percent values)) a text and audio curve pair or text and text curve pair could be evaluated at any offset with this method. A Gemini book having all four langset values (TL1, TL2, AL1, AL2) synchronized would require only four curves (two for each audio file, stored as point arrays) to represent all of the Figure 2 relationships.

Challenges and areas for improvement

Gemini is still very much in the proof-of-concept stages of development, with many elements not being accessible through the user interface yet at all, and others not working as expected. Some areas I can focus on improving and implementing are:

  • UI /UX: Add landscape orientation layouts, dark mode, single/dual text view toggle, and other style options; grow dataset to represent more diverse array of genres and languages;
  • Functionality: Add editor support for segment alignment and relative timing support for main reader view (these use separate display types (RecyclerView for segmented data and TextView for unsegmented) at present). Add option for user to upload own langset data files or perhaps work on public / community alignment projects
  • Port existing open source audio ↔ text synchronization software to Java/Android (Aeneas) to allow for in-app auto syncing of audiobook and its text; could also implement a parallel corpus analyzer to auto-segment and align two texts based on machine learning translation (MLT) and NLP concepts.

One challenging development decision is how I can programmatically improve the intuitiveness of the reader view synchronisation. On the Android platform, certain listener methods are missing or do not offer the required functionality by default for this project; for example, there is no method of NestedScrollView allowing distinguishment between user scroll and programmatic scroll events (Android Developers, n.d.). I may have to subclass related view types to implement such functionality myself, or solve the problem with helper methods.

Conclusion

Of the six applications I developed for my Android development class, this final project, Gemini, was by far the most complex; it also has the most potential to be useful and unique as a published application. Developing the logic and design flows forced me to revisit elements learned from each of the course modules and improve them, while also diving into a multitude of new techniques and mathematical concepts I had previously avoided, such as interpolation and data normalization. While the technical knowledge and situational (code library) knowledge I gained is valuable, it is perhaps less important than the huge boost in confidence I have experienced in my own ability to be a developer and create useful software.

References

Android Developers. (n.d.). NestedScrollView. Retrieved from https://developer.android.com/reference/android/support/v4/widget/NestedScrollView

Aulamo, M., & Tiedemann, J. (2019). OPUS, NLPL, FISKMO, and how it all fits together. Retrieved from https://blogs.helsinki.fi/language-technology/files/2019/02/Tools-and-interfaces.pdf

Tiedemann, J. (2012, May). Parallel Data, Tools and Interfaces in OPUS. In Lrec (Vol. 2012, pp. 2214–2218).

--

--

Propolis
Propolis

Written by Propolis

IT / Cybersecurity Grad with a strong interest in coding and #OffSec.

Responses (2)