Language detection using ML KIT's language detection API

ML Kit Language detection

Language detection is the first step towards interpreting it when you are presented with a text that you can't read or understand. We have apps like iTranslate, Google Translate, sayHi, iHandy which can translate most of the well-known languages for us. With rapid improvement, these apps not just translate between the languages of your choice but also detects the language as you start typing the text.

Language detection is not as easy as it seems to be unless you are asked to choose from a very few sample space. Because natural language is dynamic and two or more languages could be near similar.

And we need language detector because often times when working with user-provided text, we are not provided with language information. Google's ML Kit has come up with an api for language detection. This API can be used to determine the language of a string of text. There are a number of such apis provided by leading tech companies which more or less have the same capabilities. And ML Kit, specifically, has the following capabilities:

  • As of now, it can detect over a hundred different languages. The complete list is here.
  • It can identify Arabic, Bulgarian, Greek, Hindi, Japanese, Russian, and Chinese text in both native and romanized script.

Now let's build a language detector app using this api. This project requires you to setup a project at firebase console. Refer steps 1 to 4 of ML Kit Tutorial: How to recognize and extract text in images.

First make sure you have implemented the following dependencis in the app-level build.gradle file:

dependencies {
  // ...
  implementation 'com.google.firebase:firebase-ml-natural-language:18.1.1'
  implementation 'com.google.firebase:firebase-ml-natural-language-language-id-model:18.0.2'
}

After having done that, just create one input text for receiving the user input, one button to trigger detection and one textview for display of the language. A sample is as shown below:

ML Kit Language detection sample

The xml layout is as shown below:

<?xml version="1.0" encoding="utf-8"?>
<android.support.constraint.ConstraintLayout xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:app="http://schemas.android.com/apk/res-auto"
    xmlns:tools="http://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    tools:context=".MainActivity">

    <EditText
        android:id="@+id/edit_lan"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_marginStart="24dp"
        android:layout_marginLeft="24dp"
        android:layout_marginTop="116dp"
        android:layout_marginEnd="24dp"
        android:layout_marginRight="24dp"
        android:layout_marginBottom="570dp"
        android:ems="10"
        android:hint="Enter a text"
        android:inputType="text"
        app:layout_constraintBottom_toBottomOf="parent"
        app:layout_constraintEnd_toEndOf="parent"
        app:layout_constraintStart_toStartOf="parent"
        app:layout_constraintTop_toTopOf="parent" />

    <Button
        android:id="@+id/view_lan"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_marginStart="160dp"
        android:layout_marginLeft="160dp"
        android:layout_marginTop="16dp"
        android:layout_marginEnd="160dp"
        android:layout_marginRight="160dp"
        android:layout_marginBottom="8dp"
        android:text="Language"
        app:layout_constraintBottom_toTopOf="@+id/lan"
        app:layout_constraintEnd_toEndOf="parent"
        app:layout_constraintStart_toStartOf="parent"
        app:layout_constraintTop_toBottomOf="@+id/edit_lan" />

    <TextView
        android:id="@+id/lan"
        android:layout_width="wrap_content"
        android:layout_height="wrap_content"
        android:layout_marginStart="100dp"
        android:layout_marginLeft="100dp"
        android:layout_marginTop="8dp"
        android:layout_marginEnd="100dp"
        android:layout_marginRight="100dp"
        android:layout_marginBottom="411dp"
        android:text="TextView"
        app:layout_constraintBottom_toBottomOf="parent"
        app:layout_constraintEnd_toEndOf="parent"
        app:layout_constraintStart_toStartOf="parent"
        app:layout_constraintTop_toBottomOf="@+id/view_lan" />
</android.support.constraint.ConstraintLayout>

Now to trigger language detection of a string, get an instance of FirebaseLanguageIdentification, and then pass the string to the identifyLanguage() method as shown below:

FirebaseLanguageIdentification languageIdentifier =
FirebaseNaturalLanguage.getInstance().getLanguageIdentification();
languageIdentifier.identifyLanguage(text)
.addOnSuccessListener(
  new OnSuccessListener() {
	@Override
	public void onSuccess(@Nullable String languageCode) {
	  if (languageCode != "und") {
		Log.i(TAG, "Language: " + languageCode);
	  } else {
		Log.i(TAG, "Can't identify language.");
	  }
	}
  })
.addOnFailureListener(
  new OnFailureListener() {
	@Override
	public void onFailure(@NonNull Exception e) {
	  // Model couldn’t be loaded or other internal error.
	  // ...
	}
  });

On success, a BCP-47 language code is passed to the success listener, indicating the language of the text. If no language could be confidently detected, the code und (undetermined) is passed.

By default, ML Kit returns a value other than und only when it identifies the language with a confidence value of at least 0.5. You can change this threshold by passing a FirebaseLanguageIdentificationOptions object to getLanguageIdentification() as show below:

FirebaseLanguageIdentification languageIdentifier = FirebaseNaturalLanguage
        .getInstance()
        .getLanguageIdentification(
                new FirebaseLanguageIdentificationOptions.Builder()
                        .setIdentifyLanguageConfidenceThreshold(0.34f)
                        .build());

By now you are done and can run the project. In case you want to get the full language name from the language code, use Locale object as shown below:

Locale loc = new Locale(lan_string);
String lan_name = loc.getDisplayLanguage();

The full code and the project can be downloaded from here

You can also get list of all the possible languages of a string.

Getting all the possible languages of a string

In order to get the list of all possible languages of a string, get an instance of FirebaseLanguageIdentification, and then pass the string to the identifyAllLanguages() method as shwon below:

FirebaseLanguageIdentification languageIdentifier =
FirebaseNaturalLanguage.getInstance().getLanguageIdentification();
languageIdentifier.identifyAllLanguages(text)
.addOnSuccessListener(
  new OnSuccessListener() {
	@Override
	public void onSuccess(List identifiedLanguages) {
	  for (IdentifiedLanguage identifiedLanguage : identifiedLanguages) {
		String language = identifiedLanguage.getLanguageCode();
		float confidence = identifiedLanguage.getConfidence();
		Log.i(TAG, language + " (" + confidence + ")");
	  }
	}
  })
.addOnFailureListener(
  new OnFailureListener() {
	@Override
	public void onFailure(@NonNull Exception e) {
	  // Model couldn’t be loaded or other internal error.
	  // ...
	}
  });

If the call succeeds, a list of IdentifiedLanguage objects is passed to the success listener. From each object, you can get the language's BCP-47 code and the confidence that the string is in that language. It is to be noted that ML Kit doesn't identify multiple languages in a single string. The values obtained indicate the confidence that the entire string is in the given language.

By default, ML Kit returns only languages with confidence values of at least 0.01. You can change this threshold by passing a FirebaseLanguageIdentificationOptions object to getLanguageIdentification() as follows:

FirebaseLanguageIdentification languageIdentifier = FirebaseNaturalLanguage
.getInstance()
.getLanguageIdentification(
new FirebaseLanguageIdentificationOptions.Builder()
.setIdentifyAllLanguagesConfidenceThreshold(0.5f)
.build());

If no language meets this threshold, the list will have one item, with the value und.

For quick set up, you may download the project directly from here or you may refer to this repo for all the source codes.

That's it! You have just learned how to use the ML Kit langauge detection api. Suggestions and questions are welcome.






Author:


Ratul Doley
Ratul Doley
Entrepreneur and AI researcher. Currently learning and working on Unsupervised learning and Data Clustering. Professional Android app developer and designer. Updated Nov 15, 2018