Wolfram Language

Open live version

Determine the Language of a Text

Make a function that determines the language that a text is written in.


code

french = Import["http://fr.wikipedia.org/wiki/Main_Page"]; english = Import["http://en.wikipedia.org/wiki/Main_Page"]; german = Import["http://de.wikipedia.org/wiki/Main_Page"]; spanish = Import["http://es.wikipedia.org/wiki/Main_Page"];
language = Classify[{french -> "French", english -> "English", german -> "German", spanish -> "Spanish"}]

how it works

Samples of texts in various languages are abundant on the web. Import French, English, German, and Spanish texts to use to train a classifier:

french = Import["http://fr.wikipedia.org/wiki/Main_Page"]; english = Import["http://en.wikipedia.org/wiki/Main_Page"]; german = Import["http://de.wikipedia.org/wiki/Main_Page"]; spanish = Import["http://es.wikipedia.org/wiki/Main_Page"];

This is what the beginning of the French text looks like:

StringTake[french, 150]

Make a classifier function using the training texts:

language = Classify[{french -> "French", english -> "English", german -> "German", spanish -> "Spanish"}]

Test the classifier on texts that were not in the training set:

language[ExampleData[{"Text", #}]] & /@ {"AliceInWonderland", "LesFleursDuMal", "DonQuixoteISpanish", "UNHumanRightsGerman"}

Make a table of classified phrases:

{# , language[#]} & /@ {"the house is blue", "la maison est bleue", "la casa es azul", "das Haus ist blau"} // TableForm