Login

Husky · (This post was last modified: 02-18-2016, 01:49 PM by Husky.)

At last, for once I'm going to spam in the appropriate thread.

1. opensource.com/life/15/9/open-source-extract-text-images

Quote:Google's Optical Character Recognition (OCR) software works for 248+ languages

Posted 18 Sep 2015 by

Subhashish Panigrahi

Feed

up17 readers like this

Book stack

Image by :

Image by Kate Ter Haar. Modified by opensource.com. CC BY-SA 2.0.

Tweet Widget Reddit logo

Google's Optical Character Recognition (OCR) software now works for over 248 world languages (including all the major South Asian languages). It's quite simple and easy to use, and can detect most languages with over 90% accuracy.

The technology extracts text from images, scans of printed text, and even handwriting, which means text can be extracted from pretty much any old books, manuscripts, or images.

Google's OCR is probably using dependencies of Tesseract, an OCR engine released as free software, or OCRopus, a free document analysis and optical character recognition (OCR) system that is primarily used in Google Books. Developed as a community project during 1995-2006 and later taken over by Google, Tesseract is considered one of the most accurate OCR engines and works for over 60 languages. The source code is available on GitHub.

(BTW, to give credit where it's due: Tesseract was IIRC to have started its existence as the brilliant OCR software by HP, some decades ago. It may have had an open-source phase before Google. Eventually Google got it, tinkered at it, rebranded it - of course - and the crowd goes wild.

But HP's OCR-ing didn't work on"South Asian" - whatever that is - languages, IIRC.)

The OCR project support page offers additional details on preserving character formatting for things like bold and italics after OCR in the output text:

Quote:When processing your document, we attempt to preserve basic text formatting such as bold and italic text, font size and type, and line breaks. However, detecting these elements is difficult and we may not always succeed. Other text formatting and structuring elements such as bulleted and numbered lists, tables, text columns, and footnotes or endnotes are likely to get lost.

Tamil-language Wikimedian and Wikimedia India's program director Ravishankar Ayyakkannu said on Facebook this after testing: "For some of the languages like Malayalam and Tamil, the OCR works with almost 100% accuracy, along with support in formatting like auto cropping, separating text by discarding images, and ignoring colored backgrounds." Native speakers of the following Indian lanaguagesÃ¢â¬âBangla, Malayalam, Kannada, Odia, Tamil, and TeluguÃ¢â¬âalso commented on a Facebook post with feedback after testing the OCR.

However, for a few scripts like Gurmukhi (used to write Punjabi), the output after OCR is quite poor and results in gibberish text in different scripts.

<media>

A tutorial to convert text in Odia (Indian language) from a scanned image using Google's OCR. Designed by Subhashish Panigrahi. CC BY-SA 4.0

Overall, this is quite a large leap for languages that have old texts that have not yet been digitized. Old and valuable text in many languages can now be digitized and shared over the internet using platforms like Wikisource.

(Make sure not to share Hindoo stuffs with the alien appropriating demons or in such a manner that the latter can get their claws on it. Else you're helping the enemy. Think first. Then plan how to make it available to Hindoos alone - for all time - and then "share" with this restriction. It is NOT and NEVER WAS universal knowledge, contrary to what "Hinduism is open-source" activists try to brainwash you with. For instance, many Hindoo heathen texts state clearly that they are secret knowledge/texts, that they are for Hindu heathens alone. Meaning: not aliens and alienated, not non-Hindoo Indics.

Remember: Everything you put on Google drive can be analysed by Google and probably will be. There's no such thing as "private" right, just that they won't share with others. Well, except their government or even the Chinese govt, as multiple court cases involving Google/Yahoo/etc demonstrated.)

Editor's note: Article has been updated based on community feedback. We changed "Google's OCR partly uses Tesseract, an OCR engine released as free software" to "Google's OCR is probably using dependencies of Tesseract, an OCR engine released as free software, or OCRopus, a free document analysis and optical character recognition (OCR) system that is primarily used in Google Books." If you have additional feedback on the article or technology, please let us know in the comments. -Rikki Endsley

2. support.google.com/drive/answer/176692

Not sure why they don't list scripts in place of languages, but maybe for the west the two usually end up meaning the same thing (?)

Quote:Supported languages

List of OCR supported languages

Acehnese, Acholi, Adangme, Afrikaans, Akan, Albanian, Algonquinian, Amharic, Ancient Greek, Arabic (Modern Standard), Araucanian/Mapuche, Armenian, Assamese, Asturian, Athabaskan, Aymara, Azerbaijani, Azerbaijani (Cyrillic; old orthography), Balinese, Bambara, Bantu, Bashkir, Basque, Batak, Belorussian, Bemba, Bengali, Bikol, Bislama, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Chechen, Cherokee, Chinese (Mandarin; Hong Kong), Chinese (Simplified; Mandarin), Chinese (Traditional; Mandarin), Choctaw, Chuvash, Cree, Creek, Crimean Tatar, Croatian, Czech, Dakota, Danish, Dhivehi, Duala, Dutch, Dzonkha, Efik, English (American), English (British), Esperanto, Estonian, Ewe, Faroese, Fijian, Filipino, Finnish, Fon, French (Canadian), French (European), Fulah, Ga, Galician, Ganda, Gayo, Georgian, German, Gilbertese, Gothic, Greek, Guarani, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Herero, Hiligaynon, Hindi, Hungarian, Iban, Icelandic, Igbo, Iloko, Indonesian, Irish, Italian, Japanese, Javanese, Kabyle, Kachin, Kalaallisut, Kamba, Kannada, Kanuri, Kara-Kalpak, Kazakh, Khasi, Khmer, Kikuyu, Kinyarwanda, Kirghiz, Komi, Kongo, Korean, Kosraean, Kuanyama, Lao, Latin, Latvian, Lingala, Lithuanian, Low German, Lozi, Luba-Katanga, Luo, Macedonian, Madurese, Malagasy, Malay, Malayalam, Maltese, Mandingo, Manx, Maori, Marathi, Marshallese, Mende, Middle English, Middle High German, Minangkabau, Mohawk, Mongo, Mongolian, Nahuatl, Navajo, Ndonga, Nepali, Niuean, North Ndebele, Northern Sotho, Norwegian (BokmÃÂ¥l), Nyanja, Nyankole, Nyasa Tonga, Nzima, Occitan, Ojibwa, Old English, Old French, Old High German, Old Norse, Old Provencal, Oriya, Ossetic, Pampanga, Pangasinan, Papiamento, Pashto, Persian, Polish, Portuguese (Brazilian), Portuguese (European), Punjabi (Gurmukhi), Quechua, Romanian, Romansh, Romany, Rundi, Russian, Russian (Old Orthography), Sakha, Samoan, Sango, Sanskrit, Scots, Scottish Gaelic, Serbian (Cyrillic), Serbian (Latin), Shona, Sinhala, Slovak, Slovenian, Songhai, Southern Sotho, Spanish (European), Spanish (Latin American), Sundanese, Swahili, Swati, Swedish, Tahitian, Tajik, Tamil, Tatar, Telugu, Temne, Thai, Tibetan, Tigirinya, Tongan, Tsonga, Tswana, Turkish, Turkmen, Udmurt Ukrainian, Urdu, Uzbek, Uzbek (Cyrillic; old orthography), Venda, Vietnamese, Votic, Welsh, Western Frisian, Wolof, Xhosa, Yiddish, Yoruba, Zapotec, and Zulu.

Login
Username:
Password:	Lost Password?
	Remember me