Indian Languages
[quote name='Kaushal' date='17 April 2006 - 02:07 AM' timestamp='1145257151' post='50007']

Can somebody educate mew on the bestway to

1.type in a text in Telugu script

2.convert english transliterated text into Telugu




Check out this service from Google: [url="http://www.google.com/transliterate/Telugu"]http://www.google.com/transliterate/Telugu[/url]
Great discussion topic.
Software to write in indian languages.

Baraha software supports Kannada, Konkani, Tulu, Hindi, Marathi, Sanskrit, Tamil, Telugu, Malayalam, Gujarati, Punjabi, Bengali, Assamese and Oriya languages. Runs on Windows XP/Vista and Windows 7.

Baraha software can be effectively used for...

Email, Blog, Twitter, Facebook, Website in Indian languages.

Document, Spreadsheet, Presentation in Indian languages.

Database, Customized application in Indian languages.

[quote name='Amber G.' date='19 April 2006 - 12:05 AM' timestamp='1145384831' post='50071']

Kaushal -

I am sure you and others here know more than I do, and all this be trivial stuff but ...

If you are using Windows XP with word, Arial Unicode (and thus most of Indian scripts fonts) are there already. You may insert the Telugu fonts here they look like:



I have used "Akshar" for Devanagari script for last many years (Bought it almost 10 years ago) starting from pre-window days etc.. but I am not what you call a heavy user..

I also like "aksharmala" (you may have to do google - to find out where to buy it) . The program is/was free for personal use (and very modestly prices if you use it for business etc) .. Basically you type on standard keyboard using Roman script and it puts the scripted fonts in any application. I used it for Devnagri scripts, but since the alphabet is almost same in Telugu, it should work there too. It's not as good as a professional language specific keyboard but if you are willing to type "ka" to look like क (I am using devnagri here because that is what I have installed) you can do the typing in any word processing package, web forum (or here) or any application which support uncode fonts..fairly easily.

The table looks like:


Redifmail allows you to type in Telugu (Then you cut and paste that in Word)

Latex (If you use math typing ) has Telugu packages too..

If you don't want to install anything on your computer but wants to use web based transliteration program .. I find Itrans site (http://www.aczoom.com/itrans/)

very nice.

Web interface is at many mirror sites eg:


Latex examples:


(My son (BTW he is graduating from Duke in Physics and Math with Minor in classical languages) took sanskrit for a semester and he used that to write the paper etc.) ... You can install the package on your computer , if you want, but if not,

type it on the web, it will translate and put the output in gif/pdf/ or unicode font..

Hope that helps.

BTW may be we should have a thread here for Indian fonts / word processors/ language tools etc..


Copy the Telugu above and paste in Baraha editing window. You get


But translitrated it is same as you wrote.



On translieration to Hindi, I found it be a aa etc. Alphabet. You get a portable code. And it is free.

Akshar too is good, if used with code2000.
ITRAN etc depend heavily on dicritic symbols. After all it was not developed by Indians. Baraha does not use dicritics.
India is like a real life tapestry - a very complex thread of community but equally fascinating as well. That is why some people sometimes call India as a sub-continent. It's diversity may be considered as a big problem to some, but in that very same variety lies the life blood of India's pride.
At last, for once I'm going to spam in the appropriate thread.

1. opensource.com/life/15/9/open-source-extract-text-images

Quote:Google's Optical Character Recognition (OCR) software works for 248+ languages

Posted 18 Sep 2015 by

Subhashish Panigrahi


Book stack

Image by :

Image by Kate Ter Haar. Modified by opensource.com. CC BY-SA 2.0.

Google's Optical Character Recognition (OCR) software now works for over 248 world languages (including all the major South Asian languages). It's quite simple and easy to use, and can detect most languages with over 90% accuracy.

The technology extracts text from images, scans of printed text, and even handwriting, which means text can be extracted from pretty much any old books, manuscripts, or images.

Google's OCR is probably using dependencies of Tesseract, an OCR engine released as free software, or OCRopus, a free document analysis and optical character recognition (OCR) system that is primarily used in Google Books. Developed as a community project during 1995-2006 and later taken over by Google, Tesseract is considered one of the most accurate OCR engines and works for over 60 languages. The source code is available on GitHub.

(BTW, to give credit where it's due: Tesseract was IIRC to have started its existence as the brilliant OCR software by HP, some decades ago. It may have had an open-source phase before Google. Eventually Google got it, tinkered at it, rebranded it - of course - and the crowd goes wild.

But HP's OCR-ing didn't work on"South Asian" - whatever that is - languages, IIRC.)

The OCR project support page offers additional details on preserving character formatting for things like bold and italics after OCR in the output text:

Quote:When processing your document, we attempt to preserve basic text formatting such as bold and italic text, font size and type, and line breaks. However, detecting these elements is difficult and we may not always succeed. Other text formatting and structuring elements such as bulleted and numbered lists, tables, text columns, and footnotes or endnotes are likely to get lost.

Tamil-language Wikimedian and Wikimedia India's program director Ravishankar Ayyakkannu said on Facebook this after testing: "For some of the languages like Malayalam and Tamil, the OCR works with almost 100% accuracy, along with support in formatting like auto cropping, separating text by discarding images, and ignoring colored backgrounds." Native speakers of the following Indian lanaguages—Bangla, Malayalam, Kannada, Odia, Tamil, and Telugu—also commented on a Facebook post with feedback after testing the OCR.

However, for a few scripts like Gurmukhi (used to write Punjabi), the output after OCR is quite poor and results in gibberish text in different scripts.


A tutorial to convert text in Odia (Indian language) from a scanned image using Google's OCR. Designed by Subhashish Panigrahi. CC BY-SA 4.0

Overall, this is quite a large leap for languages that have old texts that have not yet been digitized. Old and valuable text in many languages can now be digitized and shared over the internet using platforms like Wikisource.

(Make sure not to share Hindoo stuffs with the alien appropriating demons or in such a manner that the latter can get their claws on it. Else you're helping the enemy. Think first. Then plan how to make it available to Hindoos alone - for all time - and then "share" with this restriction. It is NOT and NEVER WAS universal knowledge, contrary to what "Hinduism is open-source" activists try to brainwash you with. For instance, many Hindoo heathen texts state clearly that they are secret knowledge/texts, that they are for Hindu heathens alone. Meaning: not aliens and alienated, not non-Hindoo Indics.

Remember: Everything you put on Google drive can be analysed by Google and probably will be. There's no such thing as "private" right, just that they won't share with others. Well, except their government or even the Chinese govt, as multiple court cases involving Google/Yahoo/etc demonstrated.)

Editor's note: Article has been updated based on community feedback. We changed "Google's OCR partly uses Tesseract, an OCR engine released as free software" to "Google's OCR is probably using dependencies of Tesseract, an OCR engine released as free software, or OCRopus, a free document analysis and optical character recognition (OCR) system that is primarily used in Google Books." If you have additional feedback on the article or technology, please let us know in the comments. -Rikki Endsley

2. support.google.com/drive/answer/176692

Not sure why they don't list scripts in place of languages, but maybe for the west the two usually end up meaning the same thing (?)

Quote:Supported languages

List of OCR supported languages

Acehnese, Acholi, Adangme, Afrikaans, Akan, Albanian, Algonquinian, Amharic, Ancient Greek, Arabic (Modern Standard), Araucanian/Mapuche, Armenian, Assamese, Asturian, Athabaskan, Aymara, Azerbaijani, Azerbaijani (Cyrillic; old orthography), Balinese, Bambara, Bantu, Bashkir, Basque, Batak, Belorussian, Bemba, Bengali, Bikol, Bislama, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Chechen, Cherokee, Chinese (Mandarin; Hong Kong), Chinese (Simplified; Mandarin), Chinese (Traditional; Mandarin), Choctaw, Chuvash, Cree, Creek, Crimean Tatar, Croatian, Czech, Dakota, Danish, Dhivehi, Duala, Dutch, Dzonkha, Efik, English (American), English (British), Esperanto, Estonian, Ewe, Faroese, Fijian, Filipino, Finnish, Fon, French (Canadian), French (European), Fulah, Ga, Galician, Ganda, Gayo, Georgian, German, Gilbertese, Gothic, Greek, Guarani, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Herero, Hiligaynon, Hindi, Hungarian, Iban, Icelandic, Igbo, Iloko, Indonesian, Irish, Italian, Japanese, Javanese, Kabyle, Kachin, Kalaallisut, Kamba, Kannada, Kanuri, Kara-Kalpak, Kazakh, Khasi, Khmer, Kikuyu, Kinyarwanda, Kirghiz, Komi, Kongo, Korean, Kosraean, Kuanyama, Lao, Latin, Latvian, Lingala, Lithuanian, Low German, Lozi, Luba-Katanga, Luo, Macedonian, Madurese, Malagasy, Malay, Malayalam, Maltese, Mandingo, Manx, Maori, Marathi, Marshallese, Mende, Middle English, Middle High German, Minangkabau, Mohawk, Mongo, Mongolian, Nahuatl, Navajo, Ndonga, Nepali, Niuean, North Ndebele, Northern Sotho, Norwegian (Bokmål), Nyanja, Nyankole, Nyasa Tonga, Nzima, Occitan, Ojibwa, Old English, Old French, Old High German, Old Norse, Old Provencal, Oriya, Ossetic, Pampanga, Pangasinan, Papiamento, Pashto, Persian, Polish, Portuguese (Brazilian), Portuguese (European), Punjabi (Gurmukhi), Quechua, Romanian, Romansh, Romany, Rundi, Russian, Russian (Old Orthography), Sakha, Samoan, Sango, Sanskrit, Scots, Scottish Gaelic, Serbian (Cyrillic), Serbian (Latin), Shona, Sinhala, Slovak, Slovenian, Songhai, Southern Sotho, Spanish (European), Spanish (Latin American), Sundanese, Swahili, Swati, Swedish, Tahitian, Tajik, Tamil, Tatar, Telugu, Temne, Thai, Tibetan, Tigirinya, Tongan, Tsonga, Tswana, Turkish, Turkmen, Udmurt Ukrainian, Urdu, Uzbek, Uzbek (Cyrillic; old orthography), Venda, Vietnamese, Votic, Welsh, Western Frisian, Wolof, Xhosa, Yiddish, Yoruba, Zapotec, and Zulu.

