This SLM will have 7 to 13 billion parameters. The genesis of the SLM lies in a paper authored by Microsoft research scientists, titled ‘TinyStories: How small can language models be and still speak coherent English’.
Elevate Your Tech Prowess with High-Value Skill Courses
|IIMK Senior Management Programme
|IITD Certificate Programme in Data Science & Machine Learning
|Indian School of Business
|ISB Professional Certificate in Product Management
When Chaitanya Chokkareddy, the chief technology officer of Ozonetel, chanced upon this paper, the idea of creating a Telugu SLM took shape in his head. He collaborated with Swecha Telangana and Indian Institute of Information Technology, Hyderabad, to compile a dataset of Telugu stories to build an SLM.
In all, 40,000 pages of stories were manually proofread and corrected by 8,000 students from 30 colleges, who participated in a ‘datathon’ led by Swecha.
“I reached out to the secretary of the Free Software Movement of India (which Swecha is a part of), Kiran Chandra Yarlagada, and asked if we can build a Telugu SLM,” he recalled.
Ganesh Katrapati, secretary, Swecha, said their aim was to give children of today access to the kind of stories that used to appear in the magazine Chandamama Kathalu, which went out of print in 2012.
Discover the stories of your interest
“For example, kids can now play around with the characters of Vikram-Betal, which were a mainstay in these magazines,” he said. By end of November, student volunteers part of Swecha across engineering colleges had built a data set and assessed if they need separate tokenisers. Tokens are the basic units of text or code that a language model uses to process and generate language. Tokens can be characters, words, sub-words or other segments of text or code, depending on the chosen tokenisation method or scheme.
“Microsoft released a paper called ‘Tiny Stories’ where they trained an SLM using 21 million stories, and it was able to generate coherent text. It was able to generate stories, so that gave us a lot of hope. We thought, if they can do it, why can’t we, Chokkareddy said. A classic Indian monthly magazine for children, Chandamama was a mainstay in all Indian homes through the 1940s till 2012. It published long-running mythological and magical Indian stories.
INDIA’S SLM LANDSCAPE
SLMs are built using the same methodology as any larger model, but on a smaller neural network, fewer parameters, and less training data. Some large language models (LLMs) in Indian languages that were announced recently include Sarvam AI’s OpenHathi; a Hindi LLM built on Meta AI’s architecture, promising GPT 3.5-like performance; and Ola’s Krutrim, which will have generative support for 10 Indian languages and will be able to take inputs in a total of 22 languages. It has been trained on over two trillion tokens of data for Indian languages. AI4Bharat’s IndicBERT is a multilingual ALBERT model pre-trained exclusively on 12 major Indian languages. IndicBART is a multilingual, sequence-to-sequence pre-trained model focusing on Indic languages and English.
It currently supports 11 Indian languages and is based on the mBART architecture. The Google-funded Project Vaani by IISc, Bengaluru, and ARTPARK is expected to create data corpora of over 150,000 hours of speech and text from about one million people across all 773 districts of India and will be open-sourced.
At Swecha, several student clubs called GLUG across several colleges started work on optical character recognition (OCR) trying to collect magazines and stories from the ’50s to the ’70s.
Students also digitised scanned PDFs of Chandamama stories. An OCR is the electronic or mechanical conversion of images of typed, handwritten or printed text into machineen coded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.
“We helped them with an open-source OCR tool and converted around 70 per cent of the text. The students typed out the remaining 30 per cent in the datathon. Around 8,000 students from 25 engineering colleges participated, and within four hours, we had around 45,000 stories,” Chokkareddy explained.
Moreover, these were big stories, he added.
“It was around 50-60 lines of Telugu text, so we had around half a million lines of text generated, and then we made it open source,” he said. The tooling was done by Swecha volunteers using open-source tools, including OCR, the front end, and the storage of the backend, he said.
“Then we uploaded this on Hugging Face, so companies like Sarvam.AI could theoretically use this dataset. And since they’ve already done it for Hindi, they can generate the same thing for Telugu in a couple of days. Our idea was to open up that dataset,” he said.
Hugging Face lets users create interactive, in-browser demos of machine learning models. This lets users showcase and test models more easily.
“We are now doing our own research into what kind of tokeniser is better, and if we should build an LLM from scratch, and not use Meta’s LlaMa 2 architecture. For that, we are interacting with IIIT,” he said. We launched this dataset as public in Hyderabad, and the IIIT Hyderabad professors came to us and asked if we could collaborate and try to use natural language processing and build our own architecture, he said.
“We’re also working with multiple startup companies like Alpes, which is an AI company who have their own algorithm for deep learning to build a made-in-India algorithm, and not use anything open source,” he said.
I think it will take four or five months before we have our own LLM, he said. But meanwhile, by next week or the week after that, we want to train an open-source LlaMa 2 model and allow Telugu stories to be read out or written out.
“That’s our plan for the next week,” he said
Denial of responsibility! Planetconcerns is an automatic aggregator around the global media. All the content are available free on Internet. We have just arranged it in one platform for educational purpose only. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials on our website, please contact us by email – [email protected]. The content will be deleted within 24 hours.