How can we ensure that every language—and the communities that speak them—can fully participate in the digital world? That was the question explored at the CHM Live event Character Building: Bridging Code and Culture through Unicode. With over 7,000 modern languages in use today, it’s a difficult task, but the Unicode Consortium, a nonprofit organization that establishes and maintains standards for representing written language, is trying.
An expert panel decoded how Unicode works for the audience and included Roy Boney, Jr., Cherokee language revitalization manager at Cherokee Film, Mark Davis, cofounder and CTO of the Unicode Consortium, and Anushah Hossain, research director of the Script Encoding Initiative. The moderator was Teresa Marshall, vice president of Globalization & Localization at Salesforce.
In a video clip from a recent CHM oral history interview with Unicode cofounders, Lee Collins and Mark Davis made the point that Unicode aims to enable people everywhere to communicate digitally in their own language. That means Unicode is always evolving. For example, new Chinese ideographs are often added, and additional levels of support like being able to read or type in a particular language are provided.
Hossain added that it’s hard to overstate how important it is that Unicode found a common way to treat the wide variety of writing systems we have in the world.
Anushah Hossain explains the difference between language and script.
Globally, there are close to 350 writing systems and 170 are currently in Unicode. The CLDR (Common Locale Data Repository) project at Unicode deals with language-specific issues, with the goal to customize everything so that the specifics of a language work, like how dates, times, numbers, and currency formats are portrayed in a particular location. Unicode also produces code libraries that can be taken into any product and used so that programmers don’t have to manage all the data that handles the character properties.
The first step in getting a new script into Unicode is to submit a proposal to a subcommittee called the Script Encoding Working Group, explained Hossain. The 15 or so experts on the committee have linguistic backgrounds or a deep interest in language as well as a programming background. They meet once a month to review all the proposals for new characters or scripts and discuss how the script works, if the proposal adequately explains all the characters, and the reach and legitimacy of the script.
Successful proposals often go back and forth with the authors two or three times before being approved, and then they advance to the Unicode Technical Committee that meets once a quarter. The ISO (International Standards Organization) also has a specific working group dedicated to a universal character code, and they review the same proposals. It’s a complex, multistakeholder process. Davis added that Unicode also hopes to make it easier for individuals and organizations to contribute to fleshing out their own language in Unicode.
Mark Davis explains how Unicode tries to be inclusive.
Like most indigenous languages in the US, Cherokee is endangered, says Roy Boney, so for last 40 years the tribe has been trying to preserve and revitalize it. But getting people to shift away from the fonts they had created to the Unicode script has not been easy. There's been a lot of education in the community about what the tools are and what they can be used for. Originally, they needed a font and keyboard and operating systems that would support the language, and then they began working with companies in Silicon Valley to make sure the language was supported on all their products.
Sometimes the process of adding a language to Unicode can become controversial if different groups disagree on what the script actually looks like, noted Hossain. Old Hungarian, for instance, went through 13 proposals because social and political tensions around a few characters stalled the process. Boney described how a team that included scholars, font designers, historians of the language, and community members worked together to research and craft a proposal that still required revisions.
Davis noted that occasionally characters are fast-tracked, like when a Japanese emperor died, requiring a new era character to be used in dates. Chinese ideographs are the largest part of Unicode, outnumbering all other characters. Tranches come out regularly and involve very large data sets. It’s an involved process because they have to verify that an ideograph is actually new and not a variation on an existing one.
Many of us may take for granted that our language is supported on the devices we use every day—like our computer or smart phone. When it’s not, says Boney, you realize very quickly how limited you are in what you can do.
Roy Boney describes the impact of Unicode Cherokee.
Now that it’s common for people in the Cherokee community to have access to their language on digital devices, more and more people are making their own content. Access gives you confidence to do things in your language and pursue your dreams, says Boney, and he’s thankful for Unicode.
While most people’s languages are in Unicode and it has fairly full support for about 100, many languages don’t have enough to help them get to the same level as Cherokee. And there are still a lot of historical works, like those in hieroglyphics, that are not yet able to have digital representation, notes Davis. And, as people find more things they want to do on computers, Unicode has to adapt to meet product requirements.
Hossain says it’s important for Unicode to maintain what’s already there and respond to reported bugs. Arabic has Unicode but it doesn’t work great and there’s a lot still to do to make it functional for people. If there is even a little friction, it’s easy for people to just switch to Latin script or come up with a hack. That’s a problem, because text won’t be processed properly by search engines or anything on the internet.
These are big challenges for a small organization that has more work than people. While everyone benefits from Unicode’s vital work, it’s easy to use their tools without contributing to help it survive. But perhaps telling real-life stories about the positive impacts and the challenges of language inclusivity can help inspire and motivate stakeholders to continue to invest in Unicode and our collective digital future.
Character Building | CHM Live, May 13, 2025
Free events like these would not be possible without the generous support of people like you who care deeply about decoding technology for everyone. Please consider making a donation.