Data on Degree of Word Standardness Calculated Using Twitter

In language education, it is extremely important to correctly identify how likely a word is to be in Standard language or how likely it is to be a dialect or slang. Therefore it has been extensively researched in the language processing field. However, the identification of standard language vocabulary have been carried out by experts for dictionaries, etc. In today’s fast-changing world, new concepts and new words related to it are born frequently. Hence, it’s much difficult for experts to stay up-to-date with standard vocabulary. It is not always convenient to give only standardized vocabularies when comparing standardized vocabularies and when considering the degree of standardness.
In this study, we focused on the frequency of usage by SNS users, assigned a standard value to each word based on this. In this way, we quantified the standardization so that this value can be used to compare the degree of standardization among two or more words. We, at Social Computing laboratory have named this data as “WORD GINI”. Data for WORD GINI is available on this page. Please use it freely.


New Version
updated on:2018/03/10, link for data:WORD GINI (日本語), file size:6.2MB
updated on:2018/03/10, link for data:WORD GINI (英語), file size:9.5MB
Older version
Not available


The WORD GINI file is a csv file with a word and its associated GINI score on each line. WORD GINI for Japanese language has about 200,000 words and WORD GINI for English language has about 320,000 words. The following is a screenshot of words from WORD GINI for Japanese language.

Examples from WORD GINI


Characteristics of WORD GINI

WORD GINI is a dictionary designed by applying the Gini coefficient (a type of economic indicators) to words, and we created a list of pairs using of word and Gini coefficient (SNS) data from Twitter.
Common words used by a lot of users are assigned a low value, and words only by a few users, such as technical terms are assigned a higher value.
Please note that WORD GINI cannot be used for the comparison between Japanese and English words because the Japanese GINI and English GINI are constructed from different data.
Please refer to the following paper for more information on the calculation method and original data.
村山太一, 若宮翔子, 荒牧英治. WORD GINI: 語の使用の偏りを捉える指標の提案とその応用. 言語処理学会第24 回年次大会, pp. 698–701, 2018.
(Taichi Murayama, Shoko Wakamiya, Eiji Aramaki. WORD GINI: A proposal and application of an index to capture word usage bias. In The 24th Annual Conference of the Association for Natural Language Processing, pp. 698-701, 2018.)

Usage Example

Research Related to the degree of standardness of words
  • Since the values are assigned are close to the concept of degree of standardness of words, they may be used as the degree of standardness of words itself.
  • Since comparison of degree of standardness of words is also possible with WORD GINI, it works well as a reference for defining degree of standardness.
Research related to simplification of words
  • WORD GINI can be also be used for simplification by replacing less frequently used words with more frequently words.
Research related to readability index of texts
  • Since words frequently used on SNS has low score associated with them, WORD GINI can also be potentially used as an index for readability of texts.

