Data on Degree of Word Standardness Calculated Using Twitter
In language education, it is extremely important to correctly identify how likely a word is to be in Standard language or how likely it is to be a dialect or slang. Therefore it has been extensively researched in the language processing field. However, the identification of standard language vocabulary have been carried out by experts for dictionaries, etc. In today’s fast-changing world, new concepts and new words related to it are born frequently. Hence, it’s much difficult for experts to stay up-to-date with standard vocabulary. It is not always convenient to give only standardized vocabularies when comparing standardized vocabularies and when considering the degree of standardness.
In this study, we focused on the frequency of usage by SNS users, assigned a standard value to each word based on this. In this way, we quantified the standardization so that this value can be used to compare the degree of standardization among two or more words. We, at Social Computing laboratory have named this data as “WORD GINI”. Data for WORD GINI is available on this page. Please use it freely.
The WORD GINI file is a csv file with a word and its associated GINI score on each line. WORD GINI for Japanese language has about 200,000 words and WORD GINI for English language has about 320,000 words. The following is a screenshot of words from WORD GINI for Japanese language.
Examples from WORD GINI
Characteristics of WORD GINI
WORD GINI is a dictionary designed by applying the Gini coefficient (a type of economic indicators) to words, and we created a list of pairs using of word and Gini coefficient (SNS) data from Twitter. Common words used by a lot of users are assigned a low value, and words only by a few users, such as technical terms are assigned a higher value. Please note that WORD GINI cannot be used for the comparison between Japanese and English words because the Japanese GINI and English GINI are constructed from different data. Please refer to the following paper for more information on the calculation method and original data.
村山太一, 若宮翔子, 荒牧英治. WORD GINI: 語の使用の偏りを捉える指標の提案とその応用. 言語処理学会第24 回年次大会, pp. 698–701, 2018.
(Taichi Murayama, Shoko Wakamiya, Eiji Aramaki. WORD GINI: A proposal and application of an index to capture word usage bias. In The 24th Annual Conference of the Association for Natural Language Processing, pp. 698-701, 2018.)
Research Related to the degree of standardness of words
- Since the values are assigned are close to the concept of degree of standardness of words, they may be used as the degree of standardness of words itself.
- Since comparison of degree of standardness of words is also possible with WORD GINI, it works well as a reference for defining degree of standardness.
Research related to simplification of words
- WORD GINI can be also be used for simplification by replacing less frequently used words with more frequently words．
Research related to readability index of texts
- Since words frequently used on SNS has low score associated with them, WORD GINI can also be potentially used as an index for readability of texts.
The deliverable has been developed with the utmost care and a great attention to detail. However, complete and full reliability and robustness cannot be guaranteed. We do not accept responsibility for any problems that may occur as a result of using this application or data. Please use it at your own risk when using it.