- Open Access
- Authors : Pritish Sinha , Khushi , Siddhant Jain , Vijaya Chauodhary
- Paper ID : IJERTV12IS030111
- Volume & Issue : Volume 12, Issue 03 (March 2023)
- Published (First Online): 29-03-2023
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Sentimental ESG Mining using Machine Learning and AI Techniques
Pritish Sinha
Computer Science and Engineering Galgotias University Greater Noida, India
Siddhant Jain
Computer Science and Engineering Galgotias University Greater Noida, India
Khushi
Computer Science and Engineering Galgotias University Greater Noida, India
Vijaya Chauodhary
Computer Science and Engineering Galgotias University Greater Noida, India
AbstractEnvironment, Social & Governance are new gener- alised terms in sustainability of businesses and organisations. ESG Policies are now imposed in good financing terms and business development. Many companies has followed sustainable paths while others are now turning their way through this. New investors seek for information regarding ESG of an investment so to keep their capital safe and keep developing with new regu- lations of law makers. This calls for large committee and decision making with including proper documentations of each aspect in company working from electricity consumption to policies that safeguards human rights, thus each organisation maintains doc- uments namely CSR Report, Annual Report, Social-Governance Documents etc however an investor doesnt directly reads these documents to factor their investment. They hire business analyst companies and their ESG analyst goes through manual task of reading each entry and filing a compiled factset with reviews according to global standards. Each organisation have different ways of documenting their data. Thus, text sentimental mining comes to play role here and reduce task of compiling and reviewing these large data scaling down each process time to less than hours of work which took more than days in traditional method.
Index TermsESG, factset, visualisation, sentiment analysis
-
INTRODUCTION
There are number of companies that have setup its gov- ernance to measure, and rate its sustainability. S&P 500 companies have board level committees to look over this aspect of business and many hire analyst companies that keep track on their annual reports and documents related to social and governance. In this paper we are examining disclosure of these data to machine learning model and AI model that will be analysing sentimental values of data to rate a business model over environment,social and governance topics.
-
Problem Statement
ESG analyst work is of more than days depending on how data is documented by an organisation, scale of investment in organisation, number of factory sites handled by organisation, other investment of that organisation etc. Additionally, green
washing is a problem which is falsely using of data and words to trick an analyst to come to conclusion that organisation is working under right full terms. Another disadvantage of this manual labour is that data is not timestamped and not all real time data being evaluated.
-
Problem Solution
After some research it can be concluded that many data segments of these organisation are being published regularly at sites like GDelt, Yahoo and other. We can make use of machine learning and AI techniques that can help to visualise data of organisation and also provide factored filter option. Proposed paper plans to use machine learning, concept of NLP to extract sentimental features of data and scale them to visualise whole work. Secondary objective of this work is to provide data of connected/related organisation and better investment options to its user using Node2vec embedded connection graphs.
-
-
LITERATURE SURVEY
-
Background Research
Theres been increase in governance regulation to many organisation to follow ESG standards in heed to sustainable development. Smart investment is now subject to investment in firms/organisation that is following global standards and UNSDP regulation in 3 terms E, S & G i.e, Environment, Social and Governance. We have examined disclosure of ESG activity data by organisation to prove there ESG standards and actract the investors. [10]
However there are many cases of malpractices in area of research and analysis as it is tricky and only based on data released by firms. One of these is greenwashing that is publishing data that makes company looks like it is following and aiming to work upon SD path.
In other words such laws are costly as pooling the whole disclosure can lead to destruction of share holder value. We have also found number of investment in analyst groups and
firms that work on contract basis to pool and analyse data by manually working on data like CSR report, Annual report and Social-Governance sheets and policies chart.
-
Related Work
Sentiment analysis and opinion mining field has grown over last few years. Are of research now aims large set of application in real life for example use of reviews from customer to enhance a product, enhance users accessibility of product use like spotify, instagram, facebook, web browsing, etc. These all applications are utilizing higher artificial intel- ligence techniques to determine users feelings, opinion and natural language processing is at boss to solve these urges. Many scientific articles are being published and worked upon related to opinion/ emotion mining from text data like reviews, comments and tweets. [12]
This section of paper aims to provide a curated list of existing models, implementation and other related work. Many dictio- nary types like WordNet developed SentiWordNet [13] these generalise terms related to emotions but major drawback is that these dictionary types cannot implement hierarchical word net failing proper testiment of sentiment analysis. Below is table of obtained works with acronyms ML to represent machine learning work, L for lexicon and H to denote hybrid work.
TABLE I
RELATED WORKS
Year
Approach
Domain
Result
2002
ML
Movie Review [14]
82.9%
2009
ML
Product Review [15]
83.30%
2013
L
Word Dictionary [7]
93.30%
2015
ML
Tweet [5]
67.40%
2017
L
Review [10]
83.30%
2018
H
Customer Review [8]
83.30%
2019
H
General [4]
83.30%
2020
H
ESG – Japenese [13]
83.30%
ML
ESG- Spanish,English [12]
91.2%
-
-
VISUALIZATION MODEL
Model summary includes steps involved in development which entails: preprocessing, word embedding, classification, and visualization. Below are explanation of each step.
-
Step 1- Preprocessing
Following the nature of making a machine learning model we collected our data first for preprocessing. We gathered over 120+ CSR, Annual, Social-Governance and Policy Documents reports that followed US and French standards, formatted each pdf and extracted proper nouns, action verbs and selective
E, S& G dictionary texts ex- sewage, coal, monoxide, accident, crash, etc while manually stemmed plurals in more than few cases.
TABLE II
CLASSIFIE WORD IN E, S, & G CATEGORY
Category
Word Details
it falls
Word
Count
E – Environment
global warming, wastage
1573
S – Social
overtime, violation
798
G – Governance
committee, succession plan
1360
Other
profit, data, activity
24672
Total
28403
-
Step 2- Word Embedding
In proposed paper we have used word2vec neural mapping model to generate out word embedding vector. This has generated a 50-dimensional vector of word embedding, for specificity skip gram model of word2vec [7] model has been employed to structure a hierarchical tree for word embedding.
-
Step 3- Word Classification
Our classification basket consist of 4 categories which are Environment, Social, Governance, & Other embedded in word vector label of E, S, G & O respectively. Some of labeling done manually through local natural language toolkit directed dictionary for ESG word stack and rest repeti- tive words labeled as Other, remaining classification is handled by neural network. Outcome of this is 1573 E-categorised word, 798 S- categorised words and 1360 G- categorised words rest of words are labeled as O here. Use of mono words is not beneficial so word structuring aided interpretation of statements this can be done by adding hierarchical words on basis of frequency and word in tree model. Example Water Pollution where Pollution is Environment topic and Waste Diffuse, Insoluble impurities are required to be included in Environment topic also. Word structuring
Fig. 1. Example of word structuring
requires indexing proper nouns of highlighted topics of ESG i.e, for environment,social, and governance (k=1) fixed and for diverse words we make a word embedding vector vi and vj . We have dij as
dij = (1 cos(vi, vj ))2
The value of cos ranges from -1 to 1, and according to calculation dij falls under range 0 to 4. But tree structure sum
of divergence cannot be constructed as heuristic foundation is already established. Tree structure optimization is the key which can be worked on in future for better word structuring and we can apply other constraint algorithms for solution like greedy algorithm.
TABLE III WORD COUNT
Word
Count
Absolute zero target
73
Cap and trade
25
CO2
14
carbon emissions
13
DEI
0
EHS
0
EPA
4
Glasgow COP2
6
GHG
4
Scope 1
3
Scope 2
5
Scope 3
7
LCA
2
Net-Zero carbon emission
24
Offset
2
Paris Agreement
2
Our approach is non-optimised based for large set of words. Each tree can cover hundreds of words but presented solu- tion will create similar sets of trees with quantitative word frequency. For example take instance from below image as process start discard a branch taking reference from Fig. 1 we used above.
Fig. 2. Example of word frequency
-
Step 4- Visualisation
From Table 2 and Figure 2 we form a hypothesis related to E categorised words. In figure 2 darker the dots are and the diameter represents word frequency. Thus we can discern that organisation is somewhat related to Energy consumption and either there is bad impact on environment or its detailing good environment friendly techniques leading to lower air and soil pollution. White dots are one that are discarded due to too low or zero frequency utilised from nlk tool dictionary.
-
Visualisation Summary: Figure 3 is depiction of possi- ble tree result from an Annual Report following S&P global standards in ESG sustainable development and quality drive in its good social and governance regulations. Connection of category E is formed here and shows that firms engages to issues of ecosystem with rare, indigenous species involved in it, on right hand side we see there are some topics related to global warming, discharge in air in relation to energy consumption or generation.
-
Fig. 3. Example of final word frequency
-
-
ESG SCORING
Main objective of proposed paper is to provide feature of rating an organisation on basis of ESG activities and provide a comparison foundation of it. Proposed solution is to do a sentimental analysis on extracted data over selective topics and then set a parameter to score these analysis, use of emo- tion artificial intelligence, natural language processing(NLP) technique is used here.
1) What is Sentiment Analysis: Sentiment analysis falls under natural learning processing(NLP) technique that is either subjective analysis or opinion mining on basis of identified patterns of large data. It generates score card according to at- titude, views of the feeded topic. [3]This analysis is structured over 3 elements of context:
-
Opinion/Emotion: Opinion refers to polarity of subject while emotion scales qualitative features of sad, joy, anger and happiness.
-
Subject: In sentiment analysis its a crucial leg to provide subject over which analysis is targeted to get better and accurate ratings.
-
Organisation/Holder: In sentiment analysis we should be providing scale of passiveness to feature evaluation for example reference of author who expressed and to whom it is expressed.
-
Scoring function
sqnt = ),k
log
(nk,i,t + 1) + ),2012
Dt
To score a firm according to its ESG activities we cant be using single function to define as a whole. [9]For this we define separate specificity proxies for each category and then the solid quality score sqnti,t it is the sum of logarithms of nk,i,t for each frequency count heads in generalised topics of word K in time t and of firm reports i. Sum is by default 1 so we have added 1 to nk,i,t by general, preventing it from being 0.
i,t
k=1 n
t=2006
sqnt k
i,t = ),k=1 logn(nk,i,t + 1)
i,t
Industrial average sspc
is average of total divergence to
number of establishment transition in words of E, S and G. This average function is calculated because there can be number of different establishment or investment of a firm, these other variables will have different ESG activities and can factor the scoring for example- Firm A have total 3 plants(I, O & U) in different sectors of market now A uses plant I for electricity generation, O is sewage plant and U is manufacturing unit of A. E related activities is possible in all three plants, S and G is limited to plant U as this plant is manual operation i.e, we have to score each plant to consider ESG activity of A. For this another feature extraction required and an additional layer that will find relations to segments of firms made as following explanation is done.
Fig. 4. Polarity – Word frequency
We confirm that Industrial average score are better parameter for analysing a ESG standard of firm as we go for higher variables i.e, more number of investment, sub divisions in target firm/organisation. Moreover, calculating industrial average of each topic environmental activity, social related activity and gvernance related activity theres an improvement of ESG quantitative performance from CSR reports.
sspc k k
i,t = ( ),k=1 logn(nk,i,t)dij| ),k=1 logn(ni,t +1))
-
Relationship with ESG performance
With large quantity and divergence in ESG activities there will be improvement in ESG performance of a firm/ organisation. Thus, we concludes that there is positive relation between out scores and ESG performance of firm. We cross verified our hypothesis by comparing these score data against Thomson Reuters Asset4 score. KPIs are the generalised scores ranging from 0 to 100, also our score range is between 0 to 100 only while industrial average bounds upto 50. Scaling and scoring sometimes cannot fit against empty segments of missing data so we used Tokyo Stock Exchange logarithm as control variable. over extent of year t. Regression as follows:
TABLE IV
INDICATOR SCORE
Score
No. of indicators
Example
Environment
73
Pollutant released
Metric Tons of CO2
Social
42
Women to Men ratio
Anti-slavery policy
Governance
56
Board meetings
CEO compensation
Tokyo logarithms exchanges occurred from 2006 to 2012. So adding control variable we get:
-
Comparative ESG strategy
Over secondary objective in this proposed model is to provide an better investment option while we analyse our target firm/organisation. For example primarily we have chosen firm A to invest in and we analyse its ESG activities traditional analyst will be targeting that firms data and make its decision if that investment is good or bad. However, in such case simple use of Node2vec specifically word2vec can ease our findings for an alternative option to investment in and again whole analysis part will be covered by the model.
Fig. 5. Related firm confidence plots
Word2vec implements uni-gram natural algorithm to find the phrases from whole bucket data set create a distributional hypothesise to find the alternative investment, we used word
embedding out to plot a 3D graph which will be visual output for easier accessibility to the feature. Term frequency is confidence factor in above Figure 5. here that is calculated by using document inverse frequency manner:
Wi,j = tfi,j log(n|dfi)
Fig. 6. Final embedded 3D graph
-
-
CONCLUSION
Upcoming years ESG will be a fundamental in sustainable growth and factor all investments in market. The proposed paper analysed text data of different firms and organisation over subject ESG activity specifically. Traditional method of manual analysis is time consuming and found to be somewhat inaccurate as proper time stamping and real time data not available all time. This calls for mining model to visualise firms/organisations ESG activities by acquiring text documented data published according to global standards. We tested it over 100+ organisation and hypothesis of evaluating ESG scores by automation of natural language processing and various machine learning techniques proves to be better and faster solution against disadvantaged factors of manual work and greenwashing subject by organisations. Figure and
TABLE V
ESG PERFORMANCE
report(firm name not disclosed) and can be concluded that further investigation and proper sustainable methods could be taken to improve ESG ratings.
-
REFERENCES
Category |
ESG Performance |
||
E |
S |
G |
|
Adjusted R2 |
0.224 |
0.233 |
0.296 |
E score |
3.8 |
||
Specific E score |
0.6608 |
||
Average ind. E score |
11.21 |
||
S score |
19.46 |
||
Specific S score |
0.417 |
||
Average ind. S score |
27.45 |
||
G score |
10.45 |
||
Specific G score |
0.433 |
||
Average ind. G score |
19.42 |
Standard errors: *p¡0.05; **p¡0.01; ***p¡0.001
Table is summary of subsequent work done in this paper. We found aspects of ESG performance from firms CSR