KNO.E.SIS Research Blog: 2016

Bots in the Election

In the Kno.e.sis Center at Wright State University, we continue to refine our Twitris technology (Licensed by Cognovi Labs LLC) for collective social intelligence to analyze social media (esp. Twitter) in real time. Kno.e.sis and Cognovi Labs teamed up with the Applied Policy Research Institute (APRI) early in the year and created some tools to monitor the debates. See press coverage on TechCrunch. From the time that we first began following the nominees on Twitter, one thing became clear: Donald Trump was considerably more popular than his competition during the primaries as well as the general election. To be honest, I had never considered the possibility that social bots may have been playing a role in this popularity.

After the conclusion of the first debate all parties who had watched our "Debate Dashboard" were shocked not just by the volume of tweets but by the sentiment and emotion that appeared more positive for Trump than Clinton. When we investigated news from major media outlets, we became more and more concerned that our tool had some serious flaws. Due to articles we had seen discussing the large support Trump had on Twitter, we decided to focus on sentiment. Up until a few days before the election, we continued to update and improve our sentiment analysis algorithm.

Notwithstanding the improvements in precision to our sentiment classifier, we continued to see Trump as the clear leader. As the debates came and went, our data remained consistent. We began an urgent quest for reason. We added gender analysis because media outlets were telling us that women were down on Trump and would be a major force in the election. Our analysis did not show this, despite having 96% precision in determining female and male users. We developed a proprietary process to separate users into left-leaning and right-leaning users. We could even say whether a user was strongly or loosely associated with a particular political party. Unfortunately, analyzing the data based on political association didn't help either. Surprisingly many strongly left leaning users were anti-Hillary, just a bit behind the right-leaning users.

After the second debate, we began to see many articles pop up about social bots. Once we began to look more into the issue, we found many articles from early in the year talking about Trump's "Bot Army" (Trumps' Biggest Lie? The Size of His Twitter Following). We had our aha moment. In that article, there is a reference to The Atlantic's use of a tool called BotOrNot. We decided to attempt to use this or some similar tool during the last debate to remove users with bot accounts and analyze the remaining data.

BotOrNot is a tool developed at the University of Indiana, Bloomington with collaboration from the University of Southern California, Marina Del Rey. Their tool creates over one thousand metrics by looking at the user account and analyzing retweets, hashtags, metadata, etc. The tool performed extremely well in the DARPA Twitter Bot Challenge, having correctly identifying all of the known bots (though it did incorrectly mark some additional users as bots). We were excited to learn that they had made their tool available through an API endpoint and decided to run our tweets through the system to test the speed at which we could process users. Twitris at this point was processing nearly 35 tweets per second for the election analysis alone, and it was clear very quickly that their service would not be able to handle the volume of data we would be consuming.

Though the final presidential debate was only several days away, we still had hope that we would find an answer. We saw Prof. Philip Howard, from the University of Oxford, mention in The Washington Post that they considered any user that tweets more than 50 times in one day a bot. It would be relatively simple to create an index of users and simply increment the count per tweet and then check the index quickly as the tweets roll through. We may have done this if our team had been free at the time. Some were working on bug fixes, others on improving sentiment, and still others were working to fix some infrastructure issues that we were experiencing at the time. Our corpus of tweets for the election campaign was on its way to exceeding 60 million tweets. A robust implementation would have required more time than anyone had to offer.

I suppose that now is a better time than never to define what a "bot" is. Phil Howard says, in that Washington Post article mentioned above, that "A Twitter bot is nothing more than a program that automatically posts messages to Twitter and/or auto-retweets the messages of others". I personally like this definition, but it leaves a little wiggle room where tools like BotOrNot, which focus primarily on the user accounts, are considered. A Twitter user can be a real, living, breathing, human being and still exhibit bot-like tweeting habits. I think that this is the reason that Howard's group settled on the 50-tweets-per-day stat instead of relying on a classifier. A user can tweet for themselves part of the time, but still have some automation process that tweets certain things on their behalf. There are many reasons that someone would do this, to increase their Klout score (imagine a LuLaRoe seller, YouTuber, or Blogger), for example. There are many companies that you can pay to automate this kind of activity for you. Some, like Linkis' "Convey" (more on this later) work by finding influential tweets and tweeting them on your behalf. These tweets are fairly easy to spot because they attach "via @c0nvey" to the end of the original tweet.

In the end, what we did was to develop a system that is able to quickly and accurately weed out tweets that were not authored by humans, even if the user account is owned by an actual human. Let's see how well our system stacks up to the BotOrNot service. We collected all of the bots found over one fifteen minute period and ran them through BotOrNot. 67.16% of the accounts from the tweets we labeled were determined by BotOrNot to be “bot” owned accounts. I was a little disappointed by this, so I took a look at the users that BotOrNot dismissed as human. In the first pass, I decided to apply Howard's 50-tweets-per-day rule. Doing this increased the percentage of accurately labeled bot tweets to 73.88%. One of the screen names found among this group actually contained the word "bot" in it. The most important thing for us at this point was to make sure that we weren’t getting a lot of false positives. We looked at each account one by one; we looked at each tweet in our system, both those classified as "likely bot" and "likely human".

Figure: Real-time labeling of bot and human tweets in Twitris

Many of the users with a low tweet-per-day usage had a combination of the two, and looking at the tweets that were labeled "likely bot", nearly all of them contained "via @c0nvey". Digging a bit further, I found that the Convey service has been accused of tweeting on behalf of unwitting users in the past. Here's a thread from one Twitter user talking about his experience:

So, our system is able to accurately detect automated (bot) tweets despite the fact that the user is not a bot. Our average detection rate also looks reasonable compared to other services. We experience bot traffic at a rate of about a 5% per day in our election campaign, though there are days that get to nearly 8%.

Moving Forward (oh, and Fake News)

Elections come and go. Once they are gone, we seek to apply our findings to future work. Being able to detect bots is great; but, what else can we do with that? Well, post-election we have learned the term “fake news” (with content that was entirely made up, not grounded in truth or reality), something that not too many people were concerned about before: see the Google Trend.

Where does all of this fake news come from? There have been troves of news reports blaming Facebook and Twitter for altering the outcome of the election. Obviously, Facebook and Twitter themselves weren’t creating pro-Trump news (Facebook in particular was accused of killing pro-Trump trends); however, some blame these companies for not alerting people that the news that they were “allowing” to spread was fake. I don’t think that that is fair. These companies are reliant on the fact that they don’t publish news (for more, read this article on Facebook, the “News Feed”, and the Communications Decency Act 230) to avoid lawsuits.

During the evaluation of our bot detection system, we noticed something interesting. A large majority of the bot-labeled tweets contained links to dubious looking news stories. Because of our ability to identify these bot tweets, we can exclude them from analysis when considering a brand-centered campaign (like Samsung during the Note 7 battery “situation”). I think that it is important to note that, despite the elimination of “fake news” and “bot tweets” from our analysis, Trump was always winning. We saw the same thing with the Brexit referendum earlier this year, where Twitris helped us to correctly predict the Brexit outcome before the polls closed. There is clear evidence to support the fact that high tweet volume translates to success (except Bernie Sanders, but there may be, uh-hum, other reasons for that). It seems to me that for bots and “fake news” to have swayed the election, they would have needed to be ready to go as soon as Trump announced that he was running, but what we have seen is that he was always ahead.

We will continue to find new ways to leverage everything that we learned from the 2016 Election. If you want to stay up to date on our analysis, please sign up to receive CognoviLabs’ newsletter at www.CognoviLabs.com or join Kno.e.sis on FB, and while you are there check the other post-election analysis.

4:15:00 PM
88 Comments

Subjectivity — Tapping All the Valuable Insights beyond Sentiment for Nextgen Information Extraction

Subjective Information Extraction

The information in text can be generally divided into two categories: objective information and subjective information. Objective information encompasses the facts about something or someone, while subjective information is about someone's personal experiences. For example, the fact that it is raining is objective, while how one feels about the rain is subjective; that you have not had breakfast is objective, while your feeling of hunger is subjective; that you watched “You Can Count on Me” last night is objective, while your emotion of heartwarming because of the movie is subjective.

Subjective information about what people think and how people feel is useful for all parties including individuals, businesses, and government agencies during their decision-making processes. The traditional way of collecting subjective information takes the form of surveys, questionnaires, polls, focus groups, interviews, etc. For example, individuals ask their friends which cell phone carriers they recommend and whether the coverage is good in their area; retailers conduct a focus group to have in-depth discussions with their target customers about how they feel regarding shopping in the stores; governments solicit public opinions on particular policy issues via surveys.

The web and social media have changed the way we communicate and provide new potentially powerful avenues for us to glean useful subjective information from user generated content such as blogs, forum posts, reviews, chats, and microblogs. However, much of the useful subjective information is buried in ever-growing user generated data, which makes it very difficult (if not impossible) to manually capture and process the information needed for various purposes. To address the information overload, it is essential to develop techniques to automatically discover and derive high-quality (i.e., contextually or application relevant and accurate) subjective information from user generated content.

Current subjectivity and sentiment analysis efforts have been focused on classifying the text polarity, specifically, whether the expressed opinion for a specific topic in a given text (e.g., document, sentence, word/phrase) is positive, negative, or neutral. This narrow definition considers subjective information and sentiment as the same object, while other types of subjective information (e.g., emotion, intent, preference, expectation) are either not taken into account, or are handled similarly without sufficient differentiation. This limitation may prevent the exploitation of subjective information from reaching its full potential.

At Kno.e.sis, we extend the definition of subjective information and develop a unified framework that captures the key components of diverse types of subjective information. We define a subjective experience as a quadruple (h, s, e, c), where h is an individual who holds the experiences, s is a stimulus (or target) that elicits the experiences, e.g., an entity or an event, e is a set of expressions that are used to describe the subjective experiences, e.g., the sentiment words/phrases or the opinion claims, and c is a classification or assessment that characterizes or measures the subjectivity. Accordingly, the problem of identifying different types of subjective information can all be formulated as a data mining task that aims to automatically derive the four components of the quadruple from text, as illustrated in Table 1.

Table 1 Components of sentiment, opinion, emotion, intent, preference and expectation.
Subjective Experience	Holder h	Stimulus s	Expression e	Classification c
Sentiment	an individual who holds the sentiment	an object	sentiment words and phrases	positive, negative, neutral
Opinion	an individual who holds the opinion	an object	opinion claims (may or may not contain sentiment words)	positive, negative, neutral
Emotion	an individual who holds the emotion	An event or situation	emotion words and phrases, description of events/situations	anger, disgust, fear, happiness, sadness, surprise, etc.
Intent	an individual who holds the intent	an action	expressions of desires and beliefs	depending on specific tasks
Preference	an individual who holds the preference	a set of alternatives	the expressions of liking, disliking or preferring an alternative	depending on specific tasks
Expectation	an individual who holds the expectation	an object	expressions of beliefs about someone or how something will be	depending on specific tasks

Consider the following example:

“Action and science fiction movies are usually my favorite, but I don't like the new Jurassic World. Mad Max: Fury Road is the best I've seen so far this year. It's a magnificent visual spectacle and the acting is stellar too. I cried, laughed and smiled watching Inside Out. It was so touching. Would like to watch the new Spy movie this weekend. I hope it’s good!''

The traditional sentiment analysis would find positive opinion about action and science fiction movies, “Mad Max: Fury Road,” “Inside Out” and “Spy,” and find negative opinion about the movie “Jurassic World.” However, if we consider different types of subjective information, and handle each particular type based on the framework we proposed, we will be able to derive much richer information from the text, as illustrated in Table 2.

Table 2 Information that can be extracted from the example text.
Subjective Experience	Holder h	Stimulus s	Expression e	Classification c
Preference	the author	movie genres	“favorite”	prefer action and science fiction movies over other types of movies
Sentiment	the author	movie Jurassic World	“don’t like”	negative
Opinion	the author	movie Mad Max: Fury Road, visual effect, performances	“best”, “magnificent”,“spectacle”, “stellar”	positive
Emotion	the author	movie Inside Out	“cried”, “laughed”, “smiled”, “touching”	sadness, joy, touching
Intent	the author	movie Spy	“would like to”	transactional
Expectation	the author	movie Spy	“hope”	optimistic

Figure 1 depicts the process of subjective information extraction. At the beginning, a number of preprocessing steps are needed to handle the raw textual data before the information extraction can take place. Common preprocessing steps include sentence splitting, word tokenization, syntactic parsing or POS tagging, and stop words removal. Afterwards, an optional step is to detect the subjective content from the input text, such as classifying the sentences into subjective or objective categories. The subjective content can be further classified into different types, e.g., sentiment, emotion, intent and expectation. Language resources such as WordNet, Urban Dictionary, and subjectivity lexicons (e.g., MPQA, SentiWordNet) can be used for the subjectivity classification task.

Figure 1 An overview of subjective information extraction.

The next step is to extract the four components of subjective experiences, including the holder, the stimulus or target, the set of expressions, and the classification category or assessment score. Depending on the type of subjective information, specific techniques need to be developed and applied. For example, the target of sentiment is usually an entity, and thus entity recognition is used to extract sentiment target; while the target of intent can be an action, e.g., “to buy a new cell-phone”, thus we need to develop techniques to extract actions from text. In addition, for the same type of subjective information, different classification/assessment schema and techniques may need to be developed according to the purpose of application. For example, many sentiment analysis and opinion mining systems classify the polarity of a text (e.g., a movie review, a tweet) as positive, negative or neutral [1-3], or rate it on a 1-5 stars rating scale [4,5]. Some emotion identification systems focus on classifying emotions into six basic categories: anger, disgust, fear, happiness, sadness, and surprise [6], while some other systems define their own set of emotions based on the application purpose, e.g., understanding emotions in suicide notes [7], identifying emotions that people express using cursing words [8], classifying emotional response to TV shows and movies [9]. Existing work on detecting users' query intent classifies search queries into three categories: navigational, informational, or transactional [10,11]. Studies on identifying purchase intent (PI) for online advertising classify users' posts into PI or Non-PI [12], or information seeking or transactional [13].

Finally, the extracted subjective information can be used for a wide variety of applications, including but not limited to business analytics, Customer Relationship anagement (CRM), marketing, predicting the financial performance of a company, targeting advertisement, recommendation (based on users' interest and preference), monitoring social phenomena (e.g., social tension, subjective well-being), and predicting election results.

At Kno.e.sis, we have developed automatic methods to extract components of different subjective experiences. We have proposed an optimization-based approach that extracts a diverse set of sentiment-bearing expressions, including formal and slang words/phrases, for a given target from an unlabeled corpus [2]. We have developed a clustering approach that identifies opinion targets (product features and aspects) from plain product reviews [14]. The proposed approach identifies features and clusters them into aspects simultaneously. Furthermore, it extracts both explicit and implicit features and does not require seed terms. We have also explored the classification and assessment of different types of subjective information. In particular, we have explored supervised methods for emotion classification [6-9]. We have proposed methods to group opinion holders based on their political preference and participation in the discussion about election candidates on Twitter, and assess their sentiments towards the candidates to predict the election results [15]. In order to understand the effect of religiosity on happiness, we analyzed the tweets and networks of more than 250k U.S. Twitter users who self-declared their religious beliefs, and examined the pleasant/unpleasant emotional expressions in users' tweets to estimate their subjective well-being [16,17].

References:

[1] Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. "Thumbs up?: sentiment classification using machine learning techniques." EMNLP. 2002.

[2] Lu Chen, Wenbo Wang, Meenakshi Nagarajan, Shaojun Wang, Amit Sheth. Extracting Diverse Sentiment Expressions with Target-dependent Polarity from Twitter. ICWSM. 2012.

[3] Cícero Nogueira dos Santos, and Maira Gatti. "Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts." COLING. 2014.

[4] Ganu, Gayatree, Noemie Elhadad, and Amélie Marian. "Beyond the Stars: Improving Rating Predictions using Review Text Content." WebDB. Vol. 9. 2009.

[5] Sharma, Raksha, et al. "Adjective Intensity and Sentiment Analysis." EMNLP. 2015.

[6] Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit Sheth. Harnessing Twitter "Big Data" for Automatic Emotion Identification.SocialCom. 2012.

[7] Wenbo Wang, Lu Chen, Ming Tan, Shaojun Wang, Amit Sheth. Discovering Fine-grained Sentiment in Suicide Notes. Biomedical Informatics Insights (BII). 2012.

[8] Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, Amit Sheth. Cursing in English on Twitter. CSCW. 2014.

[9] Justin Martineau, Lu Chen, Doreen Cheng and Amit Sheth. Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy. ACL. 2014.

[10] Jansen, Bernard J., Danielle L. Booth, and Amanda Spink. "Determining the informational, navigational, and transactional intent of Web queries." Information Processing & Management 44.3: 1251-1266. 2008.

[11] Hu, Jian, et al. "Understanding user's query intent with wikipedia." Proceedings of the 18th international conference on World wide web. ACM, 2009.

[12] Gupta, Vineet, et al. "Identifying Purchase Intent from Social Posts." ICWSM. 2014.

[13] Nagarajan, Meenakshi, et al. "Monetizing user activity on social networks-challenges and experiences." Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 01. IEEE Computer Society, 2009.

[14] Lu Chen, Justin Martineau, Doreen Cheng and Amit Sheth. Clustering for Simultaneous Extraction of Aspects and Features from Reviews. NAACL. 2016.

[15] Lu Chen, Wenbo Wang, Amit Sheth. Are Twitter Users Equal in Predicting Elections? A Study of User Groups in Predicting 2012 U.S. Republican Presidential Primaries. Proceedings of the 4th International Conference on Social Informatics (SocInfo). 2012.

[16] Lu Chen, Ingmar Weber and Adam Okulicz-Kozaryn. U.S. Religious Landscape on Twitter. Proceedings of the 6th International Conference on Social Informatics (SocInfo), 2014.

[17] Lu Chen. “Mining and Analyzing Subjective Experiences in User Generated Content.” Ph.D. Dissertation. Department of Computer Science & Engineering. [Dayton]: Wright State University; 2016. p. 161.

5:30:00 PM
170 Comments

Implicit Information Extraction

Implicit entity

People communicate their ideas, opinions, and facts using natural language. It is one of the powerful tools that we as humans collectively developed over hundreds of thousands of years and will continue to develop in years to come. While the debate on the evolution of the languages has not reached a consensus, the theories have estimated that it first evolved around 150,000 - 350,000 years ago, which is roughly the time frame accepted for the evolution of modern Homo sapiens.1 English is one of the most used descendants of this evolution and it has evolved over 1,400 years on its own.2 The evolution of the language accounts for the social, cultural, and economic changes which have taken place in society. For example, the industrial revolution took place in the 18th and 19th century and had a great impact on these three dimensions. It added new words to the language such as condenser, vacuum, reservoir, taxonomy, sodium, and platinum [1].

The evolution of the language has enriched it with many features. One such feature is its ability to express facts, ideas, and opinions in an implicit manner. As humans, we seamlessly use implicit constructs in our daily conversations and rarely find it difficult to decode the content of the messages. Consider the following tweet:

Are we not going to talk about how ridiculous the new space movie with Sandra Bullock is going to be?
— ☯Mac (@mfpokesmot) September 15, 2013

This tweet contains an implicit mention of the movie 'Gravity'. A human with up-to-date knowledge on movies would instantly understand that the tweet talks about the movie 'Gravity'. However, the whole field of information extraction, which has the objective of automatically extracting structured information from unstructured and/or semi-structured data, almost exclusively focused on extracting explicit information from the text. Consider the following text snippet extracted from a clinical narrative.

"Bob Smith is a 61-year-old man referred by Dr. Davis for outpatient cardiac catheterization because of a positive exercise tolerance test. Recently, he started to have left shoulder twinges and tingling in his hands. A stress test done on 2013-06-02 revealed that the patient exercised for 6 1/2 minutes, stopped due to fatigue. However, Mr. Smith is comfortably breathing in room air. He also showed accumulation of fluid in his extremities. He does not have any chest pain."

The state-of-the-art information extraction algorithms would extract information like 'Bob Smith' and 'Dr. Davis' are entities of type person, 'cardiac catheterization' and 'chest pain' are the same as entities identified with concept unique identifiers (CUI) C0018795 and C0008031 in unified medical language system (UMLS), and there is a cause-effect relationship between 'fatigue' and exercising. The algorithms developed for named entity recognition, entity linking, and relationship extraction would help to extract this structured information from the above text snippet. However, the two sentences "Mr. Smith is comfortably breathing in room air" and "He also showed accumulation of fluid in his extremities" implicitly indicate that the patient does not have the clinical condition named 'shortness of breath', but he has 'edema.' This is very important information for assessing the health status of the patient and a medical professional reading this snippet would easily decode the mentions of these two clinical conditions in the text. An automatic entity linking technique should identify these mentions as same as the entities identified with CUIs C0013404 ('shortness of breath') and C0013604 ('edema') in UMLS. Commercially, it has applications such as Computer Assisted Coding and Computerized Document Improvement. Unfortunately, current information extraction algorithms would not be able to extract this implicit information.

Implicit constructs are not a rare occurrence. Our studies found that 21% of the movie mentions and 40% of the book mentions are implicit in tweets, and about 35% and 40% of 'edema' and 'shortness of breath' mentions are implicit in clinical narratives. There are genuine reasons why people tend to use implicit mentions in daily conversations. Here are few reasons that we have observed:

To express sentiment and sarcasm : The following tweet has an element of sarcasm and a negative sentiment towards the movie 'Transformers: Age of Extinction.' These feelings were expressed implicitly in this tweet. It is noted that people heavily use implicit constructs to express sarcasm [2].

I'm striving to be positive in what I say on Twitter. So I'll refrain from making a comment about the latest Michael Bay movie.
— Darren Currin (@darrencurrin) July 22, 2014
To provide descriptive information : For example, it is a common practice to describe the features of an entity rather than simply list down its name in clinical narratives. Consider the sentence "small fluid adjacent to the gallbladder with gallstones which may represent inflammation." This sentence contains an implicit mention of the clinical condition 'cholecystitis' and provides important information about the patient's health status that would be missing if the author chose to list down only the name of clinical condition. The condition 'cholecystitis' means "inflammation in gallbladder" with multiple causes and the sentence provides a detailed explanation of 'cholecystitis' along with the possible cause. This descriptive information is critical in understanding the patient's health status and treating the patient.
To emphasize the features of an entity : Sometimes we replace the name of the entity with its special characteristics in order to give importance to those characteristics. For example, the text snippet "Mason Evans 12 year long shoot won big in golden globe" has an implicit mention of the movie 'Boyhood.' There is a difference between this text snippet and its alternative form "Boyhood won big in golden globe." The speaker is interested in emphasizing the distinct feature of the movie, which would have been ignored if he had used the name of the movie as in the second phrase.
To communicate shared understanding : We do not bother spelling out everything when we know that the other person has enough background knowledge to understand the message conveyed. A good example is the fact that clinical narratives rarely mention the relationships between entities explicitly (e.g., relationships between symptoms and disorders, relationships between medications and disorders), rather it is understood that the other professionals reading the document have the expertise to understand such implicit relationships in the document.

The above examples show the value added by the implicit constructs to daily communications. Another important observation is the role of world knowledge in interpreting implicit constructs. A human reading the text with implicit information would only be able to decode implicit information if he/she has relevant knowledge on the domain. A reader who does not know about Michael Bay's movie release would have no clue about the movie mentioned in the tweet with sarcasm; a reader who does not know the characteristics of the clinical conditions 'shortness of breath' and 'edema' would not be able to decode their mentions in the clinical text snippet shown above; a reader who is not a medical expert would not be able to connect the diseases and symptoms mentioned in a clinical narrative.

The implicit information extraction task demands comprehensive and up-to-date world knowledge. Individuals resort to a diverse set of entity characteristics to make implicit references (also see [3]). For example, the implicit references to the movie 'Boyhood' use phrases like "Richard Linklater movie", "Ellar Coltrane on his 12-year movie role", "12-year long movie shoot", "latest movie shot in my city Houston", and "Mason Evan's childhood movie." Hence, it is important to have comprehensive knowledge about the entities to decode their implicit mentions. Another complexity is the temporal relevancy of the knowledge. The same phrase can be used to implicitly refer to different entities at different time intervals. For instance, the phrase "space movie" could refer to the movie 'Gravity' in fall 2013 while the same phrase in fall 2015 would likely refer to the movie 'The Martian.' On the flip side, the most salient characteristics of the movies may change over time, and so will the phrases used to refer to them. The movie 'Furious 7' was frequently referred to with the phrase "Paul Walker's last movie" in November 2014. This was due to the actor's death around that time. However, after the movie release in April 2015 the same entity was often mentioned through the phrase "fastest film to reach the $1 billion."

At Kno.e.sis, we have developed a knowledge-driven solution to perform implicit information extraction. This solution acquires relevant domain knowledge from a diverse set of structured and unstructured knowledge sources, processes acquired knowledge to represent it in a machine readable manner, and contains information extraction techniques that uses these knowledge sources to decode the implicit information in the text. We have successfully applied this solution to extract implicit entities and relationships in clinical narratives [4] [6] and implicit entities in tweets [5].

1https://en.wikipedia.org/wiki/Origin_of_language

2https://en.wikipedia.org/wiki/English_language

References:

[1] Bragg, Melvyn. The adventure of English: The biography of a language. Arcade Publishing, 2006.

[2] Davidov, Dmitry, Oren Tsur, and Ari Rappoport. "Semi-supervised recognition of sarcastic sentences in twitter and amazon." Proceedings of the fourteenth conference on computational natural language learning. Association for Computational Linguistics, 2010.

[3] "Help For HealthCare: Mapping Unstructured Clinical Notes To ICD-10 Coding Schemes." Http://www.dataversity.net/. N.p., 26 Nov. 2013. Web. 19 Aug. 2016.

[4] Sujan Perera, Pablo N. Mendes, Amit Sheth, Krishnaprasad Thirunarayan, Adarsh Alex, Christopher Heid, and Greg Mott. "Implicit entity recognition in clinical documents." In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics (*SEM), pp. 228-238. 2015.

[5] Sujan Perera, Pablo N. Mendes, Adarsh Alex, Amit P. Sheth, and Krishnaprasad Thirunarayan. "Implicit Entity Linking in Tweets." In Extended Semantic Web Conference, pp. 118-132. Springer International Publishing, 2016.

[6] Sujan Perera, Cory Henson, Krishnaprasad Thirunarayan, Amit Sheth, and Suhas Nair. "Semantics driven approach for knowledge acquisition from EMRs." IEEE journal of biomedical and health informatics 18, no. 2 (2014): 515-524.

4:26:00 PM
32 Comments

Bots in the Election

Subjectivity — Tapping All the Valuable Insights beyond Sentiment for Nextgen Information Extraction

Implicit Information Extraction

Labels

Blog Archive

Other Links