NLP Challenges with Tisane: Code Words & Secret Language

Introduction

NLP has its own unique challenges in law enforcement & intelligence scenarios. These challenges are rarely addressed by the mainstream NLP frameworks. From the conversations with our law enforcement users, we learned that the use of code words subverting detection efforts is one such challenge.

Oftentimes, offenders or intelligence assets replace important words by seemingly unrelated mutually agreed upon terms. You never know who’s listening, right? Or where the device ends up. Code words are not the same as jargon or slang; it’s a disposable secret language designed to obscure the real meaning.

Once the investigators start guessing what these unrelated terms mean, they can try using keywords. But what if it’s more than a couple of words, or the inventors are particularly creative or supplement it with purportedly introduced misspellings?

What if the misspellings are the auto-correct gone rogue?

Tisane has a solution for that.

Peeking Under the Hood

In Tisane, text processing is built around concepts (or word-senses) rather than words. The word-senses are clustered in structures called families, each provided with a unique numeric ID. A family contains a set of synonyms and their inflected forms, complete with a set of features. The family ID is crosslingual, as clusters of synonyms are aligned across all languages supported by Tisane, now or in the future.

By default, Tisane’s decisions are made based on the data coming from its language models. However, since the Darjeeling release, Tisane has a so-called “long-term memory” module. The long-term memory module, among other things, allows making changes to attributes at the level of a call, “redefining” the contents of memory, and preserving the accumulated knowledge, if the calling application chooses to do so. As the word-sense is just one of the attributes, it can be redefined on the fly, too. Doing so assigns a new sense to whole categories. As the family IDs are crosslingual, there is no “family ID for English” or “family ID for Japanese”. It’s exactly the same family ID everywhere.

To assign a new meaning to the code words, the settings parameter in Tisane’s processing methods (POST /parse and POST /transform) must contain the section where the redefinition takes place. The concepts matching the redefinition criteria are then assigned the new attributes. To redefine the word sense, we simply need to supply a new family ID.

In your application, you may allow your users to look up and redefine the word sense on the spot, in order to run experiments with information extraction.

Examples

Mushrooms as Code Words for â€œSpy”

Let’s say, a group agrees among themselves that when they will mention a mushroom (of any kind), they will actually mean a spy. It may be a chanterelle, a truffle, or any other type of mushrooms.

We’ll use two methods:

  • GET /lm/senses (in Tisane Embedded, ListSenses) to locate the family IDs for the relevant concepts.
  • the same old POST /parse (in Tisane Embedded, Parse), for which we’ll need to build a redefinition clause.

In Tisane, we need to:

  1. Look up the family ID of fungus. This is accomplished by invoking GET /lm/senses as shown on the screenshot below. Our family ID is stored in the Key attribute, it is 79199. This is the family that will be used for our criteria. (In cases when there are several interpretations, find the correct one from the definition.)

2. Look up the family ID of spy. Again, this is accomplished by invoking GET /lm/senses as shown on the screenshot below:

There is a number of senses; looking at the definition attribute, we see that there is one sense that means a “secret agent”. The family ID is 67433. This will be our target family.

3. Finally, we can build our reassignment clause. We want any kind of fungus to match.

Meaning, we need to match all word-senses whose hypernym (an umbrella concept or a super-family, if you will) is fungus (79199), and redefine them as a family 67433 (spy). The reassignment clause is:

{"if":{"hypernym":79199},"then":{"family":67433}}

The reassignment clauses are kept in the assign array under memory.

Let’s test:

Shiitake is a mushroom, and, in this case, a spy

Finding Items Used in Criminal Activities

Suppose a law enforcement agency is looking to locate any record of items that could have been used in a burglary. We can, of course, manually search for hammers, screwdrivers, maybe other tools. We may find something or miss something.

We can, on the other hand, link all the tools (e.g. hammers, drills, etc.) and fasteners (e.g. nails) to the category of “illegal item”. That does not mean that the item is illegal, of course, but it will cause Tisane to generate a criminal_activity alert.

We need to:

  1. Look up the family ID of tool and fastener. The family IDs are 34876 and 28191, as can be seen from the screenshots below:
Looking up the family ID of â€œtool”
Looking up the family ID of “fastener”

2. Look up the family ID of illegal item. The family is 123078.

Looking up the family ID of “illegal item”

3. Build the reassignment clause. We want items that have family ID 34876 as a hypernym, and those with 28191 as a hypernym, to be linked to hypernym 123078. That will still leave their existing hierarchy intact (they will still be “tools” and “fasteners”), but will have a new link. The assignment array is:

[{"if":{"hypernym":34876},"then":{"hypernym":123078}}, 
{"if":{"hypernym":28191},"then":{"hypernym":123078}}]

Let’s test:

Sledgehammer generated a “criminal activity” alert because it’s a tool

Linking Names of Discussion Participants

Many of Tisane’s patterns marking personal attacks are looking for discussion participants. The trouble is, with a name alone, it’s impossible to know whether they are participants in the current discussion. On the other hand, the names and the aliases could be available from other sources.

Wouldn’t it be handy if there was a way to tell Tisane that Kewldude1995 and John_Smith are names of the participants, and the attacks on them should be treated as personal attacks?

While it may not be a “code word” problem, the solution is exactly the same: redefining family IDs. This time, based on a string pattern (as these names are not mapped in the language models).

We need to:

  1. Look up the family ID of discussion participant, as shown on the screenshot below:

2. Unlike in the previous examples, we need to rely on plain string recognition. This is done by defining a regex condition:

{"if":{"regex":"Kewldude1995|John_Smith"},"then":{"hypernym":123887}}

Let’s test:

What if we want to treat all names as discussion participants? Then the condition in the second step is to be replaced by a condition looking for a name:

{"if":{"hypernym":44155},"then":{"hypernym":123887}}

Conclusion

In this post, we’ve shown how to use Tisane to extract information while redefining the language models on the fly. These techniques can be used to tackle challenges like secret language and auto-correct.

If you have any questions, or remarks, please feel free to contact us or connect to us on LinkedIn.

In the next posts, we’ll show how to use the same long-term memory module to detect patterns based on multiple signals, tackling challenges common in applications like detection of child grooming and fraud.

Tisane API Integrated with PubNub

PubNub, the company behind the world’s leading realtime Data Stream Network (DSN), added Tisane API as a supported component in its catalog. The Tisane Labs Natural Language Processing Block runs serverlessly in the PubNub network, joining the blocks released by Microsoft, Amazon, IBM, ESRI, and more. The block fully supports the original Tisane functionality, including:

  • Detection of personal attacks and cyberbullying, hate speech, criminal activities, sexual harassment
  • Topic modelling, compliant with IPTC and IAB standards
  • Sentiment analysis 2.0
  • Entity extraction

And more.

The PubNub Data Stream Network powers thousands of apps, streaming 1.9 Trillion messages to over 330 million devices a month, with powerful and extensible frameworks like PubNub ChatEngineℱ .

Tisane Labs Launches Solution to Detect Hate Speech and Cyberbullying

published on Yahoo Finance via PRNewswire

Affordable API enables developers and businesses to detect hate speech, cyberbullying, unwanted sexual advances, criminal activity, and more

WASHINGTON, Nov. 13, 2018 /PRNewswire/ — Tisane Labs, a supplier of text analytics AI solutions, today announced the launch of Tisane API, the first API to detect and classify abusive textual content in 27 languages. Tisane detects hate speech, personal attacks, unwanted sexual advances, and criminal activity in text, with additional varieties of detected abuse to come.

“Trolls, bigots, harassers, and criminals made the Internet an unpleasant and at times dangerous place. For the users, it often means being unsafe online with possible consequences in real life. For the online communities, it means high user turnover, additional headaches with the moderation, and enormous monetary losses or legal issues,” said Vadim Berman, Chief Executive Officer and Co-founder of Tisane Labs. “Now, with Tisane API, the communities online can automate much of the moderation process and even warn potential offenders before the post is published. Rather than producing a blanket statement and a floating-point figure, Tisane API pinpoints the actual instance of abuse and classifies the type of abuse.”

Tisane API runs in the cloud, with a simple REST interface that can be linked from any popular programming platform today. Tisane Labs provides a range of plans for every pocket with the option of a custom installation on premises and a generous FREE plan.

To try Tisane API, visit https://tisane.ai.

For more information, contact Carla Johnston (email: Carla.Johnston@tisane.ai or call: +1 (703)-628-8827)

Related Links

Tisane Labs website

Evolving the Sentiment Analysis

Introduction

Sentiment analysis, or opinion mining, is a process of finding out the sentiment expressed in a fragment of text. The idea is relatively young: first papers on sentiment analysis only appeared less than two decades ago (Turney, 2002). Its importance in various verticals coupled with the explosion in the volume of social media helped to fast-track the commercialization and the generous R&D investment.

Today, it is no longer enough just to answer whether the author of the content gives “thumbs up” or “thumbs down”. In this paper, we will lay the framework for more advanced applications of sentiment analysis.

World in Black & White: Classic Sentiment Analysis

The proverbial spherical cow in vacuum. Every bit as useful as the classic sentiment analysis

The document-level or classic sentiment analysis is meant to process a piece of text and answer the question, “is it positive, negative, or neutral?” The old naïve approach would just take every bit of negativity and every bit of positivity, sum them up, and calculate the score. Coupled with a bag of words method where the bits of positivity and bits of negativity were words or phrases (so-called “polarity terms”), the approach was hopelessly inaccurate. After a while, the approach was boosted with recognition of negations and other artefacts that modify the outcome. Then, the result was changed to a floating-point value.

However, none of that addressed the elephant in the room.

The concept of sentiment analysis was born out of the actual business need to predict whether the customer will buy the product or the service again, and whether they will recommend the members of their social circle to buy it. Straightforward logic, isn’t it? If they liked it, they will opt for the same vendor again given a chance.

Sort of.

In reality, the likelihood of people buying again is more than the mere sum of the grades given to different features. People don’t fly budget airlines because they like small legroom and dirty seats: they are trying to save money. They might put up with the poor service but they will flee at the first sign of pricing being no longer competitive. These customers will also not tolerate cancellations without refund and lost baggage, even if they are OK with the lack of inflight entertainment.

Already in the late 2000s, the online reviews became an established literary genre. They are often much longer than one sentence, and, most importantly, tell exactly what is good and what is not good. It’s all there! Unfortunately, the classic, “black & white” sentiment analysis was completely ignoring it, even with the “shades of grey”, that is, a floating-point score. Some sentiment analysis vendors went further and started assigning a score to every sentence. It did not solve the problem though, because one sentence may very well contain a number of factors in the customer’s decision.

While the simple “black and white” principle was easy to explain to the uninitiated business people holding the purse strings, the same business people needed better actionable intelligence. Why didn’t 67% of the guests like that hotel? If I have to go through these thousands of reviews manually to find this out, what exactly did your software accomplish?

Be More Specific: Aspect-based and Entity-based Sentiment Analysis

One figure could not provide an adequate response. There was a need to find out what exactly the reviewer liked and what they did not like. So-called aspect-based (or facet-based) sentiment analysis is meant to do exactly that.

Instead of determining sentiment for a document or a sentence, every relevant aspect encountered in the text is given a sentiment score. For example, if a review says, “the breakfast was a bit tasteless but I liked the helpfulness of the staff”, in the hospitality domain, the aspect “breakfast” would have negative sentiment while the aspect “staff” would have positive sentiment.

The aspects from different reviews can be then aggregated, drawing a big picture of customer preferences and issues. The screenshot below demonstrates such an application for the hospitality industry.

Sample front-end displaying the results of the aspect-based sentiment analysis

What happens if the review mentions several vendors or suppliers?

Then the sentiment must be determined for a specific named entity, e.g. a company. In a way, it is a variation of the aspect-based sentiment analysis, with the entities treated as “aspects”. However, there are two nuances.

When two entities are compared to each other, the same clause may contain two sentiments. For example, “company X is more innovative than company Y” means positive sentiment for company X and at the same time negative sentiment for company Y.

While the regular aspect-based sentiment may contain a comparison between aspects, it doesn’t necessarily mean negative sentiment towards either (e.g. “I liked their breakfast more than their location” does not mean the location was bad and generally sounds contrived, while in case of the competing entities “less good” means “bad”).

The second nuance is that in both aspect-based and entity-based varieties of sentiment analysis, we may have cases where the sentiment is not generated by the author. Continuing the example above, “according to the analyst Z, company X is more innovative than company Y” does not bear direct sentiment. It merely quotes someone else. Depending on what the application is to accomplish, referenced sentiment may have to be ignored. This means that the sentiment analysis application must be able to detect quotations.

Sample entity-based sentiment analysis with opinion attribution

As the aspect-based and the entity-based sentiment analysis works with a collection of values, does it make sense to calculate an overall score giving different weights to different factors?

Clearly not in case of the entity-based variety: it doesn’t make sense to add the sentiment score of Orange S.A. to the sentiment score of Apple Inc.

What about the aspect-based sentiment analysis? We believe that it would still be a bad practice. Does the sentence “the room is small but OK” convey an overall positive sentiment? Maybe, but we should not discard the size aspect if the goal is providing actionable intelligence. Even if we disregard the complexity of coming up with the constituents that work well, different people may have different criteria for the product. Some don’t care about noisy environment in a hotel; others have to have convenient parking nearby that does not cost too much. Providing one figure may tempt the integrators and the aggregators to discard everything else, and the end-user will only get to see a questionable, one-size-fits-all score.

This is Not a Pipe: Sentiment Analysis of Creative Content

Aspect-based sentiment analysis seems to be providing an adequate solution for reviews of goods and services. A praise means positive sentiment; a message of disapproval means negative sentiment. This is not always the case when discussing movies or fiction in general.

“Realistically depicted bad guys” means that by common ethical standards, the characters in the movie would be judged negatively. However, we are not judging these fictional people; we are judging the work of art depicting them. This work of art gets points for realism, quality of acting, good plot, and so on. It does not get demerit points for unethical conduct of the characters it depicts or the dirty and unsafe streets of the imaginary city.

As difficult as aspect-based sentiment analysis is, sentiment analysis of creative content raises the bar even further. It is a sub-type of the aspect-based sentiment analysis, with the distinction that we have to ignore many of the aspects of the review. It’s not enough to merely take into consideration some aspects: a character doing something stupid in the movie can be ignored; a script writer who made a stupid decision, on the other hand, means negative sentiment.

RenĂ© Magritte’s famous The Treachery of Images warns that a painting of a pipe is not a real pipe; and so the parts of the review describing the imaginary universe of the creative content being reviewed, are to be excluded from the sentiment analysis.

Let’s run a sample movie review through a regular aspect-based sentiment analysis:

I didn’t expect a lot from ‘Beowulf’, for lots of reasons, most of which were to do with the casting: incorrigibly cockney Ray Winstone as a warrior from what’s now southern Sweden; wacky John Malkovich as a cynical counselor; loony Crispin Glover as a flesh-rending monster, and weirdest of all, Angelina Jolie as the monster’s mother
 Then there was the way they did the whole thing in CGI, running the risk of making it all look a bit rubbery. Finally, Robert Zemeckis presided over the insufferable ‘Forrest Gump’.

 

While the straight aspect-based sentiment analysis did find the necessary snippets, it completely misses the point as demonstrated below:

Aspect-based sentiment analysis applied on the movie review

Being “incorrigible” or “wacky” may be a bad thing in the world of customer service. It is not necessarily bad for an actor. A character being cynical or a monster does not mean the reviewer did not like the movie; it just refers to the imaginary universe. On the other hand, “looking rubbery” may be neutral in customer service but negative when it comes to CGI.

However, once we ignore the imaginary universe, focusing only on specific aspects, it’s not much different from the generic aspect-based sentiment analysis. Fortunately, the tools of Tisane Labs allow configuring exactly what we want to capture, and the solution is to create a special configuration targeting only specific aspects.

Tell Me Who Your Friends Are: Sentiment Analysis for Politics

The sentiment analysis of political content is far more difficult than any other type and adds more moving parts to the equation. One part of this type of sentiment analysis is largely the same as any other aspect-based sentiment analysis: unethical conduct, inadequate skills, etc. are bad things; ethical conduct, being skilled are good things. Nothing special here — a regular entity-based sentiment analysis. The regular reservations about the entity-level sentiment analysis apply as well: the external allegations and quotations have to be eliminated from the result or returned within a sub-scope.

Things get more interesting, however, when the sentiment is expressed indirectly.

A comparison to, or allegations of an affiliation with a notorious dictator or a criminal is clearly a negative sentiment. However, these are all named entities; how do we know that being linked to a person X is a bad thing? Can’t we just assemble a list of all the bad guys and mark an association with them as a bad thing?

We can’t. These scapegoats may not be universal scapegoats. For example, Democrats in the United States may use prominent Republican figures as this kind of “locally negative” entities, and vice versa; any nation at war or in poor diplomatic relations perceives the other side negatively.

Heated discourse over the 2016 Presidential Elections in the US is a good example. For the sake of simplicity, let’s focus on the main entities as depicted on the diagram below:

Simplified diagram of main actors in the 2016 US Presidential election

Democrats with Hillary Clinton, and Republicans with Donald Trump. Let’s assume the author of a post does not give away his attitude by calling the Democratic candidate “Killary” or the Republican candidate “Drumpf”. If the author equates one of these politicians with the Nazis as a group (or, for that matter, any prominent member of the Nazi government, like Goebbels or Himmler), the sentiment is most likely negative (unless the author belongs to a small fringe group). It is less clear when the author alleges association with Vladimir Putin; it became clearly negative in the US as the election was closing but not universally damning earlier. Furthermore, it is not necessarily negative if uttered by commentators outside of the United States.

In other words, it’s all relative to the author of the content. Tell me who your friends are, and I will tell you whether the sentiment is positive or negative.

Considering that in most cases, these links are pointed out in a negative context (if it were positive, it would be known to everyone, and there is no need to point them out), it is tempting to assume that any association is negative. But that is not necessarily the case, as sometimes they are mentioned to demonstrate even-handedness or an affiliation with a friendly group.

This means that in order to resolve whether the association with a group or an actor within this group is mentioned in a positive context, the analysis needs to know where the author, or the group which the author is associated with (e.g. a news agency with a stable general political orientation), stands. Other than that, this aspect can only be returned as a relationship between entities, which may or may not bear sentiment polarity.

In practice, this means either returning a collection of “absolute” sentiment values with a collection of “relative” sentiment values, or working with a knowledge base of political groups to resolve the relative sentiment values.

Conclusion

Sentiment analysis is a young and dynamic area. As the social media is catching up with the traditional mass media in its importance, and have long exceeded it in volume, the importance of accurate and powerful textual analysis in different scenarios is difficult to overstate.

We believe that this critical review and suggestions in it will encourage productive discussions and yield new approaches. We are working to bring them to life at Tisane Labs. If you’re curious, do drop by and try out our take on the sentiment analysis 2.0.

Tisane Labs launches Tisane API

Tisane Labs is pleased to announce the release of Tisane API.

Harness the power of next-generation AI to extract more from text in 27 languages: detect hate speech, sexual harassment, cyberbullying, extract topics, and find not only whether, but also why the customer is happy or unhappy with your product or service. Our applications and components are accessible in the cloud on a subscription basis (SaaS), can be installed on premises, or embedded in 3rd party applications for seamless integration and security.

We support: English, Chinese (Simplified and Traditional), Arabic, Danish, German, Spanish, Persian, Finnish, French, Hebrew, Indonesian, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Pashto, Portuguese, Russian, Swedish, Thai, Turkish, Urdu, Vietnamese.

We offer several ways to use our components, from a generous free plan (not a limited trial) to enterprise-grade plans and on prem installation options. Whether you’re a small startup, an independent developer, or an enterprise, we can work together.

Questions? Browse our knowledge base, chat with us using the real-time chat widget, or email us.

Sign up and start using Tisane API today.