Text Analytics 2009: User Perspectives on Solutions and Providers

Text Analytics 2009:
User Perspectives on
Solutions and Providers

Seth Grimes

An Alta Plana research study
Sponsored by

Text Analytics 2009: User Perspectives

Table of Contents
Executive Summary................................................................................................................... 3
Text Analytics Basics ................................................................................................................ 4
Discovering Meaning in Text.....................................................................................................4
Software and Solution Market Overview.................................................................................. 7
Applications and Sources ............................................................................................................ 7
Demand-Side Perspectives ........................................................................................................ 9
Study Context..............................................................................................................................9
About the Survey ....................................................................................................................... 10
Demand-Side Study 2009: Response ......................................................................................... 13
Q1: Length of Experience ........................................................................................................... 13
Q2: Application Areas ................................................................................................................ 13
Q3: Information Sources ........................................................................................................... 14
Q4: Return on Investment ......................................................................................................... 15
Q5: Mindshare ............................................................................................................................ 15
Q6: Spending ............................................................................................................................. 16
Q8: Satisfaction ......................................................................................................................... 16
Q9: Overall Experience ............................................................................................................. 16
Q12: Like and Dislike ................................................................................................................. 18
Q13: Information Types ............................................................................................................ 19
Q14: Important Properties & Capabilities ................................................................................ 20
Additional Analysis .................................................................................................................. 21
Selected Cross-tabulations .........................................................................................................21
Interpretive Limitations ............................................................................................................ 22
About the Study ....................................................................................................................... 24
Solution Profile: Attensity ....................................................................................................... 26
Solution Profile: Clarabridge ................................................................................................... 28
Solution Profile: GATE ........................................................................................................... 30
Solution Profile: IxReveal ......................................................................................................... 32
Solution Profile: Nstein ........................................................................................................... 34
Solution Profile: SAP BusinessObjects ................................................................................... 36
Solution Profile: TEMIS ......................................................................................................... 38

Published May 31, 2009 under the Creative Commons Attribution 3.0 License.

2


Executive Summary
The global text-analytics market is growing at a very rapid pace, an estimated 40% in
2008, creating a $350 million market for software and vendor supplied support and
services. The total business value generated by text-analytics reliant information
products, in-house development, service providers, applications such as e-discovery,
and research surely multiplies this figure eight-fold. The author projects 2009 market
growth up to 25% despite the economic downturn.
Market Factors
A number of factors have impelled sustained text-analytics market growth. The
technology – text mining and related visualization and analytical software – continues
to deliver unmatched capabilities both in early-adopter domains such as intelligence
and the life sciences and in business sectors that have embraced text analytics more
recently, in the last 3-5 years. These latter sectors include, notably, media and
publishing, financial services and insurance, travel and hospitality, and consumer
products and retail. Business and technical functions such as customer support and
satisfaction, brand and reputation management, claims processing, human resources,
media monitoring, risk management and fraud, and search have fueled recent growth.
No single organization or approach dominates the market. While existing players
have been very successful, they and new entrants continue to innovate, offering
cutting-edge capabilities, for instance in sentiment analysis, as well as in newer, as-a-
service and mash-up ready delivery models and capabilities targeted to market niches.
Insights into the question, “What do current and prospective text-analytics users really
think of the technology, solutions, and solution providers?” will help providers craft
products and services that better serve users. Insights will guide users seeking to
maximize benefit for their own organizations. Alta Plana conducted a spring 2009
survey to explore the topic. This report, “Text Analytics 2009: User Perspectives on
Solutions and Providers,” presents findings drawn from 116 responses, the majority of
whom already use text analytics. The study was supported by seven sponsors but is
editorially independent, designed and conducted by industry analyst and consultant
Seth Grimes, a recognized expert in the application of text analytics.
Key Study Stats
The following are key study findings:
Top business applications of text analytics for respondents are a) Brand /
product / reputation management (40% of respondents), b) Competitive
intelligence (37%), and c) Voice of the Customer / Customer Experience
Management (33%) and d) other Research (33%).
These applications match a focus on on-line sources: a) blogs and other social
media (47%), b) news articles (44%), and c) on-line forums (35%) as well as
direct customer feedback in the form of d) e-mail and correspondence (36%)
and customer/market surveys (34%).
Users with 2 years or more experience prefer tools that support specialized
dictionaries, taxonomies, or extraction rules and they often like open source.
Prospective users expect to focus their initial text analytics work on inside-
the-firewall feedback sources: e-mail, surveys, and contact center materials.
Prospective users have high ROI hopes. Use of each of six different measures,
led by increased sales to existing customers, is favored by over 50% of
respondents who are not current users. Other measures are not far behind.

3


Text Analytics Basics
The term text analytics describes software and transformational steps that discover
business value in “unstructured” text. The aim is to improve automated text
processing.
Most everything people do with electronic documents falls into one of four classes:
1. Compose, publish, manage, and archive.
2. Index and search.
3. Categorize and classify according to metadata & contents.
4. Summarize and extract information.
Text analytics enhances the first and second sets of functions and enables the third
and fourth.
The remainder of this section will at the technology, and the section after will look at
the market and applications.

Discovering Meaning in Text
Text analytics encompasses applications of the technology in government, science,
and industry and for cross-cutting tasks that range from information retrieval to text-
fueled investigative analyses. Text analytics can be seen as a subspecies of business
intelligence, and capabilities will be an essential component of the eventual creation
of the Semantic Web.
Structure in Text
Text – news and blog articles, scientific papers, spoken call-center conversations,
survey responses, product reviews posted to on-line forums, this report – is replete
with structure. Humans (relatively easily) learn to use this structure – the
morphology of individual words, the syntax the governs the composition of
expressions, the grammar behind phrases and sentences, and the larger-scale structure
of text as organized and presented in Web pages, e-mail, newspapers, books, and
myriad other forms – to both understand and generate text. We are able to do this
without conscious thought, coupled with a grasp of context, knowledge, and emotion
that allows us to understand often-complex interactions.
Text-analytics software technology – text mining and related visualization and
analytical tools – enables machine treatment of text that replicates, automates, and
extends human capabilities.
Sense-Making through Statistics
The earliest approaches to automated text analysis applied statistical methods to text.
Consider Hans Peter Luhn‟s 1958 IBM Journal paper, “The Automatic Creation of
Literature Abstracts”1, which envisaged application of statistics for sense-making and
summarization. Luhn wrote,
“Statistical information derived from word frequency and distribution is used
by the machine to compute a relative measure of significance, first for
individual words and then for sentences. Sentences scoring highest in
significance are extracted and printed out to become the auto-abstract.”
Luhn illustrated his approach, as shown in the figure below, with the kind of
frequency analysis that is performed today by search-engine optimization (SEO)
tools and software such as Wordle that generates word and tag clouds. Luhn
1
http://www.research.ibm.com/journal/rd/022/luhn.pdf -- paper is behind a “paywall.”

4


additionally proposed a Keyword-in-Context (KWIC) indexing system that is at the
root of modern information retrieval methods.

“Statistical information derived from word frequency and distribution is used by the machine
to compute a relative measure of significance": H.P. Luhn
Vector Space Methods
Vector-space models became the prevailing approach to representing documents for
information retrieval, classification, and other tasks.
The text content of a document is reduced to an
unordered “bag of words” that becomes a point in a
high-dimensional vector space that may embed the
word content of many documents as illustrated in the
diagram that appears to the right2.
Approaches such as TF-IDF (term frequency–inverse
document frequency) weigh the significance of a term
according to its prevalence in a larger document set.
We apply additional analytical methods to make text
tractable, for instance, latent semantic indexing
utilizing singular value decomposition for term
reduction / feature selection to create a new, reduced
concept space. In plain English, such techniques identify
and retain the most important concepts and consolidate or
eliminate lesser concepts.
Text analytics will typically apply one or more of a
number of statistical clustering and classification methods
to documents. These methods include Naive Bayes,
Support Vector Machines, and k-nearest neighbor
clustering. The diagram to the left illustrates the
identification of a hyperplane, the red line given a 2-D
picture, that best separates the dot-/circle-represented documents into distinct sets.
2
Salton, Wong & Yang, “A Vector Space Model for Automatic Indexing,” November 1975

5


Linguistic Approaches
Statistical approaches have a hard time making sense of nuanced human language, an
issue that H.P. Luhn foresaw in 1958. Luhn wrote in his visionary paper, cited above,
"This rather unsophisticated argument on „significance‟ [inferred
from a word‟s frequency of use] avoids such linguistic implications as
grammar and syntax. In general, the method does not even propose to
differentiate between word forms. Thus the variants differ,
differentiate, different, differently, difference and differential could
ordinarily be considered identical notions and regarded as the same
word. No attention is paid to the logical and semantic relationships
the author has established. In other words, an inventory is taken and a
word list compiled in descending order of frequency."
Consider the following pair of sentences, proposed by Luca Scagliarini of Expert
System. The two cases produce the same “bag of words” but their meanings, the data
content of the texts, is very different given the switch of fell and gained.
The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's
500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq
composite gained 6.84, or 0.32 percent, to 2,162.78.
The Dow gained 46.58, or 0.42 percent, to 11,002.14. The Standard &
Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85, and the Nasdaq
composite fell 6.84, or 0.32 percent, to 2,162.78.
Linguistic approaches will, for instance, analyze the parts of speech of a phrase,
detecting the subject-verb-object triple that constitutes a factual (or subjective)
statement as well as additional, modifying elements.
Natural Language Processing
Part-of-speech (POS) analysis is typically one of a sequence or pipeline of resolving
steps applied to text. Other, typically applied steps include:
Tokenization: Identification of distinct elements within a text, usually words,
expressions, punctuation markets, white space, etc.
Stemming: Identifying variants of word bases created by conjugation,
declension, case, and pluralization, e.g., “act” for “acts,” “actor,” and “acted.”
Lemmatization: Use of stemming and other techniques, including analysis of
context and parts of speech, to associate multiple words or terms with a
canonical term. For example, "better" might have "good" as its lemma.
Entity Recognition: Look-up in lexicons or gazetteers and use of pattern
matching to discern items such as names of people, companies, products, and
places and expressions such as e-mail addresses, phone numbers, and dates.
Tagging: XML mark-up of distinct elements, a.k.a. text annotation.
Entities are one type of “feature” found in text. Other features of interest include:
Attributes: A person‟s attributes include age, sex, height, and occupation.
Abstract attributes: Properties such as “expensive” and “comfortable.”
Concepts: Abstractions of entities, for instance, a category.
Metadata: In this context, items that describe a document such as the author,
creation date, and title as well a topic tag.
Facts and relationships: These include statements such as “Dow fell 46.58.”
Subjective data: Covers sentiment, opinions, mood, and other attitudinal data.
The next section of the report looks at how the technology is applied.

6


Software and Solution Market Overview
What we now see as text analytics was actually, in the late 1950s, put forward as the
foundation for a visionary business intelligence system. This system would focus on
discovering and communicating relationships (and not just data values) and on
business-goal alignment. Knowledge-management questions drove this early
BI conceptualization, with answers to questions such as:
What is known?
Who knows what?
Who needs to know?
to be derived or discovered via text mining.3
Such systems are technically very difficult to realize, and BI of course developed in
other directions. Numerical data, drawn from transactional and operational systems
and stored in databases, is far easier to analyze than is information locked in text. BI
and related tools and techniques – spreadsheets, reporting, OLAP, data mining –
generally do an excellent job of creating business value from this data.
In the last few years attention has turned back to text sources. Commercial software
vendors – and open source projects – have responded to the opportunity.

Applications and Sources
Applications of text mining in the life sciences and intelligence date to the 1990s, for
purposes that include pharmaceutical lead generation – mining scientific literature to
accelerate expensive, time consuming drug-discovery processes – and counter-
terrorism. A number of factors – the huge and growing volume of on-line content,
advances in search and information retrieval, cheap computing power, and better
software – have created a market for application of these same text technologies to a
much broader variety of business, scientific, and research problems.
Application domains
Market awareness has grown immensely in the last 3-5 years, but up-take and
experiences have varied by application domain. To study adoption, survey question 2
asked, “What are your primary applications where text comes into play?” It listed the
following choices, an attempt to capture the most important application domains:
Brand/product/reputation management
Competitive intelligence
Content management or publishing
Customer service
E-discovery
Financial services
Compliance
Insurance, risk management, or fraud
Law enforcement
Life sciences or clinical medicine
Product/service design, quality assurance, or warranty claims
Research (not listed)
Voice of the Customer / Customer Experience Management
3
“BI at 50 Turns Back to the Future,”
http://www.intelligententerprise.com/showArticle.jhtml?articleID=211900005

7


Information sources
In each of the application areas listed above, text analytics enhances existing analyses.
It enhances both BI and data mining applied to transactional data and non-automated
review of textual sources, a.k.a. reading. By automating the reading process, text
analytics allows analysts and researchers to tap material that had not previously been
systematically mined. It allows them to work far faster than before and to analyze far
greater volumes of information than ever before. Importantly, text analytics can
make a huge difference in text analysis and processing costs and enable the creation of
new information products and services.
Survey question 3 asked about information sources. These sources may be grouped:
On-line and social media: blogs and other social media (twitter, social-network
sites, etc.); news articles; review sites or forums.
Enterprise communications and feedback: chat and/or instant messaging text;
contact-center notes or transcripts; customer/market surveys; e-mail and
correspondence; employee surveys; point-of-service notes or transcripts;
SMS/text messages; warranty claims/documentation; Web-site feedback.
Operational materials (of course varying by business): crime, legal, or judicial
reports or evidentiary materials; insurance claims or underwriting notes;
medical records; patent/IP filings; scientific or technical literature.
Application modes
The applications themselves vary widely. They may be classified in several
(overlapping) groups:
Media and publishing systems – the author includes search engines here – use
text analytics to generate metadata and enrich and index metadata and
content in order to support content distribution and retrieval. Semantic Web
applications would fit in this category.
Content management systems – and, again, related search tools – use text
analytics to enhance the findability of content for business processes that
include compliance, e-discovery, and claims processing.
Line-of-business systems for functions such as compliance and risk, customer
experience management (CEM), customer support and service, human
resources and recruiting.
Investigative and research systems for functions such as fraud, intelligence
and law enforcement, competitive intelligence, and life sciences research.
This list is representative and not exhaustive. All listed areas are experiencing strong
growth. In certain cases, text-analytics‟ contribution is not at all obvious. Google and
other major search engines top their responses to “map massachusetts” and “34+178”
and “orcl” with a map, the number 212, and Oracle share data, respectively, enabled by
their ability to recognize named entities and expressions. This particular application
of text analytics is shallow but reaches a very, very large audience.
Solution providers
Text-analytics solution providers include a significant cadre of young but mature
pure-play software vendors, software giants that have built or acquired text
technologies, robust open-source projects, and a constant stream of start-ups, many of
which focus on market niches or specialized capabilities such as sentiment analysis.
The provider-side is vibrant and doing well despite the adverse economic climate due
to the market‟s growing awareness of solution providers‟ ability to respond to
business needs and technical challenges alike.4

4
http://www.b-eye-network.com/channels/1394/view/9720

8


Demand-Side Perspectives
Alta Plana designed a spring 2009 survey, “Text Analytics demand-side perspectives:
users, prospects, and the market,” to collect raw material for an exploration of key text-
analytics market-shaping questions:
What do customers, prospects, and users think of the technology, solutions,
and vendors?
What works, and what needs work?
How can solution providers better serve the market?
Will your companies expand their use of text analytics in the coming year?
Will spending on text analytics grow, decrease, or remain the same?
It is clear that current and prospective text-analytics users wish to learn how others
are using the technology, and solution providers of course need demand-side data to
improve their products, services, and market positioning, to boost sales and better
satisfy customers. The Alta Plana study therefore has two goals:
1. To raise market awareness and educate current and prospective users.
2. To collect information of value to sponsors.
Survey findings, as presented and analyzed in this study report, provide a form of
measure of the state of the market, a form of benchmark. They are designed to be of
use to everyone who is interested in the commercial text-analytics market.

Study Context
Text-analytics solutions have been applied to a spectrum of business problems.
Provider revenues are booming (for most established providers). Academic and
industrial research is only expanding, and there has been a steady pace of emergence
of new companies in the field, many of them academic spin-offs. Demand-side views
are, anecdotally, quite positive, judging from published user stories and case studies
and based on inquiries from organizations that are researching solutions.
The author previously explored market questions in a number of papers and articles.
These included white papers created for the Text Analytics Summit in 2005, The
Developing Text Mining Market,”5 and 2007, “What's Next for Text.”6
Analyst and Provider Analyses
The 2007 paper contains a number of telling quotations:
“Organizations embracing text analytics all report having an epiphany
moment when they suddenly knew more than before.”
– Philip Russom, the Data Warehousing Institute
“Growth is largely driven by the wealth of unstructured information found
on the external web, in corporate intranets, document repositories, call-
centers, and in customer and employee business communications.”
– IBM researcher David Ferrucci
Other analysts and solution providers have had a lot to say about text analytics‟
benefits and growth. The article “Perspectives on Text Analytics in 2009”7 is a
systematic (albeit informal) survey of industry perspectives that reports provider
5
http://altaplana.com/TheDevelopingTextMiningMarket.pdf
6
http://altaplana.com/WhatsNextForText.pdf
7
http://www.b-eye-network.com/channels/1394/view/9720

9


CEO and CTO and thought-leader responses to the query:
“What do you see as the 3 (or fewer) most important text analytics
technology, solution or market challenges in 2009?”
Responses were informative, based on the respondents‟ own research and, especially,
on extensive contact with customers and prospects.
In the current context, a market challenge articulated by Aaron B. Brown, IBM
program director for ECM Discovery, is particularly telling. That challenge is for
solution text-analytics providers to better define business cases. According to Brown,
“In the current economic situation, organizations are clamping down on new
projects and more than ever looking for hard ROI savings to justify
investment. To pass the funding bar, text-analytics solutions, which typically
fall in the category of new projects undertaken for business optimization, need
to come with solid business cases that demonstrate hard-dollar operational
savings based on proven examples. Given the emerging nature of many text-
analytics solution areas, this will be a challenge to growth in 2009.”
Business cases don‟t rest solely on solution-provider research and assertions, of
course. Demand-side experiences and perceptions can and should also contribute.
Demand-Side Views
A systematic look at the demand side will provide a good complement to provider-
side views and to vendor- and analyst-published case studies.
Alta Plana‟s 2008 study report, “Voice of the Customer: Text Analytics for the Responsive
Enterprise,”8 published by BeyeNETWORK.com, was our first systematic survey of
demand-side perspectives, albeit focused on a particular set of business problems.
VoC analysis is frequently applied to enhance customer support and satisfaction
initiatives, in support of marketing, product and service quality, brand and reputation
management, and other enterprise feedback initiatives. The listening concept is
extended to other voice applications: Voice of the Patient, Voice of the Market, etc.
Views related in our 2008 study were telling:
“Text Analytics is exciting technology, opening up new applications
and approaches to solving information needs and supporting decision
making for an improved customer experience.”
– Michael House, Maritz Research, Division Vice President
“We've uncovered concepts and relationships in text that would be too
costly – or even impossible – to detect by any other methods. We can
now combine multiple data sources to evaluate customer expectations
and improve customer satisfaction by employing more one-to-one
customer contact and preemptively resolving customer complaints to
keep our retention rates high."
– Federico Cesconi, Cablecom, head of customer insight and retention

About the Survey
There were 116 responses to the 2009 survey, which ran from April 13 to May 10.
Survey invitations
The author solicited responses via:
E-mail to the TextAnalytics, Corpora, datamining2, sla-dkm (Special
8
http://altaplana.com/BIN-VOCTextAnalyticsReport.pdf

10


Libraries Association, Division for Knowledge Management), sla-dite
(SLA Information Technology), Asis-l (American Society for
Information Science), and GATE lists and the author‟s personal list.
Invitations published in electronic newsletters: Intelligent Enterprise,
KDnuggets, SearchDataManagement.com, TDWI‟s BI This Week,
Text Analytics Summit, and statistics.com.
Notices posted to LinkedIn forums and Facebook groups and on
twitter.
Messages sent by sponsors to their communities.
Survey introduction
The survey started with a definition and brief description as follow:
Text Analytics is the use of computer software to automate:
annotation and information extraction from text – entities, concepts,
topics, facts, and attitudes,
analysis of annotated/extracted information, and
document processing – retrieval, categorization, and classification, and
derivation of business insight from textual sources.
This is a survey of demand-side perceptions of text technologies, solutions, and
providers. Please respond only if you are a user, prospect, integrator, or
consultant.
There are 20 questions. The survey should take you 5-10 minutes to complete.
For this survey, text mining, text data mining, content analytics, and text
analytics are all synonymous.
I'll be preparing a free report with my findings. Thanks for participating!
Survey response
There is little question that the survey results overweight current text-analytics users
– 63% of respondents who answered Q1, “How long have you been using Text Analytics?,”
versus 61% of respondents who replied to Q7, “Are you currently using text analytics?” –
among the broad set of potential business, government, and academic users.
BI market comparison
We can infer this overweighting, for example, from market-size figures. The author
estimates a $350 million global market for text-analytics software and vendor supplied
support and services. By contrast, in March 2009, research firm IDC published a
preliminary, 2008 BI-market estimate. IDC‟s sizing “suggests that the business
intelligence tools software market grew 6.4% in 2008 to reach $7.5 billion.”9 Former
Forrester analyst Merv Adrian estimated $8.4 billion for 2008. A simple, good-enough
heuristic says that if the BI market is 20 times the size of the text-analytics market,
there are likely around 20 times as many BI users as there are text-analytics users.
Data mining comparison
Another contrasting data point is that 55% of respondents to a March 2009 KDnuggets
poll10 report currently using text analytics on projects. KDnuggets reaches data
miners, a technically sophisticated audience who are among the most likely of any
market segment to have embraced text analytics. The rate of text-analytics adoption
by data miners surely exceeds the rate adoption by any other user sector.

9
http://www.idc.com/getdoc.jsp?containerId=217443
10
http://www.kdnuggets.com/polls/2009/text-analytics-use.htm

11


How much did you use text analytics / text mining in 2008?
Did not use (45) 45%
Used on < 10% of my projects (17) 17%
Used on 10-25% of projects (14) 14%
Used on 26-50% of my projects (11) 11%
Used on over 50% of my projects (14) 14%
As an aside, that 52% of KDnuggets respondents stated that in 2009, they would use
text analytics more than in 2008, with 42% stating their use would be about the same
as in 2008, strongly suggests growth in the user base.

12


Demand-Side Study 2009: Response
The subsections that follow tabulate and chart survey responses, which are presented
without unnecessary elaboration.

Q1: Length of Experience

How long have you been using Text Analytics?
70%

60%

50%

40%
Response
Percentage 30%

20%

10%

0%
not using,
6 months one year two years
no
currently less than 6 to less to less to less four years
definite
evaluating months than one than two than four or more
plans to
year years years
use
Response % 16% 22% 8% 5% 7% 18% 25%
Cumulative Response 8% 13% 20% 37% 63%

Q2: Application Areas

What are your primary applications where text
comes into play?
Brand / product / reputation management 40%
Competitive intelligence 37%
Voice of the Customer / Customer Experience … 33%
Research (not listed) 33%
Customer service 22%
Content management or publishing 19%
Life sciences or clinical medicine 18%
Insurance, risk management, or fraud 17%
Financial services 15%
E-discovery 15%
Product/service design, quality assurance, or … 14%
Other 13%
Compliance 8%
Law enforcement 7%

0% 5% 10% 15% 20% 25% 30% 35% 40% 45%

13


Q3: Information Sources

What textual information are you analyzing or do you
plan to analyze?
blogs and other social media 47%
news articles 44%
e-mail and correspondence 36%
on-line forums 35%
customer/market surveys 34%
scientific or technical literature 27%
contact-center notes or transcripts 25%
Web-site feedback 21%
review sites or forums 21%
medical records 16%
employee surveys 16%
insurance claims or underwriting notes 15%
chat and/or instant messaging text 15%
other 14%
crime, legal, or judicial reports or evidentiary materials 13%
point-of-service notes or transcripts 12%
patent/IP filings 11%
SMS/text messages 8%
warranty claims/documentation 7%

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50%

14


Q4: Return on Investment
Question 4 asked, “How do you measure ROI, Return on Investment? Have you
achieved positive ROI yet?” Results are charted from highest to lowest values of the
sum of “currently measure” and “plan to measure”:
How do you measure ROI, Return on Investment?

increased sales to existing 54%
customers
51%
higher satisfaction ratings

improved new-customer 46%
acquisition
higher customer retention/lower 39%
churn
reduction in required staff/higher 38%
staff productivity Measure or Plan to
more accurate processing of 36% Measure
claims/requests/casework
faster processing of 36%
claims/requests/casework Plan to Measure
ability to create new information 34%
products
fewer issues reported and/or 30%
Achieved
service complaints
lower average cost of sales, new 30%
& existing customers
higher search ranking, Web 28%
Currently Measure
traffic, or ad response
18%
other

0% 20% 40% 60%

Q5: Mindshare
A word cloud, generated at Wordle.net, seemed a good way to present responses to the
query, “Please enter the names of text-analytics companies you have heard of.”

15


Q6: Spending
Question 6 asked, “How much did your organization spend in 2008, and how much do
you expect to spend in 2009, on text-analytics solutions?”

13% use open source 11%
14%
20% use open source
7% under $50,000 6%
under $50,000
7% 8%
$50,000 to $99,000 $50,000 to $99,000
$100,000 to $199,999
$100,000 to $200,000 to $499,999
20% 38%
$199,999 22% $500,000 or above
34%
$200,000 to
$499,999

2008 Spending 2009 Spending
Q8: Satisfaction
Question 8 asked, “Please rate your overall experience – your satisfaction – with text
analytics.” Results are as shown:

23%
Completely satisfied

Satisfied
2%
Neutral
2%
53% Disappointed

Very disappointed
21%

Q9: Overall Experience
Question 9 asked, “Please describe your overall experience – your satisfaction – with
text analytics.” The following are 32 verbatim responses, lightly edited for spelling
and grammar and to mask the two products that were named:

We are highly satisfied. Costs were lower than expected due to high degree of automation.
Expectations were exceeded. More timely and more fine grained customer insight and market
intelligence and competitive intelligence than ever before.
It's been a fun journey, but still struggling with how to get to root cause and how far text

16


analytics can get you there vs. need analysts.
No one solution addresses every use case. Some solutions better address the up-front creation of
dictionaries than others.
I would like a more automated system the integrates with our current IS.
Not really neutral but it's sort of a love hate thing. There's a very high learning curve,
sometimes it's seductive to measure things that aren't relevant - to run things just because you
cannot because they tell you anything. But the customers like it - even if they don't understand
it.
I want to see more applications
Pretty good on named entity extraction, fairly good on fact extraction, poor on sentiment
analysis.
Several possibilities, several applications; Emphasis on efficiency enhancing; solutions;
Problems in selling accuracy.
I was satisfied of the effectiveness of the tools - specifically for named-entity recognition.
Good but still have a ways to go with capabilities
OK, it is hard to describe satisfaction of using text analytics tools when we all know how
language is ambiguous and complex - we cannot expect too much from automatic processing yet,
maybe in the time when neutral networks can be used, but NLP on its own cannot impress us
yet I think.
Developing part-of-speech tagging for Arabic text, morphological analyzer, to deal with wide
range of text domain, formats and genres.
Frustration with developing custom dictionaries that allow real-time categorization of content.
Pleased with progress in neural analysis of text content.
I'm building this all myself using open source tools. I'm extremely satisfied.
Hard learning curve, but we have it going now.
Excellent.
We have pretty low expectations for the accuracy of automated classification techniques, and
those are fulfilled but not exceeded. We use automated categorization in building demos, but
most of our customers use semi-automated or manual tagging.
It has been extremely valuable in certain situations. We always look at the text and verbatims
with our [product] software
It's great, but most of it is primarily designed for the English language only. As soon as you
need other languages, you need a lot of different providers (= increased implementation costs)
or you have to pay a lot of money.
I have written an entire textbook based upon text analytics and plan to write another.
92% accuracy, 6.7 fold increase in productivity, cut search time by 50%
Hundreds of hours of auditors’ time has been saved by a combination of scanning of hard copy
evidence, electronic evidence collection, and importing into [product], building business rules
from auditors defined keywords to produce first cut analysis classification.
Very satisfied - state-of-the-art in text analytics is advancing at a very rapid pace and text-
analytics based solutions are able to demonstrate business value addition/ROI.
Feedback from our users with the current tools is that they are not meeting their needs, which
is why we are looking at other solutions.
Difficult implementation into our core software, but now works as designed.
We have presented sentiment analysis on a wide range of documents and used the information
to be predictive in nature.
Text analytics allows us to gain new customer and market insights as well as better
competitive intelligence: higher report frequency, automated reporting, lower cost, finer
granularity.
Great hopes.
Long way to go.
Too early to tell.
10 million Voice of Customer can be in real time understood.

17


Q12: Like and Dislike
Question 12 asked, “What do you like or dislike about your solution or software
provider(s)?” Respondents were allowed to enter up to five points. Twenty-seven
individuals responded, entering a total of 82 points. One respondent entered “cost” in
all five slots.
The following table normalizes, classifies as positive or negative, and groups the
responses into thematic categories. We take the sum of positive and negative remarks
in a category as indicating the category‟s importance, so the chart is sorted in
descending order of (sum) number of remarks.
What do you like or dislike about your solution or
software provider(s)?
14

12 Plus

10 Minus

Sum
8

6

4

2

0

18


Q13: Information Types

Do you need (or expect to need) to extract or analyze -

Other 15%

Other entities – phone numbers, e-mail & street
40%
addresses
Metadata such as document author, publication date,
53%
title, headers, etc.

Events, relationships, and/or facts 55%

Concepts, that is, abstract groups of entities 58%

Sentiment, opinions, attitudes, emotions 60%

Topics and themes 65%

Named entities – people, companies, geographic
71%
locations, brands, ticker symbols, etc.

0% 10% 20% 30% 40% 50% 60% 70% 80%

Q19: Comments
There were twelve comments. Several pushing-the-envelope respondent observations
were particularly interesting:
“We were shocked at the lack of appreciation for hosted and/or turnkey
solutions from many vendors we evaluated in 2008. The product capabilities
of many commercial solutions were poorly conceived, leading us to believe
that they didn't really understand the potential of text analytics.”
“As a market research supplier, my clients cross a number of industries.
Thus, lack of scalability is the major obstacle to adopting text analysis for my
purpose.”
“Twitter data requires new text analytic algorithms, because of the presence
of „@person‟ fields, hashtags, and HTML links that have been shortened. As a
consequence, "traditional" algorithms don't work. I am developing those
algorithms myself, which is yet another reason I use open source tools
exclusively.”
One other comment is interesting and prompts a response.
“We are building an information retrieval product and wish to embed out-of-
the-box functionality but with the option to plug in other 3rd party analytical
components.”
The response is that several frameworks provide a plug-in architecture for the
construction of IR and other text-analytics applications. These include:
UIMA, the Unstructured Information Management Architecture, an Apache
Incubator project that was recently approved as an OASIS standard.
GATE, the General Architecture for Text Engineering.
Eclipse SMILA, a new SeMantic Information Logistics Architecture project.

19


Q14: Important Properties & Capabilities
What is important in a solution?

Important Properties & Capabilities
ability to use specialized dictionaries, taxonomies, or extraction
62%
rules

broad information extraction capability 59%

deep sentiment/opinion extraction 53%

low cost 51%

support for multiple languages 39%

predictive-analytics integration 37%

BI (business intelligence) integration 35%

open source 24%

ability to create custom workflows 24%
sector adaptation (e.g., hospitality, insurance, retail, health care,
23%
communications, financial services)

media monitoring/analysis interface 22%

hosted or "as a service" option 22%

supports data fusion / unified analytics 19%

interface specialized for your line-of-business 17%
vendor's reseller/integrator/OEM relationships with tech or
13%
service providers

other 9%

0% 20% 40% 60% 80%

20


Additional Analysis
The survey was designed so that responses to questions would be easy to interpret and
immediately useful without elaborate cross-tabulation or filtering. The exception was
cross-tabulation of length of time using text analytics and of whether a respondent is
currently using text analytics or not with other variables.

Selected Cross-tabulations
The author‟s interpretation of survey findings generally supports prior notions, points
such as –
Length of involvement with text analytics correlates with particularity of
requirements. Each bar represents the percentage of respondents in a time
category who indicated that “ability to…” is important:
100%
90%
80% Ability to use specialized
70%
60% dictionaries, taxonomies,
50% or extraction rules is
40%
30% important
20%
10%
0% Ability to create custom
less than 6 one year two four workflows is important
6 months to less years to years or
months to less than two less than more
than one years four
year years

Length of involvement with text analytics correlates with preference for open
source:

Open source is important versus Time using Text
Analytics
60%
40%
20%
0%
less than 6 6 months to one year to two years to four years or
months less than one less than two less than more
year years four years

Using / Not
Other interesting points come out of contrasting respondents who are already using
text analytics with respondents who are still in planning stages.
Sources
The top responses to “What textual information are you analyzing or do you plan to
analyze?” for current users are:
blogs and other social media (twitter, social-network 62%
sites, etc.)

21


news articles 55%
on-line forums 41%
These are on-line and other feedback-rich sources. Their high rate of selection
suggests that veteran users have found significant benefit in these sources.
By contrast, only three information-type categories were selected by over 26% of
respondents who are not yet using text analytics:
contact-center notes or transcripts 29%
It‟s easy to infer that the value of on-line materials (social media, news articles,
forums), which is evident to current users, has not yet become clear to prospective
users. That only a minority chose any particular category suggests some combination
of the following, that
Prospective users are more broadly distributed across application categories.
Prospective users are cautious about how many different sources they tackle
initially.
The particular top selections suggest that the plurality – the largest portion – of
prospective users will focus initially on materials they have on hand that involve
interactions with known stakeholders. Web sources can come later.
Expectations
Prospective users are not similarly guarded in their expectations. When responses to
Question 4 “How do you measure ROI, Return on Investment?” are split out by
current versus prospective use, six measures are each selected by between 50% and
55% of prospective-user respondents. They are:
increased sales to existing customers
improved new-customer acquisition
higher satisfaction ratings
fewer issues reported and/or service complaints
faster processing of claims/requests/casework
reduction in required staff/higher staff productivity
(Of prospective-user respondents, almost a quarter are already using “increased sales
to existing customers” as an ROI measure, which make sense. Sales are easily tracked
and analyzed by current systems where items such as satisfaction ratings are not.)
“Higher customer retention/lower churn” comes in at just under 50% and three others
top 38%.
These prospective users, and the folks who advise them, would do well to manage and
focus their expectations.

Interpretive Limitations
The number of survey respondents was not large enough to support further useful

22


cross-tabulation of variables beyond the analyses above.
In interpreting presented findings, do keep in mind that the survey was not designed
or conducted scientifically, that is, with the intention or the actuality of creating a
random sample or a statistically robust characterization of the broad market.
Findings surely reflect selection bias due to 1) the outlets where the survey was
advertised and 2) a likelihood that those individuals who are unaware of text analytics
or the potential for text analytics to help them solve their business problems would
not respond to the survey. Findings therefore over-represent current text-analytics
users and also over-represent, to a lesser extent, the business intelligence and data
warehousing communities.
Finally, responses to several of the survey questions were not especially illuminating
or likely to be of much use to report readers. These questions are, in particular,
Question 10. Who is your provider?
Question 11. How did you identify and choose your provider?
Question 15. What BI (business intelligence) software do you use if any?
Question 16. What social media do/would you look to for text-analytics
contacts, discussions, or information?
Question 17. What industry publications do you receive, on paper or
electronically?
Question 18. What industry/technical conferences do you attend?

23


About the Study
Text Analytics 2009: Users Perspectives on Solutions and Providers reports the findings
of a study conducted by Seth Grimes, president and principal consultant at Alta Plana
Corporation. Findings were drawn from responses to a spring 2009 survey of current
and prospective text-analytics users, consultants, and integrators. The survey asked
respondents to relay their perceptions of text-analytics technology, solutions, and
vendors. It asked users to describe their organizations‟ usage of text analytics and
their experiences.

Sponsors
The author is grateful for the support of seven sponsors – Attensity, Clarabridge, the
University of Sheffield (GATE project), IxReveal, Nstein, SAP, and TEMIS – whose
financial contribution enabled him to conduct the current study and publish study
findings. The content of the sponsor solution profiles was provided by the sponsors.
The survey findings and the editorial content of this report do not necessarily
represent the views of the study sponsors. This report, with the exception of the
sponsor solution profiles, was not reviewed by the sponsors prior to publication.

Media Partners
The author acknowledges assistance received from six media partners in
disseminating invitations to participate in the survey. Those media partners are
Intelligent Enterprise, KDnuggets, SearchDataManagement.com, Statistics.com, the
Text Analytics Summit, and The Data Warehousing Institute (TDWI).

Seth Grimes
Author Seth Grimes is an information technology analyst and analytics strategy
consultant. He is contributing editor at Intelligent Enterprise magazine, founding chair
of the Text Analytics Summit, an instructor for The Data Warehousing Institute (TDWI),
KDnuggets contributor, and text analytics channel expert at the Business Intelligence
Network.
Seth founded Washington DC-based Alta Plana Corporation in 1997. He consults,
writes, and speaks on information-systems strategy, data management and analysis
systems, industry trends, and emerging analytical technologies.
Seth can be reached at grimes@altaplana.com, 301-270-0795.

24


Sponsor Solution Profiles

25


Solution Profile: Attensity
Business is built on conversations. These customer, partner, and employee conversations are captured in emails, call
notes, letters, surveys, forums and other social media, and more. Attensity enables you to use these conversations
to drive better relationships with your customers – transforming them into loyal advocates of your business.
Attensity delivers the power of sophisticated data and semantic analytics in an integrated suite of easy-to-use
business applications, allowing business leaders, customer support personnel, and customers to get relevant and
actionable answers fast.
An Integrated Suite of Products to Help You Manage the Customer Conversation: Analyze and Respond
Attensity's ability to extract valuable insight from free-form text anywhere and transform it into actionable insights
offers organizations the opportunity to understand their customers and to manage the entire customer conversation
– analyzing and responding to customer needs. Recognized as best-of-breed by leading analysts for more than a
decade, our applications, powered by the industry’s leading natural language processing technologies, are designed
to automate related business processes, and add the rigor and speed necessary to swiftly identify often subtle
relationships and root causes and to respond timely and accurately to customers. Equally important, our easy-to-use
business applications are not only designed for analysts, but also for business leaders, researchers, brand and
category managers, and customer service representatives, while also used directly by customers to efficiently self-
serve.
Attensity Voice of the Customer/Market Voice allows your organization to glean and analyze your customers’ candid
thoughts about your brand and products, rapidly and accurately understanding and analyzing comments in E-Service
records, surveys, and emails, along with the market buzz found in web communities, blogs, product reviews and social
media sites. This delivers the actionable insights - authentic customer sentiments and issues around your brand,
products, services, your competitors and more -- you need to make smarter decisions and deliver better products and
services. Attensity Voice of the Customer/Market Voice features sophisticated integrated reporting and pre-
packaged Voice of the Customer extraction domains for fast-time-to-value, detailed sentiment analysis, and an
extensive partner solutions network to help you
extend the value of your applications.
Attensity’s other products include E-Service Suite,
Automated Response Management, Research and
Discovery and Intelligence Analysis. E-Service Suite
offers an Agent Service Portal and a Self-Service
application that enables your customers to
effectively self-serve while your agents are
empowered to extend informed and efficient service
support real-time. Attensity Automated Response
Management, a part of the E-Service Suite,
optimizes and automates up to 100% of the handling
of all incoming and outbound customer
communications, enabling you deliver a superior
customer experience while achieving significant
operational efficiency and productivity gains in your
contact center. Research and Discovery provides your organization with sophisticated information extraction,
advanced classification and enterprise-class search of and access to internal and external data, helping you meet
compliance and litigation demands while controlling costs. Intelligence Analysis allows commercial and government
organizations to “connect the dots” by delivering automatic extraction and analytical processing of “relational events”
from unstructured data –not only who or what, but the “why, when, where and how.”
A Relentless Focus on Customer Success

Companies across the full industrial spectrum and around the globe are discovering how our advanced solutions help
them thrive by helping resolve customer support issues more quickly, enable more accurate research and analysis of
customer feedback, and rapidly address and proactively prevent problems while mitigating risk. Across industries,
companies are optimizing customer interaction processes in the contact center, deepening customer relations
through effective and efficient self-serve support, and growing their competitive edge with Attensity solutions
adapted to their industry specific business needs.
Attensity’s team of vertical experts allow us to provide expert advice and specialized applications for areas such as
aerospace, automotive, consumer packaged goods, contact center outsourcing, financial services and insurance,
government and law enforcement, healthcare, hospitality, manufacturing, media and publishing, retail, technology,

26


and telecommunications. Attensity has a strong record of customer success across all of our products, including Voice
of the Customer, E-Service, and Research and Discovery. Three of our VoC success stories are presented here.
JetBlue Airways | www.jetblue.com
New York-based JetBlue Airways has created a new airline category based on value, service and style. Known for its
award-winning service and free TV as much as its low fares, JetBlue is now pleased to offer customers the most
legroom throughout coach (based on average fleet-wide seat pitch for U.S. airlines). JetBlue is also America’s first and
only airline to offer its own Customer Bill of Rights, with meaningful compensation for customers inconvenienced by
service disruptions within JetBlue’s control.
JetBlue Airways currently uses Attensity’s Voice of the Customer application in its customer service organization to
uncover customer issues, requirements and overall sentiment about the airline. The company’s pilot project
demonstrated a significant ability to find key information about customer sentiment and tangible data around how to
augment its services. JetBlue uses Attensity VoC to proactively manage and analyze all freeform customer feedback to
improve service, marketing, sales and the products they offer.
“From our Customer Bill of Rights to our customer advisory council, JetBlue is dedicated to bringing humanity back to
air travel,” Bryan Jeppsen, Research Analyst Manager said. “One of the best ways to do that is to listen — truly listen
— to our customers. Our commitment with Attensity enables us to mine subtle but important clues from all forms of
customer communications to continue improving all aspects of our company. We’re eager to learn as much as we
can, and we’re excited to have Attensity’s simple to use yet sophisticated software at our service.”
JetBlue customer service analysts use Attensity VoC daily to cull insights and actions from feedback. “Attensity Voice
of the Customer offers us the unprecedented ability to automatically extract customer sentiments, preferences and
requests we simply wouldn’t find any other way,” according to Jeppsen. “Attensity VOC enables us to intelligently
structure, search and integrate the data into our other business intelligence and decision-making systems.”
Charles Schwab | www.schwab.com
For this Global 1000 investment services firm, Attensity is a central part of efforts to understand and act on customer
feedback. With hundreds of thousands of interactions per month, the need to understand customer issues, act on
signs of dissatisfaction and churn and drive sales and service interactions can be the difference between success and
failure. With Attensity they are able to capture these interactions through customer service notes, emails, survey
responses and online discussions and analyze them to power customer retention and growth.
Attensity Voice of the Customer enables Schwab to analyze customer feedback to drive proactive programs and
understand emerging issues and opportunities, communicate key issues and opportunities at the client segment level
on a daily basis, and integrate this valuable customer feedback into their SAS analytics platform on their Teradata
data warehouse to expand the customer signature and to deepen customer loyalty analytics.
Attensity has become integral to Schwab customer program planning and churn identification efforts. The firm has
improved satisfaction and been able to mitigate churn via improved direct broker communications with customers
and marketing programs. Customer satisfaction, specifically reasons customers are not happy, is directly monitored
and specific issues are addressed. Issues can include problems with services, communication, collateral, and specific
individual interactions. Attensity also helps the firm dig deep into Net Promoter™ Program results, uncovering
reasons customers give low scores and identify as “detractors.” Attensity contributed to important changes to
account statements. Most importantly, Attensity reduced the time needed to analyze customer satisfaction issues
from almost 1 year to less than one week!
Whirlpool | www.whirlpool.com
As a $13.2B appliance manufacturer and the #1 appliance manufacturer in the world, Whirlpool focuses on great
products and great customer relationships to maintain and grow its global customer base. As a customer-centered
company, Whirlpool need to understand the root cause of pain points and brand, product, and service related issues.
With the vast amounts of customer service records, emails, survey response and online community forums, there is
more than enough data to get and use customer insights to improve customer experiences.
When Whirlpool started with Attensity in 2004, the company wanted to be able to leverage the web and over 8.5
million annual customer and repair visit interactions captured in service notes to drive marketing programs, product
development, and quality initiatives. Whirlpool has done just that and more. With over 300 Attensity VoC users
worldwide, Whirlpool listens and acts on customer data in the service department, its innovation and product
developments groups, and in the market every day.
With Attensity VoC, Whirlpool gets early warning of safety and warranty issues and has been able to mitigate
expensive recalls through rapid change out of defective parts. Whirlpool extrapolates an ~80% reduction in the cost
of recalls due to early detection. In addition to Attensity-fueled product quality improvements, Whirlpool better
understands customers’ needs and wants – and the competition and what they are doing to win over customers.

27


Solution Profile: Clarabridge
Clarabridge was founded with the
simple premise of enabling companies
to drive business value by
understanding key customer and
prospect experiences. Now more than
ever, consumer-focused companies
turn to Clarabridge to help retain
customers, attract new customers, cut
servicing and operational costs, sell
more products to current customers,
and develop more relevant products
and services. Clarabridge is the
leading provider of text mining
software for Customer Experience
Management (CEM) due to four key
strengths:
Commitment to CEM applications: Clarabridge’s rapid growth is due to a focus on the
value our customers gain from leveraging our VOC solutions. Our staff, technology,
customers, and partners are all 100% focused around delivering VOC applications, and our
entire company is committed to providing business value for our customers.
Speed-to-Value: No other advanced text mining solution can be deployed as efficiently and
powerfully as Clarabridge. Whether an implementation is source specific or enterprise-wide,
no other vender can compete with the speed in which our customers not only implement but
recognize value.
Market Leadership: We believe that being a market leader is more than market statistics
and sales wins. While Clarabridge dominates these statistics, we believe that being the
market leader also means being a thought leader, an innovator and a standard-setting force
in the marketplace. Clarabridge is the first company in our industry to organize a specific
user group and conference on using text-mining to support VOC.
The Best Technology: There are many great technologies in the text mining world. Some
are proven in academia and government think tanks, others within very controlled
implementations. But no current text mining technology can compete with our ability to deliver
repeatable and tangible business value within the commercial space.
Enabled by text analytics, CEM provides the opportunity to create innovative offerings from the
start while targeting the precise customer segments and later react to customer feedback on
desired improvements and enhancements.
Text Mining to Support Business Improvements
Clarabridge’s content mining
process involves three
integrated components:
1. Collect and Connect:
Clarabridge's pre-built source
connectors allow easy access
to external and internal
customer information,
harvesting content from all of
your listening posts.
Clarabridge’s built-in feedback
module allows the design,
deployment and capture of
surveys, campaigns,
community forums and web
forms.

28


2. Mine and Refine:
Once all textual content is sourced, Clarabridge extracts meaning through its fully integrated and
automated features, so millions of verbatims transform seamlessly into actionable information.
Clarabridge deep parsing Natural Language Processing technology extracts parts of speech and
linguistic relationships. This output is used for downstream entity & fact extraction, sentiment
extraction, categorization, and root cause analysis.
3. Analyze and Discover:
Clarabridge provides two interfaces with a range of functional and analytic tools: Clarabridge
Reporting and Analysis and Clarabridge Navigator. Analysts and business users can identify key
themes and emerging issues, dynamically investigate results, set up alerts and drill into root
causes with the full discovery functionality integrated into the software.
Case Studies: Technology in Action
Today, leading Fortune 1000 companies across all major markets rely on Clarabridge for
the essential customer experience intelligence they require for strategic insight and pre-
emptive action. Supported by the Clarabridge Content Mining Platform, clients are able to
capture the 360-degree view on current customer attitudes and sentiment shifts, rather than to
settle for a limited understanding of their Voice of the Customer.
Use cases reflect Clarabridge’s successful engagements and their outcomes with clients across a
range of major industries.
AOL uses Clarabridge to manage, capture and analyze over 5 million website feedback forms
for over 150 products in dozens of languages. Clarabridge automatically processes and
reports the now quantified insights directly to product teams.
A major international airline company uses Clarabridge to capture and analyze over 7 million
surveys per year, allowing them to analyze drivers of loyalty and dissatisfaction for all of their
customer segments. The airline can better meet the needs of their passengers through
improved understanding of the drivers of customer satisfaction.
Gaylord Entertainment used Clarabridge to replace their manual guest satisfaction review
process with automatic coding, sentiment extraction and reporting. VOC analysis is available
near real-time based on the needs of Gaylord employees. Driving more business through
high value event planners and raising customer satisfaction scores, Gaylord has had
enormous business and customer experience success using Clarabridge.

Vision, Experience, and Strength
Clarabridge’s goal is to help you fully access your customer experience intelligence—and
leverage that information to your advantage. By bridging the gap between your customer’s
experience and your brand’s promise, we provide a unique portal into the human dimension of
your business. With this insight, you gain the strategic edge in serving your customers, controlling
costs and risk, competing resourcefully, and building profitability.
When you work with Clarabridge, you work with the management team that had guided the
company’s growth and innovation from the start. Each has had decades of experience, bolstered
by successful entrepreneurial ventures and strengthened by prior top-level management
experience. Executives, who include a nationally recognized entrepreneur and a multiple patent
holder, are all published authors and frequent speakers at industry conferences.
With a commitment to excellence, partnership model with clients, and fast-paced development
processes, Clarabridge is strong from the ground up. What’s more, our financial backing, board
advisors, reputation, and partnerships are sound, ensuring our software will evolve to meet your
emerging demands.

29


Solution Profile: GATE
An Open Source Solution General Architecture
for Full Lifecycle for Text Engineering
Text Analytics http://gate.ac.uk/

FREE founder member of OASIS/UIMA committee.
Open source, licensed under LGPL allowing EFFICIENT
unrestricted commercial use, hosted on SourceForge. Optimisations included with the latest version
100% JAVA provide a 20 to 40% speed and memory usage
Runs on any platform supporting Java 5 or later. improvement.
Developed and tested daily on Linux, Windows, and Highly efficient finite state text processing engine;
Mac OS X. many plug-ins with linear execution time.
MATURE AND ACTIVELY SUPPORTED POPULAR
In development since 1996; now at version 5.0; Assessed as “outstanding” and “internationally
around 20 active developers. leading” by an anonymous EPSRC peer review.
COMPREHENSIVE Used at thousands of sites: companies, universities
and research laboratories, all over the world.
Support for manual annotation, performance
~35,000 downloads/year.
evaluation, information extraction, [semi-]automatic
semantic annotation, and many other tasks. Rolling funding for more than 15 staff at the
University of Sheffield.
Over 50 plug-ins included with the standard
distribution, containing over 70 resource types. Many DATA MANAGEMENT
others available from independent sources. Pluggable input filters with out of the box support
for XML, HTML, PDF, MS Word, email, plain text, etc.
Common in-memory data model built around
stand-off annotation, documents and corpora.
Persistent storage layer with support for XML or
Java serialisation. I/O interoperation with many
other systems.
STANDARD ALGORITHMS
Ready made implementations for many typical NLP
tasks such as tokenisation, POS tagging, sentence
splitting, named entity recognition, co-reference
resolution, machine learning, etc.
USER INTERFACE
Comprehensive tool set for data editing and
INTEGRATION visualisation, rapid application development, manual
Leveraging the power of other projects such as: annotation, ontology management.
• Information Retrieval: Lucene (Nutch, Solr),
Google and Yahoo search APIs, MG4J;
• Machine Learning: Weka, MaxEnt, SVMLight, etc.;
• Ontology Support: Sesame and OWLIM;
• Parsing: RASP, Minipar, and SUPPLE;
• Other: UIMA, Wordnet, Snowball, etc.
COMMUNITY AND SUPPORT
Friendly and active community of developers and
users offers efficient help. Commercial support
available from Ontotext and Matrixware.
STANDARDS BASED
Reference implementation in ISO TC37/SC4 LIRICS
project; supports XCES, ACE, TREC etc. formats;

30


OVERVIEW
GATE was first released in 1996, then completely re-designed, re-written, and re-released in 2002. The
system is now one of the most widely-used systems of its type and is a comprehensive infrastructure for
language processing software development.
The new UIMA architecture from IBM/Apache has taken inspiration from GATE and IBM have paid the
University of Sheffield to develop an interoperability layer between the two systems.
Key features of GATE are:
• Component-based development reduces the systems integration overhead in collaborative research.
• Automatic performance measurement of Language Engineering (LE) components promotes quantitative
comparative evaluation.
• Distinction between low-level tasks such as data storage, data visualisation, discovery, and loading of
components and the high-level language processing tasks.
• Clean separation between data structures and algorithms that process human language.
• Consistent use of standard mechanisms for components to communicate data about language, and use
of open standards such as Unicode and XML.
• Insulation from idiosyncratic data formats (GATE performs automatic format conversion and enables
uniform access to linguistic data).
• Provision of a baseline set of LE components that can be extended and/or replaced by users as required.
TEXT ANALYSIS
Text Analysis (TA) is a process which takes
unseen texts as input and produces fixed-
format, unambiguous data as output. This data
may be used directly for display to users, or may
be stored in a database or spreadsheet for later
analysis, or may be used for indexing purposes
in Information Retrieval (IR) applications.
TA covers a family of applications including
named entity recognition, relation extraction,
and event detection.
GATE has been used for TA applications in
domains including bioinformatics, health and
safety, and 17th century court reports.
TA systems built on GATE have been evaluated
among the top ones at international competitions (MUC, ACE, Pascal). A system built by the GATE team
came top in two of three categories in the NTCIR 2007 patent classification competition.
THE GATE FAMILY
• GATE Developer: an integrated development environment for language processing components
bundled with the most widely used Information Extraction system and a comprehensive set of other
plug-ins
• GATE Embedded: an object library optimised for inclusion in diverse applications giving access to all the
services used by GATE Developer and more
• GATE Teamware: a collaborative annotation environment for high volume factory-style semantic
annotation projects built around a workflow engine and the GATE Cloud backend web services
• GATE Cloud: a parallel distributed processing engine that combines GATE Embedded with a heavily
optimized service infrastructure
FIRST COUSINS: THE ONTOTEXT FAMILY
• Ontotext KIM: UIs demonstrating our multiparadigm approach to information management, navigation
and search
• Ontotext Mimir: (Multi-paradigm Information Management Index and Repository) a massively scaleable
multiparadigm index built on Ontotext's semantic repository family, GATE's annotation structures
database plus full-text indexing from MG4J
Sponsored by: Ontotext.com, Matrixware.com Contact: Prof. Hamish Cunningham
Research funding: EU, UK Research Councils and JISC http://www.dcs.shef.ac.uk/~hamish/

31


Solution Profile: IxReveal
IxReveal is a leading analytics software company that transcends current search and business
intelligence technologies. Our patented platforms transform large volumes of unstructured and
structured data into actionable intelligence, while enabling automatic and collaborative sharing
of concepts, connections, and findings.
Clients include global corporations, financial institutions, health organizations, and major
government agencies with data-intensive needs in areas such as fraud, security, finance, crime,
and intelligence.

is aimed at helping analysts in organizations
solve business problems and making
informed business decisions by leveraging their investment in Law Enforcement: “uReveal
collecting data. Organizations have spent millions of dollars in made our analysts ridiculously
collecting and storing information like crime incidents, claims, efficient.”
customer calls, emails etc. With uReveal, they are able to combine - Crime Analysis Manager
the structured and unstructured data to find meaningful trends
and patterns to fight crime and insurance fraud and to reshape the organization to be customer focused.
uReveal provides the bottom or top-line changing ability to analyze
huge volumes of textual data. It works with various data sources
Insurance: “Level of accuracy of like existing search infrastructures, databases containing textual
suspicious claims identification information, emails, and content management systems. uReveal’s
increased five-fold and false powerful decision support capabilities are finally making it possible
positives decreased.” to find trends and patterns and zero in on critical slices of
- Insurance Claims Manager information buried deep within the text.
uReveal is a tool that has been developed for analysts, putting
them in control by enabling them to focus their precious time on value-added analysis - instead of having
to read all the documents returned. It is designed for small to mid-sized workgroups that work with vast
amount of free-form information as part of their jobs.
With an intuitive and highly configurable user interface Insurance: “This technology not only helps
and patent pending analytics (such as relationship our analysts become very efficient but
discovery and integrated charting/graphing helps us save on legal costs as well.”
capabilities), uReveal users are able to create a - Workers Comp Fraud Manager
personalized environment to get their job done faster.

uReveal is the solution for analytical teams that work with unstructured information and provide decisive
insight as part of a mission critical business process. Using uReveal, they can both find and substantiate
business insights and recommendations, pointing back to the unstructured information as validation.

32

Text Analytics 2009: User Perspectives on Solutions and Providers

Text Analytics 2009: User Perspectives on Solutions and Providers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Text Analytics 2009: User Perspectives on Solutions and Providers

Similar to Text Analytics 2009: User Perspectives on Solutions and Providers (20)

More from Seth Grimes

More from Seth Grimes (20)

Recently uploaded

Recently uploaded (20)

Text Analytics 2009: User Perspectives on Solutions and Providers