Text Mining: Making Sense of Unstructured Data

Text mining, which is sometimes referred to “text analytics” is one way to make qualitative or “unstructured” data useable by a computer. Qualitative data is descriptive data that cannot be measured in numbers and often includes qualities of appearance like color, texture, and textual description. Quantitative data is numerical, structured data that can be measured. However, there is often slippage between qualitative and quantitative categories  For example, a photograph might traditionally be considered “qualitative data” but when you break it down to the level of pixels, which can be measured.

Did you know?

  • An estimated 80% of data is unstructured?
  • This includes emails, newspaper or web articles, internal reports, transcripts of phone calls, research papers, blog entries, and patent applications, to name a few.
  • Thanks to the web and social media, More than 7 million web pages of text are being added to our collective repository, daily

You can now begin to see the usefulness of a Software that can “read” between 15,000- 250,000 pages an hour, compare to a mere 60 pages for humans (Guernsey).

So what is text mining?

The OED defines text mining as the process or practice of examining large collections of written resources in order to generate new information, typically using specialized computer software. It is a subset of the larger field of data mining. Guernsey explains that “to the uninitiated, it may seem that Google and other Web search engines do something similar, since they also pore through reams of documents in split-second intervals. But, as experts note, search engines are merely retrieving information, displaying lists of documents that contain certain keywords. They do not suggest connections or generate any new knowledge. Text-mining programs go further, categorizing information, making links between otherwise unconnected documents and providing visual maps.

Some applications of text-mining include:

How does it work?

The JISC and National Centre for Text Mining explain how “text mining involves the application of techniques from areas such as information retrieval, natural language processing, information extraction and data mining. These various stages of a text-mining process can be combined into a single workflow” (“Text Mining“).

  • Information retrieval (IR) systems match a user’s query to documents in a collection or database. The first step in the text mining process is to find the body of documents that are relevant to the research question(s).
  • Natural language processing (NLP) analyzes the text in structures based on human speech. It allows the computer to perform a grammatical analysis of a sentence to “read” the text.
  • Information extraction (IE) involves structuring the data that the NLP system generates.
  • Data mining (DM) is the process of identifying patterns in large sets of data, to find that new knowledge.

Potential Weakness:

Finally, as a word of caution, text mining doesn’t generate new facts and is not an end, in and of itself. The process is most useful when the data it generates can be further analyzed by a domain expert, who can bring additional knowledge for a more complete picture. Still, text mining creates new relationships and hypotheses for experts to explore further.

Sources:

Guernsey, L. (2003). “Digging for Nugests of Wisdom.” The New York Times.  http://www.nytimes.com/2003/10/16/technology/digging-for-nuggets-of-wisdom.html?pagewanted=3&src=pm

“Text Mining.” (2007). JISC. http://www.jisc.ac.uk/publications/briefingpapers/2008/bptextminingv2.aspx

This post was also featured on Information Space, the official blog of Syracuse University’s School of Information Studies.

Advertisements