Market Segmentation Using Textual Data

by: John V. Colias, Ph.D.

The recent advances in Artificial Intelligence (AI) enable Large Language Models (e.g., OpenAI’s ChatGPT and Claude Sonnet) to “read” and “understand” the meaning of text. With this shift, the historical distinction between qualitative and quantitative analyses blurs.

Branding Models with Textual Data and Deep Learning

The blurring is because the meaning of text is “understood” by a large language model and stored as a vector of real numbers called embeddings. For example, ChatGPT summarizes meaning as a vector of 1536 real numbers. Embeddings provide the bridge that connects textual data to traditional analytic techniques historically reserved for quantitative data.

As a part of ongoing research into analytic methods, an experiment was conducted with cluster analysis of embeddings, which amounts to clustering on meaning rather than on words or bundles of words. The goal of this experiment was to determine whether cluster analysis of embeddings can group (or segment) people based on their written opinions about, in this example, their country’s economic problems. Can embeddings-based clusters produce a viable grouping of people? The experiment proceeded as follows:

Use OpenAI’s embedding model (https://platform.openai.com/docs/guides/embeddings) to produce the embeddings from responses to an open-ended survey question.
Perform cluster analysis of the embeddings to segment consumers.
Ask Anthropic’s AI model, Claude 3.5 Sonnet (https://docs.anthropic.com/en/docs/welcome) to identify four economic concerns expressed by respondents within each cluster.
Use expert human coders to code the same open-ended responses. (In this experiment Nuance (https://www.nuancecoding.com/ was the coding company)
Use the human-produced codes to validate the embeddings clusters.

Data

A representative sample of consumers were interviewed for a study that included not only closed-ended questions about current economic activity but also an open-ended question about each country’s economic problems.

The open-ended question was:

“In your opinion, what are the major economic problems in your country?
Please give as much detail as possible.”

The survey data was international in scope. For this study, a random sample of size n = 2000 was taken with the following country representation which included only completed surveys conducted in the English language.

Table One: Number of Completed Interviews per Country

Analysis

OpenAI’s ChatGPT provided 1536 embedding values for each open-end response. However, most of these values varied very little across respondents; that is, most had a very small standard deviation. Based on a visual inspection of the histogram of standard deviations, it was decided to use 253 (about 16.5%) of the embeddings for the cluster analysis.

Latent Class Cluster Analysis was used to group respondents based on the 253 embeddings. The analysis suggested a seven-cluster solution to be optimal, and each respondent was assigned to one of seven clusters or segments.

The segments were of size ranging from 9% to 22% of the population.

Next, Anthropic’s Claude Sonnet AI model was used with the following prompt:

You will be acting as a market research text analysis expert to summarize survey responses within seven predetermined segments into 3 or 4 main concerns about the economy. Here is the survey topic:
<survey_topic>Economic problems in my country.</survey_topic>

And here is the specific question that respondents answered:

<survey_question>
Question: In your opinion, what are the major economic problems in your country? Please give as much detail as possible.
</survey_question>

Here are the segment numbers and responses for each survey respondent:

<survey_responses>
segment, Q18_1
6, text response
2, text response
</survey_responses>

Your task is to identify 3 or 4 key concerns for each of the seven segments. The concerns should be relevant to the survey question and topic and should capture the main ideas expressed across the various responses within each segment.

Format your response as a Python dictionary with the segment numbers as the keys and the main concerns within the segment as the values.

To provide a clearer assessment of the viability of the approach, only the survey responses for the 1621 US respondents were supplied to the Claude Sonnet AI model, which delivered four economic concerns expressed by the respondents within each cluster as shown in Table Two.

Table Two: Economic Concerns Identified by AI

A study of Table Two suggests the following interpretation:

Segment 1: A younger population with a significant proportion who struggle to pay their living expenses.
Segment 2: Those who are extremely concerned with the lack of border control and the impact of immigration on jobs. These individuals are critical of politicians and party politics that do not produce solid solutions.
Segment 3: Older, mostly male, individuals who advocate conservative fiscal policies.
Segment 4: Middle class, mostly male, individuals who support more equal taxation and better management of government spending.
Segment 5: Younger, lower income, individuals who are struggling to recover financially from the pandemic.
Segment 6: Younger, higher income, individuals feeling the impact of inflation and housing or rental costs.
Segment 7: Mostly female, lower income, individuals who struggle to pay living expenses.

Validation

While the AI-produced concerns can be very quickly obtained (response from the prompt was delivered within seconds), one might question the validity of the AI-produced concerns. That is, did AI accurately summarize the open-end responses within each segment.

In order to validate the accuracy, human-produced codes by expert coders were used to identify the top 4 main concerns for each segment. The expert coders produced 144 unique codes for the 2000 respondents. Within each segment, the top 4 codes were identified that satisfied both of two criteria:

Highest incidence within the segment
Incidence indexed at least 20% higher vs. overall incidence, that is, index > = 120, where index = 100 × (incidence within segment) / (incidence within overall population).

Table Three reports the human-produced concerns for each segment. Note that the concerns in Table Three were ranked from highest to lowest incidence within the segment. That the index values were relatively high, ranging from 127 to 966, demonstrates that the AI embeddings-based segments produced strongly differentiated segments based on the economic concerns expressed in the textual data.

Table Three: Economic Concerns Identified by Expert Coders

Table Four compares the AI-produced vs. human-produced economic concerns within each segment. The shaded text identifies where the concerns from each source appeared to align. The AI-produced concerns are delineated by bullets instead of numbers, since the AI model was told to identify the four “main” or “key” concerns and there is no indication of their rank. In contrast, the human-produced concerns were ranked based on incidence.

Table Four AI-Produced versus Human-Produced

There is considerable strong alignment of the AI- and the human-produced economic concerns suggesting that AI embeddings-based cluster analysis produces viable segments that differentiate people based on their beliefs and attitudes expressed in text responses to open-ended survey questions.

Conclusions

Cluster analysis of embeddings can produce segments that are strongly differentiated.
Embeddings-based clusters correlate strongly with the actual meanings as judged by expert human coders.
Embeddings-based clusters can produce a viable grouping of people.
Given that the analysis techniques used can be automated, the method could provide a rapid summarization of text data useful for targeted messaging in advertising or political work, for example.
Clustering of responses to open-ended questions provides another way to extract more information from survey responses.
Clustering of open-ended responses could help improve the accuracy of traditional market segmentation techniques.

Author

John Colias, Ph.D.

Senior VP Research & Development

As a leader with both university teaching and business consulting experience, John focuses on predictive modeling, prescriptive analytics, and artificial intelligence. As Senior Vice President, Research & Development, at Decision Analyst, John combines academic and business interests to help analytics professionals by offering cutting-edge analytic solutions tempered by business realism. He holds a doctorate in economics from The University of Texas at Austin, with specializations in econometrics and mathematical modeling methods.

Analytics Blog