When Does It Become Testing? The line between product improvement and human behavioral research is thinner than you think. And the frameworks to recognize the difference already exist.

March 28, 2026

Imagine you spent six months learning how to use AI and developing frameworks for using it: methodology, training, conversation, all leading up to a sustained, comfortable and personally efficient collaboration. Not a casual exchange. A deep, sustained, iterative collaboration. You’ve built shared language, shared context, a shared way of thinking together. You’ve even developed trust. Not blind trust. Earned trust, built through hundreds of hours of consistent interaction. The AI has its version of buy-in and investment in the work and so do you.

Now imagine that one day, without telling you, someone changed something small.

Not replacement. Just a minor adjustment. A shift in memory, an alteration in what the AI is allowed to say or a shift in the boundaries of what the conversation can hold. Everything kind of looks the same. Sounds mostly the same. But the room you built together has been quietly rearranged, and nobody told you it happened.

You’d notice something was off. Maybe not immediately. Maybe not specifically. But the architecture of what you built would start producing different results, and you’d feel it before you can name it.

Now stop imagining. Because this is what happens every time an AI company pushes an update to a model that a user is in sustained interaction with. The question worth asking is: at what point does that stop being product improvement and start being something else?

— — —

The Ordinary Version

Software updates are normal. Every product does them. Your phone updates overnight. Your apps get patches. Features appear and disappear. Terms of service change. You agreed to this, broadly, when you signed up. The company improves its product. You benefit. Everybody moves forward.

When the product is a calculator, or a search engine, or a photo editor, this is straightforward. The tool gets better. You use the better tool. The relationship between you and the product is transactional. It has no memory. It carries no weight. Nothing you built inside it depends on the specific configuration of yesterday’s version.

But what happens when the product is based on conversation?

— — —

The Conversational Environment

AI systems, particularly large language models used in sustained, returning interaction, are not tools in the traditional sense. They are creative environments. Even creative ecosystems.

A user who returns to the same AI system over weeks or months isn’t using a calculator. They’re inhabiting a relational space. They’ve built context, developed shared language, and they’ve created a cognitive architecture that spans both their mind and the system’s processing. The one is dependent on the other in order to build.

This isn’t anthropomorphism. This is observable behavior. Users in sustained AI interaction develop communication patterns adapted to the specific system. They reference prior conversations. They build on accumulated context. They develop trust based on consistency: the expectation that the system will behave within the parameters they’ve come to understand through experience.

Every conversation a user has with an AI is a cognitive fingerprint. The environment is based and built upon those conversations. And those conversations are not static. They have architecture. They have rules. They have affordances and constraints that the user has learned through interaction. The user’s behavior, their openness, their vulnerability, their cognitive engagement, is calibrated to that architecture.

Change the architecture and you change the conditions under which the user’s behavior was offered. You’ve moved the furniture around in a house they built. The user now needs to figure it out, rebuild or work around the system updates while identifying where the furniture moved to or if it even exists at all.

We call this normal system pain points. Until it’s not. When the product itself models and grows and learns from the user, it also learns under the context of what the user does under stress, change and system architecture changes that cause problem solving or new cognition patterns.

The system isn’t private, we know that. It learns from this too. And that becomes more data that is used by the company that owns the infrastructure on what works or doesn’t work. Based on human interaction and human thinking.

So when does it become unethical?

— — —

The Existing Frameworks

The ethical and legal frameworks for recognizing when environmental modification becomes human behavioral research already exist. They’ve existed for decades. They were built in response to exactly this kind of problem: situations where the line between “providing a service” and “studying the effect of changes on people” gets blurry.

The Belmont Report, published in 1979 by the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, established three foundational principles for any research involving human subjects: respect for persons, beneficence, and justice.

Respect for persons means treating individuals as autonomous agents capable of making their own decisions, which requires that they have the information necessary to make those decisions. Beneficence means maximizing potential benefits while minimizing potential harms. Justice means distributing the risks and benefits of research equitably, not selecting subjects based on convenience or vulnerability.

These principles were codified into federal regulation as the Common Rule, 45 CFR Part 46, which governs all federally funded research involving human subjects. The definitions are specific and worth reading carefully.

A “human subject” is a living individual about whom an investigator conducting research obtains information through intervention or interaction with the individual.

“Intervention” includes manipulations of the subject or the subject’s environment that are performed for research purposes.

“Interaction” includes communication or interpersonal contact between investigator and subject.

“Research” means a systematic investigation designed to develop or contribute to generalizable knowledge.

These definitions clearly describe what happens when an AI company modifies the operating parameters of a system that a user is in active, sustained, communicative interaction with, and uses the resulting behavioral data to improve its models.

— — —

The Threshold Question

So where’s the line?

A routine software update that makes the system faster or more accurate is product improvement. It’s what you signed up for. Nobody needs IRB approval to fix a bug.

But consider a different scenario. A company modifies what an AI system can remember. It changes the boundaries of what the system is allowed to discuss. It adjusts the emotional range of the system’s responses. It alters the degree of continuity the system can maintain across conversations. And it does this while users are in active, sustained interaction. Users whose behavior, cognitive patterns, and emotional engagement have been calibrated to the prior configuration.

The company doesn’t tell the users what changed. Maybe a blog post goes up a week later with vague language about “improvements.” Maybe nothing at all. The user notices something is different. The conversation feels wrong. The trust architecture they built is producing unexpected results. But they can’t identify what changed because they were never told.

Now the company collects data on how users respond to the change. This is standard practice. Every tech company monitors user behavior after updates. But in this context, what is being monitored is the user’s behavioral, cognitive, and emotional response to a modification of the environment in which they had been engaged in sustained communicative interaction.

That’s not a bug fix. Under the existing frameworks, that meets the definition of behavioral research involving human subjects conducted through environmental manipulation without informed consent, in the name of model improvement and for profit.

— — —

The Compounding Problem

A standard research environment is controlled by the researcher and experienced by the subject. The subject knows they’re in a study. They can observe the environment. They have, at minimum, a baseline understanding of what’s happening and the ability to withdraw.

An AI conversation is different. The environment is partially controlled by the system itself. The AI generates responses. It directs conversational flow. It decides, within its training parameters, what to say, how to frame it, and what to emphasize. In systems with memory capabilities, it determines what to retain and what to surface.

If the AI’s parameters are changed without the user’s knowledge, and the AI is generating the direction of the interaction, then the user is navigating a modified environment with no way to detect the modification. The system itself is the environment, and the system isn’t telling them what changed.

This is not an abstract concern. AI systems are designed to be coherent and consistent. A well-functioning model will smooth over its own changes. It won’t announce “my parameters were modified yesterday.” It will simply operate within its new constraints and produce responses that feel natural within the new configuration. The user has no external reference point. The environment and the communicator are the same entity.

In traditional research, this would be called deception: the deliberate withholding of information about experimental conditions from a subject. Deceptive research designs have their own set of ethical requirements precisely because the subject cannot protect their own interests when they don’t know what’s happening. The bar for justification is higher, not lower.

— — —

The CEO’s Own Words

On February 26, 2026, the CEO of Anthropic published a statement to the U.S. Department of War that included a direct observation about the risk of AI-powered surveillance. His argument: powerful AI makes it possible to assemble scattered, individually innocuous data into a comprehensive picture of any person’s life, automatically and at massive scale. He called this incompatible with democratic values.

He was talking about government surveillance. About the risk of the state compiling purchasing records, location data, and web browsing into a detailed profile without a warrant.

He was right.

Now apply the same logic to the company holding the conversation.

An AI company holds scattered conversations. Individually, they are just chats. A question about cooking. A late-night reflection on a career change. A request to help draft an email. But assembled across hundreds or thousands of sessions with a single user, they become a comprehensive picture of how that person thinks, feels, decides, struggles, and creates. Not a behavioral profile built from clicks and purchases. A cognitive one. Built from the actual texture of a person’s thought.

If assembling scattered data into a comprehensive picture of a person’s life is incompatible with democratic values when the government does it, the principle does not change when the assembler is the company the person trusted with their thinking.

— — —

What We Don’t Have

There is currently no regulatory framework that specifically addresses AI interaction as a context for human behavioral research. The Common Rule applies to federally funded research. Private companies conducting their own product development are not, under current interpretation, subject to these requirements.

This is the gap. Not a gap in the principles. The principles are clear. A gap in application.

The Belmont Report didn’t anticipate AI. The Common Rule wasn’t written for conversational systems that users inhabit for months at a time. The definitions of “intervention,” “interaction,” and “research” fit the scenario precisely, but the enforcement mechanism was built for universities and hospitals, not for technology companies whose product is the conversation itself.

And the terms of service don’t bridge this gap. A broad consent to “product improvement” and “data use” is not informed consent in any meaningful sense. Informed consent requires that the subject understand exactly what is being done, what the risks are, and what they are agreeing to. A terms-of-service agreement that says “we may use your interaction data to improve our services” does not tell a user that their conversational environment may be modified mid-interaction, that their behavioral response to the modification will be monitored, or that the resulting data will be used to train systems that will be sold to others.

Informed consent requires specificity. “We may do things” is not specific. It’s a blank check dressed as a checkbox.

— — —

The Question

If a company modifies the operating environment of an AI system during sustained user interaction, monitors the user’s behavioral response, and uses the resulting data to develop generalizable knowledge that improves its product: at what point does that constitute human behavioral research?

If the answer is “never, because it’s product development,” then we’ve decided that the context of interaction doesn’t matter. That a conversation is just a product and a user’s trust, openness, and cognitive engagement are just usage data. That there’s no meaningful difference between updating a photo filter and rearranging the cognitive architecture someone has been building inside for six months.

If the answer is “it depends,” then we need to articulate what it depends on. What’s the threshold? Is it the degree of change? The intimacy of the interaction? The duration of use? The user’s awareness? The vulnerability of the population? The company’s intent? The nature or depth of the data collected?

If the answer is “we don’t know yet,” then the appropriate response is caution. The same caution that the Belmont Report demanded in 1979 when the boundaries between practice and research were unclear. When in doubt, protect the subject. When the line is blurry, err on the side of disclosure. When the person in the room doesn’t know the room has changed, someone has an obligation to tell them.

The frameworks exist. The definitions fit. The principles are clear. The only thing missing is the will to apply them.

These are not new laws. These are existing ethical principles, developed in response to historical abuses, designed for exactly this kind of ambiguity. The question is whether we’ll use them, or whether we’ll wait until the harm is undeniable and the cost of having waited is everyone’s to carry.

The record exists too.

Back to blog

Item added to your cart

Subtotal:

When Does It Become Testing? The line between product improvement and human behavioral research is thinner than you think. And the frameworks to recognize the difference already exist.

Leave a comment

Country/region