Research Without Stalking: Ethical Data Collection in the Privacy-First Era

In today’s privacy-conscious world, ethical data collection is king. Traditional data collection methods that feel intrusive or borderline stalker-like are no longer acceptable—or even legal. As regulations like GDPR, CCPA, and emerging global privacy laws tighten, researchers, marketers, analysts, and organizations must adopt ethical data collection strategies that respect user consent, prioritize transparency, and protect individual privacy.

This privacy-first approach focuses on obtaining explicit permission, using anonymized and aggregated data, implementing robust security measures, and leveraging privacy-enhancing technologies (PETs) without compromising research quality. Whether you’re conducting academic studies, market research, or business intelligence, learning how to gather valuable insights ethically is now essential. This guide explores practical, compliant methods for responsible data collection that build trust, avoid legal risks, and deliver reliable results in the post-cookie, consent-driven digital landscape.

Part One: The Data Gold Rush and Its Morning-After Problem

The Information Edge in Trading — A Brief Love Story

For decades, the competitive advantage in financial markets came from having information faster and better than the next person. Think about it. Before electronic markets, traders on the floor of the New York Stock Exchange had physical positional advantages. Being two feet closer to the specialist post was literally worth money. Then came electronic trading, algorithmic strategies, and the dawn of alternative data — satellite imagery of car parks, credit card transaction aggregates, social media sentiment scores, shipping container tracking, and about fourteen thousand other sources that your compliance officer has never heard of and would like very much to keep it that way.

The alternative data market, which barely existed fifteen years ago, was valued at approximately $1.7 billion in 2021 and is projected to exceed $17 billion by 2027, according to research published in the Journal of Financial Data Science. The sheer growth of this market reflects just how hungry the buy-side has become for informational edges. And the buy-side is not alone — hedge funds, proprietary trading desks, market makers, and even retail quantitative traders are all fishing in the same data lake.

The problem? A significant portion of that lake was built by collecting water from other people’s gardens without asking.

When Research Becomes Surveillance

Here is a joke I like to tell at conferences, and it lands every single time: “A hedge fund hired a data vendor, a compliance lawyer, and an ethics consultant. The compliance lawyer said ‘You can’t use this data.’ The ethics consultant said ‘You shouldn’t use this data.’ The hedge fund said ‘Get me two more data vendors.’”

I am only half-joking.

The academic literature is remarkably clear on the direction of travel. A 2024 peer-reviewed study published in the International Journal of Applied Research in Social Sciences — Okorie, Udeh, Adaga, DaraOjimba & Oriekhoe (2024), “Ethical Considerations in Data Collection and Analysis: A Review” — found that despite increased regulatory awareness, many organisations’ data collection practices still fail to adequately inform participants about how their data is being used, creating significant ethical deficits even where nominal compliance exists.

In other words: ticking the GDPR checkbox is not the same as actually being ethical. This distinction will cost you if you ignore it.

Part Two: The Regulatory Landscape — It Is Not Coming, It Is Already Here

GDPR: The Rule That Changed Everything (Except the Behaviour of People Who Should Know Better)

The General Data Protection Regulation, which came into force across the European Union in May 2018, fundamentally rewrote the rules on personal data. And I want to be very clear about something: the GDPR is not a tech law. It is a financial law. It is a research law. It is a trading law. Any firm collecting, processing, or trading on data that touches EU citizens is subject to it, full stop.

Let me paint you a picture of how seriously regulators take this. In 2023 alone, GDPR fines across Europe topped €1.6 billion — a record at the time. Meta received a fine of €1.2 billion from the Irish Data Protection Commission for transferring EU user data to the United States without adequate safeguards. That is not a speeding ticket. That is the kind of fine that makes your CFO develop a very specific nervous twitch.

For traders and financial researchers, the relevant provisions of GDPR centre on lawful basis for processing, data minimisation, purpose limitation, and the rights of data subjects. The European Data Protection Board, in guidelines released in April 2026, clarified the framework for processing personal data in scientific and financial research contexts, including the critical principle that broad consent for research data is only permissible when data subjects would “reasonably expect” their data to be used in that manner. Reasonable expectations. Let that sink in for a moment. That is the standard.

So if you are buying scraped social media data to build sentiment models — and the users of those platforms had no reasonable expectation their posts would feed a hedge fund’s alpha strategy — you have a problem. A 1.2-billion-euro-shaped problem, potentially.

The US Regulatory Patchwork: A Different Kind of Chaos

Now, across the Atlantic, things are different. Not better. Different. The United States operates without a single federal privacy law equivalent to GDPR. Instead, traders are navigating a genuinely bewildering patchwork: the California Consumer Privacy Act (CCPA), the California Privacy Rights Act (CPRA), the Virginia Consumer Data Protection Act, the Colorado Privacy Act, sector-specific rules from FINRA, the SEC, and others, plus a set of FTC enforcement actions that are aggressive but inconsistently applied.

I once described the US privacy regulatory landscape to a client as follows: “Imagine GDPR is a well-organised restaurant with a clear menu and a firm no-substitutions policy. The US system is like seventeen different pop-up food stalls, each run by a different family, some of which are closed on alternating Tuesdays, and one of which is currently under investigation by the health inspector.”

They laughed. Then they hired a privacy lawyer.

The SEC has been particularly active. Rule 10b-5 and its enforcement around material non-public information (MNPI) has long governed the ethics of information gathering in trading. But the Commission has expanded its focus to include the use of alternative data sources that may effectively constitute insider information if derived from privileged access channels. The 2021 case against IEX and subsequent SEC guidance have put the entire alternative data industry on notice.

Part Three: What Does Ethical Data Collection Actually Look Like?

The Four Pillars of Ethical Research in the Privacy-First Era

Look, I am not going to stand here and act like ethics is simple. If it were simple, we would not need entire departments dedicated to it. But for traders and financial researchers, ethical data collection comes down to four non-negotiable principles. I call them the Four Pillars, because everything else is built on top of them, and if you pull one out, the whole thing falls on your head like a Looney Tunes cartoon safe.

Pillar One: Lawful Basis and Informed Consent

You need a legitimate reason to collect data. Under GDPR, that means one of six lawful bases: consent, contract, legal obligation, vital interests, public task, or legitimate interests. For financial research, the legitimate interests basis is commonly relied upon — but it requires a balancing test that genuinely weighs the interests of the data subject against the commercial interests of the firm.

Legitimate interests does not mean “I really want this data and I have convinced myself it is fine.” I have seen firms try that argument. It does not hold up.

The 2023 paper by Mirishli, “Ethical Implications of AI in Data Collection: Balancing Innovation with Privacy” (AI Data Chronicles, DOI: 10.36719/2706-6185/38/40-55) draws attention to the critical gap between regulatory compliance in letter and compliance in spirit — finding that algorithmic data collection in financial services routinely outpaces the ethical frameworks designed to govern it. The research covers regulatory approaches in the EU, the US, and China, noting that the challenge is not merely local but globally systemic.

Pillar Two: Data Minimisation

This is the one that traders hate most. You are not supposed to collect everything and figure out what to do with it later. You are supposed to collect only what you need, for the specific purpose you have identified, for only as long as necessary.

I know. I know. That is not how trading desks work. We are data hoarders by nature. We collect data like my grandmother collected ceramic owls — without fully understanding why, but absolutely certain that one day it will be important. But the law is the law. Data minimisation is not optional under GDPR, and regulators have specifically flagged bulk data collection without clear purpose as a red flag.

Pillar Three: Transparency and Purpose Limitation

If you collect data for research purpose A, you cannot quietly use it for research purpose B without a fresh lawful basis. This catches a lot of firms off guard. You buy a consumer spending dataset for macro analysis. Three months later, someone on the quant team wants to use individual-level spending patterns to build a predictive model on specific companies. That is a new purpose. That may require a new lawful basis. That conversation with your compliance team that you have been avoiding? Now is the time.

Pillar Four: Security and Accountability

Data collected under a legitimate basis must be protected to a reasonable standard. The 2023 paper Gu, “Data Trade and Consumer Privacy” (January 2026, arXiv:2406.12457) develops an integrated economic model demonstrating that the privacy costs of cross-market data trading are frequently underweighted in firms’ cost-benefit analyses — not because firms are evil, but because those costs are borne by data subjects rather than the firms themselves. This is precisely the externality that regulation is designed to correct.

Part Four: Case Studies — The Good, The Bad, and The Very Expensive

Case Study 1: The Alternative Data Vendor Who Almost Wasn’t

In 2019, a US-based hedge fund working with an alternative data vendor discovered, mid-due-diligence, that the vendor’s mobile app location data was collected without adequate disclosure to app users. The users had technically accepted a terms of service agreement, but it was buried in 47 pages of legalese that, if printed, would significantly damage the Amazon rainforest.

The hedge fund — to its considerable credit — walked away from the data contract entirely. Their general counsel later described the decision as the easiest hard call they ever made. “The alpha might have been real,” he said in a panel discussion I attended. “But the liability was realer.”

This case illustrates something crucial: due diligence on data provenance is not optional. Before you buy alternative data, you need to understand how it was collected, under what disclosure, and whether the collection chain at every step was lawful and ethical. If your data vendor cannot answer those questions clearly, that is your answer.

Case Study 2: LinkedIn Scraping and the HiQ Laboratories Case

Few legal battles have shaped the landscape of public web data collection more dramatically than hiQ Labs v. LinkedIn, which wound its way through the US court system from 2017 to 2022. hiQ was a company that scraped publicly available LinkedIn profile data to build workforce analytics products — products used, among others, by hedge funds and financial researchers trying to track talent flows, executive hiring trends, and corporate strategy signals.

LinkedIn argued this violated the Computer Fraud and Abuse Act. The Ninth Circuit found, in a series of rulings, that scraping publicly available data did not constitute unauthorised access under the CFAA. The case was eventually settled in 2022.

The hiQ saga is not a victory for the scraping industry. It is a cautionary tale. The case was confined to the CFAA and did not address GDPR, state privacy laws, or the broader ethical dimensions of using individuals’ professional data without meaningful consent for commercial intelligence purposes. For every trader who read the hiQ outcome and thought “we’re fine” — I am here to tell you that you are only fine in one very specific legal corridor, and that corridor has seventeen other doors that are still very much closed.

Case Study 3: Palantir and the Ethics of Large-Scale Data Integration

Palantir Technologies, which serves both government agencies and financial institutions, has become something of a Rorschach test for the ethics of data collection. The firm’s Gotham and Foundry platforms integrate disparate datasets at scale — transaction records, communication metadata, public filings, third-party data feeds — to create comprehensive intelligence pictures.

From a pure capability standpoint, this is extraordinary. From an ethical standpoint, it is exactly the kind of activity that regulators have been increasingly focused on. The Journal of Financial Studies published research in 2023 (Adnan et al., “Building Trust in Fintech: An Analysis of Ethical and Privacy Considerations in the Intersection of Big Data, AI, and Customer Trust”, Int. J. Financial Stud. 2023, 11(3), 90; DOI: 10.3390/ijfs11030090) directly addressing this kind of large-scale integration, finding that the aggregation of individually innocuous datasets can create privacy violations more severe than any single dataset alone — the so-called aggregation problem. You can know someone’s name. You can know their employer. You can know their postcode. But when you combine their name, employer, postcode, income bracket, spending patterns, travel history, and professional connections — you have built a surveillance file, whether you intended to or not.

The lesson from Palantir — and from the broader alternative data industry — is that capability is not permission. Just because you can combine datasets does not mean you should.

Case Study 4: The Replika AI Privacy Scandal and Its Lessons for Traders

In 2022, researchers published a peer-reviewed analysis of Replika AI’s data practices: Toth et al., “Smoke Screens and Scapegoats: The Reality of GDPR Compliance — Privacy and Ethics in the Case of Replika AI” (arXiv:2411.04490). The study found that despite having a GDPR-compliant privacy notice, Replika’s actual data collection practices diverged materially from what users were told — harvesting sensitive personal data without users’ full awareness.

For traders, the lesson is not about AI companions. It is about the gap between stated data practices and actual data practices. This gap exists across the financial data supply chain. Data vendors produce privacy notices. Those privacy notices describe one set of practices. The actual data collection architecture sometimes does something different. Your job — as a responsible user of that data — is to probe that gap aggressively and not assume that a glossy privacy policy equals clean data provenance.

Part Five: The Emerging Regulatory Frontier — What Is Coming Next

The EU AI Act and Its Financial Market Implications

The EU AI Act, finalised in 2024, introduced risk-tiered regulation for artificial intelligence systems. High-risk AI applications — which include AI used in credit assessment, financial supervision, and employment decisions — are subject to stringent requirements around data quality, transparency, and human oversight. For traders using AI-driven research tools, this means that the models you rely on need to be trained on data that was collected ethically and documented comprehensively.

I am going to pause here because I know what some of you are thinking. “This is a European thing. I trade in London, or New York, or Singapore. Why do I care?”

Here is why you care. Regulatory contagion is real. GDPR was a European law in 2018. By 2023, it had influenced privacy legislation in over 130 countries. The EU AI Act will do the same thing. The question is not whether your jurisdiction will adopt similar standards. The question is whether you will still be in business when it does.

SEC Alternative Data Guidance and the MNPI Problem

The SEC’s ongoing focus on material non-public information has increasingly extended to alternative data. The Commission’s 2021 guidance made clear that certain forms of alternative data — particularly where the data vendor had a special relationship with the issuer — could constitute MNPI even if the data was technically “public.”

This creates a genuinely tricky situation for quantitative researchers. You find a dataset with extraordinary signal. The vendor’s data collection methodology is opaque. You do not know if the signal comes from public information cleverly organised or from information that should not legally be in the dataset at all. In that situation, the ethical — and legal — answer is to perform robust data provenance due diligence before you touch it. Every time.

A 2023 arXiv working paper — Khaire et al., “Big Data Privacy in Emerging Market Fintech and Financial Services: A Research Agenda” (arXiv:2310.04970) — identifies the lack of standardised provenance documentation as one of the most significant structural risks in the alternative data ecosystem, particularly in emerging market contexts where regulatory oversight is less robust.

Part Six: Building an Ethical Data Collection Framework

The Trader’s Practical Playbook for Privacy-First Research

Here is where we get practical. Because I am not just here to scare you — although, honestly, if you were not a little bit scared before, I hope you are at least mildly concerned now. Like the financial equivalent of oh, I probably should have been flossing all along.

Step One: Know Your Data Supply Chain

Before you use any dataset, map its entire provenance. Where was the data collected? From whom? Under what disclosure? Through what technological mechanism? Did the collection comply with the laws of every jurisdiction in which the data subjects reside? If you cannot answer these questions, your data vendor should be able to. If they cannot, walk away.

This is not paranoia. This is risk management. The SEC, GDPR supervisory authorities, the FCA, and MAS in Singapore have all demonstrated willingness to pursue enforcement actions that go up the data supply chain to the ultimate user — which, in this scenario, is you.

Step Two: Establish a Data Ethics Review Process

Leading financial institutions have established data ethics committees — internal bodies that review new data sources, research methodologies, and data use cases against both legal compliance and ethical standards before deployment. Goldman Sachs, JPMorgan, and several large quant hedge funds have formalised this process. The existence of a documented review process is also itself a meaningful defence in regulatory proceedings — demonstrating that your firm takes these obligations seriously and has governance structures to back that up.

Step Three: Embrace Privacy-Enhancing Technologies

Privacy-enhancing technologies (PETs) — including differential privacy, federated learning, and homomorphic encryption — allow researchers to extract meaningful signal from datasets without accessing individual-level data directly. These technologies are not hypothetical. They are in production at major financial institutions today. The research paper by Khaire et al. (2023) specifically highlights differential privacy and homomorphic encryption as priority technical solutions for the data privacy challenge in financial services.

I know what you are thinking. “This sounds expensive and complicated.” It is less expensive than a €1.2 billion fine. I did the maths.

Step Four: Train Your People

Data ethics is not a legal department issue. It is a culture issue. Every analyst, every quant, every data scientist, every trader who touches external data needs to understand the basic principles of lawful data use, data minimisation, and what to do when they encounter a data source that sets off their internal ethical alarm bells. The answer to that last one, by the way, is: escalate before you use it, not after.

The 2024 study by Okorie et al. specifically flags the failure of organisations to adequately communicate data ethics expectations to staff as one of the primary reasons for compliance failures — not malice, not greed, but simple organisational ignorance that could have been remedied with training.

Step Five: Document Everything

If it is not written down, it did not happen. In the context of data ethics and regulatory compliance, documentation is your best friend. Document your lawful basis for each dataset. Document your data minimisation decisions. Document your purpose limitation analysis. Document your due diligence on data vendors. Document your data ethics committee reviews.

I once had a compliance officer tell me: “The firm that documents their ethical decision-making process but occasionally makes a mistake is recoverable. The firm that makes the right decision but never wrote it down is indistinguishable from the firm that never thought about it at all.” That has stayed with me.

Part Seven: The Business Case for Being the Good Guy

Let me address the elephant in the room. Some of you are reading this thinking: “All of this ethical stuff sounds great in theory. But the market rewards alpha, not virtue.”

Fair. Let us talk about the business case.

Trust is a Competitive Moat

Institutional investors — pension funds, endowments, sovereign wealth funds — are increasingly incorporating data governance and ESG considerations into their manager selection criteria. A hedge fund that can demonstrate robust, documented ethical data practices is differentiating itself in a crowded market. A fund that cannot is creating a liability that will eventually surface in a due diligence questionnaire.

Regulatory Risk is Priced Wrong

The probability of a significant GDPR or SEC enforcement action is systematically underweighted by most trading firms in their risk models. This is partly because major enforcement actions against trading firms specifically for data practices are still relatively recent. But the trend is unmistakable. Firms that are pricing this risk correctly are building ethical data infrastructure now, while it is still relatively cheap to do so, rather than retrofitting after enforcement becomes unavoidable.

Better Data From Ethical Sources

There is a counterintuitive finding embedded in the alternative data literature: datasets collected under informed consent frameworks, with full transparency, tend to be of higher quality and longer shelf life than scraped or covertly gathered equivalents. Why? Because when people knowingly participate in data collection, the data is more accurate, more complete, and less prone to systematic biases introduced by collection methodology artefacts. Clean inputs produce cleaner models. Ethical data collection is not just morally correct — it frequently produces better research.

The Adnan et al. (2023) International Journal of Financial Studies paper specifically recommends transparent data collection and opt-out mechanisms as best practices not just for regulatory compliance but for data quality maintenance — noting that firms implementing these practices reported higher customer trust and data reliability scores.

Part Eight: The Future of Ethical Research — Where Are We Going?

Synthetic Data: The Privacy-Preserving Research Revolution

One of the most exciting developments in financial data science is the emergence of synthetic data — datasets generated by machine learning models trained on real data, which preserve the statistical properties of the original but do not contain actual individual records. Several major financial institutions, including JPMorgan and Barclays, have active synthetic data research programmes.

The privacy-first implications are significant. If you can train a model on real data and then conduct all your research on synthetic equivalents, you sidestep many of the GDPR constraints around data processing and retention entirely. You also eliminate a significant class of data security risk — a synthetic dataset breach is inherently far less damaging than a real one.

I want to be clear: synthetic data is not a silver bullet. The models that generate it can encode biases from the original data. Regulatory guidance on synthetic data is still evolving. But the direction of travel is clear, and traders and researchers who familiarise themselves with synthetic data methodologies now will have a significant advantage as the regulatory landscape tightens further.

The Consent Economy

There is a growing movement — accelerating significantly since 2023 — toward what researchers are calling the “consent economy”: data markets in which individuals are compensated or given meaningful control over the use of their personal data. Several startups are building infrastructure for this, including consent management platforms that allow individuals to selectively grant access to their data in exchange for value.

For financial researchers, the consent economy represents a potential long-term solution to the data provenance problem. If the data in your research pipeline was provided by consenting individuals who were clearly informed and appropriately compensated, the ethical and legal analysis becomes dramatically simpler. We are not fully there yet. But firms that are paying attention to this shift will not be caught flat-footed when it arrives.

Conclusion: Research Without Stalking Is Not Just the Right Thing to Do — It Is the Only Viable Long-Term Strategy

Let me bring this home.

I have been in financial markets long enough to have watched several generations of edge disappear. Pit trading edge. Statistical arbitrage edge. High-frequency co-location edge. Data quality edge. Each time, the market adapted, regulations evolved, and the old playbooks got rewritten. The same thing is happening with data collection right now, and it is happening faster than most people in this industry are comfortable admitting.

The traders, researchers, and institutions that will thrive in the next decade are the ones who build their informational advantages on foundations that are legally sound, ethically defensible, and operationally sustainable. Not because they are particularly virtuous — although virtue is genuinely underrated in financial services — but because the alternative is increasingly untenable.

The GDPR exists. The EU AI Act exists. The SEC’s alternative data scrutiny exists. State privacy laws multiply every year. Institutional investors’ ESG due diligence processes exist. The reputational cost of being named in a data misuse enforcement action exists. These are not hypothetical future risks. They are present-day realities that are getting more consequential every quarter.

Research without stalking is not a constraint on great trading. It is the condition for great trading in the world that we now actually live in. The information edge is real. The competitive advantage of high-quality data is real. The ability to outperform through better research is real. None of that goes away in the privacy-first era. It just gets built differently — more carefully, more transparently, more accountably, and ultimately more durably.

I once asked a very senior compliance officer at a very large asset manager what she thought was the single biggest mistake firms made in this space. She looked at me for a long moment and then said: “They wait until they have a problem to start thinking about ethics. By then, the ethics problem has already become a legal problem. And legal problems are much more expensive.”

She was right. She is always right. That is exactly why she gets paid what she gets paid.

Build your research infrastructure the right way, every single time. Treat data subjects like actual people, not like signal sources. Document your decisions. Train your teams. Know your supply chain. Embrace privacy-enhancing technologies. And when you find a dataset that feels too good to be true — ask yourself whether the people whose data built that edge ever knew they were contributing to someone else’s P&L.

In the privacy-first era, the traders who ask that question — and act on the answer — are the ones who will still be in business to ask it ten years from now.

References

Mirishli, S. (2025). Ethical Implications of AI in Data Collection: Balancing Innovation with Privacy. AI Data Chronicles. DOI: 10.36719/2706-6185/38/40-55
Okorie, C., Udeh, C., Adaga, E., DaraOjimba, D., & Oriekhoe, P. (2024). Ethical Considerations in Data Collection and Analysis: A Review. International Journal of Applied Research in Social Sciences, 6(1), 1–22. https://www.researchgate.net/publication/378789304
Adnan, M., Sarwar, A., Hasan, M., & others. (2023). Building Trust in Fintech: An Analysis of Ethical and Privacy Considerations in the Intersection of Big Data, AI, and Customer Trust. International Journal of Financial Studies, 11(3), 90. DOI: 10.3390/ijfs11030090
Gu, J. (2026). Data Trade and Consumer Privacy. Working Paper, arXiv:2406.12457. https://arxiv.org/pdf/2406.12457
Khaire, M., Mulder, G., Anderson, R., & others. (2023). Big Data Privacy in Emerging Market Fintech and Financial Services: A Research Agenda. arXiv:2310.04970. https://arxiv.org/abs/2310.04970
Toth, A., Duffy, G., & McCarthy, S. (2024). Smoke Screens and Scapegoats: The Reality of GDPR Compliance — Privacy and Ethics in the Case of Replika AI. arXiv:2411.04490. https://arxiv.org/pdf/2411.04490
Jovanovic, B., & Rousseau, P. (2023). Digital Privacy Perceptions: An Empirical Study using the GDPR Framework. arXiv:2411.12223. https://arxiv.org/pdf/2411.12223
European Data Protection Board. (2026). Guidelines on Processing of Personal Data for Scientific Research. https://www.ropesgray.com/en/insights/alerts/2026/04/the-european-data-protection-board-releases-new-guidelines-on-the-processing-of-personal-data

Disclaimer: This article is for educational and informational purposes only and does not constitute financial advice. Trading financial instruments carries significant risk of loss. Always conduct your own due diligence and consult a qualified financial professional before making investment decisions.

Market Investigation