AI builders need fresh, high quality data:
- Research labs need more data to develop state of the art models and agents.
- Software providers and enterprises need fresh data to finetune models and agents.
However, data collection comes with its risks. For example, enterprises need to avoid unethical data collection practices and unethically collected data to minimize reputational risk. Dive in for a comprehensive guide on AI data collection to help business leaders and developers navigate its challenges:
What are the data sources for AI training or inference?
Web
There is not a state of the art LLM or large multimodal model (LMM) that doesn’t leverage web data. The web includes almost all public data and a significant volume of private data. AI models need web data for:
- Training including finetuning: High-performing models almost always require fresh web data.
- Inference: For RAG and other inference-time compute operations, web data is necessary to feed facts to models.
Real-life example: While most LLMs do not share their data sources, first version of LLaMA disclosed it and it relied 100% on web data.1
Private or licensed data
Enterprises are the largest owners of private data. Unlocking this data for training can unlock further improvements in large language models. Since this involves data that was generated or gathered in the past, it can be pre-packaged and bought off-the-shelf in a rapid way.
Real-life examples:
- OpenAI forged more than 30 media partnerships to fuel its models.2
- Anthropic does not have a content partnership with Reddit and therefore it was sued for its use of Reddit’s data.3
Crowdsourced data
Suppose an image recognition system requires image data of road signs. Through public crowdsourcing, its developers can obtain these images from the public by providing some instructions to users of the network and creating a data-sharing platform.
Working with a third-party crowdsourcing platform or service provider can add cost-effectiveness and improved data quality to this method’s positives. However, this method can not be used for projects involving sensitive or confidential data.
Real-world examples: Reinforcement learning from human feedback (RLHF) was the major technical improvement between ChatGPT and GPT-3. RLHF involves collecting human ratings for AI models’ responses. These ratings are then used as training data to improve the model’s responses in line with human preferences.
Synthetic data
Synthetic data is artificially generated data that mimics real-world data. It is a form of automated data generation and can leverage both traditional AI and generative AI in augmenting existing AI training datasets or creating new ones.
Synthetic data is especially useful when you have limited labeled data, as it can improve the model’s accuracy and generalization capabilities.
Is your AI data collection ethical and compliant?
If you have collected web data for AI, you can review our ethical web data research to learn more. For other data sources, there are 3 dimensions to ethical data:
Legal data collection
While collecting public data is legal in most cases, there are exceptions.
Collecting private data can be illegal due to many reasons. For example, in most jurisdictions, it is illegal to collect private personally identifiable information (PII) without consent.
Real-life illegal private data collection example:
Cambridge Analytica collected private data belonging to 87 million Facebook users in deceptive ways. As a result:
- Meta was charged $5 billion in the US.
- The public outcry led Cambridge Analytica to declare bankruptcy.
Collecting public data that includes copyrighted material is also illegal.
Real-life public data collection controversy: Meta collected copyrighted books using the LibGen file sharing project.4 This led to ongoing litigation by the authors.
Ethical data collection
Legal data collection can be unethical.
Real-life ethical data collection controversy:
Location data brokers track and sell users’ locations via apps that legitimately gain access to users’ data after collecting user consent. However the collected data can be used for illicit actions like financial scams. Therefore, regulators stepped in to start regulating location data sales.5
Data supply chain
Regardless of whether you collect or use web data that is collected by others, unethical data collection can harm your business. In some jurisdictions like Germany, businesses are responsible for their suppliers’ legal conduct. And regardless of jurisdiction, enterprises have suffered reputational damage due to their suppliers’ conduct.
Real-life example of an enterprise’s attempt to reduce reputational damage from its data supplier:
Businesses are aware of the reputational damage their data supply chains can create. When one of Equifax’s data suppliers experienced a data leak, Equifax sent them a cease and desist letter to have references to Equifax removed from the website.6
Your business’ risk
If your business relies on web data, it may be at risk.
Even if your business is not collecting web data in an illegal or unethical manner, it is still at risk. Your data providers’ unethical or illegal actions are risking your business’ reputation.
If your business relies on web data, it is most probably working with a data collection service provider. For any large scale data collection, these companies’ services are indispensable.
Your risk of exposure to unethical or noncompliant data collection is high. This is because most data collection operations in our ethical web data benchmark fell short of the necessary level of transparency to ensure that their services are working in a legal and ethical manner.
What is the cost of unethical or noncompliant AI data collection?
Commercial risk
Data collection challenges can limit an AI models’ revenues with 2 mechanisms:
Performance issues
AI models that fail to deliver results due to data issues are unlikely to gain market share.
Compliance issues
Enterprises review AI models rigorously. For example, data provenance audits cover the entire data collection supply chain. Unethical data collection practices or not providing indemnification can limit enterprise adoption.
Real-life concerns about indemnification: Indemnification shields business users from specific legal risks like copyright infringement. Through private discussions with enterprises, AIMultiple team identified enterprises avoiding Meta’s Llama since it lacked indemnification.
OpenAI forced to retain user logs: OpenAI has been forced to contradict its existing privacy policy and retain user logs as part of the New York Times lawsuit about its use of public but copyrighted material in model training.7 Though enterprise users were excluded, it brings uncertainty to OpenAI’s commitments regarding user privacy.
Legal risk
The level of legal risk depends on how and where the solution is used:
Internal use
This is the most limited case as the only source of litigation can be employees or 3rd parties that are harmed by the model.
Real-world job applicant lawsuit against AI hiring tool:
iTutorGroup settled with Equal Employment Opportunity Commission (“EEOC”) after agreeing to pay $365,000 to applicants who had been denied an interview because of their age. The company also agreed to improve its anti-discrimination and complaint procedures.
The situation was discovered when an applicant submitted two identical applicants except one having a later birth date.8
Use by customers
Customers can sue model developers for lack of performance.
User’s family sued model developer:Character.AI chatbots have been accused of encouraging self harm. A teenager committed suicide after conversing with one of Character.AI’s chatbots.9
Use by customers that have been provided indemnification
This results in the maximum level of legal liability as your company’s risk includes the risk of all your clients’ work completed using your models. Indemnification provided by hyperscalers: Google and Microsoft provide indemnification for their models and this indemnification is expected by enterprises.10
Categories of legal risk
Legal risk takes many forms. It could be due to
- Contractual liabilities which can cover guarantees of lawful data collection
- IP risks due to the use of copyrighted or private data sets in training data.
- Product liability due to performance issues (e.g. model bias) due to unethical data collection
- Regulatory compliance, especially compliance to data protection regulations like CCPA or GDPR.
Regulatory compliance depends on the jurisdiction:
International differences in regulatory approaches
Globally, data privacy regulations vary significantly, showcasing different approaches to individual rights and data handling.11
Notable examples from the world’s largest economies:
EU General Data Protection Regulation is a comprehensive regulation emphasizing strong individual rights, broad definitions of personal data, and strict consent requirements. As an early and comprehensive law, it shaped the laws that came after it.
California Consumer Privacy Act has a narrower scope than GDPR and a less centralized enforcement approach. It provides significant privacy rights to California residents, including the right to know, opt-out of sale, and deletion.
Personal Information Protection Law of China is similar to GDPR. PIPL mandates informed consent for personal data processing, with separate explicit consent required for sensitive data, overseas transfers, and public disclosure. Comprehensive privacy notices detailing processing are also mandatory.
Reputational risk
AI companies are frequently facing backlash due to unethical data practices. So far, boycotts haven’t reached enough scale to slow down most companies’ growth since the market has been growing quite fast. However, in the future, reputational risk may be more important as the market matures. Brand damage can make the difference between success and failure as product differentiations diminish to commoditization of critical capabilities.
Stability AI’s use of copyrighted material led to numerous leaders leaving the company.12 Underpaid knowledge workers in Kenya serving a supplier of OpenAI were exposed to disturbing material.13
Other risks
Operational risks, talent churn, consumer boycotts are all possible due to challenges in data collection.
For example, if an AI research lab needs to remove a certain dataset from its training data, it would need to
- Retire all models that leveraged that datasets
- Create a new data pipeline to replace the removed dataset
- Retrain models which can cost up to hundreds of millions14
Checklist for AI training data
AI training data needs to fulfill these requirements to minimize risk and create an AI product that is ready for enterprise adoption:
Legal
Data in these groups are legal to use:
- Licensed
- In the public domain
- Owned by parties that allow its reuse without attribution in AI model training.
In all cases, data needs to be used according to its licensing terms. Since models can leak their training data, data owners are scouring AI models for their copyrighted data and initiating legal action about their copyrighted data in generative AI models.
Ethical
Ethical data collection goes beyond legal data collection ensuring that the data collection does not harm the data owners or other stakeholders.
We identified guidelines for ethical web data which should be followed while using web data for AI.
High quality
For example, hallucinations of LLMs and AI bias are a major impediment to AI adoption. Training data quality is key to reduce hallucinations.
Another component of quality is data freshness. Models trained on outdated data can misinform users.
Real-world example of garbage in, garbage out in LLMs: During its launch, Google’s AI Overviews recommended using glue to bind cheese to pizza based on an old comment on Reddit.
Secure
Security to protect competitive advantage
While aspects like user interface play a critical role in model adoption, data, compute and algorithms are the 3 core ingredients of an AI model. Therefore, data is a source of competitive advantage for AI model builders and it needs to be secured for this competitive advantage to be retained.
Security to prevent reputational or financial harm
A security issue could
- Expose licensed data and harm data owners.
- Shut down or compromise services to harm customers
Therefore, AI model builders need to invest in AI and LLM cybersecurity to protect their data.
6 steps to AI data collection
1. Identifying the need
It is the most crucial initial step in the data collection process. Without a clear focus, data collection efforts can be hard to finalize.
2. Selecting the method
Select the collection method which is most suitable for your project. For example, if conversation data between a patient and a doctor is necessary, AI model builders can contact:
- Hospitals and other healthcare providers to license their data
- Crowdsourcing companies for simulated conversations
3. Quality assurance
Cost of fixing data quality issues increase as the system matures. Therefore, in production systems, quality assurance should be prioritized during data collection.
4. Storage
A sound storage plan is essential regardless of your chosen method for collecting data. Consider privacy concerns, storage capacity, frequency of access, post-storage data management, etc.
5. Data annotation
Data annotation refers to the process of assigning labels to data to make it understandable for machines. It is an indispensable part of supervised learning. Although this step doesn’t entail collecting the data itself, it plays a crucial role in preparing the dataset for its ultimate application.
6. Verify
Check collected data against the checklist for AI training data with a data review including an audit of data pipelines.
FAQ about AI data collection
What is AI data collection?
AI data collection, also known as data harvesting, is the process of extracting data from various sources such as websites, online surveys, user feedback forms, customer social media posts, and ready-made datasets to be used in training and improving AI and machine learning (ML) models.
This process is foundational to creating AI systems, as the performance of these models heavily depends on the accuracy of the data they are trained on.
External Links
- 1. https://arxiv.org/pdf/2302.13971
- 2. https://www.informationweek.com/machine-learning-ai/is-openai-quietly-building-a-media-content-empire-
- 3. https://www.wsj.com/tech/ai/reddit-lawsuit-anthropic-ai-3b9624dd
- 4. https://storage.courtlistener.com/recap/gov.uscourts.cand.415175/gov.uscourts.cand.415175.373.0.pdf
- 5. Federal Regulators Limit Location Brokers from Selling Your Whereabouts: 2024 in Review | Electronic Frontier Foundation.
- 6. Here's What It's Like to Accidentally Expose the Data of 230M People | WIRED. WIRED
- 7. https://openai.com/index/response-to-nyt-data-demands/
- 8. EEOC Settles First AI-Discrimination Lawsuit | Sullivan & Cromwell LLP.
- 9. Florida mom sues Character.AI after her son's suicide. Axios
- 10. Protecting customers with generative AI indemnification | Google Cloud Blog. Google Cloud
- 11. https://www.dlapiperdataprotection.com/
- 12. My resignation from Stability AI — Ed Newton-Rex.
- 13. OpenAI Used Kenyan Workers on Less Than $2 Per Hour: Exclusive | TIME. Time
- 14. https://x.com/i/broadcasts/1gqGvjeBljOGB
Comments
Your email address will not be published. All fields are required.