From the Library of Alexandria to the first digital databases, the quest to organize and utilize information has been a reflection of human progress. As the volume of global data soars—from 2 zettabytes in 2010 to an anticipated 181 zettabytes by the end of 2024 – we stand on the verge of a new era in data interaction, and it’s being led by generative AI (GenAI).
Vast repositories of information, once gated by the need for specialized knowledge, are now open to anyone, anywhere, ushering in unprecedented access and more natural interaction with the data that shapes our most important decisions.
As GenAI changes the way we perceive data, it also challenges us to rethink how we manage, protect, and benefit from this invaluable resource.
Why Data Interaction Matters
Data storage systems typically involve static databases or data warehouses where information is stored in structured formats. To retrieve information from these systems, users require specific knowledge of database querying languages such as SQL. These systems are passive, meaning they do not actively assist users in discovering or interpreting data. Whoever is looking for the data must know exactly what they are looking for and how to ask for it correctly.
By transforming data repositories from static entities into dynamic systems, GenAI allows users to “have a conversation” with the data. This switch from passive storage to active interaction changes how data can be used. It becomes a responsive asset with which anyone can engage – no specific skills required.
Here are two hypothetical examples of how GenAI changes the rules of the playing field when it comes to data.
Customer Service in Retail
A customer service manager at an online retail company uses GenAI to query customer interaction data by simply asking, “What are the common issues customers faced with our new product launched last month?” The system analyzes customer feedback, reviews, and support tickets to provide a comprehensive report, identifying major thematic issues and even suggesting possible solutions based on commonalities found in the data.
Healthcare Monitoring
In a hospital, a healthcare provider queries a GenAI-enhanced database about a patient’s history by asking: “Show me the treatment history of patients with conditions similar to John Doe’s.” The system instantly gathers data from thousands of patient records to highlight effective treatments, noting any complications or successes.
Increased Complexity in Managing Data Accuracy
Traditional data systems are designed around predefined queries, which allow for specific data cleaning processes tailored to those queries. This means that data admins can anticipate and address common errors or inconsistencies before they affect output. But GenAI operates across an undefined range of inquiries. It’s impractical (and impossible) to pre-clean data for every potential question because the nature of many questions remains unknown until asked.
With GenAI, the scope of potential queries is vast and continually expanding. As users start to realize they can ask any question, data once considered peripheral may suddenly be thrust into the spotlight, as it’s required for new, unforeseen queries. This necessitates on-the-fly verification and validation of data—challenging tasks that are also resource-intensive.
GenAI systems require robust, adaptive data governance mechanisms to ensure reliable outputs. Real-time data governance systems make use of sophisticated algorithms to assess the quality and integrity of data as soon as it is queried. By applying necessary corrections and context automatically, such governance mechanisms ensure that the outputs remain reliable and accurate, regardless of the broad flexibility in user querying.
Faster Data Consumption by GenAI
Contrary to traditional data processing tools, which typically scan data without care for the underlying contexts or nuances, language models interpret data more like a human would, but with the capacity to process information at a scale and speed unattainable by human analysts. For example, a PC powered by a GeForce RTX 4080 GPU running a quantized 7B model can process data at an impressive rate of 78.1 tokens per second – about 3,480 words per minute. An average human has the ability to process about 120-150 words per minute.
This remarkable processing power enables GenAI to perform complex analytical tasks across large datasets faster than ever. More than just number crunchers, these models excel in understanding and contextualizing the information they process. This is particularly evident in their ability to handle multimodal data—from textual content and images to audio and video—creating a holistic understanding of the data instead of the fragmented view that is offered by conventional tools.
Challenges of Managing Unstructured Data
Unstructured data poses unique challenges for GenAI systems due to its inherent variability and lack of predefined schemas.
Data Consistency Issues
With data coming from various sources, in all shapes and sizes, ensuring consistency across datasets becomes challenging. GenAI must be able to interpret and align disparate data forms—such as different formats of dates, text slang, and image resolutions.
The Problem of Data Volume vs. Data Quality
As data volume increases, maintaining high-data quality becomes more important, and more difficult. Higher volumes often lead to diluted quality unless rigorous processes are in place to clean and validate data continuously. GenAI requires high-quality data to produce reliable and accurate outputs, making the management of these volumes a tough balancing act.
Integrating Diverse Data Types into Cohesive Datasets
Each data type—text, images, audio, and video—requires different processing techniques. For instance, textual data might need sentiment analysis, while images require object recognition algorithms. The ultimate challenge is data fusion, where all these preprocessed data types are combined into a single coherent model that can perform holistic analysis.
New Security and Privacy Challenges
The introduction of the automobile led to the first car crashes. These, in turn, led to the development of traffic laws and safety belts. Similarly, GenAI brings with it a new set of challenges, calling for the development of advanced security measures tailor-made to account for its operational complexity and potential threats.
Unauthorized Data Exposure
GenAI systems need access to vast datasets, which are not always free of sensitive or confidential information. These systems are engineered to extract and synthesize information to respond dynamically to user queries. However, this feature also introduces significant risks of unauthorized data exposure.
A particularly new and concerning threat in this domain is prompt hacking, where malicious actors manipulate the queries made to the GenAI system in such a way as to “convince” it to disclose information that should remain confidential. Through carefully crafted prompts, an attacker could coax a GenAI-driven customer service chatbot into revealing personal details about another user. That is, of course, if the customer service chatbot was not properly secured.
Manipulation of AI Outputs
The integrity of outputs from GenAI systems is a priority, given that these often inform significant decisions in healthcare, finance, and other critical areas. The nature of GenAI, which learns from vast amounts of data to make inferences and generate outputs, opens up new avenues for attacks that weren’t as prevalent in traditional systems. If the data that a GenAI system learns from is tampered with, or if the model’s parameters are altered maliciously, the outputs can be skewed in ways that serve the attacker’s goals.
Developing Security Measures
Traditional encryption methods may fall short in the face of GenAI systems, which are able to analyze encrypted data without ever decrypting it. Techniques such as homomorphic encryption allow data to remain encrypted throughout its lifecycle in the GenAI system. In case that is not enough, procedures such as secure multi-party computation (SMPC) enable collaborative data analysis without ever revealing the underlying data to any of the parties involved.
Real-time monitoring systems employ advanced analytics to continuously inspect the behavior and outputs of AI systems. By doing so, they can detect patterns and anomalies indicative of security breaches, unauthorized access, or unintended data disclosures. For example, an unexpected spike in data retrieval by a GenAI system could trigger an alert that prompts further investigation.
Advanced authentication protocols are essential for securing access to GenAI systems and the data they process. These protocols include biometric verification (such as fingerprint or facial recognition), which ties access to unique human characteristics; multi-factor authentication (MFA), which requires two or more verification methods, offering an additional layer of security that significantly reduces the risk of unauthorized access; and behavior-based authentication, which analyzes patterns of user behavior (such as keystroke dynamics, mouse movements, etc.), to detect anomalies that may indicate a credential compromise.
Despite the development and implementation of these security measures, concerns persist about the adequacy of these protections. In the future, we may see the rise of security measures dedicated to safekeeping the data used by GenAI systems. For now though, at least according to Brett Kahnke and Michele Goetz from Forrester: “Data privacy and security concerns are seen as the greatest single barrier to adoption of GenAI by B2B enterprises”.
Garbage In, Garbage Out
During the training phase, GenAI models are fed large volumes of data from which they learn to recognize patterns and make decisions. If the training data is comprehensive, accurate, and well-curated, the model will more effectively perform its tasks during the inference phase—when it applies what it has learned to new data. Conversely, poor-quality data during training can lead to inaccuracies that propagate every time the model makes a decision, leading to unreliable or even misleading results.
This concept, briefly summarized by the phrase, Garbage In, Garbage Out, highlights a fundamental truth in computational sciences and information technology: the quality of output is determined by the quality of input.
Data Quality: GenAI vs. Traditional Data Solutions
Data systems typically focus on structured data stored in fixed formats, such as databases and spreadsheets. The data types managed in these systems are generally numerical or categorical, adhering to strict schemas that dictate the data’s structure. A traditional customer database might store information in predefined fields such as name, address, phone number, and transaction history.
The primary data quality concerns in these environments involve ensuring accuracy (e.g., no typographical errors in entries), completeness (e.g., no missing values in required fields), and consistency (e.g., standardized formats across data entries).
GenAI, on the other hand, handles a much broader spectrum of data types, often referred to as multimodal data. Each type brings its own set of quality issues. Textual data may suffer from inconsistencies in formatting or language use, while image data could be compromised by poor resolution or irrelevant content. Audio and video data require checks for clarity and relevance, as background noise or unrelated visual elements could skew the model’s learning process. Therefore, data quality assessment for GenAI must be comprehensive, covering not just the accuracy and completeness typically evaluated in traditional datasets, but also factors like contextual relevance and media integrity.
Differences in Quality Assessment Approaches
Data quality tools and techniques in traditional systems usually involve using data validation rules, such as regular expression patterns for format validation and check constraints in databases. Manual auditing processes are suitable for structured data but not for the variability and nuances of the unstructured data that GenAI deals with.
For GenAI systems, data quality tools must be intelligent and adaptable. Machine learning algorithms detect anomalies automatically and learn from continuous data interactions. Tools such as natural language processing models interpret and rectify text data, while computer vision algorithms assess and enhance image quality.
In terms of governance, traditional data systems often operate with static policies that change only through manual updates, focusing on access controls, data integrity, and compliance. However, GenAI requires a more dynamic and adaptive approach to governance. Continuous learning mechanisms are essential, allowing GenAI systems to evolve their data understanding, and adapt to new data types, or emerging business needs autonomously. This includes the ongoing monitoring of data quality and the automatic adjustment of data handling procedures based on real-time feedback.
Consequences of Unreliable Data
In traditional data systems, the impact of unreliable data is typically localized and can be corrected with manual interventions. In most cases, these corrections involve straightforward validations and adjustments, addressing the immediate effects of errors on specific business operations, such as rectifying inaccuracies in financial reports or adjusting inventory records. However, the scope of these impacts is generally confined to specific areas, allowing organizations to control and correct issues without widespread consequences.
The implications of unreliable data in GenAI systems are more extensive. Here, unreliable data does not just lead to isolated incidents of inefficiency but can systematically influence the entire model’s behavior. Since GenAI continuously learns from data, inaccuracies can propagate erroneous patterns across the system, amplifying the impact and making corrections complex and resource-intensive. This not only affects the integrity of decision-making processes (particularly in sensitive areas like healthcare or finance), but can also erode customer trust when public-facing applications deliver subpar interactions.
Future Outlook
The transformation from static databases to dynamic, conversational interfaces marks an evolution in data interaction.
Our approach to GenAI needs to include strict data quality measures guarding against the risks of misinformation and data breaches. By prioritizing the refinement of data handling and verification processes, we can ensure the reliability and efficiency of AI technologies. As GenAI continues to evolve, proactive efforts in strengthening data quality and security protocols will be key to maximize the benefits of AI, making every interaction with data not just smoother, but also more secure.