AI Glossary

This AI Glossary is your quick reference guide to 80 common AI terms. Read on to upgrade your AI lexicon.

Artificial General Intelligence (AGI)

Artificial general intelligence refers to an evolved form of artificial intelligence that is able to broadly understand, learn and solve at or above human levels of intelligence. In contrast to currently available artificial intelligence systems, AGI aims to exhibit broad cognitive abilities similar to human ones across a wide array of tasks in a wide range of domains, as opposed to the specialized AI systems we currently use.

AI Architecture

AI architecture refers to the overall design and structure of an artificial intelligence system. This includes AI models as well as non-model related components such as data ingestion and processing, Rest APIs that expose underlying AI models, and the training pipeline.

AI Explainability

AI explainability refers to the ability to understand and interpret the decisions taken by an artificial intelligence system or model. This is part of the larger theme of transparency in AI, which focuses on understanding and explaining the outputs and predictions made by complex models that, unlike more classical models like linear regression, are not trivially understandable.

Autonomous Agents

Autonomous agents are sets of software components capable of executing a series of tasks within a complex environment. They are able to make decisions regarding how to use their available components to best achieve a desired outcome. For example, an autonomous agent built to source job candidates would be able to search LinkedIn, download resumes, parse those resumes, assess the resume’s relevance for the position, and send an email to selected candidates.

Chunking

Chunking is the process of dividing a document into smaller, more manageable parts for efficient computational processing. A chunk can be characterized by its size (number of characters) or identified by analyzing the chunked text for natural segments, such as complete sentences or paragraphs.

Chunking vs Sections

In the context of Retrieval-Augmented Generation (RAG), the terms “content chunk” and “content section” are often used interchangeably to refer to small, manageable divisions of text for efficient computational processing and analysis.

Content Analytics

Content analytics refers to the use of algorithms and tools to analyze data, extract insights and derive meaningful conclusions from content, such as text, images, or video. Content analytics are used to facilitate data-driven decision making about internal and external knowledge bases.

Content Quality Filters

Content quality filters are mechanisms, criterias and tools to assess the quality of content and ensure minimum quality standards within a collection of content. Content quality filters can be used to filter out irrelevant, inappropriate or low quality content as part of content moderation for search engines and knowledge bases.

Context Length

Context length refers to the maximum number of tokens that can be processed by a Large Language Model (LLM) or other text processing model.

Data Catalog

A data catalog is a searchable inventory of available data assets, sources, datasets, and associated metadata.

Data Ingestion

Data ingestion refers to the process of collecting, importing, transferring, and storing raw data from various sources for further use or archival and storage.

Data Labeling

Data labeling is the process of assigning labels or tags to data in order to enrich that data. Data labeling is used to train neural networks and evaluate existing AI systems, such as for instance spam detection or sentiment analysis.

Data Lake

A data lake (or data warehouse) is a centralized repository for the storage of vast amounts of raw, unprocessed data. It is a flexible, scalable infrastructure that facilitates analytics, data processing, and data exploration across an organization.

Data Pipeline

A data pipeline is a set of processes and tools for collecting, transforming, transporting, and enriching data from various sources. Data pipelines control the flow of data from source through transformation and processing components to the data’s final storage location.

Data Sanitization

Data sanitization is the process of ensuring that sensitive information is removed or masked in datasets to protect privacy and comply with data protection regulations. Sanitizing data involves identifying and appropriately handling personally identifiable information (PII) or other sensitive data to prevent unauthorized access or disclosure.

Data Validation

Data validation is the process of ensuring data is accurate, consistent and meets pre-defined criteria or standards. Data validation can be completed in a range of ways, including the use of neural networks, algorithms, or human assessment.

Data Warehouse

A data warehouse (or data lake) is a centralized repository for the storage of vast amounts of raw, unprocessed data. It is a flexible, scalable infrastructure that facilitates analytics, data processing, and data exploration across an organization.

Dataset

A dataset is a structured collection of data, usually with some sort of schema, that can be used for analysis, research, and large language or machine model training and evaluation.

Embeddings

Embeddings are numerical representations of content, such as text, images or video, that are encoded with semantic relationships in a dense multi-dimensional space. Embeddings are one example of vectorization.

Feature Engineering

Feature engineering is the process of selecting, transforming, and creating relevant features in a dataset for the purpose of improving the downstream task performance of a system, such as a trained neural network.

Fine Tuning LLM

Finetuning an LLM refers to algorithmically adjusting the parameters of that LLM in relation to a specific task-related dataset to further enhance the LLM’s performance on a particular task or domain.

Foundation Model

A foundational model is a pre-trained model, usually focusing on a single modality of content (i.e., image or text), that serves as a starting point for further training on more specific tasks and task-oriented data.

Generative Pre-trained Transformers (GPT)

Generative Pre-trained Transformers (GPT) denote a family of artificial intelligence models. These models, primarily Large Language Models such as those developed by OpenAI, are based on a transformer architecture capable of understanding and generating human-like language.

Grounding LLMs

Grounding LLMs refers to a method that leverages information relevant to a specific use-case, not available as part of the training data, when interacting with an LLM.

Guardrails

Guardrails are constraints, guidelines, and measures designed and implemented to ensure high-quality, ethical, responsible, accurate and non-harmful content generation using Large Language Models.

Hallucinations

Hallucinations refer to content by an LLM that is fabricated, signifying that the information isn’t grounded in reality or supported by the model’s training data. Hallucinations can be very deceptive and hard to spot due to the impressive abilities of LLMs to generate grammatically and semantically coherent text.

Human in the Loop (HitL)

Human in the Loop (HitL) refers to a system, model, or pipeline in which human involvement is tightly integrated into the workflow. Typically, human involvement serves as a means to evaluate the accuracy of outputs or monitor other non-human components, such as generated text from a Large Language Model.

Hybrid Search

Hybrid Search refers to a technique that merges dense vector-based search with sparse vector-based search. This approach enables simultaneous semantic and literal matching during a search operation.

Inaccessible Data

Inaccessible Data pertains to information that is not readily available, retrievable, or searchable due to reasons such as lack of structure or absence of the requisite tools.

Indexing

Metadata is a category of data that provides information about other data. This may include characteristics and properties of specific content, such as the last modification date of a content publication.

Inference

Inference involves processing a piece of content through a trained model to obtain a prediction, score, or other types of output, depending on the type of model used.

Knowledge Engineering

Knowledge Engineering is a field of study focused on the design and development of knowledge-based systems that leverage explicit knowledge in the form of knowledge bases, rules, facts, ontologies, taxonomies, and databases. These systems facilitate decision-making and problem-solving in a human-like manner.

Knowledge Graph

A Knowledge Graph is a structured representation of knowledge that organizes data in terms of entities and their interrelationships. This promotes better semantic understanding of content and complex retrieval.

Large Language Model

A Large Language Model (LLM) is an artificial intelligence model, usually based on a Transformer architecture, trained on vast amounts of data. LLMs are able to understand and emulate human language, as well as solve a wide range of tasks.

LLM Bias

LLM bias refers to the existence of systemic preferences within the outputs produced by an LLM. These biases typically mirror some underlying bias within the training data used.

LLM Connector

An LLM connector is a module or a component that facilitates integration with an LLM service.

LLM Hyperparameters

LLM Hyperparameters refer to predetermined parameters that are set before an LLM training process. These parameters, including learning rate, batch size, epochs, optimizer choice, and sequence lengths, significantly influence the behavior and performance of an LLM.

LLM Jailbreaking

LLM Jailbreaking involves techniques aimed at circumventing the guidelines and constraints imposed on a Large Language Model (LLM) via various methods. These techniques are used to accomplish things such as extracting hidden information, generating non-appropriate LLM outputs, and reverse engineering prompts to solve specific tasks.

LLM Quality Assurance

LLM Quality Assurance involves a set of processes and tools intended to ensure the accuracy, reliability, and usefulness of content generated by LLMs. These QA processes systematically test, validate, and monitor generated content based on determined quality and performance standards.

LLM Settings

Re-ranking involves the process of reorganizing and/or reprioritizing a set of search results generated by an initial retrieval system. The retrieval system might focus entirely on matching query results, while re-ranking could account for additional factors like user profile information.

LLMOps

LLMOps encompasses DevOps and MLOps procedures specific to Large Language Models (LLMs). These include the deployment, operation, monitoring, and management of these models.

Metadata

LLMOps encompasses DevOps and MLOps procedures specific to Large Language Models (LLMs). These include the deployment, operation, monitoring, and management of these models.

MLOps

Machine Learning Operations (MLOps) is a set of procedures for the deployment, management, monitoring, and maintenance of Machine Learning systems throughout their lifecycle. MLOps is a subset of broader DevOps practices.

Multimodal RAG

Multimodal Retrieval-Augmented Generation (RAG) is a specific type of RAG pipeline capable of retrieving multiple modalities of content and/or generating multiple modalities of content.

Multivector RAG

Multivector RAG is a specific variant of RAG, where a document is represented by multiple vectors instead of one, which can improve the search process. Additional vectors are created by embedding (vectorizing) information related to the document’s contents, such as a document summary, description, or hypothetical questions that the document can answer.

Named Entity Recognition (NER)

Named Entity Recognition (NER) is a text-enrichment technique to identify and classify entities within a text. These entities could include individuals, organizations, locations, dates, or countries.

Neural Network (NN)

A Neural Network (NN) is a computational model consisting of connected neurons, grouped into layers. Each neuron contains weights (learned during the model’s training process) and activation functions that govern the flow of data through the model.

Ontology

Ontology denotes a formal representation of knowledge that defines concepts and their relationships between one another. It serves as a structured framework for modeling and organizing information, and can be utilized to transform unstructured data into structured data.

Optical Character Recognition (OCR)

Optical Character Recognition (OCR) refers to a technique employed to identify characters within content types such as images or PDF documents, and convert that content into machine-encoded text. This process helps turn non-searchable and non-indexable textual data into searchable and indexable text.

Orchestration

Orchestration involves the coordination and management of various components toward the achievement of a specific goal. It’s generally applied to describe the management and coordination of multiple components that make up a larger piece of software, such as a full RAG pipeline.

Parsing

Parsing refers to the process of analyzing data, typically in the form of text, in order to extract meaningful information and understand the data’s structure. An example would be parsing an invoice, which entails extracting seller and buyer information, product details, and costs, and presenting this extracted data in a structured manner.

Pre-training LLM

The pre-training of a Large Language Model (LLM) involves training the LLM on a large dataset so that it can learn language patterns and the underlying structure of language. This process is typically carried out in an autoregressive manner via a “next token prediction” method.

Predefined Q&A Lists

Predefined Q&A Lists are curated sets of questions with corresponding answers, which are prepared in advance for use in automated responses or information retrieval systems.

Prompt Engineering

Prompt Engineering refers to the process of creating and refining specialized input prompts to optimize the generation of answers using Large Language Models for a specific objective or task.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) refers to the combination of information retrieval with language generation, done with the assistance of a Large Language Model (LLM) such as OpenAI, in order to enhance and guide the generation of accurate, relevant text.

Re-ranking

Recipe

A Recipe refers to a predefined set of rules and instructions that guide the ingestion, processing, scoring, enrichment, and filtering of content. A recipe can also guide the subsequent extraction of insights from the content data.

Reinforcement Learning

Reinforcement Learning is a learning paradigm where an agent (such as a model) develops decision-making skills by interacting with its environment. The learning process is based on positive or negative rewards based on the actions taken by the agent.

Responsible AI

Responsible AI denotes an ethical and socially conscious approach to the development and deployment of AI systems. Responsible AI includes key concepts such as fairness, transparency, and inclusivity.

Retrieval

Retrieval refers to the process of obtaining and fetching indexed or stored data from a storage system, such as a database, based on an input query or related criteria and conditions.

Segmenting

Segmenting is the process of dividing a Knowledge Base into distinct sections or segments. This process particularly emphasizes creating clear demarcations between topics, themes, and concepts.

Self-hosted LLM

A Self-hosted Large Language Model (LLM) refers to an instance of an LLM that is deployed and operated on an organization’s own infrastructure and servers, as opposed to being provided as a service by a third-party provider.

Self-hosted LLM vs Third-Party Service

Self-hosted Large Language Models (LLMs) provide full control over the data sent to an LLM, and can potentially avoid high costs from extensive usage, as hosting costs are fixed and more predictable. Third-Party-hosted LLMs spare organizations the need to deploy and manage infrastructure. They operate on a “pay as much as you use” pricing model, which can be more cost-effective when overall traffic is lower, albeit at the expense of sending data to a third party.

Semantic Database

A Semantic Database is a database that stores data in a manner that easily allows for semantic searching – matching the underlying meanings of a set of content (for instance, of a sentence) as opposed to merely identifying literal matches.

Semantic Search

Semantic Search is a search technique centered around matching a query based on its meaning, or semantics, rather than only depending on exact matches or keywords. This makes Semantic Search resistant to synonyms and allows it to match a query with correct results based on the query’s underlying meaning, even if there isn’t a vocabulary overlap. For example, a semantic search is able to retrieve the term “pants” when given the query “trousers.”

Structured data

Structured data is organized and formatted data that is easily searchable, indexable and parseable, and usually represented in tabular form along with a pre-defined schema.

Structured vs. Unstructured Data

Structured Data refers to organized data which follows a pre-defined schema, which facilitates easier searching, indexing, and parsing. It is easier to process and analyze than unstructured data (which lacks a specific format or organization), but much less flexible.

Taxonomy

Taxonomy represents a hierarchical system for classifying and organizing items, concepts, themes, and topics based on shared characteristics and other criteria. Taxonomies enhance understanding, organization, and communication while improving search and retrieval capabilities.

Tech Stack

A tech stack, in the context of AI architecture, refers to the combination of tools, software, programming languages and libraries, and infrastructure employed to construct a specific piece of software or maintain all IT-related services and products within an organization.

Temperature (LLM)

Temperature, in the context of Large Language Models (LLMs), is a hyperparameter that controls the degree of randomness in a model’s output. A higher temperature introduces more “diversity” and “randomness” in the generated LLM output, essentially serving as a “creativity” slider. Conversely, a lower temperature results in more deterministic responses.

Third-Party Service LLM

A Third-Party Service LLM represents a Large Language Model that’s deployed, hosted, and offered by a third party. It’s typically accessible via a REST API for use by other organizations and priced according to usage.

Token

A Token refers to a single unit of meaning or a basic component of language. It’s typically in the form of an individual word or a subpart of a word.

Topic Modeling

Topic Modeling is a technique used to identify underlying topics, themes, and concepts within a specified collection of documents or content pieces. It serves as a method for identifying patterns and key terminology within a given set of content, assisting in content analysis and organization.

Transformer

A Transformer refers to a type of deep learning model/neural network architecture that excels in modeling language and solving sequence-to-sequence tasks, such as machine translation or generating text based on a given input prompt.

Unstructured Data

Unstructured Data is data that lacks a predefined schema or data model. It usually exists in non-tabular, free-form formats, such as text documents or images.

Vector Database

A Vector Database is a database optimized for the storage and retrieval of vectors, which represent objects or content in a multi-dimensional space.

Vectorize

The process of Vectorization involves converting content (such as text or images) into numerical vectors that serve as representations of the given content.

Vectorize vs Embeddings

Vectorization represents the concept of transforming content into numerical vectors. In contrast, embeddings denote a specific type of vectorization, involving learned numerical representations in a dense, multi-dimensional space encoding the semantic meaning and relationships for a particular set of content.

Weights

Weights refer to parameters that a neural network learns during its training process. Within a neural network, each node or neuron is given a weight which determines how information flows through it.

[ Library ]