SQL Troubles: clustering

Showing posts with label clustering. Show all posts

11 June 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 202: How Pushing AI Models Into Out‑of‑Distribution Inputs Generates Novel Concepts)

Prompt Engineering Series

Prompt: "write a post of 600 words on how to push models into out‑of‑distribution iInputs to generate novel concepts in AI models"

Introduction

Artificial Intelligence (AI) models excel at interpolation - filling in the gaps between patterns they’ve already seen. But the frontier of creativity, innovation, and conceptual discovery lies outside those familiar boundaries. To reach that frontier, researchers use out‑of‑distribution (OOD) inputs: prompts, structures, or data patterns that sit beyond the model’s training distribution. When done intentionally and safely, this technique can reveal how models generalize, how they stretch their internal representations, and how they generate novel concepts that do not simply remix the past.

Pushing a model into OOD territory is not about confusing it. It’s about stress‑testing its conceptual elasticity. Models trained on massive datasets develop dense clusters of meaning - regions where concepts are richly represented - and sparse regions where the model has little experience. OOD inputs target those sparse regions. They force the model to navigate conceptual space without the usual statistical anchors, revealing how it constructs meaning when familiar patterns disappear. This connects directly to rare‑event blind‑spot analysis, where unusual inputs expose hidden weaknesses.

One powerful method for generating OOD conditions is structural perturbation. Instead of changing the content of a prompt, researchers alter its structure - using unusual syntax, hybrid formats, or nested instructions. For example, combining mathematical notation with poetic metaphor, or embedding code inside rhetorical questions. These hybrid structures push the model into regions where its learned representations overlap in unexpected ways. The model must reconcile incompatible patterns, often producing emergent conceptual blends that would not appear in standard prompting. This technique aligns with insights from uncommon linguistic structure testing.

Another approach involves semantic displacement - asking the model to apply concepts from one domain to another where they do not naturally belong. For example: 'Describe quantum entanglement using the logic of medieval guild economics.' This forces the model to map distant conceptual regions together, creating novel analogies or frameworks. These mappings are not random; they reveal how the model organizes knowledge internally. When the model is pushed far enough, it begins to generate new conceptual hybrids, not because it has seen them before, but because its internal geometry allows it to interpolate across distant domains.

A more advanced technique uses contradictory task layering, where the model must satisfy overlapping constraints that do not naturally coexist. For example: 'Invent a biological organism that obeys thermodynamic laws but violates known evolutionary principles.' These prompts push the model into conceptual dead zones - regions where no training example exists. The model must synthesize new structures to satisfy the constraints, often producing novel theoretical constructs. This method parallels the logic of boundary‑stress evaluation, where conflicting instructions reveal the model’s reasoning hierarchy.

OOD prompting also benefits from recursive abstraction, where the model is asked to generalize beyond its own generalizations. For instance: 'Generate a concept that is to machine learning what machine learning is to statistics.' This forces the model to climb the abstraction ladder, leaving the comfort of known categories. The resulting concepts often reflect the model’s latent ability to extrapolate beyond its training distribution.

Finally, OOD exploration can involve synthetic anomalies - inputs that deliberately violate the statistical norms of the training data. These anomalies act as conceptual shockwaves, pushing the model to reorganize its internal representations. When guided carefully, they can reveal new conceptual pathways, much like how scientific breakthroughs often emerge from anomalies that challenge existing theories.

Ultimately, pushing models into OOD inputs is not about breaking them. It is about discovering the edges of their conceptual space. By exploring those edges, researchers can uncover how models generalize, how they innovate, and how they generate ideas that go beyond the sum of their training data. OOD prompting is a tool for expanding the frontier of machine creativity - one carefully engineered anomaly at a time.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

14 July 2019

🧱IT: Cluster/Clustering (Definitions)

"Technology that enables you to create a hot spare. That is a server that is actually running and can take over immediately. This technology enables you to mirror an entire server to another computer." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"The use of multiple computers to provide increased reliability, capacity, and management capabilities." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Grouping data together on the same disk page to improve retrieval performance." (Jan L Harrington, "Relational Database Dessign: Clearly Explained" 2nd Ed., 2002)

"Any collection of distinct computers that are connected and used as a parallel computer, or to form a redundant system for higher availability. The computers in a cluster are not specialized to cluster computing and could, in principle, be used in isolation as standalone computers. In other words, the components making up the cluster, both the computers and the networks connecting them, are not custom-built for use in the cluster." (Beverly A Sanders, "Patterns for Parallel Programming", 2004)

"Connecting two or more computers in such a way that they behave like a single computer to an application or client. Clustering is used for parallel processing, load balancing, and fault tolerance." (Allan Hirt et al, "Microsoft SQL Server 2000 High Availability", 2004)

[server cluster:] "The type of Windows Clustering that provides availability only. It is a collection of nodes that allow resources to be failed over to another node in the event of a problem. SQL Server 2000 failover clustering is installed and configured on top of a server cluster." (Allan Hirt et al, "Microsoft SQL Server 2000 High Availability", 2004)

"A group of computers linked via a LAN and working together to form the equivalent of a single computer." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

[failover cluster:] "A group of servers that are in one location and that are networked together for the purpose of providing live backup in case one of the servers fails." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A set of computers with distributed memory communicating over a high-speed interconnect. The individual computers are often called nodes." (Michael McCool et al, "Structured Parallel Programming", 2012)

"A group of servers connected by a network and configured in such a way that if the primary server fails, a secondary server takes over." (IBM, "Informix Servers 12.1", 2014)

"A cluster is a set of servers configured to function together. Servers sometimes have differentiated functions and sometimes they do not." (Dan Sullivan, "NoSQL for Mere Mortals®", 2015)

"A collection of complete systems that work together to provide a single, unified computing capability." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

[failover clustering:] "A network of two or more interconnected servers that redirects an overloaded or failed resource to maintain system availability." (Microsoft Technet)

08 July 2019

🧱IT: Grid Computing (Definitions)

"A grid is an architecture for distributed computing and resource sharing. A grid system is composed of a heterogeneous collection of resources connected by local-area and/or wide-area networks (often the Internet). These individual resources are general and include compute servers, storage, application servers, information services, or even scientific instruments. Grids are often implemented in terms of Web services and integrated middleware components that provide a consistent interface to the grid. A grid is different from a cluster in that the resources in a grid are not controlled through a single point of administration; the grid middleware manages the system so control of resources on the grid and the policies governing use of the resources remain with the resource owners." (Beverly A Sanders, "Patterns for Parallel Programming", 2004)

"Clusters of cheap computers, perhaps distributed on a global basis, connected using even something as loosely connected as the Internet." (Gavin Powell, "Beginning Database Design", 2006)

"A step beyond distributed processing. Grid computing involves large numbers of networked computers, often geographically dispersed and possibly of different types and capabilities, that are harnessed together to solve a common problem." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"A web-based operation allowing companies to share computing resources on demand." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The use of networks to harness the unused processing cycles of all computers in a given network to create powerful computing capabilities." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"A distributed set of computers that can be allocated dynamically and accessed remotely. A grid is distinguished from a cloud in that a grid may be supported by multiple organizations and is usually more heterogeneous and physically distributed." (Michael McCool et al, "Structured Parallel Programming", 2012)

"the use of multiple computing resources to leverage combined processing power. Usually associated with scientific applications." (Bill Holtsnider & Brian D Jaffe, "IT Manager's Handbook" 3rd Ed., 2012)

"A step beyond distributed processing, involving large numbers of networked computers (often geographically dispersed and possibly of different types and capabilities) that are harnessed to solve a common problem. A grid computing model can be used instead of virtualization in situations that require real time where latency is unacceptable." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A named set of interconnected replication servers for propagating commands from an authorized server to the rest of the servers in the set." (IBM, "Informix Servers 12.1", 2014)

"A type of computing in which large computing tasks are distributed among multiple computers on a network." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"Connecting many computer system locations, often via the cloud, working together for the same purpose." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"A computer network that enables distributed resource management and on-demand services." (Forrester)

"A computing architecture that coordinates large numbers of servers and storage to act as a single large computer." (Oracle, "Oracle Database Concepts")

"connecting different computer systems from various location, often via the cloud, to reach a common goal." (Analytics Insight)

26 November 2018

🔭Data Science: Clustering (Just the Quotes)

"To the untrained eye, randomness appears as regularity or tendency to cluster." (William Feller, "An Introduction to Probability Theory and its Applications", 1950)

"In scientific information, then, we find that subjects – the themes and topics on which books and articles are written – cluster into fields, each of which can be analysed into its characteristic set of facets of terms." (Brian C Vickery, "Classification and indexing in science", 1958)

"In comparison with Predicate Calculus encoding is of factual knowledge, semantic nets seem more natural and understandable. This is due to the one-to-one correspondence between nodes and the concepts they denote, to the clustering about a particular node of propositions about a particular thing, and to the visual immediacy of 'interrelationships' between concepts, i.e., their connections via sequences of propositional links." (Lenhart K Schubert, "Extending the Expressive Power of Semantic Networks", Artificial Intelligence 7, 1976)

"Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts. [...] A graphic representation of data abstracted from banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data." (William Gibson, "Neuromancer", 1984)

"While a small domain (consisting of fifty or fewer objects) can generally be analyzed as a unit, large domains must be partitioned to make the analysis a manageable task. To make such a partitioning, we take advantage of the fact that objects on an information model tend to fall into clusters: groups of objects that are interconnected with one another by many relationships. By contrast, relatively few relationships connect objects in different clusters." (Stephen J. Mellor, "Object-Oriented Systems Analysis: Modeling the World In Data", 1988)

"Randomness is a difficult notion for people to accept. When events come in clusters and streaks, people look for explanations and patterns. They refuse to believe that such patterns - which frequently occur in random data - could equally well be derived from tossing a coin. So it is in the stock market as well." (Burton G Malkiel, "A Random Walk Down Wall Street", 1989)

"Many of the basic functions performed by neural networks are mirrored by human abilities. These include making distinctions between items (classification), dividing similar things into groups (clustering), associating two or more things (associative memory), learning to predict outcomes based on examples (modeling), being able to predict into the future (time-series forecasting), and finally juggling multiple goals and coming up with a good- enough solution (constraint satisfaction)." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"While classification is important, it can certainly be overdone. Making too fine a distinction between things can be as serious a problem as not being able to decide at all. Because we have limited storage capacity in our brain (we still haven't figured out how to add an extender card), it is important for us to be able to cluster similar items or things together. Not only is clustering useful from an efficiency standpoint, but the ability to group like things together (called chunking by artificial intelligence practitioners) is a very important reasoning tool. It is through clustering that we can think in terms of higher abstractions, solving broader problems by getting above all of the nitty-gritty details." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Random events often come like the raisins in a box of cereal - in groups, streaks, and clusters. And although Fortune is fair in potentialities, she is not fair in outcomes." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"Granular computing is a general computation theory for using granules such as subsets, classes, objects, clusters, and elements of a universe to build an efficient computational model for complex applications with huge amounts of data, information, and knowledge. Granulation of an object a leads to a collection of granules, with a granule being a clump of points (objects) drawn together by indiscernibility, similarity, proximity, or functionality. In human reasoning and concept formulation, the granules and the values of their attributes are fuzzy rather than crisp. In this perspective, fuzzy information granulation may be viewed as a mode of generalization, which can be applied to any concept, method, or theory." (Salvatore Greco et al, "Granular Computing and Data Mining for Ordered Data: The Dominance-Based Rough Set Approach", 2009)

"With the ever increasing amount of empirical information that scientists from all disciplines are dealing with, there exists a great need for robust, scalable and easy to use clustering techniques for data abstraction, dimensionality reduction or visualization to cope with and manage this avalanche of data." (Jörg Reichardt, "Structure in Complex Networks", 2009)

"Data clusters are everywhere, even in random data. Someone who looks for an explanation will inevitably find one, but a theory that fits a data cluster is not persuasive evidence. The found explanation needs to make sense and it needs to be tested with uncontaminated data." (Gary Smith, "Standard Deviations", 2014)

"Your goal when designing a scattr plot is to make the relationship between two variables as clear as possible, including the overall level of association but also revealing clusters and outliers. This is easier said than done. The data and a few bad design choices can make reading a scatter plot too complex or misleading." (Jorge Camões, "Data at Work: Best practices for creating effective charts and information graphics in Microsoft Excel", 2016)

"Cluster analysis refers to the grouping of observations so that the objects within each cluster share similar properties, and properties of all clusters are independent of each other. Cluster algorithms usually optimize by maximizing the distance among clusters and minimizing the distance between objects in a cluster. Cluster analysis does not complete in a single iteration but goes through several iterations until the model converges. Model convergence means that the cluster memberships of all objects converge and don’t change with every new iteration." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"Hierarchical clustering is summarised by a dendrogram, which sequentially shows points being joined to form a cluster, with the corresponding distances. Breaking the data into clusters is done by cutting the dendrogram at the long edges. [...] Plotting the dendrogram in the data space can help you understand how the hierarchical clustering has collected the points together into clusters. You can learn if the algorithm has been confused by nuisance patterns in the data, and how different choices of linkage method affect the result." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"PCA (Principal Component Analysis) is very broadly useful for summarising linear association by using combinations of the variables that are highly correlated. However, high correlation can also occur when there are outliers or clustering. PCA is commonly used to detect these patterns also, although this might NOT be a reliable way to do so. To detect clustering or anomalies, using a different approach that is specifically focused on these types of patterns is advisable. To some extent capturing clustering or anomalies using PCA is actually finding problematic patterns that adversely affect conducting appropriate dimension reduction." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

"Unsupervised classification, or cluster analysis, organizes observations into similar groups. Clusteranalysis is a commonly used, appealing, and conceptually intuitive statistical method. Some of its uses include market segmentation, where customers are grouped into clusters with similar attributes for targeted marketing; gene expression analysis, where genes with similar expression patterns are grouped together; and the creation of taxonomies for animals, insects, or plants. Clustering can be used as a way of reducing a massive amount of data because observations within a cluster can be summarised by its centre. Also, clustering effectively subsets the data thus simplifying analysis because observations in each cluster can be analysed separately." (Dianne Cook & Ursula Laa, "Interactively Exploring High-Dimensional Data and Models in R", 2026)

13 May 2018

🔬Data Science: Self-Organizing Map (Definitions)

"A clustering neural net, with topological structure among cluster units." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"A self organizing map is a form of Kohonen network that arranges its clusters in a (usually) two-dimensional grid so that the codebook vectors (the cluster centers) that are close to each other on the grid are also close in the k-dimensional feature space. The converse is not necessarily true, as codebook vectors that are close in feature-space might not be close on the grid. The map is similar in concept to the maps produced by descriptive techniques such as multi-dimensional scaling (MDS)." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"result of a nonparametric regression process that is mainly used to represent high-dimensional, nonlinearly related data items in an illustrative, often two-dimensional display, and to perform unsupervised classification and clustering." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"a method of organizing and displaying textual information according to the frequency of occurrence of text and the relationship of text from one document to another." (William H Inmon, "Building the Data Warehouse", 2005)

"A type of unsupervised neural network used to group similar cases in a sample. SOMs are unsupervised (see supervised network) in that they do not require a known dependent variable. They are typically used for exploratory analysis and to reduce dimensionality as an aid to interpretation of complex data. SOMs are similar in purpose to Ic-means clustering and factor analysis." (David Scarborough & Mark J Somers, "Neural Networks in Organizational Research: Applying Pattern Recognition to the Analysis of Organizational Behavior", 2006)

"A method to learn to cluster input vectors according to how they are naturally grouped in the input space. In its simplest form, the map consists of a regular grid of units and the units learn to represent statistical data described by model vectors. Each map unit contains a vector used to represent the data. During the training process, the model vectors are changed gradually and then the map forms an ordered non-linear regression of the model vectors into the data space." (Atiq Islam et al, "CNS Tumor Prediction Using Gene Expression Data Part II", Encyclopedia of Artificial Intelligence, 2009)

"A neural-network method that reduces the dimensions of data while preserving the topological properties of the input data. SOM is suitable for visualizing high-dimensional data such as microarray data." (Emmanuel Udoh & Salim Bhuiyan, "C-MICRA: A Tool for Clustering Microarray Data", 2009)

"A neural network unsupervised method of vector quantization widely used in classification. Self-Organizing Maps are a much appreciated for their topology preservation property and their associated data representation system. These two additive properties come from a pre-defined organization of the network that is at the same time a support for the topology learning and its representation. (Patrick Rousset & Jean-Francois Giret, "A Longitudinal Analysis of Labour Market Data with SOM" Encyclopedia of Artificial Intelligence, 2009)

"A simulated neural network based on a grid of artificial neurons by means of prototype vectors. In an unsupervised training the prototype vectors are adapted to match input vectors in a training set. After completing this training the SOM provides a generalized K-means clustering as well as topological order of neurons." (Laurence Mukankusi et al, "Relationships between Wireless Technology Investment and Organizational Performance", 2009)

"A subtype of artificial neural network. It is trained using unsupervised learning to produce low dimensional representation of the training samples while preserving the topological properties of the input space." (Soledad Delgado et al, "Growing Self-Organizing Maps for Data Analysis", 2009)

"An unsupervised neural network providing a topology-preserving mapping from a high-dimensional input space onto a two-dimensional output space." (Thomas Lidy & Andreas Rauber, "Music Information Retrieval", 2009)

"Category of algorithms based on artificial neural networks that searches, by means of self-organization, to create a map of characteristics that represents the involved samples in a determined problem." (Paulo E Ambrósio, "Artificial Intelligence in Computer-Aided Diagnosis", 2009)

"Self-organizing maps (SOMs) are a data visualization technique which reduce the dimensions of data through the use of self-organizing neural networks." (Lluís Formiga & Francesc Alías, "GTM User Modeling for aIGA Weight Tuning in TTS Synthesis", Encyclopedia of Artificial Intelligence, 2009)

"SOFM [self-organizing feature map] is a data mining method used for unsupervised learning. The architecture consists of an input layer and an output layer. By adjusting the weights of the connections between input and output layer nodes, this method identifies clusters in the data." (Indranil Bose, "Data Mining in Tourism", 2009)

"The self-organizing map is a subtype of artificial neural networks. It is trained using unsupervised learning to produce low dimensional representation of the training samples while preserving the topological properties of the input space. The self-organizing map is a single layer feed-forward network where the output syntaxes are arranged in low dimensional (usually 2D or 3D) grid. Each input is connected to all output neurons. Attached to every neuron there is a weight vector with the same dimensionality as the input vectors. The number of input dimensions is usually a lot higher than the output grid dimension. SOMs are mainly used for dimensionality reduction rather than expansion." (Larbi Esmahi et al, "Adaptive Neuro-Fuzzy Systems", Encyclopedia of Artificial Intelligence, 2009)

"A type of neural network that uses unsupervised learning to produce two-dimensional representations of an input space." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The Self-organizing map is a non-parametric and non-linear neural network that explores data using unsupervised learning. The SOM can produce output that maps multidimensional data onto a two-dimensional topological map. Moreover, since the SOM requires little a priori knowledge of the data, it is an extremely useful tool for exploratory analyses. Thus, the SOM is an ideal visualization tool for analyzing complex time-series data." (Peter Sarlin, "Visualizing Indicators of Debt Crises in a Lower Dimension: A Self-Organizing Maps Approach", 2012)

"SOMs or Kohonen networks have a grid topology, with unequal grid weights. The topology of the grid provides a low dimensional visualization of the data distribution." (Siddhartha Bhattacharjee et al, "Quantum Backpropagation Neural Network Approach for Modeling of Phenol Adsorption from Aqueous Solution by Orange Peel Ash", 2013)

"An unsupervised neural network widely used in exploratory data analysis and to visualize multivariate object relationships." (Manuel Martín-Merino, "Semi-Supervised Dimension Reduction Techniques to Discover Term Relationships", 2015)

"ANN used for visualizing low-dimensional views of high-dimensional data." (Pablo Escandell-Montero et al, "Artificial Neural Networks in Physical Therapy", 2015)

"Is a unsupervised learning ANN, which means that no human intervention is needed during the learning and that little needs to be known about the characteristics of the input data." (Nuno Pombo et al, "Machine Learning Approaches to Automated Medical Decision Support Systems", 2015)

"A kind of artificial neural network which attempts to mimic brain functions to provide learning and pattern recognition techniques. SOM have the ability to extract patterns from large datasets without explicitly understanding the underlying relationships. They transform nonlinear relations among high dimensional data into simple geometric connections among their image points on a low-dimensional display." (Felix Lopez-Iturriaga & Iván Pastor-Sanz, "Using Self Organizing Maps for Banking Oversight: The Case of Spanish Savings Banks", 2016)

"Neural network which simulated some cerebral functions in elaborating visual information. It is usually used to classify a large amount of data." (Gaetano B Ronsivalle & Arianna Boldi, "Artificial Intelligence Applied: Six Actual Projects in Big Organizations", 2019)

"Classification technique based on unsupervised-learning artificial neural networks allowing to group data into clusters." Julián Sierra-Pérez & Joham Alvarez-Montoya, "Strain Field Pattern Recognition for Structural Health Monitoring Applications", 2020)

"It is a type of artificial neural network (ANN) trained using unsupervised learning for dimensionality reduction by discretized representation of the input space of the training samples called as map." (Dinesh Bhatia et al, "A Novel Artificial Intelligence Technique for Analysis of Real-Time Electro-Cardiogram Signal for the Prediction of Early Cardiac Ailment Onset", 2020)

"Being a particular type of ANNs, the Self Organizing Map is a simple mapping from inputs: attributes directly to outputs: clusters by the algorithm of unsupervised learning. SOM is a clustering and visualization technique in exploratory data analysis." (Yuh-Wen Chen, "Social Network Analysis: Self-Organizing Map and WINGS by Multiple-Criteria Decision Making", 2021)

11 May 2018

🔬Data Science: K-Means Algorithm (Definitions)

"A top-down grouping method where the number of clusters is defined prior to grouping." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"An algorithm used to assign K centers to represent the clustering of N points (K< N). The points are iteratively adjusted so that each of the N points is assigned to one of the K clusters, and each of the K clusters is the mean of its assigned points." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k = n. The algorithm minimizes the total intra-cluster variance or the squared error function." (Dimitrios G Tsalikakis et al, "Segmentation of Cardiac Magnetic Resonance Images", 2009)

"The k-means algorithm assigns any number of data objects to one of k clusters." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The clustering algorithm that divides a dataset into k groups such that the members in each group are as similar as possible, that is, closest to one another." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"K-Means is a technique for clustering. It works by randomly placing K points, called centroids, and iteratively moving them to minimize the squared distance of elements of a cluster to their centroid." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"It is an iterative algorithm that partition the hole data set into K non overlaping subsets (Clusters). Each data point belongs to only one subset." (Aman Tyagi, "Healthcare-Internet of Things and Its Components: Technologies, Benefits, Algorithms, Security, and Challenges", 2021)

[Non-scalable K-means:] "A Microsoft Clustering algorithm method that uses a distance measure to assign a data point to its closest cluster." (Microsoft Technet)

"An algorithm that places each value in the cluster with the nearest mean, and in which clusters are formed by minimizing the within-cluster deviation from the mean." (Microsoft, "SSAS Glossary")

08 May 2018

🔬Data Science: Cluster Analysis (Definitions)

"Generally, cluster analysis, or clustering, comprises a wide array of mathematical methods and algorithms for grouping similar items in a sample to create classifications and hierarchies through statistical manipulation of given measures of samples from the population being clustered. (Hannu Kivijärvi et al, "A Support System for the Strategic Scenario Process", 2008)

"Defining groups based on the 'degree' to which an item belongs in a category. The degree may be determined by indicating a percentage amount." (Mary J Lenard & Pervaiz Alam, "Application of Fuzzy Logic to Fraud Detection", 2009)

"A technique that identifies homogenous subgroups or clusters of subjects or study objects." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A statistical technique for finding natural groupings in data; it can also be used to assign new cases to groupings or categories." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"Techniques for organizing data into groups of similar cases." (Meta S Brown, "Data Mining For Dummies", 2014)

"A statistical technique whereby data or objects are classified into groups (clusters) that are similar to one another but different from data or objects in other clusters." (Soraya Sedkaoui, "Big Data Analytics for Entrepreneurial Success", 2018)

"Clustering or cluster analysis is a set of techniques of multivariate data analysis aimed at selecting and grouping homogeneous elements in a data set. Clustering techniques are based on measures relating to the similarity between the elements. In many approaches this similarity, or better, dissimilarity, is designed in terms of distance in a multidimensional space. Clustering algorithms group items on the basis of their mutual distance, and then the belonging to a set or not depends on how the element under consideration is distant from the collection itself." (Crescenzio Gallo, "Building Gene Networks by Analyzing Gene Expression Profiles", 2018)

"A type of an unsupervised learning that aims to partition a set of objects in such a way that objects in the same group (called a cluster) are more similar, whereas characteristics of objects assigned into different clusters are quite distinct." (Timofei Bogomolov et al, "Identifying Patterns in Fresh Produce Purchases: The Application of Machine Learning Techniques", 2020)

"Cluster analysis is the process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data." (Analytics Insight)

05 May 2018

🔬Data Science: Clustering (Definitions)

"Grouping of similar patterns together. In this text the term 'clustering' is used only for unsupervised learning problems in which the desired groupings are not known in advance." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"The process of grouping similar input patterns together using an unsupervised training algorithm." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"Clustering attempts to identify groups of observations with similar characteristics." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"The process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects, which are 'similar' between them and are 'dissimilar' to the objects belonging to other clusters." (Juan R González et al, "Nature-Inspired Cooperative Strategies for Optimization", 2008)

"Grouping the nodes of an ad hoc network such that each group is a self-organized entity having a cluster-head which is responsible for formation and management of its cluster." (Prayag Narula, "Evolutionary Computing Approach for Ad-Hoc Networks", 2009)

"The process of assigning individual data items into groups (called clusters) so that items from the same cluster are more similar to each other than items from different clusters. Often similarity is assessed according to a distance measure." (Alfredo Vellido & Iván Olie, "Clustering and Visualization of Multivariate Time Series", 2010)

"Verb. To output a smaller data set based on grouping criteria of common attributes." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The process of partitioning the data attributes of an entity or table into subsets or clusters of similar attributes, based on subject matter or characteristic (domain)." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A data mining technique that analyzes data to group records together according to their location within the multidimensional attribute space." (SQL Server 2012 Glossary, "Microsoft", 2012)

"Clustering aims to partition data into groups called clusters. Clustering is usually unsupervised in the sense that the training data is not labeled. Some clustering algorithms require a guess for the number of clusters, while other algorithms don't." (Ivan Idris, "Python Data Analysis", 2014)

"Form of data analysis that groups observations to clusters. Similar observations are grouped in the same cluster, whereas dissimilar observations are grouped in different clusters. As opposed to classification, there is not a class attribute and no predefined classes exist." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"Organization of data in some semantically meaningful way such that each cluster contains related data while the unrelated data are assigned to different clusters. The clusters may not be predefined." (Sanjiv K Bhatia & Jitender S Deogun, "Data Mining Tools: Association Rules", 2014)

"Techniques for organizing data into groups of similar cases." (Meta S Brown, "Data Mining For Dummies", 2014)

[cluster analysis:] "A technique that identifies homogenous subgroups or clusters of subjects or study objects." (K N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"Clustering is a classification technique where similar kinds of objects are grouped together. The similarity between the objects maybe determined in different ways depending upon the use case. Therefore, clustering in measurement space may be an indicator of similarity of image regions, and may be used for segmentation purposes." (Shiwangi Chhawchharia, "Improved Lymphocyte Image Segmentation Using Near Sets for ALL Detection", 2016)

"Clustering techniques share the goal of creating meaningful categories from a collection of items whose properties are hard to directly perceive and evaluate, which implies that category membership cannot easily be reduced to specific property tests and instead must be based on similarity. The end result of clustering is a statistically optimal set of categories in which the similarity of all the items within a category is larger than the similarity of items that belong to different categories." (Robert J Glushko, "The Discipline of Organizing: Professional Edition" 4th Ed., 2016)

[cluster analysis:]"A statistical technique for finding natural groupings in data; it can also be used to assign new cases to groupings or categories." (Jonathan Ferrar et al, "The Power of People", 2017)

"Unsupervised learning or clustering is a way of discovering hidden structure in unlabeled data. Clustering algorithms aim to discover latent patterns in unlabeled data using features to organize instances into meaningfully dissimilar groups." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"The term clustering refers to the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters." (Satyadhyan Chickerur et al, "Forecasting the Demand of Agricultural Crops/Commodity Using Business Intelligence Framework", 2019)

"In the machine learning context, clustering is the task of grouping examples into related groups. This is generally an unsupervised task, that is, the algorithm does not use preexisting labels, though there do exist some supervised clustering algorithms." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"A cluster is a group of data objects which have similarities among them. It's a group of the same or similar elements gathered or occurring closely together." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

"Clustering describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"Describes an unsupervised machine learning technique for identifying structures among unstructured data. Clustering algorithms group sets of similar objects into clusters, and are widely used in areas including image analysis, information retrieval, and bioinformatics." (Accenture)

"The process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data." (Analytics Insight)

26 November 2011

📉Graphical Representation: Cluster (Just the Quotes)

"To the untrained eye, randomness appears as regularity or tendency to cluster." (William Feller, "An Introduction to Probability Theory and its Applications", 1950)

"Sometimes clusters of variables tend to vary together in the normal course of events, thereby rendering it difficult to discover the magnitude of the independent effects of the different variables in the cluster. And yet it may be most desirable, from a practical as well as scientific point of view, to disentangle correlated describing variables in order to discover more effective policies to improve conditions. Many economic indicators tend to move together in response to underlying economic and political events." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The logarithmic transformation serves several purposes: (1) The resulting regression coefficients sometimes have a more useful theoretical interpretation compared to a regression based on unlogged variables. (2) Badly skewed distributions - in which many of the observations are clustered together combined with a few outlying values on the scale of measurement - are transformed by taking the logarithm of the measurements so that the clustered values are spread out and the large values pulled in more toward the middle of the distribution. (3) Some of the assumptions underlying the regression model and the associated significance tests are better met when the logarithm of the measured variables is taken." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The scatterplot is a useful exploratory method for providing a first look at bivariate data to see how they are distributed throughout the plane, for example, to see clusters of points, outliers, and so forth." (William S Cleveland, "Visualizing Data", 1993)

"Multivariate techniques often summarize or classify many variables to only a few groups or factors (e.g., cluster analysis or multi-dimensional scaling). Parallel coordinate plots can help to investigate the influence of a single variable or a group of variables on the result of a multivariate procedure. Plotting the input variables in a parallel coordinate plot and selecting the features of interest of the multivariate procedure will show the influence of different input variables." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Parallel coordinate plots are often overrated concerning their ability to depict multivariate features. Scatterplots are clearly superior in investigating the relationship between two continuous variables and multivariate outliers do not necessarily stick out in a parallel coordinate plot. Nonetheless, parallel coordinate plots can help to find and understand features such as groups/clusters, outliers and multivariate structures in their multivariate context. The key feature is the ability to select and highlight individual cases or groups in the data, and compare them to other groups or the rest of the data." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Be careful not to confuse clustering and stratification. Even though both of these sampling strategies involve dividing the population into subgroups, both the way in which the subgroups are sampled and the optimal strategy for creating the subgroups are different. In stratified sampling, we sample from every stratum, whereas in cluster sampling, we include only selected whole clusters in the sample. Because of this difference, to increase the chance of obtaining a sample that is representative of the population, we want to create homogeneous groups for strata and heterogeneous (reflecting the variability in the population) groups for clusters." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"Linking is a powerful dynamic interactive graphics technique that can help us better understand high-dimensional data. This technique works in the following way: When several plots are linked, selecting an observation's point in a plot will do more than highlight the observation in the plot we are interacting with - it will also highlight points in other plots with which it is linked, giving us a more complete idea of its value across all the variables. Selecting is done interactively with a pointing device. The point selected, and corresponding points in the other linked plots, are highlighted simultaneously. Thus, we can select a cluster of points in one plot and see if it corresponds to a cluster in any other plot, enabling us to investigate the high-dimensional shape and density of the cluster of points, and permitting us to investigate the structure of the disease space." (Forrest W Young et al, "Visual Statistics: Seeing data with dynamic interactive graphics", 2016)

"Dimensionality reduction is a way of reducing a large number of different measures into a smaller set of metrics. The intent is that the reduced metrics are a simpler description of the complex space that retains most of the meaning. […] Clustering techniques are similarly useful for reducing a large number of items into a smaller set of groups. A clustering technique finds groups of items that are logically near each other and gathers them together." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Important features to look for in a scatter plot are whether there is one cloud of dots or several clusters, whether there is an upward or downward slope to the cloud of dots, and whether there is any curvature to the slope." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"[...] scatterplots had advantages over earlier graphic forms: the ability to see clusters, patterns, trends, and relations in a cloud of points. Perhaps most importantly, it allowed the addition of visual annotations (point symbols, lines, curves, enclosing contours, etc.) to make those relationships more coherent and tell more nuanced stories." (Michael Friendly & Howard Wainer, "A History of Data Visualization and Graphic Communication", 2021)

20 August 2009

🛢DBMS: Clustering (Definitions)

"The use of multiple computers to provide increased reliability, capacity, and management capabilities." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

[federated cluster:] "A grouping of SQL servers used together to achieve scalability by employing a distributed partition view. A federated cluster is not used for availability, only for achieving scalability through scale out." (Allan Hirt et al, "Microsoft SQL Server 2000 High Availability", 2004)

"A method of keeping database files physically close to one another on the storage media for improving performance through sequential pre-fetch operations." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"(1) The condition whereby data is physically ordered contiguously by a specified key (usually implemented by means of an index). (2) The use of multiple, 'independent' computing systems working together to form what appears to users as a single highly available system." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures" 2nd Ed, 2012)

"The tendency of elements to become unevenly distributed in the hash table, with many adjacent locations containing elements" (Nell Dale et al, "Object-Oriented Data Structures Using Java 4th Ed.", 2016)

21 January 2009

🛢DBMS: Clustered Index [CI[ (Definitions)

"An index in which the physical order and the logical (indexed) order is the same. The leaf level of a clustered index represents the data pages themselves." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"A clustered index forces the rows in a table to be physically stored in sorted order, using one or more columns from the table to sort the rows. A table may have only one clustered index, such as a dictionary." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"An index that the DBMS uses to determine the order of data rows, according to values in one or more columns, called the cluster key. With a strong-clustered index, the data pages are the index's leaves and are thus always in order. With a weak-clustered index, data pages are separate from index leaf pages and the rows need not be 100% in order. The terms weak clustered index and strong-clustered index are not common usage; they appear only in this book." (Peter Gulutzan & Trudy Pelzer, "SQL Performance Tuning", 2002)

"An index in which the logical order of the key values determines the physical order of the corresponding rows in a table." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A clustered index in SQL Server is a type of index in which the logical order of key values determines the actual data rows; thereby the data rows are kept sorted. Using a clustered index causes the actual data rows to move into the leaf level of the index." (Thomas Moore, "EXAM CRAM™ 2: Designing and Implementing Databases with SQL Server 2000 Enterprise Edition", 2005)

"This is an index that physically rearranges the data that is inserted into your tables." (Joseph L Jorden & Dandy Weyn, "MCTS Microsoft SQL Server 2005: Implementation and Maintenance Study Guide - Exam 70-431", 2006)

"An index whose leaf level is the actual data page of the table." (Sara Morganand & Tobias Thernstrom , "MCITP Self-Paced Training Kit : Designing and Optimizing Data Access by Using Microsoft SQL Server 2005 - Exam 70-442", 2007)

"An index in which the logical order of the key values determines the physical order of the corresponding rows in a table." (Microsoft, "SQL Server 2012 Glossary", 2012)

"This is an index that contains a table’s row data in its leaf-level nodes." (Jay Natarajan et al, "Pro T-SQL 2012 Programmer's Guide" 3rd Ed, 2012)

"An index that contains a table’s row data in its leaf-level nodes." (Miguel Cebollero et al, "Pro T-SQL Programmer’s Guide" 4th Ed, 2015)

"An index whose sequence of key values closely corresponds to the sequence of rows stored in a table. The degree of correspondence is measured by statistics that are used by the optimizer." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

SQL Troubles

Pages