SQL Troubles: groups

Showing posts with label groups. Show all posts

08 March 2025

🏭🎗️🗒️Microsoft Fabric: Eventstreams [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 8-Mar-2025

Real-Time Intelligence architecture [4]

[Microsoft Fabric] Eventstream(s)

{def} feature in Microsoft Fabric's Real-Time Intelligence experience, that allows to bring real-time events into Fabric

bring real-time events into Fabric, transform them, and then route them to various destinations without writing any code

⇐ aka no-code solution
{feature} drag and drop experience

gives users an intuitive and easy way to create your event data processing, transforming, and routing logic without writing any code

work by creating a pipeline of events from multiple internal and external sources to different destinations

a conveyor belt that moves data from one place to another [1]
transformations to the data can be added along the way [1]

filtering, aggregating, or enriching

{def} eventstream

an instance of the Eventstream item in Fabric [2]
{feature} end-to-end data flow diagram

provide a comprehensive understanding of the data flow and organization [2].

{feature} eventstream visual editor

used to design pipelines by dragging and dropping different nodes [1]
sources

where event data comes from
one can choose

the source type
the data format
the consumer group

Azure Event Hubs

allows to get event data from an Azure event hub [1]
allows to create a cloud connection with the appropriate authentication and privacy level [1]

Azure IoT Hub

SaaS service used to connect, monitor, and manage IoT assets with a no-code experience [1]

CDC-enabled databases

software process that identifies and tracks changes to data in a database, enabling real-time or near-real-time data movement [1]
Azure SQL Database
PostgreSQL Database
MySQL Database
Azure Cosmos DB

Google Cloud Pub/Sub

messaging service for exchanging event data among applications and services [1]

Amazon Kinesis Data Streams

collect, process, and analyze real-time, streaming data [1]

Confluent Cloud Kafka

fully managed service based on Apache Kafka for stream processing [1]

Fabric workspace events

events triggered by changes in Fabric Workspace

e.g. creating, updating, or deleting items.

allows to capture, transform, and route events for in-depth analysis and monitoring within Fabric [1]
the integration offers enhanced flexibility in tracking and understanding workspace activities [1]

Azure blob storage events

system triggers for actions like creating, replacing, or deleting a blob [1]

these actions are linked to Fabric events

allowing to process Blob Storage events as continuous data streams for routing and analysis within Fabric [1]

support streamed or unstreamed events [1]

custom endpoint

REST API or SDKs can be used to send event data from custom app to eventstream [1]
allows to specify the data format and the consumer group of the custom app [1]

sample data

out-of-box sample data

destinations

where transformed event data is stored.

in a table in an eventhouse or a lakehouse [1]
redirect data to

another eventstream for further processing [1]
an activator to trigger an action [1]

Eventhouse

offers the capability to funnel your real-time event data into a KQL database [1]

Lakehouse

allows to preprocess real-time events before their ingestion in the lakehouse

the events are transformed into Delta Lake format and later stored in specific lakehouse tables [1]

facilitating the data warehousing needs [1]

custom endpoint

directs real-time event traffic to a bespoke application [1]
enables the integration of proprietary applications with the event stream, allowing for the immediate consumption of event data [1]
{scenario} aim to transfer real-time data to an independent system not hosted on the Microsoft Fabric [1]

Derived Stream

specialized destination created post-application of stream operations like Filter or Manage Fields to an eventstream
represents the altered default stream after processing, which can be routed to various destinations within Fabric and monitored in the Real-Time hub [1]

Fabric Activator

enables to use Fabric Activator to trigger automated actions based on values in streaming data [1]

transformations

filter or aggregate the data as is processed from the stream [1]
include common data operations

filtering

filter events based on the value of a field in the input
depending on the data type (number or text), the transformation keeps the values that match the selected condition, such as is null or is not null [1]

joining

transformation that combines data from two streams based on a matching condition between them [1]

aggregating

calculates an aggregation every time a new event occurs over a period of time [1]

Sum, Minimum, Maximum, or Average

allows renaming calculated columns, and filtering or slicing the aggregation based on other dimensions in your data [1]
one can have one or more aggregations in the same transformation [1]

grouping

allows to calculate aggregations across all events within a certain time window [1]

one can group by the values in one or more fields [1]

allows for the renaming of columns

similar to the Aggregate transformation
⇐ provides more options for aggregation and includes more complex options for time windows [1]

allows to add more than one aggregation per transformation [1]

allows to define the logic needed for processing, transforming, and routing event data [1]

union

allows to connect two or more nodes and add events with shared fields (with the same name and data type) into one table [1]

fields that don't match are dropped and not included in the output [1]

expand

array transformation that allows to create a new row for each value within an array [1]

manage fields

allows to add, remove, change data type, or rename fields coming in from an input or another transformation [1]

temporal windowing functions

enable to analyze data events within discrete time periods [1]
way to perform operations on the data contained in temporal windows [1]

e.g. aggregating, filtering, or transforming streaming events that occur within a specified time period [1]
allow analyzing streaming data that changes over time [1]

e.g. sensor readings, web-clicks, on-line transactions, etc.
provide great flexibility to keep an accurate record of events as they occur [1]

{type} tumbling windows

divides incoming events into fixed and nonoverlapping intervals based on arrival time [1]

{type} sliding windows

take the events into fixed and overlapping intervals based on time and divides them [1]

{type} session windows

divides events into variable and nonoverlapping intervals that are based on a gap of lack of activity [1]

{type} hopping windows

are different from tumbling windows as they model scheduled overlapping window [1]

{type} snapshot windows

group event stream events that have the same timestamp and are unlike the other windowing functions, which require the function to be named [1]
one can add the System.Timestamp() to the GROUP BY clause [1]

{type} window duration

the length of each window interval [1]
can be in seconds, minutes, hours, and even days [1]

{parameter} window offset

optional parameter that shifts the start and end of each window interval by a specified amount of time [1]

{concept} grouping key

one or more columns in an event data use to group the data by [1]

aggregation function

one or more of the functions applied to each group of events in each window [1]

where the counts, sums, averages, min/max, and even custom functions become useful [1]

see the event data flowing through the pipeline in real-time [1]
handles the scaling, reliability, and security of event stream automatically [1]

no need to write any code or manage any infrastructure [1]

{feature} eventstream editing canvas

used to

add and manage sources and destinations [1]
see the event data [1]
check the data insights [1]
view logs for each source or destination [1]

{feature} Apache Kafka endpoint on the Eventstream item

{benefit} enables users to connect and consume streaming events through the Kafka protocol [2]

application using the protocol can send or receive streaming events with specific topics [2]
requires updating the connection settings to use the Kafka endpoint provided in the Eventstream [2]

{feature} support runtime logs and data insights for the connector sources in Live View mode [3]

allows to examine detailed logs generated by the connector engines for the specific connector [3]

help with identifying failure causes or warnings [3]
⇐ accessible in the bottom pane of an eventstream by selecting the relevant connector source node on the canvas in Live View mode [3]

{feature} support data insights for the connector sources in Live View mode [3]
{feature} integrates eventstreams CI/CD tools

{benefit} developers can efficiently build and maintain eventstreams from end-to-end in a web-based environment, while ensuring source control and smooth versioning across projects [3]

{feature} REST APIs

allow to automate and manage eventstreams programmatically

{benefit} simplify CI/CD workflows and making it easier to integrate eventstreams with external applications [3]

{recommendation} use event streams feature with at least SKU: F4 [2]
{limitation} maximum message size: 1 MB [2]
{limitation} maximum retention period of event data: 90 days [2]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2024) Microsoft Fabric: Use real-time eventstreams in Microsoft Fabric [link]
[2] Microsoft Learn (2025) Microsoft Fabric: Fabric Eventstream - overview [link]
[3] Microsoft Learn (2024) Microsoft Fabric: What's new in Fabric event streams? [link]

[4] Microsoft Learn (2025) Real Time Intelligence L200 Pitch Deck [link]

[5] Microsoft Learn (2025) Use real-time eventstreams in Microsoft Fabric [link]

Resources:

[R1] Microsoft Learn (2024) Microsoft Fabric exercises [link]
[R2] Microsoft Fabric Updates Blog (2024) CI/CD – Git Integration and Deployment Pipeline [link]

[R3] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

API - Application Programming Interface
CDC - Change Data Capture
CI/CD - Continuous Integration/Continuous Delivery
DB - database
IoT - Internet of Things
KQL - Kusto Query Language
RTI - Real-Time Intelligence

SaaS - Software-as-a-Service
SDK - Software Development Kit
SKU - Stock Keeping Unit

13 October 2018

🔭Data Science: Groups (Just the Quotes)

"The object of statistical science is to discover methods of condensing information concerning large groups of allied facts into brief and compendious expressions suitable for discussion. The possibility of doing this is based on the constancy and continuity with which objects of the same species are found to vary." (Sir Francis Galton, "Inquiries into Human Faculty and Its Development, Statistical Methods", 1883)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"'Correlation' is a term used to express the relation which exists between two series or groups of data where there is a causal connection. In order to have correlation it is not enough that the two sets of data should both increase or decrease simultaneously. For correlation it is necessary that one set of facts should have some definite causal dependence upon the other set [...]" (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"The conception of statistics as the study of variation is the natural outcome of viewing the subject as the study of populations; for a population of individuals in all respects identical is completely described by a description of anyone individual, together with the number in the group. The populations which are the object of statistical study always display variations in one or more respects. To speak of statistics as the study of variation also serves to emphasise the contrast between the aims of modern statisticians and those of their predecessors." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"An average is a single value which is taken to represent a group of values. Such a representative value may be obtained in several ways, for there are several types of averages. […] Probably the most commonly used average is the arithmetic average, or arithmetic mean." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"The fact that index numbers attempt to measure changes of items gives rise to some knotty problems. The dispersion of a group of products increases with the passage of time, principally because some items have a long-run tendency to fall while others tend to rise. Basic changes in the demand is fundamentally responsible. The averages become less and less representative as the distance from the period increases." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"Pencil and paper for construction of distributions, scatter diagrams, and run-charts to compare small groups and to detect trends are more efficient methods of estimation than statistical inference that depends on variances and standard errors, as the simple techniques preserve the information in the original data." (William E Deming, "On Probability as Basis for Action" American Statistician Vol. 29 (4), 1975)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"The central limit theorem […] states that regardless of the shape of the curve of the original population, if you repeatedly randomly sample a large segment of your group of interest and take the average result, the set of averages will follow a normal curve." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Random events often come like the raisins in a box of cereal - in groups, streaks, and clusters. And although Fortune is fair in potentialities, she is not fair in outcomes." (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"[...] statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups. Ignoring group differences when they are present frequently portends inequitable treatment." (Kaiser Fung, "Numbers Rule the World", 2010)

"The issue of group differences is fundamental to statistical thinking. The heart of this matter concerns which groups should be aggregated and which shouldn’t." (Kaiser Fung, "Numbers Rule the World", 2010)

"Be careful not to confuse clustering and stratification. Even though both of these sampling strategies involve dividing the population into subgroups, both the way in which the subgroups are sampled and the optimal strategy for creating the subgroups are different. In stratified sampling, we sample from every stratum, whereas in cluster sampling, we include only selected whole clusters in the sample. Because of this difference, to increase the chance of obtaining a sample that is representative of the population, we want to create homogeneous groups for strata and heterogeneous (reflecting the variability in the population) groups for clusters." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"If the group is large enough, even very small differences can become statistically significant." (Victor Cohn & Lewis Cope, "News & Numbers: A writer’s guide to statistics" 3rd Ed, 2012)

"Self-selection bias occurs when people choose to be in the data - for example, when people choose to go to college, marry, or have children. […] Self-selection bias is pervasive in 'observational data', where we collect data by observing what people do. Because these people chose to do what they are doing, their choices may reflect who they are. This self-selection bias could be avoided with a controlled experiment in which people are randomly assigned to groups and told what to do." (Gary Smith, "Standard Deviations", 2014)

"Often when people relate essentially the same variable in two different groups, or at two different times, they see this same phenomenon - the tendency of the response variable to be closer to the mean than the predicted value. Unfortunately, people try to interpret this by thinking that the performance of those far from the mean is deteriorating, but it’s just a mathematical fact about the correlation. So, today we try to be less judgmental about this phenomenon and we call it regression to the mean. We managed to get rid of the term 'mediocrity', but the name regression stuck as a name for the whole least squares fitting procedure - and that’s where we get the term regression line." (Richard D De Veaux et al, "Stats: Data and Models", 2016)

"Bias is error from incorrect assumptions built into the model, such as restricting an interpolating function to be linear instead of a higher-order curve. [...] Errors of bias produce underfit models. They do not fit the training data as tightly as possible, were they allowed the freedom to do so. In popular discourse, I associate the word 'bias' with prejudice, and the correspondence is fairly apt: an apriori assumption that one group is inferior to another will result in less accurate predictions than an unbiased one. Models that perform lousy on both training and testing data are underfit." (Steven S Skiena, "The Data Science Design Manual", 2017)

"To be any good, a sample has to be representative. A sample is representative if every person or thing in the group you’re studying has an equally likely chance of being chosen. If not, your sample is biased. […] The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

"If you study one group and assume that your results apply to other groups, this is extrapolation. If you think you are studying one group, but do not manage to obtain a representative sample of that group, this is a different problem. It is a problem so important in statistics that it has a special name: selection bias. Selection bias arises when the individuals that you sample for your study differ systematically from the population of individuals eligible for your study." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

03 November 2011

📉Graphical Representation: Groups (Just the Quotes)

"Bar-charts are most flexible and can be varied to suit the individual whims of the maker. In general, however, there is one style or form which will be found most satisfactory. It consists of a horizontal grouping of bars alongside of the data. The chart is arranged in tabular form, with items or stubs in a column to the left, with figures in a column beside the stubs and with bars in a column beside the figures. Several columns of figures are sometimes desirable, just as in the table of data, to show sources or original figures from which the charted figures are obtained. In any case, the bars should represent the most important set or column of figures, and there should be normally but one column of bars."(Karl G Karsten, "Charts and Graphs", 1925)

"The basic principle which should be observed in designing tables is that of grouping related data, either by the use of space or, if necessary, rules. Items which are close together will be seen as being more closely related than items which are farther apart, and the judicious use of space is therefore vitally important. Similarly, ruled lines can be used to relate and divide information, and it is important to be sure which function is required. Rules should not be used to create closed compartments; this is time-wasting and it interferes with scanning." (Linda Reynolds & Doig Simmonds, "Presentation of Data in Science" 4th Ed, 1984)

"The space between columns, on the other hand, should be just sufficient to separate them clearly, but no more. The columns should not, under any circumstances, be spread out merely to fill the width of the type area. […] Sometimes, however, it is difficult to avoid undesirably large gaps between columns, particularly where the data within any given column vary considerably in length. This problem can sometimes be solved by reversing the order of the columns […]. In other instances the insertion of additional space after every fifth entry or row can be helpful, […] but care must be taken not to imply that the grouping has any special meaning." (Linda Reynolds & Doig Simmonds, "Presentation of Data in Science" 4th Ed, 1984)

"Scatter charts show the relationships between information, plotted as points on a grid. These groupings can portray general features of the source data, and are useful for showing where correlationships occur frequently. Some scatter charts connect points of equal value to produce areas within the grid which consist of similar features." (Bruce Robertson, "How to Draw Charts & Diagrams", 1988)

"A good chart delineates and organizes information. It communicates complex ideas, procedures, and lists of facts by simplifying, grouping, and setting and marking priorities. By spatial organization, it should lead the eye through information smoothly and efficiently." (Mary H Briscoe, "Preparing Scientific Illustrations: A guide to better posters, presentations, and publications" 2nd ed., 1995)

"Grouped area graphs sometimes cause confusion because the viewer cannot determine whether the areas for the data series extend down to the zero axis. […] Grouped area graphs can handle negative values somewhat better than stacked area graphs but they still have the problem of all or portions of data curves being hidden by the data series towards the front." (Robert L Harris, "Information Graphics: A Comprehensive Illustrated Reference", 1996)

"When analyzing data it is many times advantageous to generate a variety of graphs using the same data. This is true whether there is little or lots of data. Reasons for this are: (1) Frequently, all aspects of a group of data can not be displayed on a single graph. (2) Multiple graphs generally result in a more in-depth understanding of the information. (3) Different aspects of the same data often become apparent. (4) Some types of graphs cause certain features of the data to stand out better (5) Some people relate better to one type of graph than another." (Robert L Harris, "Information Graphics: A Comprehensive Illustrated Reference", 1996)

"If you want to hide data, try putting it into a larger group and then use the average of the group for the chart. The basis of the deceit is the endearingly innocent assumption on the part of your readers that you have been scrupulous in using a representative average: one from which individual values do not deviate all that much. In scientific or statistical circles, where audiences tend to take less on trust, the 'quality' of the average (in terms of the scatter of the underlying individual figures) is described by the standard deviation, although this figure is itself an average." (Nicholas Strange, "Smoke and Mirrors: How to bend facts and figures to your advantage", 2007)

"We tend automatically to think of all the categories represented on the horizontal axis of a column Chart as being equally important. They vary of course on the value axis. Otherwise, there would be little point in the chart, but there is somehow this feeling that they are in other respects similar members of a group. This convention can be put to good use to manipulate the message of the most boring bar or column chart." (Nicholas Strange, "Smoke and Mirrors: How to bend facts and figures to your advantage", 2007)

"Where there is no natural ordering to the categories it can be helpful to order them by size, as this can help you to pick out any patterns or compare the relative frequencies across groups. As it can be difficult to discern immediately the numbers represented in each of the categories it is good practice to include the number of observations on which the chart is based, together with the percentages in each category." (Jenny Freeman et al, "How to Display Data", 2008)

"Grouping charts according to a theme and in sequence with the message and putting them all on the same sheet or slide helps you find the thread of the message (even if the charts are separated again later)." (Jorge Camões, "Data at Work: Best practices for creating effective charts and information graphics in Microsoft Excel", 2016)

"The law of connectivity tells us that objects connected to other objects tend to be seen as a group. […] The law of common fate tells us that objects moving in the same direction are seen as a group." (Jorge Camões, "Data at Work: Best practices for creating effective charts and information graphics in Microsoft Excel", 2016)

"The law of continuity states that we interpret images so as not to generate abrupt transitions or otherwise create images that are more complex. […] we can arbitrarily fill in the missing elements to complete a pattern. It’s also the case of time series, in which we assume that data points in the future will be a smooth continuation of the past. […] In a line chart, those series with a similar slope (that is, they appear to follow the same direction) are understood as belonging to the same group." (Jorge Camões, "Data at Work: Best practices for creating effective charts and information graphics in Microsoft Excel", 2016)

"The law of segregation tells us that objects within a closed shape are seen as a group. A frame around objects (charts or legends, for example) has this function, but it’s also useful for adding visual annotations." (Jorge Camões, "Data at Work: Best practices for creating effective charts and information graphics in Microsoft Excel", 2016)

"A histogram represents the frequency distribution of the data. Histograms are similar to bar charts but group numbers into ranges. Also, a histogram lets you show the frequency distribution of continuous data. This helps in analyzing the distribution (for example, normal or Gaussian), any outliers present in the data, and skewness." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Another problem is that while data visualizations may appear to be objective, the designer has a great deal of control over the message a graphic conveys. Even using accurate data, a designer can manipulate how those data make us feel. She can create the illusion of a correlation where none exists, or make a small difference between groups look big." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

SQL Troubles

Pages