SQL Troubles: optimization

Showing posts with label optimization. Show all posts

16 April 2025

🧮ERP: Implementations (Part XIV: A Never-Ending Story)

ERP Implementations Series

An ERP implementation is occasionally considered as a one-time endeavor after which an organization will live happily ever after. In an ideal world that would be true, though the work never stops – things that were carved out from the implementation, optimizations, new features, new regulations, new requirements, integration with other systems, etc. An implementation is thus just the beginning from what it comes and it's essential to get the foundation right – and that’s the purpose of the ERP implementation – provide a foundation on which something bigger and solid can be erected.

No matter how well an ERP implementation is managed and executed, respectively how well people work towards the same goals, there’s always something forgotten or carved out from the initial project. Usually, the casual suspects are the integrations with other systems, though there can be also minor or even bigger features that are planned to be addressed later, if the implementation hasn’t consumed already all the financial resources available, as it's usually the case. Some of the topics can be addressed as Change Requests or consolidated on projects of their own.

Even simple integrations can become complex when the processes are poorly designed, and that typically happens more often than people think. It’s not necessarily about the lack of skillset or about the technologies used, but about the degree to which the processes can work in a loosely coupled interconnected manner. Even unidirectional integrations can raise challenges, though everything increases in complexity when the flow of data is bidirectional. Moreover, the complexity increases with each system added to the overall architecture.

Like a sculpture’s manual creation, processes in an ERP implementation form a skeleton that needs chiseling and smoothing until the form reaches the desired optimized shape. However, optimization is not a one-time attempt but a continuous work of exploring what is achievable, what works, what is optimal. Sometimes optimization is an exact science, while other times it’s about (scientifical) experimentation in which theory, ideas and investments are put to good use. However, experimentation tends to be expensive at least in terms of time and effort, and probably these are the main reasons why some organizations don’t even attempt that – or maybe it’s just laziness, pure indifference or self-preservation. In fact, why change something that already works?

Typically, software manufacturers make available new releases on a periodic basis as part of their planning for growth and of attracting more businesses. Each release that touches used functionality typically needs proper evaluation, testing and whatever organizations consider as important as part of the release management process. Ideally, everything should go smoothly though life never ceases to surprise and even a minor release can have an important impact when earlier critical functionality stopped working. Test automation and other practices can make an important difference for organizations, though these require additional effort and investments that usually pay off when done right.

Regulations and other similar requirements must be addressed as they can involve penalties or other risks that are usually worth avoiding. Ideally such requirements should be supported by design, though even then a certain volume of work is involved. Moreover, the business context can change unexpectedly, and further requirements need to be considered eventually.

The work on an ERP system and the infrastructure built around it is a never-ending story. Therefore, organizations must have not only the resources for the initial project, but also what comes after that. Of course, some work can be performed manually, some requirements can be delayed, some risks can be assumed, though the value of an ERP system increases with its extended usage, at least in theory.

Previous Post <<||>> Next Post

17 March 2025

🏭🗒️Microsoft Fabric: V-Order [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 17-Mar-2024

[Microsoft Fabric] V-Order

{def} write time optimization to the parquet file format that enables fast reads under the MF compute engine [2]

all parquet engines can read the files as regular parquet files [2]
results in a smaller and therefore faster files to read [5]

{benefit} improves read performance
{benefit} decreases storage requirements
{benefit} optimizes resources' usage

reduces the compute resources required for reading data

e.g. network bandwidth, disk I/O, CPU usage

still conforms to the open-source Parquet file format [5]

they can be read by non-Fabric tools [5]

delta tables created and loaded by Fabric items automatically apply V-Order

e.g. data pipelines, dataflows, notebooks [5]

delta tables and its features are orthogonal to V-Order [2]

e.g. Z-Order, compaction, vacuum, time travel
table properties and optimization commands can be used to control the v-order of the partitions [2]

compatible with Z-Order [2]
not all files have this optimization applied [5]

e.g. Parquet files uploaded to a Fabric lakehouse, or that are referenced by a shortcut
the files can still be read, the read performance likely won't be as fast as an equivalent Parquet file that's had V-Order applied [5]

required by certain features

[hash encoding] to assign a numeric identifier to each unique value contained in the column [5]

{command} OPTIMIZE

optimizes a Delta table to coalesce smaller files into larger ones [5]
can apply V-Order to compact and rewrite the Parquet files [5]

[warehouse]

works by applying certain operations on Parquet files

special sorting
row group distribution
dictionary encoding
compression

enabled by default
⇒ compute engines require less network, disk, and CPU resources to read data from storage [1]

provides cost efficiency and performance [1]

the effect of V-Order on performance can vary depending on tables' schemas, data volumes, query, and ingestion patterns [1]

fully-compliant to the open-source parquet format [1]

⇐ all parquet engines can read it as regular parquet files [1]

required by certain features

[Direct Lake mode] depends on V-Order

{operation} disable V-Order

causes any new Parquet files produced by the warehouse engine to be created without V-Order optimization [3]
irreversible operation

once disabled, it cannot be enabled again [3]

{scenario} write-intensive warehouses

warehouses dedicated to staging data as part of a data ingestion process [1]

{warning} consider the effect of V-Order on performance before deciding to disable it [1]

{recommendation} test how V-Order affects the performance of data ingestion and queries before deciding to disable it [1]

via ALTER DATABASE CURRENT SET VORDER = OFF; [3]

{operation} check current status

via SELECT name, is_vorder_enabled FROM sys.databases; [post]

{feature} [lakehouse] Load to Table

allows to load a single file or a folder of files to a table [6]
tables are always loaded using the Delta Lake table format with V-Order optimization enabled [6]

[Direct Lake semantic model]

data is prepared for fast loading into memory [5]

makes less demands on capacity resources [5]
results in faster query performance [5]

because less memory needs to be scanned [5]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Fabric: Understand V-Order for Microsoft Fabric Warehouse [link]
[2] Microsoft Learn (2024) Delta Lake table optimization and V-Order [link]
[3] Microsoft Learn (2024) Disable V-Order on Warehouse in Microsoft Fabric [link]
[4] Miles Cole (2024) To V-Order or Not: Making the Case for Selective Use of V-Order in Fabric Spark [link]
[5] Microsoft Learn (2024) Understand storage for Direct Lake semantic models [link]

[6] Microsoft Learn (2025] Fabric: Load to Delta Lake table [link]

Resources:
[R1] Serverless.SQL (2024) Performance Analysis of V-Ordering in Fabric Warehouse: On or Off?, by Andy Cutler [link]
[R2] Redgate (2023 Microsoft Fabric: Checking and Fixing Tables V-Order Optimization, by Dennes Torres [link]
[R3] Sandeep Pawar (2023) Checking If Delta Table in Fabric is V-order Optimized [link]

[R4] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
MF - Microsoft Fabric

15 January 2025

🧭Business Intelligence: Perspectives (Part 23: In between the Many Destinations)

Business Intelligence Series

In too many cases the development of queries, respectively reports or data visualizations (aka artifacts) becomes a succession of drag & drops, formatting, (re)ordering things around, a bit of makeup, configuring a set of parameters, and the desired product is good to go! There seems nothing wrong with this approach as long as the outcomes meet users’ requirements, though it also gives the impression that’s all what the process is about.

Given a set of data entities, usually there are at least as many perspectives into the data as entities’ number. Further perspectives can be found in exceptions and gaps in data, process variations, and the further aspects that can influence an artifact’s logic. All these aspects increase the overall complexity of the artifact, respectively of the development process. One guideline in handling all this is to keep the process in focus, and this starts with requirements’ elicitation and ends with the quality assurance and actual use.

Sometimes, the two words, the processes and their projection into the data and (data) models don’t reflect the reality adequately and one needs to compromise, at least until the gaps are better addressed. Process redesign, data harmonization and further steps need to be upon case considered in multiple iterations that should converge to optimal solutions, at least in theory.

Therefore, in the development process there should be a continuous exploration of the various aspects until an optimum solution is reached. Often, there can be a couple of competing forces that can pull the solution in two or more directions and then compromising is necessary. Especially as part of continuous improvement initiatives there’s the tendency of optimizing locally processes in the detriment of the overall process, with all the consequences resulting from this.

Unfortunately, many of the problems existing in organizations are ill-posed and misunderstood to the degree that in extremis more effort is wasted than the actual benefits. Optimization is a process of putting in balance all the important aspects, respectively of answering with agility to the changing nature of the business and environment. Ignoring the changing nature of the problems and their contexts is a recipe for failure on the long term.

This implies that people in particular and organizations in general need to become and remain aware of the micro and macro changes occurring in organizations. Continuous learning is the key to cope with change. Organizations must learn to compromise and focus on what’s important, achievable and/or probable. Identifying, defining and following the value should be in an organization’s ADN. It also requires pragmatism (as opposed to idealism). Upon case, it may even require to say “no”, at least until the changes in the landscape offer a reevaluation of the various aspects.

One requires a lot from organizations when addressing optimization topics, especially when misalignment or important constraints or challenges may exist. Unfortunately, process related problems don’t always admit linear solutions. The nonlinear aspects are reflected especially when changing the scale, perspective or translating the issues or solutions from one are area to another.

There are probably answers available in the afferent literature or in the approaches followed by other organizations. Reinventing the wheel is part of the game, though invention may require explorations outside of the optimal paths. Conversely, an organization that knows itself has more chances to cope with the challenges and opportunities altogether.

A lot from what organizations do in a consistent manner looks occasionally like inertia, self-occupation, suboptimal or random behavior, in opposition to being self-driven, self-aware, or in self-control. It’s also true that these are ideal qualities or aspects of what organizations should become in time.

Previous Post <<||>> Next Post

07 December 2024

🏭 💠Data Warehousing: Microsoft Fabric (Part VI: SQL Databases for OLTP scenarios) [new feature]

Data Warehousing Series

One interesting announcements at Ignite is the availability in public preview of SQL databases in Microsoft Fabric, "a versatile and developer-friendly transactional database built on the foundation of Azure SQL database". With this Fabric can address besides OLAP also OLTP scenarios, evolving thus from analytics to a data platform [1]. According to the announcement, besides the AI-optimized architectural aspects, the feature makes the SQL Azure simple, autonomous and secure by design [1], and these latest aspects are considered in this post.

Simplicity revolves around the deployment and configuration of databases, the creation of a new database requiring giving a name and the database is created in seconds [1]. It’s a considerable improvement compared with the relatively complex setup needed for on-premise configurations, though sometimes more flexibility in configuration is needed upfront or over database’s lifetime. To get a database ready for testing one can import a sample database or get specific data via data flows and/or pipelines [1]. As development tools one can use Visual Studio Code or SSMS [1], and probably more tools will be available in time.

The integration with both GitHub and Azure DevOps allows to configure each database under source control, which is needed for many scenarios especially when multiple resources make changes to the database objects [1]. Frankly, that’s mainly important during the development phase, respectively in scenarios in which multiple people make in parallel changes to the logic. It will be interesting to see how much overhead or challenges the feature adds to development and how smoothly everything works together!

The most important aspect for many solutions is the replication of data in near-real time to the (open-source) delta parquet format in OneLake and thus making the data available for analytics almost immediately [1]. Probably, from this aspect many cloud-based applications can benefit, even if the performance might not be as good as in other well-established architectures. However, there are many other scenarios in which one needs to maintain and use data for OLTP/OLAP purposes. This invites adequate testing and a good weighting of the advantages and disadvantages involved.

A SQL database is a native item in Fabric, and therefore it utilizes Fabric capacity units like other Fabric workloads [1]. One can use the Fabric SKU estimator (still in private preview) to estimate the costs [2], though it will be interesting to see how cost-effective the solutions are. Probably, especially when the infrastructure is already available outside of Fabric, it will be easier and cost-effective to use the mirroring functionality. One should test and have a better estimator before moving blindly from the existing infrastructure to Fabric.

SQL databases in Fabric are autonomous by design, while allowing to get the best performance and availability by default [1]. High availability is reached through zone redundancy, while performance is achieved by scaling automatically the storage and compute to accommodate the workloads [1]. The auto-optimization capability is achieved with the help of the latest Intelligent Query Processing (IQP) enhancements, respectively the creation of missing indexes to improve query performance [1]. It will be interesting to see how the whole process works, given that the maintenance of indexes usually involves some challenges (e.g. identifying covering indexes, indexes needed only for temporary workloads, duplicated indexes).

SQL databases in Fabric are automatically configured for high availability with zone redundancy, while storage and compute scale automatically to accommodate the user workload [1]. The database is auto-optimized through the latest IQP enhancements while the system creates any missing indexes to improve query performance. All data is replicated to OneLake by default [1]. Finally, the database always receives the latest security updates with auto-patching, while automatic backups help in disaster recovery scenarios [1], which can be of real help for database administrators.

References:
[1] Microsoft Fabric Updates Blog (2024) Announcing SQL database in Microsoft Fabric Public Preview [link]
[2] Microsoft Fabric Updates Blog (2024) Announcing New Recruitment for the Private Preview of Microsoft Fabric SKU Estimator [link]

01 February 2024

🏭🗒️Microsoft Fabric: Delta Tables [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 18-Apr-2024

Delta Table Structure

[Delta Lake] delta table

{definition} table that stores data as a directory of files in the delta lake (DL) and registers table metadata to the metastore within a catalog and schema [1]

⇐ represents a schema abstraction over data files that are stored in Delta format [2]
for each table, the lakehouse stores

a folder containing its Parquet data files
a _delta_Log folder in which transaction details are logged in JSON format

all Microsoft Fabric (MF) experiences generate and consume delta tables [10]

provides interoperability and a unified product experience [10]
some experiences can only write to delta tables, while others can read from it [10]

⇒ all data files for a table in a database are grouped under a common data path [12]

{feature} support ACID transactions

modifications made to a table are logged in its transaction log

enforces serializable isolation for concurrent operations [2]
the logged transactions can be used to retrieve the history of changes made [2]
each transaction creates new Parquet files
deleting a row, doesn't physically delete in the parquet file [9]

{feature} [DL] Delete Vectors

are read as part of the table and indicate which rows to ignore [9]
make it faster to perform deletions because there's no need to re-write the existing parquet files [9]
many deleted rows take more resources to read that file [9]

{feature} support DML

only tables created while the future is available will have all DML published [..]

{feature} support data versioning

multiple versions of each table row can be retrieved from the transaction log [2]

{feature} support time travel
{feature} support for batch and streaming data

delta tables can be used as both sources and destinations for streaming data [2]

{feature} support standard formats and interoperability

via the Parquet format

table consumption

[lakehouse] SQL Endpoint

provides a read-only experience [4]
can be used to query only delta tables via T-SQL [4]

other file formats can not be queried using the SQL endpoint [4]

⇐ the files need to be converted to the delta format [4]

{limitation} doesn't support the full T-SQL surface area of a transactional data warehouse [4]

[Spark] managed table

the table definition in the metastore and the underlying data files are both managed by the Spark runtime for the Fabric lakehouse [4]

[Spark] external table

the relational table definition in the metastore is mapped to an alternative file storage location [4]

the Parquet data files and JSON log files for the table are stored in the Files storage location [4]

[Spark] allows greater control over the creation and management of delta tables [4]
{operation} create (aka create delta table)

defines the table in the metastore for the lakehouse
its data is stored in the underlying Parquet files for the table

the details of mapping the table definition in the metastore to the underlying files are abstracted [4]
⇐ internally, there is also a log file that keeps track of which parquet files, when combined, make up the data that is in the table [4]

the log files are internal and cannot be used directly by other engines [4]

⇐ DF publishes automatically the right log files so that other engines can directly access the right parquet files [..]

after every 10 transactions, a new log file (aka checkpoint) is created automatically and asynchronously [4]

⇐ the file is a summary of all the previous log files [4]
when querying the table, the system needs to read the latest checkpoint and any log files that were created after*

⇐ instead of having to read 105,120 log files, 10 or less files will be read*

the Delta Lake Logs are automatically so that other engines can directly access the right parquet files *

[Apache Spark in a lakehouse] allows greater control of the creation and management of delta tables [4]

via saving a dataframe

{method}save a dataframe as a delta table [4]

⇐ the easiest way to create a delta table
creates both the table schema definition in the metastore and the data files in delta format [4]

{method} create the table definition [4]

creates the table schema in the metastore without saving any data files [4]

{method} save data in delta format without creating a table definition in the metastore [4]

{scenario} persist the results of data transformations performed in Spark in a file format over which a table definition is overlayed later or processed directly by using the delta lake API [4]
modifications made to the data through the delta lake API or in an external table that is subsequently created on the folder will be recorded in the transaction logs [4]
{mode} overwrite

replace the contents of an existing folder with the data in a dataframe [4]

{mode} append

adds rows from a dataframe to an existing folder [4]

Fabric uses an automatic table discovery capability to create the corresponding table metadata in the metastore [4]

via DeltaTableBuilder API

enables to write Spark code to create a table based on specifications

via Spark SQL

[managed table] via CREATE TABLE <table_definition> USING DELTA
[external table] via CREATE TABLE <table_name> USING DELTA LOCATION

the schema of the table is determined by the Parquet files containing the data in the specified location
{scenario} create a table definition

that references data that has already been saved in delta format [4]
based on a folder where data are ingested in the delta format [4]

{operation} update (aka update delta table)
{operation} delete (aka delete delta table)

[managed table] deleting the table deletes the underlying files from the Tables storage location for the lakehouse [4]
[external table] deleting a table from the lakehouse metastore does not delete the associated data files [4]

performance and storage cost efficiency tend to degrade over time

{reason} new data added to the table might skew the data [3]
{reason} batch and streaming data ingestion rates might bring in many small files
{reason} update and delete operations eventually create read overhead

parquet files are immutable by design, so Delta tables adds new parquet files which the changeset, further amplifying the issues imposed by the first two items [3]

{reason} no longer needed data files and log files available in the storage

{recommendation} don’t allow special characters in column names (incl. spaces)
{recommendation} make table and column names business-friendly
{feature} table partitions

{recommendation} use a partitioned folder structure wherever applicable

helps to improve data manageability and query performance
results in faster search for specific data entries thanks to partition pruning/elimination
{best practice} partition data to align with the query patterns [15]
it can dramatically speed up query performance, especially when combined with other performance optimizations [15]

{command} MERGE

allows updating a delta table with advanced conditions [3]

from a source table, view or DataFrame [3]

{limitation} the current algorithm in the open source distribution of Delta Lake isn't fully optimized for handling unmodified rows [3]
[Microsoft Spark Delta] implemented a custom Low Shuffle Merge optimization

unmodified rows are excluded from an expensive shuffling operation that is needed for updating matched rows [3]

{command} OPTIMIZE

consolidates multiple small Parquet files into large file [8]
should be run whenever there are enough small files to justify running the compaction operation [6]

{best practice} run optimization after loading large tables [8]

benefits greatly from the ACID transactions supported [6]
[Delta Lake] predicate filtering

specify predicates to only compact a subset of your data [6]
{scenario} running a compaction job on the same dataset daily [6]

{command} VACUUM

removes old files no longer referenced by a Delta table log [8]

files need to be older than the retention threshold [8]

the default file retention threshold is seven days [8]

shorter retention period impacts Delta's time travel capabilities [8]
{default} historical data can't be delete within the retention threshold [2]

⇐ that's to maintain the consistency in data [2]

{best practice} set a retention interval to at least seven days [8]

⇐ because old snapshots and uncommitted files can still be in use by the concurrent table readers and writers [8]

important to optimize storage cost [8]

{warning} leaning up active files might lead to reader failures or even table corruption if the uncommitted files are removed [8]

{issue} small files

create large metadata transaction logs which cause planning time slowness [6]
result from

big repartition value [6]
if the dataset is partitioned on a high-cardinality column or if there are deeply nested partitions, then more small files will be created [6]
tables that are incrementally updated frequently [6]

files of sizes above 128 MB, and optimally close to 1 GB, improve compression and data distribution across the cluster nodes [8]

{feature} auto compaction

combines small files within Delta table partitions to automatically reduce small file problems [7]
occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write [7]
only compacts files that haven’t been compacted previously [7]
only triggered for partitions or tables that have at least a certain number of small files [7]
enabled at the table or session level [7]

{feature}[Delta Lake 1.2] data skipping

the engine takes advantage of minimum and maximum values metadata to provide faster queries

⇐ requires the respective metadata

its effectiveness depends on data's layout [7]

{feature} [Delta Lake 3.0] z-ordering (aka multi-dimensional clustering)

technique to collocate related information in the same set of files [7]

{feature} checkpointing

allows read queries to quickly reconstruct the current state of the table without reading too many files having incremental updates [7]
{default} each checkpoint is written as a single Parquet file [7]
{alternative} [Delta Lake 2.0] multi-part checkpointing

allows splitting the checkpoint into multiple Parquet files [7]

⇒ parallelizes and speeds up writing the checkpoint [7]

{feature} [Delta Lake 3.0] log compaction

reduces the need for frequent checkpoints and minimize the latency spikes caused by them [7]
allows new log compaction files with the format <x>.<y>.compact.json

the files contain the aggregated actions for commit range [x, y]

read support is enabled by default [7]
write support not available yet [7]

will be added in a future version of Delta [7]

{feature} Delta Lake transaction log (aka DeltaLog)

a sequential record of every transaction performed on a table since its creation [15]
central to DL functionality because it is at the core of its important features [15]

incl. ACID transactions, scalable metadata handling, time travel

{goal} enable multiple readers and writers to operate on a given version of a dataset file simultaneously [15]
{goal} provide additional information, to the execution engine for more performant operations [15]

e.g. data skipping indexes

always shows the user a consistent view of the data

⇒ serves as a single source of truth

for each write operation, the data file is always written first, and only when that operation succeeds, a transaction log file is added to the _delta_log folder

⇐ the transaction is only considered complete when the transaction log entry is written successfully [15]

{feature} [Lakehouse] table maintenance

manages efficiently delta tables and keeps them always ready for analytics [8]
performs ad-hoc table maintenance using contextual right-click actions in a delta table within the Lakehouse explorer [8]
applies bin-compaction, v-order, and unreferenced old files cleanup [8]

via Lakehouse >> Tables >> (select table) >>Maintenance >> (select options) >> Run now

a Spark maintenance job is submitted for execution [8]

uses the user identity and table privileges

consumes Fabric capacity of the workspace/user that submitted the job

{constraint} only one maintenance job on a table can be run at any time

if there's a running job on the table, the new one is rejected [8]
jobs on different tables can execute in parallel [8]

running jobs are available in the Monitoring Hub

see "TableMaintenance" text within the activity name column [8]

{best practice} properly designing the table physical structure based on the ingestion frequency and expected read patterns is likely more important than running the optimization commands [3]

Previous Post <<||>> Next Post

Resources:

[1] Microsoft Learn (2023) Data objects in the Databricks lakehouse (link)

[2] Microsoft Learn (2023) Implement medallion lakehouse architecture in Microsoft Fabric (link)

[3] Microsoft Learn (2023) Delta Lake table optimization and V-Order (link)

[4] Microsoft Learn (2023) Work with Delta Lake tables in Microsoft Fabric (link)

[5] Delta Lake (2023) Quickstart (link)

[6] Delta Lake (2023) Delta Lake Small File Compaction with OPTIMIZE (link)
[7] Delta Lake (2023) Optimizations [link]
[8] Microsoft Learn (2023) Use table maintenance feature to manage delta tables in Fabric (link)
[9] Microsoft Fabric Updates Blog (2023) Announcing: Automatic Data Compaction for Fabric Warehouse, by Kevin Conan (link)
[10] Microsoft Learn (2023) Delta Lake table format interoperability (link)

[11] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12) (link)
[12] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions (link)
[13] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment13(12) (link)
[14] Jesús Camacho-Rodríguez et al (2023) LST-Bench: Benchmarking Log-Structured Tables in the Cloud, Proceedings of the ACM on Management of Data (2024), 2 (1) (link)

[15] Bennie Haelen & Dan Davis (2024) Delta Lake: Up and Running Modern Data Lakehouse Architectures with Delta Lake, 2024]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ACID - atomicity, consistency, isolation, durability
API - Application Programming Interface
CRUD - create, read, update, and delete

DL - Delta lake
MF - Microsoft Fabric
JSON - JavaScript Object Notation

27 January 2024

Data Science: Back to the Future I (About Beginnings)

Data Science Series

I've attended again, after several years, a webcast on performance improvement in SQL Server with Claudio Silva, “Writing T-SQL code for the engine, not for you”. The session was great and I really enjoyed it! I recommend it to any data(base) professional, even if some of the scenarios presented should be known already.

It's strange to see the same topics from 20-25 years ago reappearing over and over again despite the advancements made in the area of database engines. Each version of SQL Server brought something new in what concerns the performance, though without some good experience and understanding of the basic optimization and troubleshooting techniques there's little overall improvement for the average data professional in terms of writing and tuning queries!

Especially with the boom of Data Science topics, the volume of material on SQL increased considerably and many discover how easy is to write queries, even if the start might be challenging for some. Writing a query is easy indeed, though writing a performant query requires besides the language itself also some knowledge about the database engine and the various techniques used for troubleshooting and optimization. It's not about knowing in advance what the engine will do - the engine will often surprise you - but about knowing what techniques work, in what cases, which are their advantages and disadvantages, respectively on how they might impact the processing.

Making a parable with writing literature, it's not enough to speak a language; one needs more for becoming a writer, and there are so many levels of mastery! However, in database world even if creativity is welcomed, its role is considerable diminished by the constraints existing in the database engine, the problems to be solved, the time and the resources available. More important, one needs to understand some of the rules and know how to use the building blocks to solve problems and build reliable solutions.

The learning process for newbies focuses mainly on the language itself, while the exposure to complexity is kept to a minimum. For some learners the problems start when writing queries based on multiple tables - what joins to use, in what order, how to structure the queries, what database objects to use for encapsulating the code, etc. Even if there are some guidelines and best practices, the learner must walk the path and experiment alone or in an organized setup.

In university courses the focus is on operators algebras, algorithms, on general database technologies and architectures without much hand on experience. All is too theoretical and abstract, which is acceptable for research purposes, but not for the contact with the real world out there! Probably some labs offer exposure to real life scenarios, though what to cover first in the few hours scheduled for them?

This was the state of art when I started to learn SQL a quarter century ago, and besides the current tendency of cutting corners, the increased confidence from doing some tests, and the eagerness of shouting one’s shaking knowledge and more or less orthodox ideas on the various social networks, nothing seems to have changed! Something did change – the increased complexity of the problems to solve, and, considering the recent technological advances, one can afford now an AI learn buddy to write some code for us based on the information provided in the prompt.

This opens opportunities for learning and growth. AI can be used in the learning process by providing additional curricula for learners to dive deeper in some topics. Moreover, it can help us in time to address the challenges of the ever-increase complexity of the problems.

04 March 2021

💼Project Management: Project Execution (Part IV: Projects' Dynamics II - Motion)

Motion is the action or process of moving or being moved between an initial and a final or intermediate point. From the tinniest endeavors to the movement of the planets and beyond, everything is governed by motion. If the laws of nature seem to reveal an inner structural perfection, the activities people perform are quite often far from perfect, which is acceptable if we consider that (almost) everything is a learning process. What is probably less acceptable is the volume of inefficient motion we can easily categorize sometimes as waste.

The waste associated with motion can take many forms: sorting through a pile of tools to find the right one, searching for information, moving back and forth to reach a destination or achieve a goal, etc. Suboptimal motion can have important effects for an organization resulting in reduced productivity, respectively higher costs.

If for repetitive activities that involve a certain degree of similarity can be found typically a way to optimize the motion, the higher the uncertainty of the steps involved, the more difficult it becomes to optimize it. It’s the case of discovery endeavors in which the path between start and destination can’t be traced beforehand, respectively when the destination or path in between can’t be depicted to the needed level of detail. A strategy’s implementation, ERP implementations and other complex projects, especially the ones dealing with new technologies and/or incomplete knowledge, tend to be exploratory in nature and thus fall under this latter type a motion.

In other words, one must know at minimum the starting point, the destination, how to reach it and what it takes to reach it – resources, knowledge, skillset. When one has all this information one can go on and estimate how long it will take to reach the destination, though the estimate reflects the information available as well estimator’s skills in translating the information into a realistic roadmap. Each new information has the potential of impacting considerably the whole process, in extremis to the degree that one must start the journey anew. The complexity of such projects and the volume of uncertainty can make estimation difficult if not impossible, no matter how good estimators' skills are. At best an estimator can come with a best- and worst-case estimation, both however dependent on the assumptions made.

Moreover, complex projects are sensitive to the initial conditions or auspices under which they start. This sensitivity can turn a project in a totally different direction or pace, that can be reinforced positively or negatively as the project progresses. It’s a continuous interplay between internal and external factors and components that can create synergies or have adverse effects with the potential of reaching tipping points.

Related to the initial conditions, as the praxis sometimes shows, for entities found in continuous movement (like organizations) it’s also important to know from where one’s coming (and at what speed), as the previous impulse (driving force) can be further used or stirred as needed. Metaphorically, a project will need a certain time to find the right pace if it lacks the proper impulse.

Unless the team is trained to play and plays like an orchestra, the impact of deviations from expectations can be hardly quantified. To minimize the waste, ideally a project’s journey should minimally deviate from the optimal path, which can be challenging to achieve as a project’s mass can pull the project in one direction or the other. The more the project advances the bigger the mass, fact which can make a project unstoppable. When such high-mass projects are stopped, their impulse can continue to haunt the organization years after.

Previous Post <<||>> Next Post

03 February 2021

📦Data Migrations (DM): Conceptualization (Part III: Heuristics)

Probably one of the most difficult things to learn as a technical person is using the right technology for a given purpose, this mainly because one’s inclined using the tools one knows best. Moreover, technologies’ overlapping makes the task more and more challenging, the difference between competing technologies often residing in the details. Thus, identifying the gaps resumes in understanding the details of the problem(s) or need(s), respectively the advantages or disadvantages of a technology over the other. This is true especially about competing technologies, including the ones that replace other technologies.

There are simple heuristics, that can allow approaching such challenges. For example, heavy data processing belongs usually in databases, while import/export functionality belongs in an ETL tool. Therefore, one can start looking at the problems from these two perspectives. Would the solution benefit from these two approaches or are there more appropriate technologies (e.g. data streaming, ELT, non-relational databases)? How much effort would involve building the solution?

Commercial Off-The-Shelf (COTS) tools provided by third-party vendors usually offer specialized functionality in each area. Gartner and Forrester provide regular analyses of the main players in the important areas, analyses which can be used in theory as basis for further research. Even if COTS tend to be more expensive and can have some important functionality gaps, as long they are extensible, they can prove a good starting point for developing a solution.

Sometimes it helps researching on the web what other people or organizations did, how they approached the same aspects, what technologies, techniques and best practices they used to overcome the challenges. One doesn’t need to reinvent the wheel even if it’s sometimes fun to do so. Moreover, a few hours of research can give one a basis of useful information and a better understanding over the work ahead.

On the other side sometimes it’s advisable to use the tools one knows best, however this can lead also to unusable and less performant solutions. For example, MS Excel and Access have been for years the tools of choice for building personal solutions that later grew into maintenance nightmares for the IT team. Ideally, they can still be used for data entry or data cleaning, though building solutions exclusively based on (one of) them can prove to be far than optimal.

When one doesn’t know whether a technology or mix of technologies can be used to provide a solution, it’s recommended to start a proof-of-concept (PoC) that would allow addressing most important aspects of the needed solution. One can start small by focusing on the minimal functionality needed to check the main aspects and evolve the PoC during several iterations as needed.

For example, in the case of a Data Migration (DM) this would involve building the data extraction layer for an entity, implement several data transformations based on the defined mappings, consider building a few integrity rules for validation, respectively attempt importing the data into the target system. Once this accomplished, one can start increasing the volume of data to check how the solution behaves under stress. The volume of data can be increased incrementally or by considering all the data available.

As soon the skeleton was built one can consider all the mappings, respectively add several entities to build the dependencies existing between them and other functionality. The prototype might not address all the requirements from the beginning, therefore consider the problems as they arise. For example, if the volume of data seems to cause problems then attempt splitting the data during processing in batches or considering specific optimization techniques like indexing or scaling techniques like increasing computing resources.

Previous Post <<||>> Next Post

28 June 2020

𖣯Strategic Management: Strategy Design (Part II: A System's View)

Each time one discusses in IT about software and hardware components interacting with each other, one talks about a composite referred to as a system. Even if the term Information System (IS) is related to it, a system is defined as a set of interrelated and interconnected components that can be considered together for specific purposes or simple convenience.

A component can be a piece of software or hardware, as well persons or groups if we extend the definition. The consideration of people becomes relevant especially in the context of ecologies, in which systems are placed in a broader context that considers people’s interaction with them, as this raises to important behavior that impacts system’s functioning.

Within a system each part has a role or function determined in respect to the whole as well as to the other parts. The role or function of the component is typically fixed, predefined, though there are also exceptions especially when the scope of a component is enlarged, respectively reduced to the degree that the component can be removed or ignored. What one considers or not considers as part of system defines a system’s boundaries; it’s what distinguishes it from other systems within the environment(s) considered.

The interaction between the components resumes in the exchange, transmission and processing of data found in different aggregations ranging from signals to complex data structures. If in non-IT-based systems the changes are determined by inflow, respectively outflow of energy, in IT the flow is considered in terms of data in its various aggregations (information, knowledge). The data flow (also information flow) represents the ‘fluid’ that nourishes a system’s ‘organism’.

One can grasp the complexity in the moment one attempts to describe a system in terms of components, respectively the dependencies existing between them in term of data and processes. If in nature the processes are extrapolated, in IT they are predefined (even if the knowledge about them is not available). In addition, the less knowledge one has about the infrastructure, the higher the apparent complexity. Even if the system is not necessarily complex, the lack of knowledge and certainty about it makes it complex. The more one needs to dig for information and knowledge to get an acceptable level of knowledge and logical depth, the more time is needed for designing a solution.

Saint Exupéry’s definition of simplicity applies from a system’s functional point of view, though it doesn’t address the relative knowledge about the system, which often is implicit (in people’s heads). People have only fragmented knowledge about the system which makes it difficult to create the whole picture. It’s typically the role of system or process operational manuals, respectively of data descriptions, to make that knowledge explicit, also establishing a fundament for common knowledge and further communication and understanding.

Between the apparent (perceived) and real complexity of a system there’s an important gap that needs to be addressed if one wants to manage the systems adequately, respectively to simplify the systems. Often simplification happens when components or whole systems are replaced, consolidated, or migrated, a mix between these approaches existing as well. Simplifications at data level (aka data harmonization) or process level (aka process optimization and redesign) can have an important impact, being inherent to the good (optimal) functioning of systems.

Whether these changes occur in big-bang or gradual iterations it’s a question of available resources, organizational capabilities, including the ability to handle such projects, respectively the impact, opportunities and risks associated with such endeavors. Beyond this, it’s important to regard the problems from a systemic and systematic point of view, in which ecology’s role is important.

Previous Post <<||>> Next Post

Written: Jun-2020, Last Reviewed: Mar-2024

21 July 2019

🧱IT: Search Engine Optimization [SEO] (Definitions)

"The set of techniques and methodologies devoted to improving organic search rankings (not paid search) for a Web site." (Mike Moran & Bill Hunt , "Search Engine Marketing, Inc", 2005)

"The process and strategy of presenting a business on the web to improve the ability of potential customers finding it through natural searches on search engines such as Google, Yahoo!, and Bing." (Gina Abudi & Brandon Toropov, "The Complete Idiot's Guide to Best Practices for Small Business", 2011)

"The process of improving the volume or quality of traffic to a Web site from search engines via unpaid search results." (Linda Volonino & Efraim Turban, "Information Technology for Management 8th Ed", 2011)

"techniques to help ensure that a web site appears as close to the first position on a web search results page as possible." (Bill Holtsnider & Brian D Jaffe, "IT Manager's Handbook" 3rd Ed., 2012)

"Search engine optimization, the set of techniques and methodologies devoted to improving organic search rankings (not paid search) for a Web site." (Mike Moran & Bill Hunt , "Search Engine Marketing, Inc", 2005)

"The process of writing web content so as to increase a page's ranking in online search results." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"its main function is to increase website visibility. The main search engines use algorithms to rank a website’s position and hence its overall position in the search results. In some instances it can be as simple as structuring the words on a website in a way the search engine operates. " (BCS Learning & Development Limited, "CEdMA Europe", 2019)

15 July 2019

🧱IT: Optimization (Definitions)

"A state in which all inputs, components, elements, and processes are working together to produce the most desirable, viable, and sustainable outcome possible given the current operating conditions. As referenced in this book, products and product lines should be optimized within the context of the portfolio of products so that the best combinations of product investments produce the most desirable, sustainable market, financial, and strategic outcomes." (Steven Haines, "The Product Manager's Desk Reference", 2008)

"the process of improving the performance of some element with respect to a set of optimization criteria." (Bruce P Douglass, "Real-Time Agility: The Harmony/ESW Method for Real-Time and Embedded Systems Development", 2009)

"Finding the best possible solution." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"Optimization is the design and operation of a system or process to make it as good as possible in some defined sense, it is the action of making the best or most effective use of a situation or resource." (Hari K Kondaveeti et al, "Deep Learning Applications in Agriculture: The Role of Deep Learning in Smart Agriculture", 2021)

19 April 2019

🌡Performance Management: Mastery (Part I: The Need for Perfection vs. Excellence)

A recurring theme occurring in various contexts over the years seemed to be corroborated with the need for perfection, need going sometimes in extremis beyond common sense. The simplest theory attempting to explain at least some of these situations is that people tend to confuse excellence with perfection, from this confusion deriving false beliefs, false expectations and unhealthy behavior.

Beyond the fact that each individual has an illusory image of what perfection is about, perfection is in certain situations a limiting force rooted in the idealistic way of looking at life. Primarily, perfection denotes that we will never be good enough to reach it as we are striving to something that doesn’t exist. From this appears the external and internal criticism, criticism that instead of helping us to build something it drains out our energy to the extent that it destroys all we have built over the years with a considerable effort. Secondarily, on the long run, perfection has the tendency to steal our inner peace and balance, letting fear take over – the fear of not making mistakes, of losing the acceptance and trust of the others. It focuses on our faults, errors and failures instead of driving us to our goals. In extremis it relieves the worst in people, actors and spectators altogether.

In its proximate semantics though at diametral side through its implications, excellence focuses on our goals, on the aspiration of aiming higher without implying a limit to it. It’s a shift of attention from failure to possibilities, on what matters, on reaching our potential, on acknowledging the long way covered. It allows us building upon former successes and failures. Excellence is what we need to aim at in personal and professional life. Will Durant explaining Aristotle said that: “We are what we repeatedly do. Excellence, then, is not an act, but a habit.”

People who attempt giving 100% of their best to achieve a (positive) goal are to admire, however the proximity of 100% is only occasionally achievable, hopefully when needed the most. 100% is another illusory limit we force upon ourselves as it’s correlated to the degree of achievement, completeness or quality an artefact or result can ideally have. We rightly define quality as the degree to which something is fit for purpose. Again, a moving target that needs to be made explicit before we attempt to reach it otherwise quality envisions perfection rather than excellence and effort is wasted.

Considering the volume of effort needed to achieve a goal, Pareto’s principles (aka the 80/20 rule) seems to explain the best its underlying forces. The rule states that roughly 80% of the effects come from 20% of the causes. A corollary is that we can achieve 80% of a goal with 20% of the effort needed altogether to achieve it fully. This means that to achieve the remaining 20% toward the goal we need to put four times more of the effort already spent. This rule seems to govern the elaboration of concepts, designs and other types of documents, and I suppose it can be easily extended to other activities like writing code, cleaning data, improving performance, etc.

Given the complexity, urgency and dependencies of the tasks or goals before us probably it's beneficial sometimes to focus first on the 80% of their extent, so we can make progress, and focus on the remaining 20% if needed, when needed. This concurrent approach can allow us making progress faster in incremental steps. Also, in time, through excellence, we can bridge the gap between the two numbers as is needed less time and effort in the process.

14 January 2019

🔬Data Science: Evolutionary Algorithm (Definitions)

"An Evolutionary Algorithm (EA) is a general class of fitting or maximization techniques. They all maintain a pool of structures or models that can be mutated and evolve. At every stage in the algorithm, each model is graded and the better models are allowed to reproduce or mutate for the next round. Some techniques allow the successful models to crossbreed. They are all motivated by the biologic process of evolution. Some techniques are asexual (so, there is no crossbreeding between techniques) while others are bisexual, allowing successful models to swap ''genetic' information. The asexual models allow a wide variety of different models to compete, while sexual methods require that the models share a common 'genetic' code." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"Meta-heuristic optimization approach inspired by natural evolution, which begins with potential solution models, then iteratively applies algorithms to find the fittest models from the set to serve as inputs to the next iteration, ultimately leading to a sub-optimal solution which is close to the optimal one." (Gilles Lebrun et al, "EA Multi-Model Selection for SVM", 2009)

"Evolutionary algorithms are search methods that can be used for solving optimization problems. They mimic working principles from natural evolution by employing a population–based approach, labeling each individual of the population with a fitness and including elements of random, albeit the random is directed through a selection process." (Ivan Zelinka & Hendrik Richter, "Evolutionary Algorithms for Chaos Researchers", Studies in Computational Intelligence Vol. 267, 2010)

"Population-based optimization algorithms in which each member of the population represents a candidate solution. In an iterative process the population members evolve and are then evaluated by a fitness function. Genetic Algorithms and Particle Swarm Optimization are examples of evolutionary algorithms." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"A collective term for all variants of (probabilistic) optimization and approximation algorithms that are inspired by Darwinian evolution. Optimal states are approximated by successive improvements based on the variation-selection paradigm. Thereby, the variation operators produce genetic diversity and the selection directs the evolutionary search." (Harish Garg, "A Hybrid GA-GSA Algorithm for Optimizing the Performance of an Industrial System by Utilizing Uncertain Data", 2015)

15 November 2018

🔭Data Science: Optimization (Just the Quotes)

"[...] any hope that we are smart enough to find even transiently optimum solutions to our data analysis problems is doomed to failure, and, indeed, if taken seriously, will mislead us in the allocation of effort, thus wasting both intellectual and computational effort." (John W Tukey, "Choosing Techniques for the Analysis of Data", 1981)

"In constructing a model, we always attempt to maximize its usefulness. This aim is closely connected with the relationship among three key characteristics of every systems model: complexity, credibility, and uncertainty. This relationship is not as yet fully understood. We only know that uncertainty (predictive, prescriptive, etc.) has a pivotal role in any efforts to maximize the usefulness of systems models. Although usually (but not always) undesirable when considered alone, uncertainty becomes very valuable when considered in connection to the other characteristics of systems models: in general, allowing more uncertainty tends to reduce complexity and increase credibility of the resulting model. Our challenge in systems modelling is to develop methods by which an optimal level of allowable uncertainty can be estimated for each modelling problem." (George J Klir & Bo Yuan, "Fuzzy Sets and Fuzzy Logic: Theory and Applications", 1995)

"[...] an algorithm’s average performance is determined by how 'aligned' it is with the underlying probability distribution over optimization problems on which it is run." (David H Wolpert & William G Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1), 1997)

"[...] despite the NFL theorems, algorithms can have a priori distinctions that hold even if nothing is specified concerning the optimization problems. In particular, we show that there can be 'head-to-head' minimax distinctions between a pair of algorithms, i.e., that when considering one function at a time ,a pair of algorithms may be distinguishable, even if they are not when one looks over all functions." (David H Wolpert & William G Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1), 1997)

"[...] if you have a general optimization involving uncertainty and very little prior knowledge, the situation is rather hopeless. Due to the NFL theorem, you cannot do any better than a blind search. Each blind search evaluation will be very expensive, with no hope of future improvement, theoretical or otherwise. And the number of performance searches required to get anywhere is simply too large. Neither time nor theoretical or technological progress are on your side. No grand optimization algorithm to end all algorithms is possible." (Yu-Chi Ho, "The no free lunch theorem and the human-machine interface", IEEE Control Systems Magazine, 1999)

"The No Free Lunch (NFL) theorem […] tells us that without any structural assumptions on an optimization problem, no algorithm can perform better on average than blind search." (Yu-Chi Ho, "The no free lunch theorem and the human-machine interface", IEEE Control Systems Magazine, 1999)

"A model is an imitation of reality and a mathematical model is a particular form of representation. We should never forget this and get so distracted by the model that we forget the real application which is driving the modelling. In the process of model building we are translating our real world problem into an equivalent mathematical problem which we solve and then attempt to interpret. We do this to gain insight into the original real world situation or to use the model for control, optimization or possibly safety studies." (Ian T Cameron & Katalin Hangos, "Process Modelling and Model Analysis", 2001)

"Because No Free Lunch theorems dictate that no optimization algorithm can be considered more efficient than any other when considering all possible functions, the desired function class plays a prominent role in the model. In particular, this provides a tractable way to answer the traditionally difficult question of what algorithm is best matched to a particular class of functions. Among the benefits of the model are the ability to specify the function class in a straightforward manner, a natural way to specify noisy or dynamic functions, and a new source of insight into No Free Lunch theorems for optimization." (Christopher K Monson, "No Free Lunch, Bayesian Inference, and Utility: A Decision-Theoretic Approach to Optimization", [thesis] 2006)

"There may be no significant difference between the point of view of inferring the true structure and that of making a prediction if an infinitely large quantity of data is available or if the data are noiseless. However, in modeling based on a finite quantity of real data, there is a significant gap between these two points of view, because an optimal model for prediction purposes may be different from one obtained by estimating the 'true model'." (Genshiro Kitagawa & Sadanori Konis, "Information Criteria and Statistical Modeling", 2007)

"A priori, it is clear that no method will always be the best [...]. However, it is reasonable to argue that each method will have a set of functions, a type of data, and a range of sample sizes for which it is optimal – a sort of catchment region for each procedure. Ideally, one could partition a space of regression problems into catchment regions, depending on which methods were under consideration, and determine which catchment region seemed most appropriate for each method. This ideal solution would amount to a selection principle for nonparametric methods. Unfortunately, it is unclear how to do this, not least because the catchment regions are unknown." (Bertrand Clarke et al, "Principles and Theory for Data Mining and Machine Learning", 2009)

"When generating trees, it is usually optimal to grow a larger tree than is justifiable and then prune it back. The main reason this works well is that stop splitting rules do not look far enough forward. That is, stop splitting rules tend to underfit, meaning that even if a rule stops at a split for which the next candidate splits give little improvement, it may be that splitting them one layer further will give a large improvement in accuracy." (Bertrand Clarke et al, "Principles and Theory for Data Mining and Machine Learning", 2009)

"The problem of comparing classifiers is not at all an easy task. There is no single classifier that works best on all given problems, phenomenon related to the 'No-free-lunch' metaphor, i.e., each classifier (’restaurant’) provides a specific technique associated with the corresponding costs (’menu’ and ’price’ for it). It is hence up to us, using the information and knowledge at hand, to find the optimal trade-off." (Florin Gorunescu, "Data Mining Concepts, Models and Techniques", 2011)

"In an emergency, a data product that just produces more data is of little use. Data scientists now have the predictive tools to build products that increase the common good, but they need to be aware that building the models is not enough if they do not also produce optimized, implementable outcomes." (Jeremy Howard et al, "Designing Great Data Products", 2012)

"Briefly speaking, to solve a Machine Learning problem means you optimize a model to fit all the data from your training set, and then you use the model to predict the results you want. Therefore, evaluating a model need to see how well it can be used to predict the data out of the training set. Usually there are three types of the models: underfitting, fair and overfitting model [...]. If we want to predict a value, both (a) and (c) in this figure cannot work well. The underfitting model does not capture the structure of the problem at all, and we say it has high bias. The overfitting model tries to fit every sample in the training set and it did it, but we say it is of high variance. In other words, it fails to generalize new data." (Shudong Hao, "A Beginner’s Tutorial for Machine Learning Beginners", 2014)

"Deep learning is an area of machine learning that emerged from the intersection of neural networks, artificial intelligence, graphical modeling, optimization, pattern recognition and signal processing." (N D Lewis, "Deep Learning Made Easy with R: A Gentle Introduction for Data Science", 2016)

"Optimization is more than finding the best simulation results. It is itself a complex and evolving field that, subject to certain information constraints, allows data scientists, statisticians, engineers, and traders alike to perform reality checks on modeling results." (Chris Conlan, "Automated Trading with R: Quantitative Research and Platform Development", 2016)

"Data scientists should have some domain expertise. Most data science projects begin with a real-world, domain-specific problem and the need to design a data-driven solution to this problem. As a result, it is important for a data scientist to have enough domain expertise that they understand the problem, why it is important, and how a data science solution to the problem might fit into an organization’s processes. This domain expertise guides the data scientist as she works toward identifying an optimized solution." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Optimization is the process of finding the maximum or minimum of a given function (also known as a fitness function), by calculating the best values for its variables (also known as a 'solution'). Despite the simplicity of this definition, it is not an easy process; often involves restrictions, as well as complex relationships among the various variables. Even though some functions can be optimized using some mathematical process, most functions we encounter in data science are not as simple, requiring a more advanced technique." (Yunus E Bulut & Zacharias Voulgaris, "AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"Optimization systems (or optimizers, as they are often referred to) aim to optimize in a systematic way, oftentimes using a heuristics-based approach. Such an approach enables the AI system to use a macro level concept as part of its low-level calculations, accelerating the whole process and making it more light-weight. After all, most of these systems are designed with scalability in mind, so the heuristic approach is most practical." (Yunus E Bulut & Zacharias Voulgaris, "AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"The no free lunch theorems set limits on the range of optimality of any method. That is, each methodology has a ‘catchment area’ where it is optimal or nearly so. Often, intuitively, if the optimality is particularly strong then the effectiveness of the methodology falls off more quickly outside its catchment area than if its optimality were not so strong. Boosting is a case in point: it seems so well suited to binary classification that efforts to date to extend it to give effective classification (or regression) more generally have not been very successful. Overall, it remains to characterize the catchment areas where each class of predictors performs optimally, performs generally well, or breaks down." (Bertrand S Clarke & Jennifer L. Clarke, "Predictive Statistics: Analysis and Inference beyond Models", 2018)

"Cross-validation is a useful tool for finding optimal predictive models, and it also works well in visualization. The concept is simple: split the data at random into a 'training' and a 'test' set, fit the model to the training data, then see how well it predicts the test data. As the model gets more complex, it will always fit the training data better and better. It will also start off getting better results on the test data, but there comes a point where the test data predictions start going wrong." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Random forests are essentially an ensemble of trees. They use many short trees, fitted to multiple samples of the data, and the predictions are averaged for each observation. This helps to get around a problem that trees, and many other machine learning techniques, are not guaranteed to find optimal models, in the way that linear regression is. They do a very challenging job of fitting non-linear predictions over many variables, even sometimes when there are more variables than there are observations. To do that, they have to employ 'greedy algorithms', which find a reasonably good model but not necessarily the very best model possible." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

SQL Troubles

Pages

16 April 2025

🧮ERP: Implementations (Part XIV: A Never-Ending Story)

17 March 2025

🏭🗒️Microsoft Fabric: V-Order [Notes]

15 January 2025

🧭Business Intelligence: Perspectives (Part 23: In between the Many Destinations)

07 December 2024

🏭 💠Data Warehousing: Microsoft Fabric (Part VI: SQL Databases for OLTP scenarios) [new feature]

01 February 2024

🏭🗒️Microsoft Fabric: Delta Tables [Notes]

27 January 2024

Data Science: Back to the Future I (About Beginnings)

04 March 2021

💼Project Management: Project Execution (Part IV: Projects' Dynamics II - Motion)

03 February 2021

📦Data Migrations (DM): Conceptualization (Part III: Heuristics)

28 June 2020

𖣯Strategic Management: Strategy Design (Part II: A System's View)

21 July 2019

🧱IT: Search Engine Optimization [SEO] (Definitions)

15 July 2019

🧱IT: Optimization (Definitions)

19 April 2019

🌡Performance Management: Mastery (Part I: The Need for Perfection vs. Excellence)

14 January 2019

🔬Data Science: Evolutionary Algorithm (Definitions)

15 November 2018

🔭Data Science: Optimization (Just the Quotes)

About Me