Showing posts with label optimization. Show all posts
Showing posts with label optimization. Show all posts

01 February 2024

Microsoft Fabric: Delta Tables (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. 
Last updated: 18-Apr-2024

Delta Table Structure

[Delta Lake] delta table
  • {definition} table that stores data as a directory of files in the delta lake (DL) and registers table metadata to the metastore within a catalog and schema [1]
    • ⇐ represents a schema abstraction over data files that are stored in Delta format [2]
    • for each table, the lakehouse stores 
      • a folder containing its Parquet data files
      • a _delta_Log folder in which transaction details are logged in JSON format
    • all Microsoft Fabric (MF) experiences generate and consume delta tables [10]
      • provides interoperability and a unified product experience [10]
      •  some experiences can only write to delta tables, while others can read from it [10]
    • ⇒ all data files for a table in a database are grouped under a common data path [12]
  • {feature} support ACID transactions
    • modifications made to a table are logged in its transaction log
      • enforces serializable isolation for concurrent operations [2]
      • the logged transactions can be used to retrieve the history of changes made [2]
      • each transaction creates new Parquet files
      • deleting a row, doesn't physically delete in the parquet file [9]
        • {feature} [DL] Delete Vectors 
          • are read as part of the table and indicate which rows to ignore [9]
          • make it faster to perform deletions because there's no need to re-write the existing parquet files [9]
          • many deleted rows take more resources to read that file [9]
  • {feature} support DML
    • only tables created while the future is available will have all DML published [..]
  • {feature} support data versioning 
    • multiple versions of each table row can be retrieved from the transaction log [2]
  • {feature} support time travel
  • {feature} support for batch and streaming data
    • delta tables can be used as both sources and destinations for streaming data [2]
  • {feature} support standard formats and interoperability
  • table consumption
    • [lakehouse] SQL Endpoint 
      • provides a read-only experience [4]
      • can be used to query only delta tables via T-SQL [4]
        • other file formats can not be queried using the SQL endpoint [4]
          • ⇐  the files need to be converted to the delta format [4]
      • {limitation} doesn't support the full T-SQL surface area of a transactional data warehouse [4]
  • [Spark] managed table
    •  the table definition in the metastore and the underlying data files are both managed by the Spark runtime for the Fabric lakehouse [4] 
  • [Spark] external table
    • the relational table definition in the metastore is mapped to an alternative file storage location [4]
      • the Parquet data files and JSON log files for the table are stored in the Files storage location [4]
  • [Spark] allows greater control over the creation and management of delta tables [4]
  • {operation} create (aka create delta table)
    • defines the table in the metastore for the lakehouse
    • its data is stored in the underlying Parquet files for the table
      • the details of mapping the table definition in the metastore to the underlying files are abstracted [4]
      • ⇐ internally, there is also a log file that keeps track of which parquet files, when combined, make up the data that is in the table [4]
        • the log files are internal and cannot be used directly by other engines [4]
          •  ⇐ DF publishes automatically the right log files so that other engines can directly access the right parquet files [..]
        • after every 10 transactions, a new log file (aka checkpoint) is created automatically and asynchronously [4]
          • ⇐ the file is a summary of all the previous log files [4]
          • when querying the table, the system needs to read the latest checkpoint and any log files that were created after*
            • ⇐ instead of having to read 105,120 log files, 10 or less files will be read*
        • the Delta Lake Logs are automatically so that other engines can directly access the right parquet files *
    • [Apache Spark in a lakehouse] allows greater control of the creation and management of delta tables [4]
      • via saving a dataframe 
        • {method}save a dataframe as a delta table [4]
          • ⇐ the easiest way to create a delta table 
          • creates both the table schema definition in the metastore and the data files in delta format [4]
        • {method} create the table definition [4]
          • creates the table schema in the metastore without saving any data files [4]
        • {method} save data in delta format without creating a table definition in the metastore [4]
          • {scenario} persist the results of data transformations performed in Spark in a file format over which  a table definition is overlayed later or processed directly by using the delta lake API [4]
          • modifications made to the data through the delta lake API or in an external table that is subsequently created on the folder will be recorded in the transaction logs [4]
          • {mode} overwrite
            • replace the contents of an existing folder with the data in a dataframe [4]
          • {mode} append
            • adds rows from a dataframe to an existing folder [4]
          • Fabric uses an automatic table discovery capability to create the corresponding table metadata in the metastore [4]
      • via DeltaTableBuilder API
        • enables to write Spark code to create a table based on specifications
      • via Spark SQL
        • [managed table] via CREATE TABLE <table_definition> USING DELTA
        • [external table] via CREATE TABLE <table_name> USING DELTA LOCATION
          • the schema of the table is determined by the Parquet files containing the data in the specified location
          • {scenario} create a table definition 
            • that references data that has already been saved in delta format [4]
            • based on a folder where data are ingested in the delta format [4]
  • {operation} update (aka update delta table)
  • {operation} delete (aka delete delta table)
    • [managed table] deleting the table deletes the underlying files from the Tables storage location for the lakehouse [4]
    • [external table] deleting a table from the lakehouse metastore does not delete the associated data files [4]
  • performance and storage cost efficiency tend to degrade over time
    • {reason} new data added to the table might skew the data [3]
    • {reason} batch and streaming data ingestion rates might bring in many small files
    • {reason} update and delete operations eventually create read overhead
      • parquet files are immutable by design, so Delta tables adds new parquet files which the changeset, further amplifying the issues imposed by the first two items [3]
    • {reason} no longer needed data files and log files available in the storage
  • {recommendation} don’t allow special characters in column names (incl. spaces)
  • {recommendation} make table and column names business-friendly
  • {feature} table partitions
    • {recommendation} use a partitioned folder structure wherever applicable
      • helps to improve data manageability and query performance
      • results in faster search for specific data entries thanks to partition pruning/elimination
      • {best practice} partition data to align with the query patterns [15]
      • it can dramatically speed up query performance, especially when combined with other performance optimizations [15]
  • {command} MERGE 
    • allows updating a delta table with advanced conditions [3]
      • from a source table, view or DataFrame [3]
    • {limitation} the current algorithm in the open source distribution of Delta Lake isn't fully optimized for handling unmodified rows [3]
    • [Microsoft Spark Delta] implemented a custom Low Shuffle Merge optimization
      • unmodified rows are excluded from an expensive shuffling operation that is needed for updating matched rows [3]
  • {command} OPTIMIZE
    • consolidates multiple small Parquet files into large file [8]
    • should be run whenever there are enough small files to justify running the compaction operation [6]
      • {best practice} run optimization after loading large tables [8]
    • benefits greatly from the ACID transactions supported [6]
    • [Delta Lake] predicate filtering
      • specify predicates to only compact a subset of your data [6]
      • {scenario} running a compaction job on the same dataset daily [6]
  • {command} VACUUM
    • removes old files no longer referenced by a Delta table log [8]
      • files need to be older than the retention threshold [8]
        • the default file retention threshold is seven days [8]
          • shorter retention period impacts Delta's time travel capabilities [8]
          • {default} historical data can't be delete within the retention threshold [2]
            • ⇐ that's to maintain the consistency in data [2]
        • {best practice} set a retention interval to at least seven days [8]
          • ⇐ because old snapshots and uncommitted files can still be in use by the concurrent table readers and writers [8]
        • important to optimize storage cost [8]
    • {warning} leaning up active files might lead to reader failures or even table corruption if the uncommitted files are removed [8]
  • {issue} small files
    • create large metadata transaction logs which cause planning time slowness [6]
    • result from
      • big repartition value [6]
      • if the dataset is partitioned on a high-cardinality column or if there are deeply nested partitions, then more small files will be created [6]
      • tables that are incrementally updated frequently [6]
    • files of sizes above 128 MB, and optimally close to 1 GB, improve compression and data distribution across the cluster nodes [8]
  • {feature} auto compaction 
    • combines small files within Delta table partitions to automatically reduce small file problems [7]
    • occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write [7]
    • only compacts files that haven’t been compacted previously [7]
    • only triggered for partitions or tables that have at least a certain number of small files [7]
    • enabled at the table or session level [7]
  • {feature}[Delta Lake 1.2] data skipping 
    • the engine takes advantage of minimum and maximum values metadata to provide faster queries
      • ⇐ requires the respective metadata
    •  its effectiveness depends on data's layout [7]
  • {feature} [Delta Lake 3.0] Z-Ordering (aka multi-dimensional clustering)
    • technique to collocate related information in the same set of files [7]
    • automatically used in data-skipping algorithms [7]
    • dramatically reduces the amount of data to read [7]
    • aims to produce evenly-balanced data files with respect to the number of tuples
      • ⇐ but not necessarily data size on disk [7]
        • ⇐ the two measures are most often correlated [7]
          • ⇐ but there can be situations when that is not the case, leading to skew in optimize task times [7]
    • via ZORDER BY clause 
      • applicable to columns with high cardinality commonly used in query predicates [7]
      • multiple columns can be specified as a comma-separated list
        • {warning} the effectiveness of the locality drops with each extra column [7]
        • {warning} using columns that do not have statistics collected on them is  ineffective and wastes resources [7] 
          • statistics collection can be configured on certain columns by reordering columns in the schema, or by increasing the number of columns to collect statistics on [7]
    • {characteristic} not idempotent
      • every time is executed, it will try to create a new clustering of data in all files in a partition [7]
        • it includes new and existing files that were part of previous Z-Ordering [7]
  • {feature} checkpointing
    • allows read queries to quickly reconstruct the current state of the table without reading too many files having incremental updates [7]
    • {default} each checkpoint is written as a single Parquet file [7]
    • {alternative} [Delta Lake 2.0] multi-part checkpointing
      • allows splitting the checkpoint into multiple Parquet files [7]
        • ⇒ parallelizes and speeds up writing the checkpoint [7]
  • {feature} [Delta Lake 3.0] log compaction
    • reduces the need for frequent checkpoints and minimize the latency spikes caused by them [7]
    • allows new log compaction files with the format <x>.<y>.compact.json
      • the files contain the aggregated actions for commit range [x, y] 
    • read support is enabled by default [7]
    • write support not available yet [7]
      •  will be added in a future version of Delta [7]
  • Delta Lake transaction log (aka DeltaLog)
    • a sequential record of every transaction performed on a table since its creation [15]
    • central to DL functionality because it is at the core of its important features [15]
      • incl. ACID transactions, scalable metadata handling, time travel
    • {goal} enable multiple readers and writers to operate on a given version of a dataset file simultaneously [15]
    • {goal} provide additional information,  to the execution engine for more performant operations [15]
      • e.g. data skipping indexes
    • always shows the user a consistent view of the data
      • ⇒ serves as a single source of truth
    • for each write operation, the data file is always written first, and only when that operation succeeds, a transaction log file is added to the _delta_log folder
      • ⇐ the transaction is only considered complete when the transaction log entry is written successfully [15] 
  • {feature} [Lakehouse] table maintenance 
    • manages efficiently delta tables and keeps them always ready for analytics [8]
    • performs ad-hoc table maintenance using contextual right-click actions in a delta table within the Lakehouse explorer [8]
    • applies bin-compaction, V-Order, and unreferenced old files cleanup [8]
      • via Lakehouse >> Tables >> (select table) >>Maintenance >> (select options) >> Run now
        •  a Spark maintenance job is submitted for execution [8]
          • uses the user identity and table privileges
            • consumes Fabric capacity of the workspace/user that submitted the job
          • {constraint} only one maintenance job on a table can be run at any time
            • if there's a running job on the table, the new one is rejected [8]
            • jobs on different tables can execute in parallel [8]
          • running jobs are available in the Monitoring Hub 
            • see "TableMaintenance" text within the activity name column [8]
    • {best practice} properly designing the table physical structure based on the ingestion frequency and expected read patterns is likely more important than running the optimization commands [3]

Acronyms:
ACID - atomicity, consistency, isolation, durability
API - Application Programming Interface
CRUD - create, read, update, and delete
DL - Delta lake
MF - Microsoft Fabric
JSON - JavaScript Object Notation

Resources:
[1] Microsoft Learn (2023) Data objects in the Databricks lakehouse (link)
[2] Microsoft Learn (2023) Implement medallion lakehouse architecture in Microsoft Fabric (link)
[3] Microsoft Learn (2023) Delta Lake table optimization and V-Order (link)
[4] Microsoft Learn (2023) Work with Delta Lake tables in Microsoft Fabric (link)
[5] Delta Lake (2023) Quickstart (link)
[6] Delta Lake (2023) Delta Lake Small File Compaction with OPTIMIZE (link)
[7] Delta Lake (2023) Optimizations (link)
[8] Microsoft Learn (2023) 
Use table maintenance feature to manage delta tables in Fabric (link)
[9] Microsoft Fabric Updates Blog (2023) 
Announcing: Automatic Data Compaction for Fabric Warehouse, by Kevin Conan (link)
[10] 
Microsoft Learn (2023) Delta Lake table format interoperability (link)
[11] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12)  (link)
[12] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions (link)
[13] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment13(12) (link)
[14]
 Jesús Camacho-Rodríguez et al (2023) LST-Bench: Benchmarking Log-Structured Tables in the Cloud, Proceedings of the ACM on Management of Data (2024), 2 (1) (link)
[15] Bennie Haelen & Dan Davis (2024) Delta Lake: Up and Running Modern Data Lakehouse Architectures with Delta Lake, 2024

27 January 2024

Data Science: Back to the Future I (About Beginnings)

Data Science
Data Science Series

I've attended again, after several years, a webcast on performance improvement in SQL Server with Claudio Silva, “Writing T-SQL code for the engine, not for you”. The session was great and I really enjoyed it! I recommend it to any data(base) professional, even if some of the scenarios presented should be known already.

It's strange to see the same topics from 20-25 years ago reappearing over and over again despite the advancements made in the area of database engines. Each version of SQL Server brought something new in what concerns the performance, though without some good experience and understanding of the basic optimization and troubleshooting techniques there's little overall improvement for the average data professional in terms of writing and tuning queries!

Especially with the boom of Data Science topics, the volume of material on SQL increased considerably and many discover how easy is to write queries, even if the start might be challenging for some. Writing a query is easy indeed, though writing a performant query requires besides the language itself also some knowledge about the database engine and the various techniques used for troubleshooting and optimization. It's not about knowing in advance what the engine will do - the engine will often surprise you - but about knowing what techniques work, in what cases, which are their advantages and disadvantages, respectively on how they might impact the processing.

Making a parable with writing literature, it's not enough to speak a language; one needs more for becoming a writer, and there are so many levels of mastery! However, in database world even if creativity is welcomed, its role is considerable diminished by the constraints existing in the database engine, the problems to be solved, the time and the resources available. More important, one needs to understand some of the rules and know how to use the building blocks to solve problems and build reliable solutions.

The learning process for newbies focuses mainly on the language itself, while the exposure to complexity is kept to a minimum. For some learners the problems start when writing queries based on multiple tables -  what joins to use, in what order, how to structure the queries, what database objects to use for encapsulating the code, etc. Even if there are some guidelines and best practices, the learner must walk the path and experiment alone or in an organized setup.

In university courses the focus is on operators algebras, algorithms, on general database technologies and architectures without much hand on experience. All is too theoretical and abstract, which is acceptable for research purposes,  but not for the contact with the real world out there! Probably some labs offer exposure to real life scenarios, though what to cover first in the few hours scheduled for them?

This was the state of art when I started to learn SQL a quarter century ago, and besides the current tendency of cutting corners, the increased confidence from doing some tests, and the eagerness of shouting one’s shaking knowledge and more or less orthodox ideas on the various social networks, nothing seems to have changed! Something did change – the increased complexity of the problems to solve, and, considering the recent technological advances, one can afford now an AI learn buddy to write some code for us based on the information provided in the prompt.

This opens opportunities for learning and growth. AI can be used in the learning process by providing additional curricula for learners to dive deeper in some topics. Moreover, it can help us in time to address the challenges of the ever-increase complexity of the problems.

04 March 2021

Project Management: Projects' Dynamics II (Motion)

Project Management

Motion is the action or process of moving or being moved between an initial and a final or intermediate point. From the tinniest endeavors to the movement of the planets and beyond, everything is governed by motion. If the laws of nature seem to reveal an inner structural perfection, the activities people perform are quite often far from perfect, which is acceptable if we consider that (almost) everything is a learning process. What is probably less acceptable is the volume of inefficient motion we can easily categorize sometimes as waste.

The waste associated with motion can take many forms: sorting through a pile of tools to find the right one, searching for information, moving back and forth to reach a destination or achieve a goal, etc. Suboptimal motion can have important effects for an organization resulting in reduced productivity, respectively higher costs.

If for repetitive activities that involve a certain degree of similarity can be found typically a way to optimize the motion, the higher the uncertainty of the steps involved, the more difficult it becomes to optimize it. It’s the case of discovery endeavors in which the path between start and destination can’t be traced beforehand, respectively when the destination or path in between can’t be depicted to the needed level of detail. A strategy’s implementation, ERP implementations and other complex projects, especially the ones dealing with new technologies and/or incomplete knowledge, tend to be exploratory in nature and thus fall under this latter type a motion.

In other words, one must know at minimum the starting point, the destination, how to reach it and what it takes to reach it – resources, knowledge, skillset. When one has all this information one can go on and estimate how long it will take to reach the destination, though the estimate reflects the information available as well estimator’s skills in translating the information into a realistic roadmap. Each new information has the potential of impacting considerably the whole process, in extremis to the degree that one must start the journey anew. The complexity of such projects and the volume of uncertainty can make estimation difficult if not impossible, no matter how good estimators' skills are. At best an estimator can come with a best- and worst-case estimation, both however dependent on the assumptions made.

Moreover, complex projects are sensitive to the initial conditions or auspices under which they start. This sensitivity can turn a project in a totally different direction or pace, that can be reinforced positively or negatively as the project progresses. It’s a continuous interplay between internal and external factors and components that can create synergies or have adverse effects with the potential of reaching tipping points.

Related to the initial conditions, as the praxis sometimes shows, for entities found in continuous movement (like organizations) it’s also important to know from where one’s coming (and at what speed), as the previous impulse (driving force) can be further used or stirred as needed. Metaphorically, a project will need a certain time to find the right pace if it lacks the proper impulse.

Unless the team is trained to play and plays like an orchestra, the impact of deviations from expectations can be hardly quantified. To minimize the waste, ideally a project’s journey should minimally deviate from the optimal path, which can be challenging to achieve as a project’s mass can pull the project in one direction or the other. The more the project advances the bigger the mass, fact which can make a project unstoppable. When such high-mass projects are stopped, their impulse can continue to haunt the organization years after.

Previous Post <<||>> Next Post

03 February 2021

Data Migrations (DM): Conceptualization III (Heuristics)

Data Migration

Probably one of the most difficult things to learn as a technical person is using the right technology for a given purpose, this mainly because one’s inclined using the tools one knows best. Moreover, technologies’ overlapping makes the task more and more challenging, the difference between competing technologies often residing in the details. Thus, identifying the gaps resumes in understanding the details of the problem(s) or need(s), respectively the advantages or disadvantages of a technology over the other. This is true especially about competing technologies, including the ones that replace other technologies.

There are simple heuristics, that can allow approaching such challenges. For example, heavy data processing belongs usually in databases, while import/export functionality belongs in an ETL tool.  Therefore, one can start looking at the problems from these two perspectives. Would the solution benefit from these two approaches or are there more appropriate technologies (e.g. data streaming, ELT, non-relational databases)? How much effort would involve building the solution? 

Commercial Off-The-Shelf (COTS) tools provided by third-party vendors usually offer specialized functionality in each area. Gartner and Forrester provide regular analyses of the main players in the important areas, analyses which can be used in theory as basis for further research. Even if COTS tend to be more expensive and can have some important functionality gaps, as long they are extensible, they can prove a good starting point for developing a solution. 

Sometimes it helps researching on the web what other people or organizations did, how they approached the same aspects, what technologies, techniques and best practices they used to overcome the challenges. One doesn’t need to reinvent the wheel even if it’s sometimes fun to do so. Moreover, a few hours of research can give one a basis of useful information and a better understanding over the work ahead.

On the other side sometimes it’s advisable to use the tools one knows best, however this can lead also to unusable and less performant solutions. For example, MS Excel and Access have been for years the tools of choice for building personal solutions that later grew into maintenance nightmares for the IT team. Ideally, they can still be used for data entry or data cleaning, though building solutions exclusively based on (one of) them can prove to be far than optimal. 

When one doesn’t know whether a technology or mix of technologies can be used to provide a solution, it’s recommended to start a proof-of-concept (PoC) that would allow addressing most important aspects of the needed solution. One can start small by focusing on the minimal functionality needed to check the main aspects and evolve the PoC during several iterations as needed.

For example, in the case of a Data Migration (DM) this would involve building the data extraction layer for an entity, implement several data transformations based on the defined mappings, consider building a few integrity rules for validation, respectively attempt importing the data into the target system. Once this accomplished, one can start increasing the volume of data to check how the solution behaves under stress. The volume of data can be increased incrementally or by considering all the data available. 

As soon the skeleton was built one can consider all the mappings, respectively add several entities to build the dependencies existing between them and other functionality. The prototype might not address all the requirements from the beginning, therefore consider the problems as they arise. For example, if the volume of data seems to cause problems then attempt splitting the data during processing in batches or considering specific optimization techniques like indexing or scaling techniques like increasing computing resources. 

Previous Post <<||>> Next Post

28 June 2020

Strategic Management: Simplicity II (A System's View)

Strategic Management

Each time one discusses in IT about software and hardware components interacting with each other, one talks about a composite referred to as a system. Even if the term Information System (IS) is related to it, a system is defined as a set of interrelated and interconnected components that can be considered together for specific purposes or simple convenience.

A component can be a piece of software or hardware, as well persons or groups if we extend the definition. The consideration of people becomes relevant especially in the context of ecologies, in which systems are placed in a broader context that considers people’s interaction with them, as this raises to important behavior that impacts system’s functioning.

Within a system each part has a role or function determined in respect to the whole as well as to the other parts. The role or function of the component is typically fixed, predefined, though there are also exceptions especially when the scope of a component is enlarged, respectively reduced to the degree that the component can be removed or ignored. What one considers or not considers as part of system defines a system’s boundaries; it’s what distinguishes it from other systems within the environment(s) considered.

The interaction between the components resumes in the exchange, transmission and processing of data found in different aggregations ranging from signals to complex data structures. If in non-IT-based systems the changes are determined by inflow, respectively outflow of energy, in IT the flow is considered in terms of data in its various aggregations (information, knowledge).  The data flow (also information flow) represents the ‘fluid’ that nourishes a system’s ‘organism’.

One can grasp the complexity in the moment one attempts to describe a system in terms of components, respectively the dependencies existing between them in term of data and processes. If in nature the processes are extrapolated, in IT they are predefined (even if the knowledge about them is not available). In addition, the less knowledge one has about the infrastructure, the higher the apparent complexity. Even if the system is not necessarily complex, the lack of knowledge and certainty about it makes it complex. The more one needs to dig for information and knowledge to get an acceptable level of knowledge and logical depth, the more time is needed for designing a solution.

Saint Exupéry’s definition of simplicity applies from a system’s functional point of view, though it doesn’t address the relative knowledge about the system, which often is implicit (in people’s heads). People have only fragmented knowledge about the system which makes it difficult to create the whole picture. It’s typically the role of system or process operational manuals, respectively of data descriptions, to make that knowledge explicit, also establishing a fundament for common knowledge and further communication and understanding.

Between the apparent (perceived) and real complexity of a system there’s an important gap that needs to be addressed if one wants to manage the systems adequately, respectively to simplify the systems. Often simplification happens when components or whole systems are replaced, consolidated, or migrated, a mix between these approaches existing as well. Simplifications at data level (aka data harmonization) or process level (aka process optimization and redesign) can have an important impact, being inherent to the good (optimal) functioning of systems.

Whether these changes occur in big-bang or gradual iterations it’s a question of available resources, organizational capabilities, including the ability to handle such projects, respectively the impact, opportunities and risks associated with such endeavors. Beyond this, it’s important to regard the problems from a systemic and systematic point of view, in which ecology’s role is important.

Previous Post <<||>> Next Post

Written: Jun-2020, Last Reviewed: Mar-2024

15 July 2019

IT: Search Engine Optimisation (Definitions)

"The set of techniques and methodologies devoted to improving organic search rankings (not paid search) for a Web site." (Mike Moran & Bill Hunt , "Search Engine Marketing, Inc", 2005)

"The process and strategy of presenting a business on the web to improve the ability of potential customers finding it through natural searches on search engines such as Google, Yahoo!, and Bing." (Gina Abudi & Brandon Toropov, "The Complete Idiot's Guide to Best Practices for Small Business", 2011)

"The process of improving the volume or quality of traffic to a Web site from search engines via unpaid search results." (Linda Volonino & Efraim Turban, "Information Technology for Management 8th Ed", 2011)

"techniques to help ensure that a web site appears as close to the first position on a web search results page as possible." (Bill Holtsnider & Brian D Jaffe, "IT Manager's Handbook" 3rd Ed., 2012)

"Search engine optimization, the set of techniques and methodologies devoted to improving organic search rankings (not paid search) for a Web site." (Mike Moran & Bill Hunt , "Search Engine Marketing, Inc", 2005)

"The process of writing web content so as to increase a page's ranking in online search results." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"its main function is to increase website visibility. The main search engines use algorithms to rank a website’s position and hence its overall position in the search results. In some instances it can be as simple as structuring the words on a website in a way the search engine operates. " (BCS Learning & Development Limited, "CEdMA Europe", 2019)

19 April 2019

Performance Management: The Need for Perfection vs. Excellence

Performance Management

A recurring theme occurring in various contexts over the years seemed to be corroborated with the need for perfection, need going sometimes in extremis beyond common sense. The simplest theory attempting to explain at least some of these situations is that people tend to confuse excellence with perfection, from this confusion deriving false beliefs, false expectations and unhealthy behavior. 

Beyond the fact that each individual has an illusory image of what perfection is about, perfection is in certain situations a limiting force rooted in the idealistic way of looking at life. Primarily, perfection denotes that we will never be good enough to reach it as we are striving to something that doesn’t exist. From this appears the external and internal criticism, criticism that instead of helping us to build something it drains out our energy to the extent that it destroys all we have built over the years with a considerable effort. Secondarily, on the long run, perfection has the tendency to steal our inner peace and balance, letting fear take over – the fear of not making mistakes, of losing the acceptance and trust of the others. It focuses on our faults, errors and failures instead of driving us to our goals. In extremis it relieves the worst in people, actors and spectators altogether. 

In its proximate semantics though at diametral side through its implications, excellence focuses on our goals, on the aspiration of aiming higher without implying a limit to it. It’s a shift of attention from failure to possibilities, on what matters, on reaching our potential, on acknowledging the long way covered. It allows us building upon former successes and failures. Excellence is what we need to aim at in personal and professional life. Will Durant explaining Aristotle said that: “We are what we repeatedly do. Excellence, then, is not an act, but a habit.” 

People who attempt giving 100% of their best to achieve a (positive) goal are to admire, however the proximity of 100% is only occasionally achievable, hopefully when needed the most. 100% is another illusory limit we force upon ourselves as it’s correlated to the degree of achievement, completeness or quality an artefact or result can ideally have. We rightly define quality as the degree to which something is fit for purpose. Again, a moving target that needs to be made explicit before we attempt to reach it otherwise quality envisions perfection rather than excellence and effort is wasted. 

Considering the volume of effort needed to achieve a goal, Pareto’s principles (aka the 80/20 rule) seems to explain the best its underlying forces. The rule states that roughly 80% of the effects come from 20% of the causes. A corollary is that we can achieve 80% of a goal with 20% of the effort needed altogether to achieve it fully. This means that to achieve the remaining 20% toward the goal we need to put four times more of the effort already spent. This rule seems to govern the elaboration of concepts, designs and other types of documents, and I suppose it can be easily extended to other activities like writing code, cleaning data, improving performance, etc. 

Given the complexity, urgency and dependencies of the tasks or goals before us probably it's beneficial sometimes to focus first on the 80% of their extent, so we can make progress, and focus on the remaining 20% if needed, when needed. This concurrent approach can allow us making progress faster in incremental steps. Also, in time, through excellence, we can bridge the gap between the two numbers as is needed less time and effort in the process.


15 November 2018

Data Science: Optimization (Just the Quotes)

"[...] any hope that we are smart enough to find even transiently optimum solutions to our data analysis problems is doomed to failure, and, indeed, if taken seriously, will mislead us in the allocation of effort, thus wasting both intellectual and computational effort." (John W Tukey, "Choosing Techniques for the Analysis of Data", 1981)

"In constructing a model, we always attempt to maximize its usefulness. This aim is closely connected with the relationship among three key characteristics of every systems model: complexity, credibility, and uncertainty. This relationship is not as yet fully understood. We only know that uncertainty (predictive, prescriptive, etc.) has a pivotal role in any efforts to maximize the usefulness of systems models. Although usually (but not always) undesirable when considered alone, uncertainty becomes very valuable when considered in connection to the other characteristics of systems models: in general, allowing more uncertainty tends to reduce complexity and increase credibility of the resulting model. Our challenge in systems modelling is to develop methods by which an optimal level of allowable uncertainty can be estimated for each modelling problem." (George J Klir & Bo Yuan, "Fuzzy Sets and Fuzzy Logic: Theory and Applications", 1995)

"[...] an algorithm’s average performance is determined by how 'aligned' it is with the underlying probability distribution over optimization problems on which it is run." (David H Wolpert & William G Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1), 1997)

"[...] despite the NFL theorems, algorithms can have a priori distinctions that hold even if nothing is specified concerning the optimization problems. In particular, we show that there can be 'head-to-head' minimax distinctions between a pair of algorithms, i.e., that when considering one function at a time ,a pair of algorithms may be distinguishable, even if they are not when one looks over all functions." (David H Wolpert & William G Macready, "No free lunch theorems for optimization", IEEE Transactions on Evolutionary Computation 1 (1), 1997)

"[...] if you have a general optimization involving uncertainty and very little prior knowledge, the situation is rather hopeless. Due to the NFL theorem, you cannot do any better than a blind search. Each blind search evaluation will be very expensive, with no hope of future improvement, theoretical or otherwise. And the number of performance searches required to get anywhere is simply too large. Neither time nor theoretical or technological progress are on your side. No grand optimization algorithm to end all algorithms is possible." (Yu-Chi Ho, "The no free lunch theorem and the human-machine interface", IEEE Control Systems Magazine, 1999)

"The No Free Lunch (NFL) theorem […] tells us that without any structural assumptions on an optimization problem, no algorithm can perform better on average than blind search." (Yu-Chi Ho, "The no free lunch theorem and the human-machine interface", IEEE Control Systems Magazine, 1999)

"A model is an imitation of reality and a mathematical model is a particular form of representation. We should never forget this and get so distracted by the model that we forget the real application which is driving the modelling. In the process of model building we are translating our real world problem into an equivalent mathematical problem which we solve and then attempt to interpret. We do this to gain insight into the original real world situation or to use the model for control, optimization or possibly safety studies." (Ian T Cameron & Katalin Hangos, "Process Modelling and Model Analysis", 2001)

"Because No Free Lunch theorems dictate that no optimization algorithm can be considered more efficient than any other when considering all possible functions, the desired function class plays a prominent role in the model. In particular, this provides a tractable way to answer the traditionally difficult question of what algorithm is best matched to a particular class of functions. Among the benefits of the model are the ability to specify the function class in a straightforward manner, a natural way to specify noisy or dynamic functions, and a new source of insight into No Free Lunch theorems for optimization." (Christopher K Monson, "No Free Lunch, Bayesian Inference, and Utility: A Decision-Theoretic Approach to Optimization", [thesis] 2006)

"There may be no significant difference between the point of view of inferring the true structure and that of making a prediction if an infinitely large quantity of data is available or if the data are noiseless. However, in modeling based on a finite quantity of real data, there is a significant gap between these two points of view, because an optimal model for prediction purposes may be different from one obtained by estimating the 'true model'." (Genshiro Kitagawa & Sadanori Konis, "Information Criteria and Statistical Modeling", 2007)

"A priori, it is clear that no method will always be the best [...]. However, it is reasonable to argue that each method will have a set of functions, a type of data, and a range of sample sizes for which it is optimal – a sort of catchment region for each procedure. Ideally, one could partition a space of regression problems into catchment regions, depending on which methods were under consideration, and determine which catchment region seemed most appropriate for each method. This ideal solution would amount to a selection principle for nonparametric methods. Unfortunately, it is unclear how to do this, not least because the catchment regions are unknown." (Bertrand Clarke et al, "Principles and Theory for Data Mining and Machine Learning", 2009)

"When generating trees, it is usually optimal to grow a larger tree than is justifiable and then prune it back. The main reason this works well is that stop splitting rules do not look far enough forward. That is, stop splitting rules tend to underfit, meaning that even if a rule stops at a split for which the next candidate splits give little improvement, it may be that splitting them one layer further will give a large improvement in accuracy." (Bertrand Clarke et al, "Principles and Theory for Data Mining and Machine Learning", 2009)

"The problem of comparing classifiers is not at all an easy task. There is no single classifier that works best on all given problems, phenomenon related to the 'No-free-lunch' metaphor, i.e., each classifier (’restaurant’) provides a specific technique associated with the corresponding costs (’menu’ and ’price’ for it). It is hence up to us, using the information and knowledge at hand, to find the optimal trade-off." (Florin Gorunescu, "Data Mining Concepts, Models and Techniques", 2011)

"In an emergency, a data product that just produces more data is of little use. Data scientists now have the predictive tools to build products that increase the common good, but they need to be aware that building the models is not enough if they do not also produce optimized, implementable outcomes." (Jeremy Howard et al, "Designing Great Data Products", 2012)

"Briefly speaking, to solve a Machine Learning problem means you optimize a model to fit all the data from your training set, and then you use the model to predict the results you want. Therefore, evaluating a model need to see how well it can be used to predict the data out of the training set. Usually there are three types of the models: underfitting, fair and overfitting model [...]. If we want to predict a value, both (a) and (c) in this figure cannot work well. The underfitting model does not capture the structure of the problem at all, and we say it has high bias. The overfitting model tries to fit every sample in the training set and it did it, but we say it is of high variance. In other words, it fails to generalize new data." (Shudong Hao, "A Beginner’s Tutorial for Machine Learning Beginners", 2014)

"Deep learning is an area of machine learning that emerged from the intersection of neural networks, artificial intelligence, graphical modeling, optimization, pattern recognition and signal processing." (N D Lewis, "Deep Learning Made Easy with R: A Gentle Introduction for Data Science", 2016)

"Optimization is more than finding the best simulation results. It is itself a complex and evolving field that, subject to certain information constraints, allows data scientists, statisticians, engineers, and traders alike to perform reality checks on modeling results." (Chris Conlan, "Automated Trading with R: Quantitative Research and Platform Development", 2016)

"Data scientists should have some domain expertise. Most data science projects begin with a real-world, domain-specific problem and the need to design a data-driven solution to this problem. As a result, it is important for a data scientist to have enough domain expertise that they understand the problem, why it is important, and how a data science solution to the problem might fit into an organization’s processes. This domain expertise guides the data scientist as she works toward identifying an optimized solution." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Optimization is the process of finding the maximum or minimum of a given function (also known as a fitness function), by calculating the best values for its variables (also known as a 'solution'). Despite the simplicity of this definition, it is not an easy process; often involves restrictions, as well as complex relationships among the various variables. Even though some functions can be optimized using some mathematical process, most functions we encounter in data science are not as simple, requiring a more advanced technique." (Yunus E Bulut & Zacharias Voulgaris, "AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"Optimization systems (or optimizers, as they are often referred to) aim to optimize in a systematic way, oftentimes using a heuristics-based approach. Such an approach enables the AI system to use a macro level concept as part of its low-level calculations, accelerating the whole process and making it more light-weight. After all, most of these systems are designed with scalability in mind, so the heuristic approach is most practical." (Yunus E Bulut & Zacharias Voulgaris, "AI for Data Science: Artificial Intelligence Frameworks and Functionality for Deep Learning, Optimization, and Beyond", 2018)

"The no free lunch theorems set limits on the range of optimality of any method. That is, each methodology has a ‘catchment area’ where it is optimal or nearly so. Often, intuitively, if the optimality is particularly strong then the effectiveness of the methodology falls off more quickly outside its catchment area than if its optimality were not so strong. Boosting is a case in point: it seems so well suited to binary classification that efforts to date to extend it to give effective classification (or regression) more generally have not been very successful. Overall, it remains to characterize the catchment areas where each class of predictors performs optimally, performs generally well, or breaks down." (Bertrand S Clarke & Jennifer L. Clarke, "Predictive Statistics: Analysis and Inference beyond Models", 2018)

"Cross-validation is a useful tool for finding optimal predictive models, and it also works well in visualization. The concept is simple: split the data at random into a 'training' and a 'test' set, fit the model to the training data, then see how well it predicts the test data. As the model gets more complex, it will always fit the training data better and better. It will also start off getting better results on the test data, but there comes a point where the test data predictions start going wrong." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Random forests are essentially an ensemble of trees. They use many short trees, fitted to multiple samples of the data, and the predictions are averaged for each observation. This helps to get around a problem that trees, and many other machine learning techniques, are not guaranteed to find optimal models, in the way that linear regression is. They do a very challenging job of fitting non-linear predictions over many variables, even sometimes when there are more variables than there are observations. To do that, they have to employ 'greedy algorithms', which find a reasonably good model but not necessarily the very best model possible." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

12 May 2018

Data Science: Backpropagation (Definitions)

"A learning algorithm for multilayer neural nets based on minimizing the mean, or total, squared error." (Laurene V Fausett, "Fundamentals of Neural Networks: Architectures, Algorithms, and Applications", 1994)

"A learning scheme by which a multi-layer feedforward network is organized for pattern recognition or classification utilizing an external teacher, and error feedback (or propagation)." (Guido J Deboeck and Teuvo Kohonen, "Visual explorations in finance with self-organizing maps", 2000)

"weight-vector optimization method used in multilayered feed-forward networks. The corrective steps arc made starting at the output layer and proceeding toward the input layer." (Teuvo Kohonen, "Self-Organizing Maps" 3rd Ed., 2001)

"A class of feed-forward neural networks used for classification, forecasting, and estimation. Backpropagation is the process by which connection weights between neurons are modified using a backward pass of error derivatives." (David Scarborough & Mark J Somers, "Neural Networks in Organizational Research: Applying Pattern Recognition to the Analysis of Organizational Behavior", 2006)

"A method for training a neural network by adjusting the weights using errors between the current prediction and the training set." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"A supervised learning algorithm used to train artificial neural networks, where the network learns from many inputs, similar to the way a child learns to identify a bird from examples of birds and birds attributes." (Eitan Gross, "Stochastic Neural Network Classifiers", 2015)

"A learning algorithm for artificial neural networks used for supervised learning, where connection weights are iteratively updated to decrease the approximation error at the output units." (Ethem Alpaydın, "Machine learning: the new AI", 2016)

"Learning algorithm that optimizes a neural network by gradient descent to minimize a cost function and improve performance." (Terrence J Sejnowski, "The Deep Learning Revolution", 2018)

"The backpropagation algorithm is an ML algorithm used to train neural networks. The algorithm calculates for each neuron in a network the contribution |  the neuron makes to the error of the network. Using this error calculation for each neuron it is possible to update the weights on the inputs to each neuron so as to reduce the overall error of the network. The backpropagation algorithm is so named because it works in a two stage process. In the first stage an instance is input to the network and the information flows forward through the network until the network generates a prediction for that instance. In the second stage the error of the network on that instance is calculated by comparing the network's prediction to the correct output for that instance (as specified by the training data) and then this error is then shared back (or backpropagated) through the neurons in the network on a layer by layer basis beginning at the output layer." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Backpropagation is short for 'backward propagation of errors'. Backpropagation in convolutional neural networks is a way of training these networks based on a known, desired output for a specific sample case." (Accenture)

06 May 2018

Data Science: Swarm Intelligence (Definitions)

"Swarm systems generate novelty for three reasons: (1) They are 'sensitive to initial conditions' - a scientific shorthand for saying that the size of the effect is not proportional to the size of the cause - so they can make a surprising mountain out of a molehill. (2) They hide countless novel possibilities in the exponential combinations of many interlinked individuals. (3) They don’t reckon individuals, so therefore individual variation and imperfection can be allowed. In swarm systems with heritability, individual variation and imperfection will lead to perpetual novelty, or what we call evolution." (Kevin Kelly, "Out of Control: The New Biology of Machines, Social Systems and the Economic World", 1995)

"Dumb parts, properly connected into a swarm, yield smart results." (Kevin Kelly, "New Rules for the New Economy", 1999)

"It is, however, fair to say that very few applications of swarm intelligence have been developed. One of the main reasons for this relative lack of success resides in the fact that swarm-intelligent systems are hard to 'program', because the paths to problem solving are not predefined but emergent in these systems and result from interactions among individuals and between individuals and their environment as much as from the behaviors of the individuals themselves. Therefore, using a swarm-intelligent system to solve a problem requires a thorough knowledge not only of what individual behaviors must be implemented but also of what interactions are needed to produce such or such global behavior." (Eric Bonabeau et al, "Swarm Intelligence: From Natural to Artificial Systems", 1999)

"Just what valuable insights do ants, bees, and other social insects hold? Consider termites. Individually, they have meager intelligence. And they work with no supervision. Yet collectively they build mounds that are engineering marvels, able to maintain ambient temperature and comfortable levels of oxygen and carbon dioxide even as the nest grows. Indeed, for social insects teamwork is largely self-organized, coordinated primarily through the interactions of individual colony members. Together they can solve difficult problems (like choosing the shortest route to a food source from myriad possible pathways) even though each interaction might be very simple (one ant merely following the trail left by another). The collective behavior that emerges from a group of social insects has been dubbed 'swarm intelligence'." (Eric Bonabeau & Christopher Meyer, Swarm Intelligence: A Whole New Way to Think About Business, Harvard Business Review, 2001)

"[…] swarm intelligence is becoming a valuable tool for optimizing the operations of various businesses. Whether similar gains will be made in helping companies better organize themselves and develop more effective strategies remains to be seen. At the very least, though, the field provides a fresh new framework for solving such problems, and it questions the wisdom of certain assumptions regarding the need for employee supervision through command-and-control management. In the future, some companies could build their entire businesses from the ground up using the principles of swarm intelligence, integrating the approach throughout their operations, organization, and strategy. The result: the ultimate self-organizing enterprise that could adapt quickly - and instinctively - to fast-changing markets." (Eric Bonabeau & Christopher Meyer, "Swarm Intelligence: A Whole New Way to Think About Business", Harvard Business Review, 2001)

"Swarm Intelligence can be defined more precisely as: Any attempt to design algorithms or distributed problem-solving methods inspired by the collective behavior of the social insect colonies or other animal societies. The main properties of such systems are flexibility, robustness, decentralization and self-organization." (Ajith Abraham et al, "Swarm Intelligence in Data Mining", 2006)

"Swarm intelligence can be effective when applied to highly complicated problems with many nonlinear factors, although it is often less effective than the genetic algorithm approach discussed later in this chapter. Swarm intelligence is related to swarm optimization […]. As with swarm intelligence, there is some evidence that at least some of the time swarm optimization can produce solutions that are more robust than genetic algorithms. Robustness here is defined as a solution’s resistance to performance degradation when the underlying variables are changed." (Michael J North & Charles M Macal, "Managing Business Complexity: Discovering Strategic Solutions with Agent-Based Modeling and Simulation", 2007)

[swarm intelligence] "Refers to a class of algorithms inspired by the collective behaviour of insect swarms, ant colonies, the flocking behaviour of some bird species, or the herding behaviour of some mammals, such that the behaviour of the whole can be considered as exhibiting a rudimentary form of 'intelligence'." (John Fulcher, "Intelligent Information Systems", 2009)

"The property of a system whereby the collective behaviors of unsophisticated agents interacting locally with their environment cause coherent functional global patterns to emerge." (M L Gavrilova, "Adaptive Algorithms for Intelligent Geometric Computing", 2009) 

[swarm intelligence] "Is a discipline that deals with natural and artificial systems composed of many individuals that coordinate using decentralized control and self-organization. In particular, SI focuses on the collective behaviors that result from the local interactions of the individuals with each other and with their environment." (Elina Pacini et al, "Schedulers Based on Ant Colony Optimization for Parameter Sweep Experiments in Distributed Environments", 2013). 

"Swarm intelligence (SI) is a branch of computational intelligence that discusses the collective behavior emerging within self-organizing societies of agents. SI was inspired by the observation of the collective behavior in societies in nature such as the movement of birds and fish. The collective behavior of such ecosystems, and their artificial counterpart of SI, is not encoded within the set of rules that determines the movement of each isolated agent, but it emerges through the interaction of multiple agents." (Maximos A Kaliakatsos-Papakostas et al, "Intelligent Music Composition", 2013)

"Collective intelligence of societies of biological (social animals) or artificial (robots, computer agents) individuals. In artificial intelligence, it gave rise to a computational paradigm based on decentralisation, self-organisation, local interactions, and collective emergent behaviours." (D T Pham & M Castellani, "The Bees Algorithm as a Biologically Inspired Optimisation Method", 2015)

"It is the field of artificial intelligence in which the population is in the form of agents which search in a parallel fashion with multiple initialization points. The swarm intelligence-based algorithms mimic the physical and natural processes for mathematical modeling of the optimization algorithm. They have the properties of information interchange and non-centralized control structure." (Sajad A Rather & P Shanthi Bala, "Analysis of Gravitation-Based Optimization Algorithms for Clustering and Classification", 2020)

"It [swarm intelligence] is the discipline dealing with natural and artificial systems consisting of many individuals who coordinate through decentralized monitoring and self-organization." (Mehmet A Cifci, "Optimizing WSNs for CPS Using Machine Learning Techniques", 2021)

Resources:
More quotes on "Swarm Intelligence" at the-web-of-knowledge.blogspot.com.

05 April 2018

Data Science: Genetic Algorithms (Definitions)

"A method for solving optimization problems using parallel search, based on the biological paradigm of natural selection and 'survival of the fittest'." (Joseph P Bigus, "Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support", 1996)

"Algorithms for solving complex combinatorial and organizational problems with many variants, by employing analogy with nature's evolution. The general steps a genetic algorithm cycles through are: generate a new population (crossover) starting at the beginning with initial one; select the best individuals; mutate, if necessary; repeat the same until a satisfactory solution is found according to a goodness (fitness) function." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"The type of algorithm that locates optimal binary strings by processing an initially random population of strings using artificial mutation, crossover, and selection operators, in an analogy with the process of natural selection." (David E Goldberg, "Genetic Algorithms", 1989)

"A technique for estimating computer models (e.g., Machine Learning) based on methods adapted from the field of genetics in biology. To use this technique, one encodes possible model behaviors into a 'genes'. After each generation, the current models are rated and allowed to mate and breed based on their fitness. In the process of mating, the genes are exchanged, and crossovers and mutations can occur. The current population is discarded and its offspring forms the next generation." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"Genetic algorithms are problem-solving techniques that solve problems by evolving solutions as nature does, rather than by looking for solutions in a more principled way. Genetic algorithms, sometimes hybridized with other optimization algorithms, are the best optimization algorithms available across a wide range of problem types." (Guido Deboeck & Teuvo Kohonen (Eds), "Visual Explorations in Finance with Self-Organizing Maps" 2nd Ed., 2000)

"learning principle, in which learning results are foully from generations of solutions by crossing and eliminating their members. An improved behavior usually ensues from selective stochastic replacements in subsets of system parameters." (Teuvo Kohonen, "Self-Organizing Maps 3rd Ed.", 2001)

"A genetic algorithm is a search method used in computational intelligence to find true or approximate solutions to optimization and search problems." (Omar F El-Gayar et al, "Current Issues and Future Trends of Clinical Decision Support Systems", 2008)

"A method of evolutionary computation for problem solving. There are states also called sequences and a set of possibility final states. Methods of mutation are used on genetic sequences to achieve better sequences." (Attila Benko & Cecília S Lányi, "History of Artificial Intelligence", 2009) 

"Genetic algorithms are derivative free, stochastic optimization methods based on the concepts of natural selection and evolutionary processes." (Yorgos Goletsis et al, Bankruptcy Prediction through Artificial Intelligence, 2009)

"Genetic Algorithms (GAs) are algorithms that use operations found in natural genetics to guide their way through a search space and are increasingly being used in the field of optimisation. The robust nature and simple mechanics of genetic algorithms make them inviting tools for search learning and optimization. Genetic algorithms are based on computational models of fundamental evolutionary processes such as selection, recombination and mutation." (Masoud Mohammadian, Supervised Learning of Fuzzy Logic Systems, 2009)

"The algorithms that are modelled on the natural process of evolution. These algorithms employ methods such as crossover, mutation and natural selection and provide the best possible solutions after analyzing a group of sub-optimal solutions which are provided as inputs." (Prayag Narula, "Evolutionary Computing Approach for Ad-Hoc Networks", 2009)

"The type of algorithm that locates optimal binary strings by processing an initially random population of strings using artificial mutation, crossover, and selection operators, in an analogy with the process of natural selection." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"These algorithms mimic the process of natural evolution and perform explorative search. The main component of this method is chromosomes that represent solutions to the problem. It uses selection, crossover, and mutation to obtain chromosomes of highest quality." (Indranil Bose, "Data Mining in Tourism", 2009)

"Search algorithms used in machine learning which involve iteratively generating new candidate solutions by combining two high scoring earlier (or parent) solutions in a search for a better solution." (Radian Belu, "Artificial Intelligence Techniques for Solar Energy and Photovoltaic Applications", 2013)

"Genetic algorithms (GAs) is a stochastic search methodology belonging to the larger family of artificial intelligence procedures and evolutionary algorithms (EA). They are used to generate useful solutions to optimization and search problems mimicking Darwinian evolution." (Niccolò Gordini, "Genetic Algorithms for Small Enterprises Default Prediction: Empirical Evidence from Italy", 2014)

"Genetic algorithms are based on the biological theory of evolution. This type of algorithms is useful for searching and optimization." (Ivan Idris, "Python Data Analysis", 2014)

"A Stochastic optimization algorithms based on the principles of natural evolution." (Harish Garg, "A Hybrid GA-GSA Algorithm for Optimizing the Performance of an Industrial System by Utilizing Uncertain Data", 2015)

"It is a stochastic but not random method of search used for optimization or learning. Genetic algorithm is basically a search technique that simulates biological evolution during optimization process." (Salim Lahmir, "Prediction of International Stock Markets Based on Hybrid Intelligent Systems", 2016)

"Machine learning algorithms inspired by genetic processes, for example, an evolution where classifiers with the greatest accuracy are trained further." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

28 February 2017

Data Warehousing: Data Load Optimization – A Success Story

Data Warehousing
Data Warehousing Series

Introduction

This topic has been waiting in the queue for almost two years already - since I finished optimizing an already existing relational data warehouse within a SQL Server 2012 Enterprise Edition environment. Through various simple techniques I managed then to reduce the running time for the load process by more than 65%, from 9 to 3 hours. It’s a considerable performance gain, considering that I didn’t have to refactor any business logic implemented in queries.

The ETL (Extract, Transform, Load) solution was making use of SSIS (SQL Server Integration Services) packages to load data sequentially from several sources into staging tables, and from stating further into base tables. Each package was responsible for deleting the data from the staging tables via TRUNCATE, extracting the data 1:1 from the source into the staging tables, then loading the data 1:1 from the staging table to base tables. It’s the simplest and a relatively effective ETL design I also used with small alterations for data warehouse solutions. For months the data load worked smoothly, until data growth and eventually other problems increased the loading time from 5 to 9 hours.

Using TABLOCK Hint

Using SSIS to bulk load data into SQL Server provides an optimum of performance and flexibility. Within a Data Flow, when “Table Lock” property on the destination is checked, it implies that the insert records are minimally logged, speeding up the load by a factor of two. The TABLOCK hint can be used also for other insert operations performed outside of SSIS packages. At least in this case the movement of data from staging into base tables was performed in plain T-SQL, outside of SSIS packages. Also further data processing had benefitted from this change. Only this optimization step alone provided 30-40% performance gain.

Drop/Recreating the Indexes on Big Tables

As the base tables were having several indexes each, it proved beneficial to drop the indexes for the big tables (e.g. with more than 1000000 records) before loading the data into the base tables, and recreate the indexes afterwards. This was done within SSIS, and provided an additional 20-30% performance gain from the previous step.

Consolidating the Indexes

Adding missing indexes, removing or consolidating (overlapping) indexes are typical index maintenance tasks, apparently occasionally ignored. It doesn’t always bring much performance as compared with the previous methods, though dropping and consolidating some indexes proved to be beneficial as fewer data were maintained. Data processing logic benefited from the creation of new indexes as well.

Running Packages in Parallel

As the packages were run sequentially (one package at a time), the data load was hardly taking advantage of the processing power available on the server. Even if queries could use parallelism, the benefit was minimal. Enabling packages run in parallel added additional performance gain, however this minimized the availability of processing resources for other tasks. When the data load is performed overnight, this causes minimal overhead, however it should be avoided when the data are loading to business hours.

Using Nonclustered Indexes

In my analysis I found out that many tables, especially the ones storing prepared data, were lacking a clustered index, even if further indexes were built on them. I remember that years back there was a (false) myth that fact and/or dimension tables don’t need clustered indexes in SQL Server. Of course clustered indexes have downsides (e.g. fragmentation, excessive key-lookups) though their benefits exceed by far the downsides. Besides missing clustered index, there were cases in which the tables would have benefited from having a narrow clustered index, instead of a multicolumn wide clustered index. Upon case also such cases were addressed.

Removing the Staging Tables

Given the fact that the source and target systems are in the same virtual environment, and the data are loaded 1:1 between the various layers, without further transformations and conversions, one could load the data directly into the base tables. After some tests I came to the conclusion that the load from source tables into the staging table, and the load from staging table into base table (with TABLOCK hint) were taking almost the same amount of time. This means that the base tables will be for the same amount of the time unavailable, if the data were loaded from the sources directly into the base tables. Therefore one could in theory remove the staging tables from the architecture. Frankly, one should think twice when doing such a change, as there can be further implications in time. Even if today the data are imported 1:1, in the future this could change.

Reducing the Data Volume

Reducing the data volume was identified as a possible further technique to reduce the amount of time needed for data loading. A data warehouse is built based on a set of requirements and presumptions that change over time. It can happen for example that even if the reports need only 1-2 years’ worth of data, the data load considers a much bigger timeframe. Some systems can have up to 5-10 years’ worth of data. Loading all data without a specific requirement leads to waste of resources and bigger load times. Limiting the transactional data to a given timeframe can make a considerable difference. Additionally, there are historical data that have the potential to be archived.

There are also tables for which a weekly or monthly refresh would suffice. Some tables or even data sources can become obsolete, however they continue to be loaded in the data warehouse. Such cases occur seldom, though they occur. Also some unused or redundant column could have been removed from the packages.

Further Thoughts

There are further techniques to optimize the data load within a data warehouse like partitioning large tables, using columnstore indexes or optimizing the storage, however my target was to provide maximum sufficient performance gain with minimum of effort and design changes. Therefore I stopped when I considered that the amount of effort is considerable higher than the performance gain.

Further Reading:
[1] TechNet (2009) The Data Loading Performance Guide, by Thomas Kejser, Peter Carlin & Stuart Ozer (link)
[2] MSDN (2010) Best Practices for Data Warehousing with SQL Server 2008 R2, by Mark Whitehorn, Keith Burns & Eric N Hanson (link)
[3] MSDN (2012) Whitepaper: Fast Track Data Warehouse Reference Guide for SQL Server 2012, by Eric Kraemer, Mike Bassett, Eric Lemoine & Dave Withers (link)
[4] MSDN (2008) Best Practices for Data Warehousing with SQL Server 2008, by Mark Whitehorn & Keith Burns (link)
[5] TechNet (2005) Strategies for Partitioning Relational Data Warehouses in Microsoft SQL Server, by Gandhi Swaminathan (link)
[6] SQL Server Customer Advisory Team (2013) Top 10 Best Practices for Building a Large Scale Relational Data Warehouse (link)

18 December 2014

Systems Engineering: Equilibrium (Just the Quotes)

"We cannot prevent equilibrium from producing its effects. We may brave human laws, but we cannot resist natural ones." (Jules Verne, "Twenty Thousand Leagues Under the Sea", 1870)

"Every situation is an equilibrium of forces; every life is a struggle between opposing forces working within the limits of a certain equilibrium." (Henri-Frédéric Amiel, "Amiel's Journal", 1885)

"Plasticity, then, in the wide sense of the word, means the possession of a structure weak enough to yield to an influence, but strong enough not to yield all at once. Each relatively stable phase of equilibrium in such a structure is marked by what we may call a new set of habits." (William James, "The Laws of Habit", 1887)

"Every change of one of the factors of an equilibrium occasions a rearrangement of the system in such a direction that the factor in question experiences a change in a sense opposite to the original change." (Henri L Le Chatelier, "Recherches Experimentales et Theoriques sur les Equilibres Chimiques" ["Experimental and Theoretical Research on Chemical Equilibria"], Annales des Mines 8, 1888)

"In every symmetrical system every deformation that tends to destroy the symmetry is complemented by an equal and opposite deformation that tends to restore it. […] One condition, therefore, though not an absolutely sufficient one, that a maximum or minimum of work corresponds to the form of equilibrium, is thus applied by symmetry." (Ernst Mach, "The Science of Mechanics: A Critical and Historical Account of Its Development", 1893)

"We rise from the conception of form to an understanding of the forces which gave rise to it [...] in the representation of form we see a diagram of forces in equilibrium, and in the comparison of kindred forms we discern the magnitude and the direction of the forces which have sufficed to convert the one form into the other." (D'Arcy Wentworth Thompson, "On Growth and Form" Vol. 2, 1917)

"What in the whole denotes a causal equilibrium process, appears for the part as a teleological event." (Ludwig von Bertalanffy, 1929)

"True equilibria can occur only in closed systems and that, in open systems, disequilibria called ‘steady states’, or ‘flow equilibria’ are the predominant and characteristic feature. According to the second law of thermodynamics a closed system must eventually attain a time-independent equilibrium state, with maximum entropy and minimum free energy. An open system may, under certain conditions, attain a time-independent state where the system remains constant as a whole and in its phases, though there is a continuous flow of component materials. This is called a steady state. Steady states are irreversible as a whole. […] A closed system in equilibrium does not need energy for its preservation, nor can energy be obtained from it. In order to perform work, a system must be in disequilibrium, tending toward equilibrium and maintaining a steady state, Therefore the character of an open system is the necessary condition for the continuous working capacity of the organism." (Ludwig on Bertalanffy, "Theoretische Biologie: Band 1: Allgemeine Theorie, Physikochemie, Aufbau und Entwicklung des Organismus", 1932)

"A state of equilibrium in a system does not mean, further, that the system is without tension. Systems can, on the contrary, also come to equilibrium in a state of tension (e.g., a spring under tension or a container with gas under pressure).The occurrence of this sort of system, however, presupposes a certain firmness of boundaries and actual segregation of the system from its environment (both of these in a functional, not a spatial, sense). If the different parts of the system are insufficiently cohesive to withstand the forces working toward displacement (i.e., if the system shows insufficient internal firmness, if it is fluid), or if the system is not segregated from its environment by sufficiently firm walls but is open to its neighboring systems, stationary tensions cannot occur. Instead, there occurs a process in the direction of the forces, which encroaches upon the neighboring regions with diffusion of energy and which goes in the direction of an equilibrium at a lower level of tension in the total region. The presupposition for the existence of a stationary state of tension is thus a certain firmness of the system in question, whether this be its own inner firmness or the firmness of its walls." (Kurt Lewin, "A Dynamic Theory of Personality", 1935)

"The process moves in the direction of a state of equilibrium only for the system as a whole. Part processes may at the same time go on in opposed directions, a circumstance which is of the greatest significance for, for example, the theory of detour behavior. It is hence important to take the system whole which is dominant at the moment as basis." (Kurt Lewin, "A Dynamic Theory of Personality", 1935)

"One may generalize upon these processes in terms of group equilibrium. The group may be said to be in equilibrium when the interactions of its members fall into the customary pattern through which group activities are and have been organized. The pattern of interactions may undergo certain modifications without upsetting the group equilibrium, but abrupt and drastic changes destroy the equilibrium." (William F Whyte, "Street Corner Society", 1943)

"The behavior of two individuals, consisting of effort which results in output, is considered to be determined by a satisfaction function which depends on remuneration (receiving part of the output) and on the effort expended. The total output of the two individuals is not additive, that is, together they produce in general more than separately. Each individual behaves in a way which he considers will maximize his satisfaction function. Conditions are deduced for a certain relative equilibrium and for the stability of this equilibrium, i.e., conditions under which it will not pay the individual to decrease his efforts. In the absence of such conditions ‘exploitation’ occurs which may or may not lead to total parasitism." (Anatol Rapoport, "Mathematical theory of motivation interactions of two individuals," The Bulletin of Mathematical Biophysics 9, 1947)

"The study of the conditions for change begins appropriately with an analysis of the conditions for no change, that is, for the state of equilibrium." (Kurt Lewin, "Quasi-Stationary Social Equilibria and the Problem of Permanent Change", 1947)

"Now a system is said to be at equilibrium when it has no further tendency to change its properties." (Walter J Moore, "Physical chemistry", 1950)

"Physical irreversibility manifests itself in the fact that, whenever the system is in a state far removed from equilibrium, it is much more likely to move toward equilibrium, than in the opposite direction." (William Feller, "An Introduction To Probability Theory And Its Applications", 1950)

"Equilibrium requires that the whole of the structure, the form of its elements, and the means of interconnection be so combined that at the supports there will automatically be produced passive forces or reactions that are able to balance the forces acting upon the structures, including the force of its own weight."  (Eduardo Torroja, "Philosophy of Structure", 1951) 

"[…] there are three different but interconnected conceptions to be considered in every structure, and in every structural element involved: equilibrium, resistance, and stability." (Eduardo Torroja, "Philosophy of Structure" , 1951) 

"Every stable system has the property that if displaced from a state of equilibrium and released, the subsequent movement is so matched to the initial displacement that the system is brought back to the state of equilibrium. A variety of disturbances will therefore evoke a variety of matched reactions." (W Ross Ashby, "Design for a Brain: The Origin of Adaptive Behavior", 1952)

"The primary fact is that all isolated state-determined dynamic systems are selective: from whatever state they have initially, they go towards states of equilibrium. These states of equilibrium are always characterised, in their relation to the change-inducing laws of the system, by being exceptionally resistant." (W Ross Ashby, "Design for a Brain: The Origin of Adaptive Behavior", 1952)

"As shorthand, when the phenomena are suitably simple, words such as equilibrium and stability are of great value and convenience. Nevertheless, it should be always borne in mind that they are mere shorthand, and that the phenomena will not always have the simplicity that these words presuppose." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"Reversible processes are not, in fact, processes at all, they are sequences of states of equilibrium. The processes which we encounter in real life are always irreversible processes." (Arnold Sommerfeld, "Thermodynamics and Statistical Mechanics", Lectures on Theoretical - Physics Vol. V, 1956)

"The static stability of a system is defined by the initial tendency to return to equilibrium conditions following some disturbance from equilibrium. […] If the object has a tendency to continue in the direction of disturbance, negative static stability or static instability exists. […] If the object subject to disturbance has neither the tendency to return nor the tendency to continue in the displacement direction, neutral static stability exists." (Hugh H Hurt, "Aerodynamics for Naval Aviators", 1960)

"While static stability is concerned with the tendency of a displaced body to return to equilibrium, dynamic stability is concerned with the resulting motion with time. If an object is disturbed from equilibrium, the time history of the resulting motion indicates the dynamic stability of the system. In general, the system will demonstrate positive dynamic stability if the amplitude of the motion decreases with time." (Hugh H Hurt, "Aerodynamics for Naval Aviators", 1960)

"[The equilibrium model describes systems] which, in moving to an equilibrium point, typically lose organization, and then tend to hold that minimum level within relatively narrow conditions of disturbance." (Walter F Buckley, "Sociology and modern systems theory", 1967)

"A system is in equilibrium when the forces constituting it are arranged in such a way as to compensate each other, like the two weights pulling at the arms of a pair of scales." (Rudolf Arnheim, "Entropy and Art: An Essay on Disorder and Order", 1971) 

"When matter is becoming disturbed by non-equilibrium conditions it organizes itself, it wakes up. It happens that our world is a non-equilibrium system." (Ilya Prigogine, "Thermodynamics of Evolution", 1972) 

"In an isolated system, which cannot exchange energy and matter with the surroundings, this tendency is expressed in terms of a function of the macroscopic state of the system: the entropy." (Ilya Prigogine, "Thermodynamics of Evolution", 1972) 

"Chance is commonly viewed as a self-correcting process in which a deviation in one direction induces a deviation in the opposite direction to restore the equilibrium. In fact, deviations are not 'corrected' as a chance process unfolds, they are merely diluted." (Amos Tversky, "Judgment Under Uncertainty: Heuristics and Biases", 1974)

"In any system governed by a potential, and in which the system's behavior is determined by no more than four different factors, only seven qualitatively different types of discontinuity are possible. In other words, while there are an infinite number of ways for such a system to change continuously (staying at or near equilibrium), there are only seven structurally stable ways for it to change discontinuously (passing through non-equilibrium states)." (Alexander Woodcock & Monte Davis, "Catastrophe Theory", 1978)

"All environmental areas, from the primeval forest to the large city, can be regarded as ecosystems and investigated accordingly, most of the attention being given to the lasting existence and functioning or 'equilibrium' of these systems." (Wolfgang Haber, Universitas: A Quarterly German Review of the Arts and Sciences Vol.26 (2), 1984)


"If a system is in a state of equilibrium (a steady state), then all sub-systems must be in equilibrium. If all sub-systems are in a state of equilibrium, then the system must be in equilibrium." (Barry Clemson, "Cybernetics: A New Management Tool", 1984)

"When loops are present, the network is no longer singly connected and local propagation schemes will invariably run into trouble. [...] If we ignore the existence of loops and permit the nodes to continue communicating with each other as if the network were singly connected, messages may circulate indefinitely around the loops and process may not converges to a stable equilibrium. […] Such oscillations do not normally occur in probabilistic networks […] which tend to bring all messages to some stable equilibrium as time goes on. However, this asymptotic equilibrium is not coherent, in the sense that it does not represent the posterior probabilities of all nodes of the network." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference", 1988)

"Regarding stability, the state trajectories of a system tend to equilibrium. In the simplest case they converge to one point (or different points from different initial states), more commonly to one (or several, according to initial state) fixed point or limit cycle(s) or even torus(es) of characteristic equilibrial behaviour. All this is, in a rigorous sense, contingent upon describing a potential, as a special summation of the multitude of forces acting upon the state in question, and finding the fixed points, cycles, etc., to be minima of the potential function. It is often more convenient to use the equivalent jargon of 'attractors' so that the state of a system is 'attracted' to an equilibrial behaviour. In any case, once in equilibrial conditions, the system returns to its limit, equilibrial behaviour after small, arbitrary, and random perturbations." (Gordon Pask, "Different Kinds of Cybernetics", 1992)

"Systems, acting dynamically, produce (and incidentally, reproduce) their own boundaries, as structures which are complementary (necessarily so) to their motion and dynamics. They are liable, for all that, to instabilities chaos, as commonly interpreted of chaotic form, where nowadays, is remote from the random. Chaos is a peculiar situation in which the trajectories of a system, taken in the traditional sense, fail to converge as they approach their limit cycles or 'attractors' or 'equilibria'. Instead, they diverge, due to an increase, of indefinite magnitude, in amplification or gain.(Gordon Pask, "Different Kinds of Cybernetics", 1992)

"The model of competitive equilibrium which has been discussed so far is set in a timeless environment. People and companies all operate in a world in which there is no future and hence no uncertainty." (Paul Ormerod, "The Death of Economics", 1994)

"Self-organization refers to the spontaneous formation of patterns and pattern change in open, nonequilibrium systems. […] Self-organization provides a paradigm for behavior and cognition, as well as the structure and function of the nervous system. In contrast to a computer, which requires particular programs to produce particular results, the tendency for self-organization is intrinsic to natural systems under certain conditions." (J A Scott Kelso, "Dynamic Patterns : The Self-organization of Brain and Behavior", 1995)

"Contrary to what happens at equilibrium, or near equilibrium, systems far from equilibrium do not conform to any minimum principle that is valid for functions of free energy or entropy production." (Ilya Prigogine, "The End of Certainty: Time, Chaos, and the New Laws of Nature", 1996) 

"[…] self-organization is the spontaneous emergence of new structures and new forms of behavior in open systems far from equilibrium, characterized by internal feedback loops and described mathematically by nonlinear equations.” (Fritjof Capra, “The web of life: a new scientific understanding of living systems”, 1996)

"Complex systems operate under conditions far from equilibrium. Complex systems need a constant flow of energy to change, evolve and survive as complex entities. Equilibrium, symmetry and complete stability mean death. Just as the flow, of energy is necessary to fight entropy and maintain the complex structure of the system, society can only survive as a process. It is defined not by its origins or its goals, but by what it is doing." (Paul Cilliers,"Complexity and Postmodernism: Understanding Complex Systems", 1998)

"An equilibrium is not always an optimum; it might not even be good. This may be the most important discovery of game theory." (Ivar Ekeland, "Le meilleur des mondes possibles" ["The Best of All Possible Worlds"], 2000)

"Positive feedbacks, when unchecked, can produce runaways until the deviation from equilibrium is so large that other effects can be abruptly triggered and lead to ruptures and crashes." (Didier Sornette, "Why Stock Markets Crash - Critical Events in Complex Systems", 2003)

"The players in a game are said to be in strategic equilibrium (or simply equilibrium) when their play is mutually optimal: when the actions and plans of each player are rational in the given strategic environment - i. e., when each knows the actions and plans of the others." (Robert Aumann, "War and Peace", 2005)

"The second law of thermodynamics states that in an isolated system, entropy can only increase, not decrease. Such systems evolve to their state of maximum entropy, or thermodynamic equilibrium. Therefore, physical self-organizing systems cannot be isolated: they require a constant input of matter or energy with low entropy, getting rid of the internally generated entropy through the output of heat ('dissipation'). This allows them to produce ‘dissipative structures’ which maintain far from thermodynamic equilibrium. Life is a clear example of order far from thermodynamic equilibrium." (Carlos Gershenson, "Design and Control of Self-organizing Systems", 2007)
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.