SQL Troubles: benchmarking

Showing posts with label benchmarking. Show all posts

06 October 2025

🤖〽️Prompt Engineering: Copilot Unabridged (Part 104: Formalizing Intelligence in Artificial Intelligence)

Prompt Engineering Series

Prompt: "write a post of 600 words on how intelligence in AI can be formalized"

Introduction

Artificial Intelligence (AI) has rapidly evolved from rule-based systems to deep learning models capable of complex reasoning, perception, and decision-making. Yet, despite these advances, the concept of 'intelligence' in AI remains elusive and often loosely defined. Formalizing intelligence is essential not only for benchmarking progress but also for designing systems that exhibit predictable and interpretable behavior. This post explores how intelligence in AI can be formalized, drawing from computational theory, cognitive science, and mathematical frameworks.

1. Defining Intelligence

At its core, intelligence can be described as the ability to achieve goals in a wide range of environments. This definition, proposed by Shane Legg and Marcus Hutter, is one of the most widely accepted in the AI research community. It emphasizes adaptability, generalization, and goal-oriented behavior - traits that distinguish intelligent agents from narrow, task-specific systems.

2. The AIXI Model

One of the most ambitious attempts to formalize intelligence is the AIXI model, developed by Hutter. AIXI combines Solomonoff induction (a formal theory of prediction) with sequential decision theory. It defines an agent that maximizes expected reward in any computable environment. While AIXI is incomputable in practice, it serves as a theoretical ideal for general intelligence. It provides a mathematical framework that captures learning, planning, and decision-making in a unified model.

3. Computational Rationality

Another approach to formalizing intelligence is through computational rationality, which models intelligent behavior as the outcome of optimizing decisions under resource constraints. This framework acknowledges that real-world agents (including humans and machines) operate with limited time, memory, and computational power. By incorporating these constraints, computational rationality bridges the gap between idealized models and practical AI systems.

4. Information-Theoretic Measures

Intelligence can also be quantified using information theory. Concepts like entropy, mutual information, and Kolmogorov complexity help measure the efficiency and generality of learning algorithms. For example, an intelligent system might be one that can compress data effectively, discover patterns with minimal prior knowledge, or adapt to new tasks with minimal retraining. These metrics provide objective ways to compare different AI systems.

5. Benchmarking and Evaluation

Formalization also involves creating standardized benchmarks. Datasets like ImageNet, GLUE, and SuperGLUE have helped quantify progress in specific domains like vision and language. More recently, multi-task and generalization benchmarks (e.g., BIG-bench, ARC) aim to evaluate broader cognitive capabilities. These benchmarks are crucial for testing whether AI systems exhibit traits of general intelligence, such as transfer learning, abstraction, and reasoning.

6. Ethical and Interpretability Considerations

Formalizing intelligence isn't just a technical challenge - it has ethical implications. A well-defined notion of intelligence can help ensure that AI systems behave safely and transparently. For instance, interpretability frameworks like SHAP or LIME aim to explain model decisions, which is essential for trust and accountability. Formal models also support value alignment, ensuring that intelligent agents act in accordance with human values.

7. Toward Artificial General Intelligence (AGI)

The ultimate goal of formalizing intelligence is to guide the development of Artificial General Intelligence (AGI) - systems that can perform any intellectual task a human can. While current AI excels in narrow domains, formal models like AIXI, computational rationality, and information-theoretic approaches provide blueprints for building more general, adaptable agents.

Conclusion

Formalizing intelligence in AI is a multidisciplinary endeavor that blends theory with practice. It involves defining what intelligence means, modeling it mathematically, and evaluating it empirically. As AI systems become more capable and autonomous, having a rigorous understanding of intelligence will be key to ensuring they are safe, reliable, and aligned with human goals.

Just try the prompt on Copilot or your favorite AI-powered assistant! Have you got a different/similar result? How big or important is the difference? Any other thoughts?
Just share the link to the post with me and I'll add it to this post as a resource!

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

04 August 2020

💼Project Management: Project Execution (Part I: Redefining Projects' Success I)

A project is typically considered as successful if has met the beforehand defined objectives within the allocated budget, timeframe and expected quality levels. Any negative deviation from any of these equates with a project failure. In other words, the success or failure of a project is judged as black or white with no grays in between, which is utopic, especially for mid to big software projects, typically associated with lot of uncertainty. According to this definition a project which had a delay of a few months, or the budget was overrun by 10%, or the users got only 90% from the planned functionality, or any combination of these negative deviations, can be considered as failed.

If a small project needed 6 instead of 3 months to complete, which is normal for projects with reduced priority, as long the project costs haven’t changed, then the increase in duration can be ignored. In exchange, 3 months of a delay for a 2 years project is normal, especially when the project is complex. Even if additional costs incurred within this timeframe, as long they are a small percentage in comparison with the overall project costs, then the impact can be acceptable for the business. On the other side, when the delays have an exponential growth with further implications, then the problem changes dramatically.

Big projects have typically a strategic importance. It’s the case of ERP implementations, which besides the technology changes have in theory have the potential to transform an organization pushing it to reach further performance levels. Such projects are estimated to take on average one to two years for a medium organization, however the delays can easily reach 50% to 100% from the initial estimation. Independently of what caused the delay, as long the organization achieved the intended goals and can cover project’s costs, one can say that the project made a (positive) difference.

Independently of project’s size, if 90% of the important functionality is available, then more likely the 10% can be covered in a first step with manual work, following in time to further invest into the system as part of a continuous improvement process. It’s maybe not ideal for the users, however the approach incorporates also the learning curve of working with the system and understanding ist possibilities and limitations. Of course, when the percentage of the available functionality decreases below a given limit, system’s acceptance is endangered, which users eventually start looking for alternatives.

There are also projects which opened the door to new possibilities and which require more investments to leverage the full capabilities. Some ERP implementations have this potential, despite overruns. Some of such investments are entitled while others are not. Related to this last category, there are projects which are on time, on budget, and the deliverables satisfy the quality criteria and objectives, however they make no difference for the organization despite the important investments made. Sure, some of the projects from this category are a must (e.g. updates, upgrade, technology changes), however there are also projects which can be considered as self-occupational hazard. In extremis such projects run in the background and cost organizations lot of energy and resources, while their effects are questionable.

At least from these examples the definition of a project's success needs to be changed or maybe standardized to consider not only intrinsic but also extrinsic aspects. In theory, that is the role of a Project Management Office (PMO), however it’s challenging to find an evaluation methodology that fits all needs. Further on, from same considerations, benchmarking projects across organizations and industries can prove to be a foolhardy attempt.

13 February 2016

♜Strategic Management: Benchmarking (Definitions)

"The process of comparison in which one set of metrics comes from the entity being measured and the other set of metrics comes from averages for an industry, specific configuration, or other common attributes." (Janice M Roehl-Anderson, IT Best Practices for Financial Managers, 2010)

Benchmarks: "Objective measures of performance, often available from industry trade associations." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"A systematic process of comparing an organization to other organizations for the purposes of identifying better work methods and determining best practices." (Joan C Dessinger, "Fundamentals of Performance Improvement" 3rd Ed., 2012)

"Benchmarking uses external and internal comparisons to plan for future improvements." (John R Schermerhorn Jr, "Management" 12th Ed., 2012)

"A point of reference for measurement." (Information Management)

"A technique in which an organization measures its performance against that of best-in-class organizations, determines how those organizations achieved their performance levels and uses the information to improve its own performance. Subjects that can be benchmarked include strategies, operations and processes." (American Society for Quality)

SQL Troubles

Pages