Showing posts with label Learning. Show all posts
Showing posts with label Learning. Show all posts

16 September 2024

🧭Business Intelligence: Mea Culpa (Part IV: Generalist or Specialist in an AI Era?)

Business Intelligence Series
Business Intelligence Series

Except the early professional years when I did mainly programming for web or desktop applications in the context of n-tier architectures, over the past 20 years my professional life was a mix between BI, Data Analytics, Data Warehousing, Data Migrations and other topics (ERP implementations and support, Project Management, IT Service Management, IT, Data and Applications Management), though the BI topics covered probably on average at least 60% of my time, either as internal or external consultant. 

I can consider myself thus a generalist who had the chance to cover most of the important aspects of a business from an IT perspective, and it was thus a great experience, at least until now! It’s a great opportunity to have the chance to look at problems, solutions, processes and the various challenges and opportunities from different perspectives. Technical people should have this opportunity directly in their jobs through the communication occurring in projects or IT services, though that’s more of a wish! Unfortunately, the dialogue between IT and business occurs almost only over the tickets and documents, which might be transparent but isn’t necessarily effective or efficient! 

Does working only part time in an area make one person less experienced or knowledgeable than other people? In theory, a full-time employee should get more exposure in depth and/or breadth, but that’s relative! It depends on the challenges one faces, the variation of the tasks, the implemented solutions, their depth and other technical and nontechnical factors like training, one’s experience in working with the various tools, the variety of the tasks and problem faced, professionalism, etc. A richer exposure can but not necessarily involve more technical and nontechnical knowledge, and this shouldn’t be taken as given! There’s no right or wrong answer even if people tend to take sides and argue over details.

Independently of job's effective time, one is forced to use his/her time to keep current with technologies or extend one’s horizon. In IT, a professional seldom can rely on what is learned on the job. Fortunately, nowadays one has more and more ways of learning, while the challenge shifts toward what to ignore, respectively better management of one’s time while learning. The topics increase in complexity and with this blogging becomes even more difficult, especially when one competes with AI content!

Talking about IT, it will be interesting to see how much AI can help or replace some of the professions or professionals. Anyway, some jobs will become obsolete or shift the focus to prompt engineering and technical reviews. AI still needs explicit descriptions of how to address tasks, at least until it learns to create and use better recipes for problem definition and solving. The bottom line, AI and its use can’t be ignored, and it can and should be used also in learning new things. It’s amazing what one can do nowadays with prompt engineering! 

Another aspect on which AI can help is to tailor the content to one’s needs. A high percentage in the learning process is spent on fishing in a sea of information for content that is worth knowing, respectively for a solution to one’s needs. AI must be able to address also some of the context without prompters being forced to give information explicitly!

AI opens many doors but can close many others. How much of one’s experience will remain relevant over the next years? Will AI have more success in addressing some of the challenges existing in people’s understanding or people will just trust AI blindly? Anyway, somebody must be smarter than AI, and here people’s collective intelligence probably can prove to be a real match. 

13 June 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part V: One Person Can’t Learn or Do Everything)

Business Intelligence Series
Business Intelligence Series

Today’s Explicit Measures webcast [1] considered an article written by Kurt Buhler (The Data Goblins): [Microsoft] "Fabric is a Team Sport: One Person Can’t Learn or Do Everything" [2]. It’s a well-written article that deserves some thought as there are several important points made. I can’t say I agree with the full extent of some statements, even if some disagreements are probably just a matter of semantics.

My main disagreement starts with the title “One Person Can’t Learn or Do Everything”. As clarified in webcast's chat, the author defines “everything" as an umbrella for “all the capabilities and experiences that comprise Fabric including both technical (like Power BI) or non-technical (like adoption data literacy) and everything in between” [1].

For me “everything” is relative and considers a domain's core set of knowledge, while "expertise" (≠ "mastery") refers to the degree to which a person can use the respective knowledge to build back-to-back solutions for a given area. I’d say that it becomes more and more challenging for beginners or average data professionals to cover the core features. Moreover, I’d separate the non-technical skills because then one will also need to consider topics like Data, Project, Information or Knowledge Management.

There are different levels of expertise, and they can vary in depth (specialization) or breadth (covering multiple areas), respectively depend on previous experience (whether one worked with similar technologies). Usually, there’s a minimum of requirements that need to be covered for being considered as expert (e.g. certification, building a solution from beginning to the end, troubleshooting, performance optimization, etc.). It’s also challenging to roughly define when one’s expertise starts (or ends), as there are different perspectives on the topics. 

Conversely, the term expert is in general misused extensively, sometimes even with a mischievous intent. As “expert” is usually considered an external consultant or a person who got certified in an area, even if the person may not be able to build solutions that address a customer’s needs. 

Even data professionals with many years of experience can be overwhelmed by the volume of knowledge, especially when one considers the different experiences available in MF, respectively the volume of new features released monthly. Conversely, expertise can be considered in respect to only one or more MF experiences or for one area within a certain layer. Lot of the knowledge can be transported from other areas – writing SQL and complex database objects, modelling (enterprise) semantic layers, programming in Python, R or Power Query, building data pipelines, managing SQL databases, etc. 

Besides the standard documentation, training sessions, and some reference architectures, Microsoft made available also some labs and other material, which helps discovering the features available, though it doesn’t teach people how to build complete solutions. I find more important than declaring explicitly the role-based audience, the creation of learning paths for the various roles.

During the past 6-7 months I've spent on average 2 days per week learning MF topics. My problem is not the documentation but the lack of maturity of some features, the gaps in functionality, identifying the respective gaps, knowing what and when new features will be made available. The fact that features are made available or changed while learning makes the process more challenging. 

My goal is to be able to provide back-to-back solutions and I believe that’s possible, even if I might not consider all the experiences available. During the past 22 years, at least until MF, I could build complete BI solutions starting from requirements elicitation, data extraction, modeling and processing for data consumption, respectively data consumption for the various purposes. At least this was the journey of a Software Engineer into the world of data. 

References:
[1] Explicit Measures (2024) Power BI tips Ep.328: Microsoft Fabric is a Team Sport (link)
[2] Data Goblins (2024) Fabric is a Team Sport: One Person Can’t Learn or Do Everything (link)

13 February 2024

🧭🏭Business Intelligence: A One-Man Show III (The Microsoft Fabric)

Business Intelligence Series
Business Intelligence Series

Announced at the end of the last year, Microsoft Fabric (MF) become a reality for the data professional, even if there are still many gaps in the overall architecture and some things don't work as they should. The Delta Lake and the various data consumption experiences seem to bring more flexibility but also raise questions on how one can use them adequately in building solutions for Data Analytics and/or Data Science. 

Currently, as it happens with new technologies, data professionals seem to try to explore the functionality, see what's possible, what's missing, and that's a considerable effort as everybody is more or less on his own. The material released by Microsoft and other professionals should facilitate in theory this effort, though the considerable number of features and the effort needed to review them do the opposite. Some professionals do this as part of their jobs, and exploring the feature seems to be a full job in each area, while others, like myself, do it in their own time. 

There are organizations that demand from their employees to regularly actualize their knowledge in their field of activity, respectively explore how new technologies can be integrated in organization's architecture. Having a few hours or even a day a weak for this can go a long way! Occasionally, I could take 1-2 hours a week during the program and take maybe a few many more hours from my own time. Unfortunately, most of the significant progress I made in a certain area (SQL Server, Dynamics 365, Software Engineering, Power BI, and now MF) it was done in my own time, which became in time more and more challenging to do given the pace with which new features and technologies develop.

By comparison, it was relatively easy to locally install SQL Server in its various CTP or community versions, deploy one of the readily-available databases, and start learning. I'm still doing it, playing with a SQL Server 2022 instance whenever I find the time. Similarly, I can use Power BI and a few other tools, depending again on the time available to make progress. However, with MF things start slowly to get blurry. The 60 days of trial won't cut it anymore as there are so many things to learn - Spark SQL, PySpark, Delta Lake, KQL, Dataflows, etc. Probably, there will be ways for learning any of this standalone, though not together in an integrated manner. 

The complexity of the tools demands more time, a proper infrastructure and a good project to accommodate them. This doesn't mean that the complexity of the solutions need to increase as well! Azure Synapse allowed me to reuse many of the techniques I used in the past to build a modern Data Analytics solution, while in other areas I had to accommodate the new. The solution wasn't perfect (only time will tell), though it provided the minimum of what was needed. I expect the same to happen in Microsoft Fabric, even if the number of choices is bigger. 

There's a considerable difference between building a minimal viable solution and exploring, respectively harnessing MF's capabilities. The challenge for many organizations is to determine what that minimum is about, how to build that knowledge into the team, especially when starting from zero. 

Conversely, this doesn't mean that the skillset and effort can't be covered by one person. It might be more challenging though achievable if the foundation is there, respectively if certain conditions are met. This depends also on organization's expectations, infrastructure and other characteristics. A whole team is more likely to succeed than one person, but not certainty! 

Previous Post <<||>> Next Post

27 January 2024

Data Science: Back to the Future I (About Beginnings)

Data Science
Data Science Series

I've attended again, after several years, a webcast on performance improvement in SQL Server with Claudio Silva, “Writing T-SQL code for the engine, not for you”. The session was great and I really enjoyed it! I recommend it to any data(base) professional, even if some of the scenarios presented should be known already.

It's strange to see the same topics from 20-25 years ago reappearing over and over again despite the advancements made in the area of database engines. Each version of SQL Server brought something new in what concerns the performance, though without some good experience and understanding of the basic optimization and troubleshooting techniques there's little overall improvement for the average data professional in terms of writing and tuning queries!

Especially with the boom of Data Science topics, the volume of material on SQL increased considerably and many discover how easy is to write queries, even if the start might be challenging for some. Writing a query is easy indeed, though writing a performant query requires besides the language itself also some knowledge about the database engine and the various techniques used for troubleshooting and optimization. It's not about knowing in advance what the engine will do - the engine will often surprise you - but about knowing what techniques work, in what cases, which are their advantages and disadvantages, respectively on how they might impact the processing.

Making a parable with writing literature, it's not enough to speak a language; one needs more for becoming a writer, and there are so many levels of mastery! However, in database world even if creativity is welcomed, its role is considerable diminished by the constraints existing in the database engine, the problems to be solved, the time and the resources available. More important, one needs to understand some of the rules and know how to use the building blocks to solve problems and build reliable solutions.

The learning process for newbies focuses mainly on the language itself, while the exposure to complexity is kept to a minimum. For some learners the problems start when writing queries based on multiple tables -  what joins to use, in what order, how to structure the queries, what database objects to use for encapsulating the code, etc. Even if there are some guidelines and best practices, the learner must walk the path and experiment alone or in an organized setup.

In university courses the focus is on operators algebras, algorithms, on general database technologies and architectures without much hand on experience. All is too theoretical and abstract, which is acceptable for research purposes,  but not for the contact with the real world out there! Probably some labs offer exposure to real life scenarios, though what to cover first in the few hours scheduled for them?

This was the state of art when I started to learn SQL a quarter century ago, and besides the current tendency of cutting corners, the increased confidence from doing some tests, and the eagerness of shouting one’s shaking knowledge and more or less orthodox ideas on the various social networks, nothing seems to have changed! Something did change – the increased complexity of the problems to solve, and, considering the recent technological advances, one can afford now an AI learn buddy to write some code for us based on the information provided in the prompt.

This opens opportunities for learning and growth. AI can be used in the learning process by providing additional curricula for learners to dive deeper in some topics. Moreover, it can help us in time to address the challenges of the ever-increase complexity of the problems.

19 October 2022

🌡Performance Management: Mastery (Part II: First Time Right - The Aim toward Operational Excellence)

 

Performance Management Series

Rooted in Six Sigma methodology as a step toward operational excellence, First Time Right (FTR) implies that any procedure is performed in the right manner the first time and every time. It equates to minimizing the waste in its various forms (inventory, motion, overprocessing, overproduction, waiting, transportation, defects). Like many quality concepts from the manufacturing industry, the concept was transported in the software development process as principle, process, goal and/or metric. Thus, it became part of Software Engineering, Project Management, Data Science, and any other similar endeavors whose outcome results in software products. 

Besides the quality aspect, FTR is rooted also in the economic imperative – the need to achieve something in the minimum amount of time with the minimum of effort. It’s about being efficient in delivering a product or achieving a given target. It can be associated with continuous improvement, learning and mastery, the aim being to encompass FTR as part of organization’s culture. 

Even if not explicitly declared, FTR lurks in each task planned. It seems that it became common practice to plan with the FTR in mind, however between this theoretical aim and practice there’s as usual an important gap. Unfortunately, planners, managers and even tasks' performers often forget that mistakes are made, that several iterations are needed to get the job done. It starts with the communication between people in clarifying the requirements and ends with the formal sign off. All the deviations from the FTR add up in the deviations between expected and actual effort, though probably more important are the deviations from the plan and all the consequences deriving from it. Especially in complex projects this adds up into a spiral of issues that can easily reinforce themselves. 

Many of the jobs that imply creativity, innovation, research or exploration require at least several iterations to get the job done and this is independent of participants’ professionalism and experience. Moreover, the more quality one needs, the higher the effort, the 80/20 being sometimes a good approximation of the effort needed. In extremis, aiming for perfection instead of excellence can make certain tasks a never-ending story. 

Achieving FTR requires practice - the more novelty, the higher the complexity, the communication or the synchronization needs, the more practice is needed. It starts with the individual to master the individual tasks and ends with the team, where communication, synchronization and other aspects need to be considered. The practice is usually achieved on hands-on work as part of the daily duties, project work, and so on. Unfortunately, it’s based primarily on individual experience, and seldom groomed in advance, as preparation for future tasks. That’s why sometimes when efficiency is needed in performing critical complex tasks, one also needs to consider the learning curve in achieving the required quality. 

Of course, many organizations demand from job applicants experience and, when possible, they hire people with experience, however the diversity, complexity and changing nature of tasks require further practice. This aspect is somehow recognized in the implementation in organizations of the various forms of DevOps, though how many organizations adopt it and enforce it on a regular basis? Moreover, a major requirement of nowadays businesses is to be agile, and besides the mere application of methodologies, being agile means to have also a FTR mindset. 

FTR starts with the wish for mastery at individual and team level and, with the right management attention, by allocating time for learning, self-development in the important areas, providing relevant feedback and building an infrastructure for knowledge sharing and harnessing, FTR can become part of organization’s culture. It’s up to each of us to do it!

09 August 2022

🧭🪄Business Intelligence: Power BI (Part I: Power BI's Learning Curve I)

A learning curve attempts depicting the (average) time it takes a person to learn how to use a method, tool, or technique, tracing the path from newbie to mastery. A common definition of the learning curve is based on the correlation between a learner’s performance on a task or activity and the number of attempts or amount of time required to complete it.

There are several diagrams in circulation which depict the correlation between the difficulty of Power BI concepts and probably their implementation as functionality. Even if they reflect to some degree the rate of learning, their simplicity and fuzziness can easily make one question their accuracy in reflecting the reality.

Researchers tend to categorize the curves associated with the learning process in simple idealized patterns like S-curve (aka sigmoid), exponential growth, exponential rise and fall to limit, or power law, however the learning process in IT-based endeavors is seldom characterized by a linear or exponential curve, given that the tasks seldom allow a steady path. The jumps of knowledge between tasks can be wide enough to appear insurmountable, and they can prove to be quite of a challenge without some help.

Like a baby’s first steps, we, as learners, must learn first to crawl, before making some unsteady steps, and it can take long time until visible progress is made. It’s a slow progress until we suddenly hit a (tipping) point from which everything seems easy, fact that increases our confidence in us. On the other side, when we find that we make no visible progress for a long period, it’s easy to arrive to the opposite, a critical zone, which in extremis could make one lose interest.

As beginners, after the first tipping point on the learning journey, it’s easy to arrive at a plateau in which there seem no need to learn new things, the current knowledge allowing to handle a range of tasks of small to average complexity. This can last for a long time, and then, a big thing comes our way – a hard problem to solve or a concept hard to understand. It’s the point where we stagnate, and the deeper we go, and the more such challenges are thrown in our way, the more difficult the learning seems to be. However, with new understanding, small steps are made, one step after the other, the pace makes us to evolve faster until we reach again a critical point from which the process increases smoothly until we seem to stagnate again. We meet again a hard limit to growth, which seems to be more solid than the previous one.

Power BI's learning curve

Both limits to growth can appear to be hard, however, considering that the knowledge in the field expands, more opportunity for growth appear, thus, the limits are apparent. Even if knowledge tends to increase ‘indefinitely’, the limits are there in terms of complexity, time available, knowledge quality (incl. availability) or any other dimension of the learning process. Moreover, these successions of tipping points, growth limits, plateaus, critical, steady and fast progress zones can occur in several iterations in the learning path. Thus, the path seems to resemble a snakelike curve with many ups and downs.

For the learner is important to be aware of this last aspect, there are always ups and downs, taking effort, patience and maybe an expert’s help to bridge the gaps in between. The chances are high that the gap between what we think we know and what we know is considerable, therefore a reality check is useful from time to time. A new problem to tackle will provide that occasion!

Previous Post <<||>> Next Post

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part I: An Introduction)

Business Intelligence

Each important technology has the potential of creating divides between the specialists from a given field. This aspect is more suggestive in the data-driven fields like BI/Analytics or Data Warehousing. The data professionals (engineers, scientists, analysts, developers) skilled only in the new wave of technologies tend to disregard the role played by the former technologies and their role in the data landscape. The argumentation for such behavior is rooted in the belief that a new technology is better and can solve any problem better than previous technologies did. It’s a kind of mirage professionals and customers can easily fall under.

Being bigger, faster, having new functionality, doesn’t make a tool the best choice by default. The choice must be rooted in the problem to be solved and the set of requirements it comes with. Just because a vibratory rammer is a new technology, is faster and has more power in applying pressure, this doesn’t mean that it will replace a hammer. Where a certain type of power is needed the vibratory rammer might be the best tool, while for situations in which a minimum of power and probably more precision is needed, like driving in a nail, then an adequately sized hammer will prove to be a better choice.

A technology is to be used in certain (business/technological) contexts, and even if contexts often overlap, the further details (aka requirements) should lead to the proper use of tools. It’s in a professional’s duties to be able to differentiate between contexts, requirements and the capabilities of the tools appropriate for each context. In this resides partially a professional’s mastery over its field of work and of providing adequate solutions for customers’ needs. Especially in IT, it’s not enough to master the new tools but also have an understanding about preceding tools, usage contexts, capabilities and challenges.

From an historical perspective each tool appeared to fill a demand, and even if maybe it didn’t manage to fill it adequately, the experience obtained can prove to be valuable in one way or another. Otherwise, one risks reinventing the wheel, or more dangerously, repeating the failures of the past. Each new technology seems to provide a deja-vu from this perspective.

Moreover, a new technology provides new opportunities and requires maybe to change our way of thinking in respect to how the technology is used and the processes or techniques associated with it. Knowledge of the past technologies help identifying such opportunities easier. How a tool is used is also a matter of skills, while its appropriate use and adoption implies an inherent learning curve. Having previous experience with similar tools tends to reduce the learning curve considerably, though hands-on learning is still necessary, and appropriate learning materials or tutoring is upon case needed for a smoother transition.

In what concerns the implementation of mature technologies, most of the challenges were seldom the technologies themselves but of non-technical nature, ranging from the poor understanding/knowledge about the tools, their role and the implications they have for an organization, to an organization’s maturity in leading projects. Even the most-advanced technology can fail in the hands of non-experts. Experience can’t be judged based only on the years spent in the field or the number of projects one worked on, but on the understanding acquired about implementation and usage’s challenges. These latter aspects seem to be widely ignored, even if it can make the difference between success and failure in a technology’s implementation.

Ultimately, each technology is appropriate in certain contexts and a new technology doesn’t necessarily make another obsolete, at least not until the old contexts become obsolete.

Previous Post <<||>>Next Post

04 March 2021

💼Project Management: Project Execution (Part IV: Projects' Dynamics II - Motion)

Project Management

Motion is the action or process of moving or being moved between an initial and a final or intermediate point. From the tinniest endeavors to the movement of the planets and beyond, everything is governed by motion. If the laws of nature seem to reveal an inner structural perfection, the activities people perform are quite often far from perfect, which is acceptable if we consider that (almost) everything is a learning process. What is probably less acceptable is the volume of inefficient motion we can easily categorize sometimes as waste.

The waste associated with motion can take many forms: sorting through a pile of tools to find the right one, searching for information, moving back and forth to reach a destination or achieve a goal, etc. Suboptimal motion can have important effects for an organization resulting in reduced productivity, respectively higher costs.

If for repetitive activities that involve a certain degree of similarity can be found typically a way to optimize the motion, the higher the uncertainty of the steps involved, the more difficult it becomes to optimize it. It’s the case of discovery endeavors in which the path between start and destination can’t be traced beforehand, respectively when the destination or path in between can’t be depicted to the needed level of detail. A strategy’s implementation, ERP implementations and other complex projects, especially the ones dealing with new technologies and/or incomplete knowledge, tend to be exploratory in nature and thus fall under this latter type a motion.

In other words, one must know at minimum the starting point, the destination, how to reach it and what it takes to reach it – resources, knowledge, skillset. When one has all this information one can go on and estimate how long it will take to reach the destination, though the estimate reflects the information available as well estimator’s skills in translating the information into a realistic roadmap. Each new information has the potential of impacting considerably the whole process, in extremis to the degree that one must start the journey anew. The complexity of such projects and the volume of uncertainty can make estimation difficult if not impossible, no matter how good estimators' skills are. At best an estimator can come with a best- and worst-case estimation, both however dependent on the assumptions made.

Moreover, complex projects are sensitive to the initial conditions or auspices under which they start. This sensitivity can turn a project in a totally different direction or pace, that can be reinforced positively or negatively as the project progresses. It’s a continuous interplay between internal and external factors and components that can create synergies or have adverse effects with the potential of reaching tipping points.

Related to the initial conditions, as the praxis sometimes shows, for entities found in continuous movement (like organizations) it’s also important to know from where one’s coming (and at what speed), as the previous impulse (driving force) can be further used or stirred as needed. Metaphorically, a project will need a certain time to find the right pace if it lacks the proper impulse.

Unless the team is trained to play and plays like an orchestra, the impact of deviations from expectations can be hardly quantified. To minimize the waste, ideally a project’s journey should minimally deviate from the optimal path, which can be challenging to achieve as a project’s mass can pull the project in one direction or the other. The more the project advances the bigger the mass, fact which can make a project unstoppable. When such high-mass projects are stopped, their impulse can continue to haunt the organization years after.

Previous Post <<||>> Next Post

💼Project Management: Project Execution (Part III: Projects' Dynamics - An Introduction)

Despite the considerable collection of books on Project Management (PM) and related methodologies, and the fact that projects are inherent endeavors in professional as well personal life (setups that would give in theory people the environment and exposure to different project types), people’s understanding on what it takes to plan and execute a project seems to be narrow and questionable sometimes. Moreover, their understanding diverges considerably from common sense. It’s also true that knowledge and common sense are relative when considering any human endeavor in which there are multiple roads to the same destination, or when learning requires time, effort, skills, and implies certain prerequisites, however the lack of such knowledge can hurt when endeavor’s success is a must and a team effort. 

Even if the lack of understanding about PM can be considered as minor when compared with other challenges/problems faced by a project, when one’s running fast to finish a race, even a small pebble in one’s running shoes can hurt a lot, especially when one doesn’t have the luxury to stop and remove the stone, as it would make sense to do.

It resides in the human nature to resist change, to seek for information that only confirm own opinions, to follow the same approach in handling challenges, even if the attempts are far from optimal, even if people who walked the same path tell you that there’s a better way and even sketch the path and provide information about what it takes to reach there. As it seems, there’s the predisposition to learn on the hard way, if there’s significant learning involved at all. Unfortunately, such situations occur in projects and the solutions often overrun the boundaries of PM, where social and communication skills must be brought into play. 

On the other side, there’s still hope that change can be managed optimally once the facts are explained to a certain level that facilitates understanding. However, such an attempt can prove to be quite a challenge, given the various setups in which PM takes place. The intersection between technologies and organizational setups lead to complex scenarios which make such work more difficult, even if projects’ challenges are of organizational rather than technological nature. 

When the knowledge we have about the world doesn’t fit our expectation, a simple heuristic is to return to the basics. A solid edifice can be built only on a solid foundation and the best foundation in coping with reality is to establish common ground with other people. One can achieve this by identifying their suppositions and expectations, by closing the gap in perception and understanding, by establishing a basis for communication, in which feedback is a must if one wants to make significant progress.

Despite of being explorative and time-consuming, establishing common ground can be challenging when addressing to an imaginary audience, which is quite often the situation. The practice shows however that progress can be made by starting with a set of well-formulated definitions, simple models, principles, and heuristics that have the potential of helping in sense-making.

The goal is thus to identify first the definitions that reflect the basic concepts that need to be considered. Once the concepts defined, they can be related to each other with the help of a few models. Even if fictitious, as simplifications of the reality, the models should allow playing with the concepts, facilitating concepts’ understanding. Principles (set of rules for reasoning) can be used together with heuristics (rules of thumb methods or techniques) for explaining the ‘known’ and approaching the ‘unknown’. Even maybe not perfect, these tools can help building theories or explanatory constructs.

||>>Next Post

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part V: Quality Acceptance Criteria V)

Data Migration

Efficiency 

Efficiency is the degree to which a solution uses the hardware (storage, network) and other organizational resources to fulfill a given task. Data characterized by high volume, velocity, variety and veracity can be challenging to process, requiring upon case more processing power. Therefore, the DM solutions need to consider these aspects as well. However, efficiency refers on whether the available resources are used efficiently – the waste in terms of resource utilization is minimal. 

On the other side the waste of resources can be acceptable when there are other benefits or requirements that need to be considered, respectively when the ratio between resources utilization and effort to built more efficient processes is acceptable.

A DM solution involves iterative and exploratory processes in which knowledge and feedback is integrated in each iteration, therefore it might look like resources are not used efficiently. However, this is a way to handle complexity and uncertainty by breaking the effort in manageable chunks.

Learnability

Learnability is the degree to which a person can become familiar with a solution’s use, the data and the processes associated with it. A DM can be challenging for many technical and non-technical resources as it requires a certain level of skillset and understanding of the requirements, needs and deliverables. The complexity of the data and requirements can be overwhelming, however with appropriate communication and awareness established, the challenges can be overcome. 

Stability

Stability is the degree to which a solution is sensitive to environment changes (e.g. overuse of resource, hardware or software failures, updates), respectively on whether it performs with no performance defects or it does not crash under defined levels of stress. Stability can be monitored during the various phases and countermeasures need to be considered in case the solution is not stable enough (e.g. redesigning the solution, breaking the data in smaller chunks)

Suitability 

Suitability is the degree to which a solutions provides functions that meet the stated and implied needs. No matter how performant and technologically advanced a solution is, it brings less value as long it doesn’t perform what it was intended to do.

Transparency 

Transparency is the degree to which a solution’s stakeholders have access to the requirements, processes, data, documentation, or other information required by them. In a DM transparency is important especially important in respect to the data, logic and rules used in data processing, respectively the number of records processed. 

Trustability

Trustability is the degree to which a solution can be trusted to provide the expected results. Even if the technical team assures that the solution can deliver what was indented, the success of a DM is a matter of perception from stakeholders’ perspective. Providing transparency into the data, rules and processes can improve the level of trust however, special attention need to be given to the issues raised by stakeholders during and after Go-Live, as differences need to be mitigated. 

Understandability 

Understandability is the degree to which the requirements of a solution were understood by the resources involved in terms of what needs to be performed. For the average project resource it might be challenging to understand the implications of a DM, and this can apply to technical as well non-technical resources. Making people aware of the implications is probably one of the most important criteria for success, as the success of a migration is often a matter of perception. 

Usability 

Usability is the degree to which a solution can be used by the targeted users within the agreed context of usage. Ideally DM solutions need to be easy to use, though there are always trade-offs. In the end, a DM must fit the purpose it was built for. 

Previous Post <

05 January 2021

🧮ERP: Planning (Part II: It’s all about Scope - Nonfunctional Requirements & MVP))

ERP Implementation

Nonfunctional Requirements

In contrast to functional requirements (FRs), nonfunctional requirements (NFRs) have no direct impact on system’s behavior, affecting end-users’ experience with the system, resuming thus to topics like performance, usability, reliability, compatibility, security, monitoring, maintainability, testability, respectively other constraints and quality attributes. Even if these requirements are in general addressed by design, the changes made to the system have the potential of impacting users’ experience negatively.  

Moreover, the NFRs are usually difficult to quantify, and probably that’s why they are seldom made explicit in a formal document or are considered eventually only at high level. However, one can still find a basis for comparison against compliance requirements, general guidelines, standards, best practices or the legacy system(s) (e.g. the performance should not be worse than in the legacy system, the volume of effort for carrying the various activities should not increase). Even if they can’t be adequately described, it’s recommended to list the NFRs in general terms in a formal document (e.g. implementation contract). Failing to do so can open or widen the risk exposure one has, especially when the system lacks important support in the respective areas. In addition, these requirements need to be considered during testing and sign-off as well. 

Minimum Viable Product (MVP)

Besides gaps’ consideration in respect to FRs, it’s important to consider sometimes on whether the whole functionality is mandatory, especially when considering the various activities that need to be carried out (parametrization, Data Migration).

For example, one can target to implement a minimum viable product (MVP) - a version of the product which has just enough features to cover the mandatory or the most important FRs. The MVP is based on the idea that implementing about 80% of the needed functionality has in theory the potential of providing earlier a usable product with a minimum of effort (quick wins), assure that project’s goals and objectives were met, respectively assure a basis for further development. In case of cost overruns, the MVP assures that the business has a workable product and has the opportunity of deciding whether it’s worth of investing more into the project now or later. 

The MVP allows also to get early users’ feedback and integrate it into further enhancements and developments. Often the users understand the capabilities of a system, respectively implementation, only when they are able using the system. As this is a learning process, the learning period can take up to a few months until adequate feedback is available. Therefore, postponing implementation’s continuation with a few months can have in theory a positive impact, however it can come also with drawbacks (e.g. the resources are not available anymore). 

A sketch of the MVP usually results from requirements’ prioritization, however then requirements need to be regarded holistically, as there can be different levels of dependencies existing between them. In addition, different costs can incur if the requirements will be handled later, and other constrains may apply as well. Considering an MVP approach can be a sword with two edges. In the worst-case scenario, the business will get only the MVP, with its good and bad characteristics. The business will be forced then to fill the gaps by working outside the system, which can lead to further effort and, in extremis, with poor acceptance of the system. In general, users expect having their processes fully implemented in the system, expectation which is not always economically grounded.

After establishing an MVP one can consider the further requirements (including improvement suggestions) based on a cost-benefit basis and implement them accordingly as part of a continuous improvement initiative, even if more time will be maybe required for implementing the same.

Previous Post <<||>> Next Post

30 October 2020

Data Science: Data Strategy (Part II: Generalists vs Specialists in the Field)

Data Science

Division of labor favorizes the tasks done repeatedly, where knowledge of the broader processes is not needed, where aspects as creativity are needed only at a small scale. Division invaded the IT domains as tools, methodologies and demands increased in complexity, and therefore Data Science and BI/Analytics make no exception from this.

The scale of this development gains sometimes humorous expectations or misbelieves when one hears headhunters asking potential candidates whether they are upfront or backend experts when a good understanding of both aspects is needed for providing adequate results. The development gains tragicomical implications when one is limited in action only to a given area despite the extended expertise, or when a generalist seems to step on the feet of specialists, sometimes from the right entitled reasons. 

Headhunters’ behavior is rooted maybe in the poor understanding of the domain of expertise and implications of the job descriptions. It’s hard to understand how people sustain of having knowledge about a domain just because they heard the words flying around and got some glimpse of the connotations associated with the words. Unfortunately, this is extended to management and further in the business environment, with all the implications deriving from it. 

As Data Science finds itself at the intersection between Artificial Intelligence, Data Mining, Machine Learning, Neurocomputing, Pattern Recognition, Statistics and Data Processing, the center of gravity is hard to determine. One way of dealing with the unknown is requiring candidates to have a few years of trackable experience in the respective fields or in the use of a few tools considered as important in the respective domains. Of course, the usage of tools and techniques is important, though it’s a big difference between using a tool and understanding the how, when, why, where, in which ways and by what means a tool can be used effectively to create value. This can be gained only when one’s exposed to different business scenarios across industries and is a tough thing to demand from a profession found in its baby steps. 

Moreover, being a good data scientist involves having a deep insight into the businesses, being able to understand data and the demands associated with data – the various qualitative and quantitative aspects. Seeing the big picture is important in defining, approaching and solving problems. The more one is exposed to different techniques and business scenarios, with right understanding and some problem-solving skillset one can transpose and solve problems across domains. However, the generalist will find his limitations as soon a certain depth is reached, and the collaboration with a specialist is then required. A good collaboration between generalists and specialists is important in complex projects which overreach the boundaries of one person’s knowledge and skillset. 

Complexity is addressed when one can focus on the important characteristic of the problem, respectively when the models built can reflect the demands. The most important skillset besides the use of technical tools is the ability to model problems and root the respective problems into data, to elaborate theories and check them against reality. 

Complex problems can require specialization in certain fields, though seldom one problem is dependent only on one aspect of the business, as problems occur in overreaching contexts that span sometimes the borders of an organization. In addition, the ability to solve problems seem to be impacted by the diversity of the people involved into the task, sometimes even with backgrounds not directly related to organization’s activity. As in evolution, a team’s diversity is an important factor in achievement and learning, most gain being obtained when knowledge gets shared and harnessed beyond the borders of teams.

Note:
Written as answer to a Medium post on Data Science generalists vs specialists.

25 December 2019

#️⃣Software Engineering: Mea Culpa (Part II: The Beginnings)

Software Engineering
Software Engineering Series

I started programming at 14-15 years old with logical schemas made on paper, based mainly on simple mathematical algorithms like solving equations of second degree, finding prime or special numbers, and other simple tricks from the mathematical world available for a student at that age. It was challenging to learn programming based only on schemas, though, looking back, I think it was the best learning basis a programmer could have, because it allowed me thinking logically and it was also a good exercise, as one was forced to validate mentally or on paper the outputs.

Then I moved to learning Basic and later Pascal on old generation Spectrum computers, mainly having a keyboard with 64K memory and an improvised monitor. It felt almost like a holiday when one had the chance to work 45 minutes or so on an IBM computer with just 640K memory. It was also a motivation to stay long after hours to write a few more lines of code. Even if it made no big difference in what concerns the speed, the simple idea of using a more advanced computer was a big deal.

The jump from logical schemas to actual programming was huge, as we moved from static formulas to exploratory methods like the ones of finding the roots of equations of upper degrees by using approximation methods, working with permutations and a few other combinatoric tools, interpolation methods, and so on. Once I got my own 64K Spectrum keyboard, a new world opened, having more time to play with 2- and 3-dimensional figures, location problems and so on. It was probably the time I got most interesting exposure to things not found in the curricula.  

Further on, during the university years I moved to Fortran, back to Pascal and dBASE, and later to C and C++, the focus being further on mathematical and sorting algorithms, working with matrices, and so on. I have to admit that it was a big difference between the students who came from 2-3 hours of Informatics per week (like I did) and the ones coming from lyceums specialized on Informatics, this especially during years in which learning materials were almost inexistent. In the end all went well.

The jumping through so many programming languages, some quite old for the respective times, even if allowed acquiring different perspectives, it felt sometimes like  a waste of time, especially when one was limited to using the campus computers, and that only during lab hours. That was the reality of those times. Fortunately, the university years went faster than they came. Almost one year after graduation, with a little help, some effort and benevolence, I managed to land a job as web developer, jumping from an interlude with Java to ASP, JavaScript, HTML, ColdFusion, ActionScript, SQL, XML and a few other programming languages ‘en vogue’ during the 2000.

Somewhere between graduation and my first job, my life changed when I was able to buy my own PC (a Pentium). It was the best investment I could make, mainly because it allowed me to be independent of what I was doing at work. It allowed me learning the basics of OOP programming based on Visual Basic and occasionally on Visual C++ and C#. Most of the meaningful learning happened after work, from the few books available, full of mistakes and other challenges.

That was my beginning. It is not my intent to brag about how much or how many programming languages I learned - knowledge is anyway relative - but to differentiate between the realities of then and today, as a bridge over time.

Previous Post <<||>>  Next Post

12 May 2019

#️⃣Software Engineering: Programming (Part XII: Misconceptions about Programming - Part I)

Software Engineering
Software Engineering Series

Besides equating the programming process with a programmer’s capabilities, minimizing the importance of programming and programmers’ skills in the whole process (see previous post), there are several other misconceptions about programming that influence process' outcomes.


Having a deep knowledge of a programming language allows programmers to easily approach other programming languages, however each language has its own learning curve ranging from a few weeks to half of year or more. The learning curve is dependent on the complexity of the languages known and the language to be learned, same applying to frameworks and architectures, the scenarios in which the languages are used, etc. One unrealistic expectation is that the programmers are capablle of learning a new programming language or framework overnight, this expectation pushing more pressure on programmers’ shoulders as they need to compensate in a short time for the knowledge gap. No, the programming languages are not the same even if there’s high resemblance between them!

There’s lot of code available online, many of the programming tasks involve writing similar code. This makes people assume that programming can resume to copy-paste activities and, in extremis, that there’s no creativity into the act of programming. Beside the fact that using others’ code comes with certain copyright limitations, copy-pasting code is in general a way of introducing bugs in software. One can learn a lot from others’ code, though programmers' challenge resides in writing better code, in reusing code while finding the right the level of abstraction.  
 
There’s the tendency on the market to build whole applications using wizard-like functionality and of generating source-code based on data or ontological models. Such approaches work in a range of (limited) scenarios, and even if the trend is to automate as much in the process, is not what programming is about. Each such tool comes with its own limitations that sooner or later will push back. Changing the code in order to build new functionality or to optimize the code is often not a feasible solution as it imposes further limitations.

Programming is not only about writing code. It involves also problem-solving abilities, having a certain understanding about the business processes, in which the conceptual creativity and ingenuity of design can prove to be a good asset. Modelling and implementing processes help programmers gain a unique perspective within a business.

For a programmer the learning process never stops. The release cycle for the known tools becomes smaller, each release bringing a new set of functionalities. Moreover, there are always new frameworks, environments, architectures and methodologies to learn. There’s a considerable amount of effort in expanding one's (necessary) knowledge, effort usually not planned in projects or outside of them. Trainings help in the process, though they hardly scratch the surface. Often the programmer is forced to fill the knowledge gap in his free time. This adds up to the volume of overtime one must do on projects. On the long run it becomes challenging to find the needed time for learning.

In resource planning there’s the tendency to add or replace resources on projects, while neglecting the influence this might have on a project and its timeline. Each new resource needs some time to accommodate himself on the role, to understand project requirements, to take over the work of another. Moreover, resources are replaced on project with a minimal or even without the knowledge transfer necessary for the job ahead. Unfortunately, same behavior occurs in consultancy as well, consultants being moved from one known functional area into another unknown area, changing the resources like the engines of different types of car, expecting that everything will work as magic.



18 December 2018

🔭Data Science: Discovery (Just the Quotes)

"A good method of discovery is to imagine certain members of a system removed and then see how what is left would behave: for example, where would we be if iron were absent from the world: this is an old example." (Georg C Lichtenberg, Notebook J, 1789-1793)

"Every science has for its basis a system of principles as fixed and unalterable as those by which the universe is regulated and governed. Man cannot make principles; he can only discover them." (Thomas Paine, "The Age of Reason", 1794)

"We learn wisdom from failure much more than from success. We often discover what will do, by finding out what will not do; and probably he who never made a mistake never made a discovery." (Samuel Smiles, "Facilities and Difficulties", 1859)

"The process of discovery is very simple. An unwearied and systematic application of known laws to nature, causes the unknown to reveal themselves. Almost any mode of observation will be successful at last, for what is most wanted is method." (Henry Thoreau, "A Week on the Concord and Merrimack Rivers", 1862)

"Every process has laws, known or unknown, according to which it must take place. A consciousness of them is so far from being necessary to the process, that we cannot discover what they are, except by analyzing the results it has left us." (Lord William T Kelvin , "An Outline of the Necessary Laws of Thought", 1866)

"Accurate and minute measurement seems to the nonscientific imagination a less lofty and dignified work than looking for something new. But nearly all the grandest discoveries of science have been but the rewards of accurate measurement and patient long contained labor in the minute sifting of numerical results." (William T Kelvin, "Report of the British Association For the Advancement of Science" Vol. 41, 1871)

"Modern discoveries have not been made by large collections of facts, with subsequent discussion, separation, and resulting deduction of a truth thus rendered perceptible. A few facts have suggested an hypothesis, which means a supposition, proper to explain them. The necessary results of this supposition are worked out, and then, and not till then, other facts are examined to see if their ulterior results are found in Nature." (Augustus de Morgan, "A Budget of Paradoxes", 1872)

"Science arises from the discovery of Identity amid Diversity." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"It would be an error to suppose that the great discoverer seizes at once upon the truth, or has any unerring method of divining it. In all probability the errors of the great mind exceed in number those of the less vigorous one. Fertility of imagination and abundance of guesses at truth are among the first requisites of discovery; but the erroneous guesses must be many times as numerous as those that prove well founded. The weakest analogies, the most whimsical notions, the most apparently absurd theories, may pass through the teeming brain, and no record remain of more than the hundredth part. […] The truest theories involve suppositions which are inconceivable, and no limit can really be placed to the freedom of hypotheses." (W Stanley Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1877)

"A discoverer is a tester of scientific ideas; he must not only be able to imagine likely hypotheses, and to select suitable ones for investigation, but, as hypotheses may be true or untrue, he must also be competent to invent appropriate experiments for testing them, and to devise the requisite apparatus and arrangements." (George Gore, "The Art of Scientific Discovery", 1878)

"All great scientists have, in a certain sense, been great artists; the man with no imagination may collect facts, but he cannot make great discoveries." (Karl Pearson, "The Grammar of Science", 1892)

"The folly of mistaking a paradox for a discovery, a metaphor for a proof, a torrent of verbiage for a spring of capital truths, and oneself for an oracle, is inborn in us." (Paul Valéry, "Introduction to the Method of Leonardo da Vinci", 1895)

"If we study the history of science we see happen two inverse phenomena […] Sometimes simplicity hides under complex appearances; sometimes it is the simplicity which is apparent, and which disguises extremely complicated realities. […] No doubt, if our means of investigation should become more and more penetrating, we should discover the simple under the complex, then the complex under the simple, then again the simple under the complex, and so on, without our being able to foresee what will be the last term. We must stop somewhere, and that science may be possible, we must stop when we have found simplicity. This is the only ground on which we can rear the edifice of our generalizations." (Henri Poincaré, "Science and Hypothesis", 1901)

"The only true voyage of discovery […] would be not to visit new landscapes, but to possess other eyes, to see the universe through the eyes of another, of a hundred others, to see the hundred universes that each of them sees." (Marcel Proust, "À la recherche du temps perdu", 1913)

"To come very near to a true theory, and to grasp its precise application, are two very different things, as the history of a science teaches us. Everything of importance has been said before by somebody who did not discover it." (Alfred N Whitehead, "The Organization of Thought", 1917)

"Great scientific discoveries have been made by men seeking to verify quite erroneous theories about the nature of things." (Aldous L Huxley, "Life and Letters and the London Mercury" Vol. 1, 1928)

"The art of discovery is confused with the logic of proof and an artificial simplification of the deeper movements of thought results. We forget that we invent by intuition though we prove by logic." (Sarvepalli Radhakrishnan, "An Idealist View of Life", 1929)

"Science is the attempt to discover, by means of observation, and reasoning based upon it, first, particular facts about the world, and then laws connecting facts with one another and (in fortunate cases) making it possible to predict future occurrences." (Bertrand Russell, "Religion and Science, Grounds of Conflict", 1935) 

"A great discovery solves a great problem but there is a grain of discovery in the solution of any problem. Your problem may be modest; but if it challenges your curiosity and brings into play your inventive faculties, and if you solve it by your own means, you may experience the tension and enjoy the triumph of discovery." (George Polya, "How to solve it", 1944)

"It is always more easy to discover and proclaim general principles than it is to apply them." (Winston Churchill, "The Second World War: The gathering storm", 1948)

"Scientific discovery consists in the interpretation for our own convenience of a system of existence which has been made with no eye to our convenience at all." (Norbert Wiener, "The Human Use of Human Beings", 1949)

"The scientist who discovers a theory is usually guided to his discovery by guesses; he cannot name a method by means of which he found the theory and can only say that it appeared plausible to him, that he had the right hunch or that he saw intuitively which assumption would fit the facts." (Hans Reichenbach, "The Rise of Scientific Philosophy", 1951)

"It is important for him who wants to discover not to confine himself to one chapter of science, but to keep in touch with various others." (Jacques S Hadamard, "An Essay on the Psychology of Invention in the Mathematical Field", 1954)

"The true aim of science is to discover a simple theory which is necessary and sufficient to cover the facts, when they have been purified of traditional prejudices." (Lancelot L Whyte, "Accent on Form", 1954)

"There comes a point where the mind takes a leap - call it intuition or what you will - and comes out upon a higher plane of knowledge, but can never prove how it got there. All great discoveries have involved such a leap." (Albert Einstein, [interview in Life, "Death of a Genius"] 1955)

"Discovery follows discovery, each both raising and answering questions, each ending a long search, and each providing the new instruments for a new search." (J Robert Oppenheimer, "Prospects in the Arts and Sciences", 1964)

"Discovery always carries an honorific connotation. It is the stamp of approval on a finding of lasting value. Many laws and theories have come and gone in the history of science, but they are not spoken of as discoveries. […] Theories are especially precarious, as this century profoundly testifies. World views can and do often change. Despite these difficulties, it is still true that to count as a discovery a finding must be of at least relatively permanent value, as shown by its inclusion in the generally accepted body of scientific knowledge." (Richard J. Blackwell, "Discovery in the Physical Sciences", 1969)

"A discovery must be, by definition, at variance with existing knowledge." (Albert Szent-Gyorgyi, "Dionysians and Apollonians", Science 176, 1972)

"It is one of our most exciting discoveries that local discovery leads to a complex of further discoveries. Corollary to this we find that we no sooner get a problem solved than we are overwhelmed with a multiplicity of additional problems in a most beautiful payoff of heretofore unknown, previously unrecognized, and as-yet unsolved problems." (Buckminster Fuller, "Synergetics: Explorations in the Geometry of Thinking", 1975)

"You cannot learn, through common sense, how things are you can only discover where they fit into the existing scheme of things." (Stuart Hall, 1977)

"Every discovery, every enlargement of the understanding, begins as an imaginative preconception of what the truth might be. The imaginative preconception - a ‘hypothesis’ - arises by a process as easy or as difficult to understand as any other creative act of mind; it is a brainwave, an inspired guess, a product of a blaze of insight. It comes anyway from within and cannot be achieved by the exercise of any known calculus of discovery. " (Sir Peter B Medawar, "Advice to a Young Scientist", 1979)

"Metaphors can have profound significance because, as images or figures, they allow the mind to grasp or discover unsuspected ideal and material relationships between objects." (Giuseppe Del Re, "Cosmic Dance", 1999)

"Alternative models are neither right nor wrong, just more or less useful in allowing us to operate in the world and discover more and better options for solving problems." (Andrew Weil," The Natural Mind: A Revolutionary Approach to the Drug Problem", 2004)

"We tackle a multifaceted universe one face at a time, tailoring our models and equations to fit the facts at hand. Whatever mechanical conception proves appropriate, that is the one to use. Discovering worlds within worlds, a practical observer will deal with each realm on its own terms. It is the only sensible approach to take." (Michael Munowitz, "Knowing: The Nature of Physical Law", 2005)

"Equations seem like treasures, spotted in the rough by some discerning individual, plucked and examined, placed in the grand storehouse of knowledge, passed on from generation to generation. This is so convenient a way to present scientific discovery, and so useful for textbooks, that it can be called the treasure-hunt picture of knowledge." (Robert P Crease, "The Great Equations", 2009)

"Models do not only describe reality, they are also instruments for exploring reality. They are not only involved in the integration of known data, but also in the discovery of new data." (Andreas Bartels, "The Standard Model of Cosmology as a Tool for Interpretation and Discovery", 2013)

More quotes on "Discovery" at the-web-of-knowledge.blogspot.com.

16 December 2018

🔭Data Science: Data Collection (Just the Quotes)

"There are two aspects of statistics that are continually mixed, the method and the science. Statistics are used as a method, whenever we measure something, for example, the size of a district, the number of inhabitants of a country, the quantity or price of certain commodities, etc. […] There is, moreover, a science of statistics. It consists of knowing how to gather numbers, combine them and calculate them, in the best way to lead to certain results. But this is, strictly speaking, a branch of mathematics." (Alphonse P de Candolle, "Considerations on Crime Statistics", 1833)

"Just as data gathered by an incompetent observer are worthless - or by a biased observer, unless the bias can be measured and eliminated from the result - so also conclusions obtained from even the best data by one unacquainted with the principles of statistics must be of doubtful value." (William F White, "A Scrap-Book of Elementary Mathematics: Notes, Recreations, Essays", 1908)

"[...] scientists are not a select few intelligent enough to think in terms of 'broad sweeping theoretical laws and principles'. Instead, scientists are people specifically trained to build models that incorporate theoretical assumptions and empirical evidence. Working with models is essential to the performance of their daily work; it allows them to construct arguments and to collect data." (Peter Imhof, Science Vol. 287, 1935–1936)

"Statistics is a scientific discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment. The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology." (Egon Pearson, 1936)

"Scientific data are not taken for museum purposes; they are taken as a basis for doing something. If nothing is to be done with the data, then there is no use in collecting any. The ultimate purpose of taking data is to provide a basis for action or a recommendation for action. The step intermediate between the collection of data and the action is prediction." (William E Deming, "On a Classification of the Problems of Statistical Inference", Journal of the American Statistical Association Vol. 37 (218), 1942)

"Data should be collected with a clear purpose in mind. Not only a clear purpose, but a clear idea as to the precise way in which they will be analysed so as to yield the desired information." (Michael J Moroney, "Facts from Figures", 1951)

"The technical analysis of any large collection of data is a task for a highly trained and expensive man who knows the mathematical theory of statistics inside and out. Otherwise the outcome is likely to be a collection of drawings - quartered pies, cute little battleships, and tapering rows of sturdy soldiers in diversified uniforms - interesting enough in the colored Sunday supplement, but hardly the sort of thing from which to draw reliable inferences." (Eric T Bell, "Mathematics: Queen and Servant of Science", 1951)

"Null hypotheses of no difference are usually known to be false before the data are collected [...] when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science." (I Richard Savage, "Nonparametric statistics", Journal of the American Statistical Association 52, 1957)

"Philosophers of science have repeatedly demonstrated that more than one theoretical construction can always be placed upon a given collection of data." (Thomas Kuhn, "The Structure of Scientific Revolutions", 1962) 

"It has been said that data collection is like garbage collection: before you collect it you should have in mind what you are going to do with it." (Russell Fox et al, "The Science of Science", 1964)

"Typically, data analysis is messy, and little details clutter it. Not only confounding factors, but also deviant cases, minor problems in measurement, and ambiguous results lead to frustration and discouragement, so that more data are collected than analyzed. Neglecting or hiding the messy details of the data reduces the researcher's chances of discovering something new." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"If we gather more and more data and establish more and more associations, however, we will not finally find that we know something. We will simply end up having more and more data and larger sets of correlations." (Kenneth N Waltz, "Theory of International Politics Source: Theory of International Politics", 1979)

"The systematic collection of data about people has affected not only the ways in which we conceive of a society, but also the ways in which we describe our neighbour. It has profoundly transformed what we choose to do, who we try to be, and what we think of ourselves." (Ian Hacking, "The Taming of Chance", 1990)

"When looking at the end result of any statistical analysis, one must be very cautious not to over interpret the data. Care must be taken to know the size of the sample, and to be certain the method for gathering information is consistent with other samples gathered. […] No one should ever base conclusions without knowing the size of the sample and how random a sample it was. But all too often such data is not mentioned when the statistics are given - perhaps it is overlooked or even intentionally omitted." (Theoni Pappas, "More Joy of Mathematics: Exploring mathematical insights & concepts", 1991)

"We have found that some of the hardest errors to detect by traditional methods are unsuspected gaps in the data collection (we usually discovered them serendipitously in the course of graphical checking)." (Peter Huber, "Huge data sets", Compstat '94: Proceedings, 1994)

"We do not realize how deeply our starting assumptions affect the way we go about looking for and interpreting the data we collect." (Roger A Lewin, "Kanzi: The Ape at the Brink of the Human Mind", 1994)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Consideration needs to be given to the most appropriate data to be collected. Often the temptation is to collect too much data and not give appropriate attention to the most important. Filing cabinets and computer files world-wide are filled with data that have been collected because they may be of interest to someone in future. Most is never of interest to anyone and if it is, its existence is unknown to those seeking the information, who will set out to collect the data again, probably in a trial better designed for the purpose. In general, it is best to collect only the data required to answer the questions posed, when setting up the trial, and plan another trial for other data in the future, if necessary." (P Portmann & H Ketata, "Statistical Methods for Plant Variety Evaluation", 1997)

"Data are collected as a basis for action. Yet before anyone can use data as a basis for action the data have to be interpreted. The proper interpretation of data will require that the data be presented in context, and that the analysis technique used will filter out the noise."  (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Data are generally collected as a basis for action. However, unless potential signals are separated from probable noise, the actions taken may be totally inconsistent with the data. Thus, the proper use of data requires that you have simple and effective methods of analysis which will properly separate potential signals from probable noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Just as dynamics arise from feedback, so too all learning depends on feedback. We make decisions that alter the real world; we gather information feedback about the real world, and using the new information we revise our understanding of the world and the decisions we make to bring our perception of the state of the system closer to our goals." (John D Sterman, "Business dynamics: Systems thinking and modeling for a complex world", 2000) 

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data. Unfortunately, much of the data reported to executives today are aggregated and summed over so many different operating units and processes that they cannot be said to have any context except a historical one - they were all collected during the same time period. While this may be rational with monetary figures, it can be devastating to other types of data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Data is a fact of life. As time goes by, we collect more and more data, making our original reason for collecting the data harder to accomplish. We don't collect data just to waste time or keep busy; we collect data so that we can gain knowledge, which can be used to improve the efficiency of our organization, improve profit margins, and on and on. The problem is that as we collect more data, it becomes harder for us to use the data to derive this knowledge. We are being suffocated by this raw data, yet we need to find a way to use it." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Statistics depend on collecting information. If questions go unasked, or if they are asked in ways that limit responses, or if measures count some cases but exclude others, information goes ungathered, and missing numbers result. Nevertheless, choices regarding which data to collect and how to go about collecting the information are inevitable." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Put simply, statistics is a range of procedures for gathering, organizing, analyzing and presenting quantitative data. […] Essentially […], statistics is a scientific approach to analyzing numerical data in order to enable us to maximize our interpretation, understanding and use. This means that statistics helps us turn data into information; that is, data that have been interpreted, understood and are useful to the recipient. Put formally, for your project, statistics is the systematic collection and analysis of numerical data, in order to investigate or discover relationships among phenomena so as to explain, predict and control their occurrence." (Reva B Brown & Mark Saunders, "Dealing with Statistics: What You Need to Know", 2008)

"Statistics is the art of learning from data. It is concerned with the collection of data, their subsequent description, and their analysis, which often leads to the drawing of conclusions." (Sheldon M Ross, "Introductory Statistics" 3rd Ed., 2009)

"Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions." (Ron Larson & Betsy Farber, "Elementary Statistics: Picturing the World" 5th Ed., 2011)

"The discrepancy between our mental models and the real world may be a major problem of our times; especially in view of the difficulty of collecting, analyzing, and making sense of the unbelievable amount of data to which we have access today." (Ugo Bardi, "The Limits to Growth Revisited", 2011)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Each systems archetype embodies a particular theory about dynamic behavior that can serve as a starting point for selecting and formulating raw data into a coherent set of interrelationships. Once those relationships are made explicit and precise, the 'theory' of the archetype can then further guide us in our data-gathering process to test the causal relationships through direct observation, data analysis, or group deliberation." (Daniel H Kim, "Systems Archetypes as Dynamic Theories", The Systems Thinker Vol. 24 (1), 2013)

"Statistics is an integral part of the quantitative approach to knowledge. The field of statistics is concerned with the scientific study of collecting, organizing, analyzing, and drawing conclusions from data." (Kandethody M Ramachandran & Chris P Tsokos, "Mathematical Statistics with Applications in R" 2nd Ed., 2015)

"The term data, unlike the related terms facts and evidence, does not connote truth. Data is descriptive, but data can be erroneous. We tend to distinguish data from information. Data is a primitive or atomic state (as in ‘raw data’). It becomes information only when it is presented in context, in a way that informs. This progression from data to information is not the only direction in which the relationship flows, however; information can also be broken down into pieces, stripped of context, and stored as data. This is the case with most of the data that’s stored in computer systems. Data that’s collected and stored directly by machines, such as sensors, becomes information only when it’s reconnected to its context."  (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015)

"Big data is, in a nutshell, large amounts of data that can be gathered up and analyzed to determine whether any patterns emerge and to make better decisions." (Daniel Covington, Analytics: Data Science, Data Analysis and Predictive Analytics for Business, 2016)

"Statistics can be defined as a collection of techniques used when planning a data collection, and when subsequently analyzing and presenting data." (Birger S Madsen, "Statistics for Non-Statisticians", 2016)

"Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. […] Statistics is the science of learning from data." (Moore McCabe & Alwan Craig, "The Practice of Statistics for Business and Economics" 4th Ed., 2016)

"Collecting data through sampling therefore becomes a never-ending battle to avoid sources of bias. [...] While trying to obtain a random sample, researchers sometimes make errors in judgment about whether every person or thing is equally likely to be sampled." (Daniel J Levitin, "Weaponized Lies", 2017)

"Just because there’s a number on it, it doesn’t mean that the number was arrived at properly. […] There are a host of errors and biases that can enter into the collection process, and these can lead millions of people to draw the wrong conclusions. Although most of us won’t ever participate in the collection process, thinking about it, critically, is easy to learn and within the reach of all of us." (Daniel J Levitin, "Weaponized Lies", 2017)

"Measurements must be standardized. There must be clear, replicable, and precise procedures for collecting data so that each person who collects it does it in the same way." (Daniel J Levitin, "Weaponized Lies", 2017)

"To be any good, a sample has to be representative. A sample is representative if every person or thing in the group you’re studying has an equally likely chance of being chosen. If not, your sample is biased. […] The job of the statistician is to formulate an inventory of all those things that matter in order to obtain a representative sample. Researchers have to avoid the tendency to capture variables that are easy to identify or collect data on - sometimes the things that matter are not obvious or are difficult to measure." (Daniel J Levitin, "Weaponized Lies", 2017)

"The desire to collect as much data as possible must be balanced with an approximation of which data sources are useful to address a business issue. It is worth mentioning that often the value of internal data is high. Most internal data has been cleansed and transformed to suit the mission. It should not be overlooked simply because of the excitement of so much other available data." (Mike Fleckenstein & Lorraine Fellows, "Modern Data Strategy", 2018)

"A random collection of interesting but disconnected facts will lack the unifying theme to become a data story - it may be informative, but it won’t be insightful." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Are your insights based on data that is accurate and reliable? Trustworthy data is correct or valid, free from significant defects and gaps. The trustworthiness of your data begins with the proper collection, processing, and maintenance of the data at its source. However, the reliability of your numbers can also be influenced by how they are handled during the analysis process. Clean data can inadvertently lose its integrity and true meaning depending on how it is analyzed and interpreted." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019

"Each decision about what data to gather and how to analyze them is akin to standing on a pathway as it forks left and right and deciding which way to go. What seems like a few simple choices can quickly multiply into a labyrinth of different possibilities. Make one combination of choices and you’ll reach one conclusion; make another, equally reasonable, and you might find a very different pattern in the data." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"It’d be nice to fondly imagine that high-quality statistics simply appear in a spreadsheet somewhere, divine providence from the numerical heavens. Yet any dataset begins with somebody deciding to collect the numbers. What numbers are and aren’t collected, what is and isn’t measured, and who is included or excluded are the result of all-too-human assumptions, preconceptions, and oversights." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Unless we’re collecting data ourselves, there’s a limit to how much we can do to combat the problem of missing data. But we can and should remember to ask who or what might be missing from the data we’re being told about. Some missing numbers are obvious […]. Other omissions show up only when we take a close look at the claim in question." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"What is the purpose of collecting data? People gather and store data for at least three different reasons that I can discern. One reason is that they want to build an arsenal of evidence with which to prove a point or defend an agenda that they already had to begin with. This path is problematic for obvious reasons, and yet we all find ourselves traveling on it from time to time. Another reason people collect data is that they want to feed it into an artificial intelligence algorithm to automate some process or carry out some task. […] A third reason is that they might be collecting data in order to compile information to help them better understand their situation, to answer questions they have in their mind, and to unearth new questions that they didn't think to ask." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

[Murphy’s Laws of Analysis:] "(1) In any collection of data, the figures that are obviously correct contain errors. (2) It is customary for a decimal to be misplaced. (3) An error that can creep into a calculation, will. Also, it will always be in the direction that will cause the most damage to the calculation." (G C Deakly)

"[…] numerous samples collected without a clear idea of what is to be done with the data are commonly less useful than a moderate number of samples collected in accordance with a specific design." (William C Krumbein)

More quotes on " Data Collection" at the-web-of-knowledge.blogspot.com
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.