Blue Dragon Energy & Environmental Blog 2.0: The Power of Databases: Leveraging Data and Statistical Knowledge to Understand Science and Solve Problems

This is a rather contemplative essay, but I think it was important for me to ponder. Numbers allow us to analyze the world.

Data acquisition and analysis in order to arrive at conclusions and to answer questions has long been the method of science. We often need to know what the data shows. In order to do this, we need to analyze that data through analogy, quantification, comparison to other data, and other statistical processes. Wikipedia gives dictionary and encyclopedia definitions of statistics as follows:

“Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.”

In line with that definition, statisticians are now often known as data scientists. In data analysis, there are two main methods that are used: descriptive statistics, where data is summarized via indexes such as a mean or a standard deviation; and inferential statistics, where conclusions are arrived at through random variation using observational errors and sampling variation. Descriptive statistics are concerned with sample distribution and population distribution, while inferential statistics are mathematical and based on probability theory. According to Wikipedia: “Descriptive statistics is distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aims to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.”

Forms of data are often complementary. Numerical data tied to spatial data, or geography in Geographic Information Systems (GIS) is often very valuable and very useful. Data analysis in science often involves the discovery of formulas and recipes that hold true when determining the best courses of action.

I have seen colleagues in the oil & gas business develop large databases that have proved very useful in scientific analysis. I have even helped to develop some. Federal, state and local departments and regulatory agencies also develop large databases that are utilized in scientific papers, policy proposals, and for practical comparisons and high grading/prioritization. Government data made available to citizens, academics, and industry is also common and a very useful practice.

Geospatial databases allow information to be mapped in space, typically in 2D space, but also in 3D space where applicable. The added dimension of place allows for high-grading/low-grading and prioritization of project needs or of resource development.

Data can be scanned for patterns and signals. Humans and machines can do that scanning. Humans can do it better in certain aspects and machines can do it better in other aspects. Thus, one could hypothesize that humans and machines collaborating on data analytics are currently the most optimized. As long as we are still teaching machines what we figure out about data and they are revealing hidden patterns in data that we cannot readily see, the partnership will be optimized. Humans teach. Machines learn. Then machines teach in a kind of feedback loop. Humans have been scanning their environments for patterns very deep into their evolutionary past. Doing so has aided our survival, food search, and search for mates. However, machines have advantages in scanning data for patterns. They can work very fast and scan vast amounts of data without getting tired. Thus, they have speed and stamina. They only make errors if the human who directed them made them first. Machine capabilities to analyze large datasets quickly makes them very valuable assistants to humans. However, it is still up to humans to fully interpret the results of data analysis.

Knowing a subject involves knowing its numbers, and its statistics. We need to know how much of this and how much of that and we need to pick out any trends that show up in the data. We use data to interpret the past, assess the present, and predict the future. We use it to support our decisions in policy and action. We use data and data analysis to communicate and show our proposed actions, to back them up. We follow the data, which usually doesn’t lie. The data suggests courses of action and we must decide on them.

Data can tell stories. Those stories can support business intelligence or BI. Microsoft and other companies sell tools to enhance BI. According to Microsoft:

What is business intelligence?

“Business intelligence (BI) uncovers insights for making strategic decisions. Business intelligence tools analyze historical and current data and present findings in intuitive visual formats.”

What is data storytelling?

“Data storytelling is the concept of building a compelling narrative based on complex data and analytics that help tell your story and influence and inform a particular audience.”

Microsoft gives four steps to utilizing business intelligence:

Step 1: Collect and transform data from multiple sources

Business intelligence tools typically use the extract, transform, and load (ETL) method to aggregate structured and unstructured data from multiple sources. This data is then transformed and remodeled before being stored in a central location, so applications can easily analyze and query it as one comprehensive data set.

Step 2: Uncover trends and inconsistencies

Data mining, or data discovery, typically uses automation to quickly analyze data to find patterns and outliers which provide insight into the current state of business. BI tools often feature several types of data modeling and analytics—including exploratory, descriptive, statistical, and predictive—that further explore data, predict trends, and make recommendations.

Step 3: Use data visualization to present findings

Business intelligence reporting uses data visualizations to make findings easier to understand and share. Reporting methods include interactive data dashboards, charts, graphs, and maps that help users see what’s going on in the business right now.

Step 4: Take action on insights in real time

Viewing current and historical data in context with business activities gives companies the ability to quickly move from insights to action. Business intelligence enables real time adjustments and long-term strategic changes that eliminate inefficiencies, adapt to market shifts, correct supply problems, and solve customer issues.

This can be summarized as 1) collecting data from multiple sources; 2) finding meaningful patterns in that data; 3) communicating that data to others through visualization; and 4) taking action based on findings.

We can also use data in biased ways by “cherry-picking” it to fit our preconceived narratives. Microsoft gives the three key elements of data storytelling: 1) build your narrative; 2) use visuals to enlighten; and 3) show data to support.

As a geologist, I remember when computerized mapping was new(ish) and many maps and cross-sections were still done by hand. My supervisor at the time was rightly excited by what he called “the power of the computer.” It allowed us to analyze more information faster and to test ideas quickly to see if the data supported them. I always enjoyed making maps by hand because I could skew and orient them the way I thought they should be skewed and oriented. However, it would not be long before the computer could do that as well and so much more. We could now analyze large datasets instantly. Sure, we still needed to weed out bad data but now even that could be done faster and more effectively. The advantages were massive. Later we could add GIS layers to our maps and pick and choose different layers as needed. We used to plot out large maps and lay them out on large tables for analysis. Now we can use computers, often with two or even three large monitors synced as one in order to better scan and analyze.

The final parts of data analysis are communicating that data to other decision-makers and acting on it. Thus, the data supports the action. That is why cherry-picking can be problematic as the data is used selectively to tell a story that is often predetermined. That is a biased use of data where the data is used selectively to fit a certain narrative instead of being freshly interpreted. As a scientist, I know that the goal is the truth of the situation, whether it fits preconceived ideas or not. That is not easy sometimes but it must be the way to do science. I have seen many media stories where data analysis conclusions are interpreted differently by those with biases than by those without them. Of course, we need to be aware of biased data analysis and to call out cherry-picking when it occurs.

Just hours after I first published this post a new article in Big Think came out describing some examples of the McNamara Fallacy, named after Robert McNamara, the U.S. Secretary of Defense during the Vietnam War. Using quantitative data he made several strategy miscalculations. The article points out that quantitative data can be misleading if qualitative data is missing or misinterpreted. The result is that data can sometimes be misleading. McNamara over-relied “on measurable data leading to several misguided strategies where considering certain human and contextual elements would have been successful.”

“The McNamara fallacy is what occurs when decision-makers rely solely on quantitative metrics while ignoring qualitative factors. In other words, it’s when you look at raw numbers rather than the nuances that matter in the decision-making process.”

The article goes on to suggest that over-reliance on numerical data is at the expense of stories and subtle details. Perhaps the following quote sums up the article best:

“Data is a great starting point, and a great many idiotic and dangerous things are done when we ignore data, but it doesn’t always make for the best decisions.”

They give three examples of applying the McNamara fallacy: 1) Dig into the details – look at the numbers behind the numbers, 2) Consider the root causes – look into the reasons behind the numbers, and 3) Pick up the phone – apparently, getting the details is easier in person, including by phone.

References:

Statistics. Wikipedia. Statistics - Wikipedia

What is data storytelling? Microsoft BI. What is Data Storytelling and Data Storytelling Examples | Microsoft Power BI

What is business intelligence? Microsoft BI. What Is Business Intelligence | Microsoft Power BI

The “McNamara fallacy”: When data leads to the worst decision. Jonny Thompson. Big Think. October 18, 2024. The "McNamara fallacy": When data leads to the worst decision - Big Think

Blue Dragon Energy & Environmental Blog 2.0

Blog Archive

Friday, October 18, 2024

The Power of Databases: Leveraging Data and Statistical Knowledge to Understand Science and Solve Problems

No comments:

Post a Comment

Blog Archive