investigating the analytics landscape

investigating the analytics landscape

BUS5PB – Principles of Business Analytics
Topic 1 – Introduction to Business Analytics
Learning Objectives
Defining and positioning business analytics
The analytics landscape
An analytics methodology
Analytics best practices
1
Three Principles of Real Estate
“location, location, and location”
2
Three Principles of Business Analytics
“business problem/opportunity,
business problem/opportunity, and
business problem/opportunity”
3
To Serve or Not to Serve?
• As an example, Fidelity Investments once considered discontinuing its bill-paying service
because this service consistently lost money.
• Some last-minute analysis saved it, by showing that Fidelity’s most loyal and most
profitable customers used the service.
• Although it lost money, Fidelity made much more money on these customers’ other
accounts.
4
To Serve or Not to Serve?
• After all, customers that trust their financial
institution to pay their bills have a very high
level of trust in that institution.
Cutting such value-added services might
inadvertently exacerbate the profitability
problem by causing the best customers to look
elsewhere for better service.
5
What Is the Business Problem/Opportunity?
• Should Fidelity Investments consider discontinuing its bill-paying service because this
service consistently lost money?
• Should the investment company encourage customers to switch to alternative methods
of bill-paying?
6
The Business Analytics Landscape
7
The Business Analytics Landscape: data vs maturity
Business Analytics:
– Real-time
– Value-generating
– Actionable
8
Another perspective..
9
The analytics ecosystem
Descriptive (and diagnostic) analytics
What and why?
Predictive analytics
What is likely to occur?
Prescriptive analytics
What actions can be taken?
Exploratory analytics
Exposing the ‘unknown unknowns’
Information builders, 2014
10
Positioning analytics
11
Positioning analytics
12
Positioning analytics
13
Positioning analytics
14
Positioning analytics
15
Positioning analytics
16
• EXPLORE
On-going search for answers,
understanding, insight and foresight
17
• VALUE
Real value is derived from the ideas
generated through conversations by
looking at the data
18
To generate Value, Business Analytics must…
• Be purposeful
• Deliver new insights
• Be actionable

..

..
• Be dynamic
19
Purpose
• Align with business functions
• Align with knowledge of person taking action
• Address management objectives and issues
(strategy, performance, compliance, risk)
• Derive real value from the ideas generated through
conversations by closely investigating information
20
Insight
• Discover new facts or information
• Create awareness of facts previously unknown
• Surface cause-effect information
• Enable future-looking decision-making
21
Action
• Support decision-making and action-taking
• Improve discovery and insight, determination and resolve,
innovation and creativity
• Lead to actions that integrate well with organisational
processes
• Initiate actions that improve or generate revenue
22
Business intelligence
“The ability to apprehend the interrelationships of presented facts in such a way as to guide
action towards a desired goal.”
Hans Peter Luhn (1958) A Business Intelligence System
“Concepts and methods to improve business decision making by fact-based support
systems.”
Howard Dresner (1989) A Brief History of Decision Support Systems
23
Business analytics
“The extensive use of data, statistical and quantitative analysis, explanatory and predictive
models, and fact-based management to drive decisions and actions.”
Davenport and Harris (2007)
Competing on Analytics: The New Science of Winning
24
Business intelligence versus business analytics
At least three views exist:
Business analytics is an integral part of business intelligence.
“I think of analytics as a subset of BI based on statistics, prediction and optimization. The
great bulk of BI is much more focused on reporting capabilities. Analytics has become a
sexier term to use — and it certainly is a sexier term than reporting — so it’s slowly
replacing BI in many instances.”
Thomas Davenport (2010)
Analytics at Work: Q&A with Tom Davenport
25
Business intelligence versus business analytics
• Business intelligence and business analytics are synonymous.
• “The term business intelligence is used by the information technology community,
whereas business analytics is preferred by the business community. The two terms are
synonymous and will henceforth be referred to as BI/BA.”
Sumit Sircar (2010)
Business Intelligence in the Business Curriculum
26
Business intelligence versus business analytics
• Business intelligence and business analytics have key differences.
• Business intelligence describes: “What happened?”
• Business analytics describes:
• “Why did it happen?”
• “What will happen?”
• “What is the best that can happen?”
SearchBusinessAnalytics.com (2011)
Bill Chamberlin (2011) A Primer on Advanced Business Analytics
27
Achieving success with business analytics
28
Data deluge
29
Three consequences of the data deluge
• Every problem will generate data eventually.
• Proactively defining a data collection protocol will result in more
useful information, leading to more useful analytics.
• Every company will need analytics eventually.
• Proactively analytical companies will compete more effectively.
• Everyone will need analytics eventually.
• Proactively analytical people will be more marketable and more
successful in their work.
30
The business analytics challenge
31
Data deluge
32
Q&A
• Describe a data system you work with that
generates a large amount of information.
33
Changes in the analytical landscape
34
Changes in the analytical landscape
• Historical Changes
• Executive dashboarding – Static reports about business processes
• Total quality management (TQM) – Customer focused
• Six Sigma – Voice of the process, voice of the customer
• Customer relationship management (CRM) – The
right offer to the right person at the right time
• Forecasting and predicting – 360-degree customer view
35
Changes in the analytical landscape
• Relational databases
• Enterprise resource planning (ERP) systems
• Point of sale (POS) systems
• Data warehousing
• Decision support systems
• Reporting and ad hoc queries
• Online analytical processing (OLAP)
• Performance management mystems
• Executive information systems (EIS)
• Balanced scorecard
• Dashboard
• Business intelligence
36
CRM Evolution
• Total quality management (TQM)
• Product-centric
• Quality: Six Sigma
• Total customer satisfaction
• Mass marketing
• One-to-one marketing
• Customer relationship
• Wallet share of customer
• Customer retention
• Customer relationship management (CRM)
• Customer-centric
• Strategy
• Process
• Technology
37
Changes in the analytical landscape
38
Idiosyncrasies of business analytics
• 1. The Data
• Massive, operational, and opportunistic
• 2. The Users and Sponsors
• Business decision support
• 3. The Methodology
• Computer-intensive ad hoc-ery
• Multidisciplinary lineage
Data mining can be defined as advanced methods for exploring and modeling relationships
in large amounts of data.
Data mining is an essential component of business analytics.
39
The data
Experimental Opportunistic
Purpose Research Operational
Value Scientific Commercial
Generation Actively controlled Passively observed
Size Small Massive
Hygiene Clean Dirty
State Static Dynamic
40
The data: disparate business units
41
Opportunistic data
• Operational data is typically not collected with data analysis in mind.
• Multiple business units produce a silo-based data system.
• This makes business analytics different from experimental statistics and especially
challenging.
42
The methodology: what we learned not to do
• Prediction is more important than inference.
• Metrics are used “because they work,” not based on theory.
• p-values are rough guides rather than firm decision cutoffs.
• Interpretation of a model might be irrelevant.
• The preliminary value of a model is determined by its ability to predict a holdout sample.
• The long-term value of a model is determined by its ability to continue to perform well on
new data over time.
• Models are retired as customer behavior shifts, market trends emerge, and so on.
43
Using analytics intelligently
• Intelligent use of analytics results in the following:
• better understanding of how technological, economic, and marketplace shifts affect
business performance
• ability to consistently and reliably distinguish between effective and ineffective
interventions
• efficient use of assets, reduced waste in supplies, and better management of time and
resources
• risk reduction via measurable outcomes and reproducible findings
• early detection of market trends hidden in massive data
• continuous improvement in decision making over time
44
Simple reporting
• Examples: OLAP, RFM, QC, descriptive statistics, extrapolation
• Answer questions such as
• Where are my key indicators now?
• Where were my key indicators last week?
• Is the current process behaving like normal?
• What is likely to happen tomorrow?
45
Proactive analytical investigation
• Examples: inferential statistics, experimentation, empirical validation, forecasting,
optimization
• Answer questions such as
• What does a change in the market mean for my targets?
• What do other factors tell me about what I can expect from my target?
• What is the best combination of factors to give me the most efficient use of resources and
maximum profitability?
• What is the highest price the market will tolerate?
• What will happen in six months if I do nothing? What if I implement an alternative strategy?
46
Data stalemate
• Many companies have data that they do
not use or that is used by third parties.
These third parties might even resell the
data and any derived metrics back to the
original company!
• Example: retail grocery POS card
47
Every little bit…
• Taking an analytical approach to only a few key business problems with reliable metrics
tangible benefit.
• The benefits and savings derived from early analytical successes managerial support for
further analytical efforts.
• Everyone has data.
• Analytics can connect data to smart decisions.
• Proactively analytical companies outpace competition.
48
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
Which residents in a ZIP code
should receive a coupon in the
mail for a new store location?
49
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
What advertising strategy best
elicits positive sentiment toward
the brand?
50
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
What is the best next product for
this customer?
What other product is this
customer likely to purchase?
51
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
What is the highest price that the
market will bear without
substantial loss of demand?
52
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
How many 60-inch HDTVs
should be in stock? (Too many is
expensive; too few is lost
revenue.)
53
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
What are the best times and best
days to have technical experts on
the showroom floor?
54
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
What weekly revenue increase
can be expected after the
Mother’s Day sale?
55
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
Will oatmeal sell better near
granola bars or near baby food?
56
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
Which customers are most likely
to switch to a different wireless
provider in the next six months?
57
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
How likely is it that this individual
will have a claim?
58
Areas where analytics are often used
• New customer acquisition
• Customer loyalty
• Cross-sell / up-sell
• Pricing tolerance
• Supply optimization
• Staffing optimization
• Financial forecasting
• Product placement
• Churn
• Insurance rate setting
• Fraud detection
How can I identify a fraudulent
purchase?
59
When analytics are not helpful
• Snap decisions required
• Novel approach (no previous data possible)
• Most salient factors are rare (making
decisions to work around unlikely obstacles
or miracles)
• Expert analysis suggests a particular path
• Metrics are inappropriate
• Naïve implementation of analytics
• Confirming what you already know
Deciding when to run
from danger
60
When analytics are not helpful
• Snap decisions required
• Novel approach (no previous data possible)
• Most salient factors are rare (making
decisions to work around unlikely obstacles or
miracles)
• Expert analysis suggests a particular path
• Metrics are inappropriate
• Naïve implementation of analytics
• Confirming what you already know
Predicting the adoption
of a new technology
61
When analytics are not helpful
• Snap decisions required
• Novel approach (no previous data possible)
• Most salient factors are rare (making
decisions to work around unlikely obstacles
or miracles)
• Expert analysis suggests a particular path
• Metrics are inappropriate
• Naïve implementation of analytics
• Confirming what you already know
Planning contingencies
for employees winning
the lottery
62
When analytics are not helpful
• Snap decisions required
• Novel approach (no previous data possible)
• Most salient factors are rare (making
decisions to work around unlikely obstacles
or miracles)
• Expert analysis suggests a particular path
• Metrics are inappropriate
• Naïve implementation of analytics
• Confirming what you already know
The seasoned art critic
can recognize a fake
63
When analytics are not helpful
• Snap decisions required
• Novel approach (no previous data possible)
• Most salient factors are rare (making
decisions to work around unlikely obstacles
or miracles)
• Expert analysis suggests a particular path
• Metrics are inappropriate
• Naïve implementation of analytics
• Confirming what you already know
Predicting athletes’
salaries or quantifying
love
64
When analytics are not helpful
• Snap decisions required
• Novel approach (no previous data possible)
• Most salient factors are rare (making
decisions to work around unlikely obstacles
or miracles)
• Expert analysis suggests a particular path
• Metrics are inappropriate
• Naïve implementation of analytics
• Confirming what you already know
Only looking at one
variable at a time
65
When analytics are not helpful
• Snap decisions required
• Novel approach (no previous data possible)
• Most salient factors are rare (making
decisions to work around unlikely obstacles
or miracles)
• Expert analysis suggests a particular path
• Metrics are inappropriate
• Naïve implementation of analytics
• Confirming what you already know
Ignoring variables that
might be important
66
Summary
• Defining and positioning business analytics
• The analytics landscape
• An analytics methodology
• Analytics best practices
67
References
• Data Science for Business, Foster Provost and Tom Fawcett, 1st ed.
• Competing on Analytics: The New Science of Winning, Thomas H. Davenport, 1st Ed.
• Analytics at Work: Smarter Decisions, Better Results, Thomas H. Davenport, 1st Ed.
• The Value of Business Analytics: Identifying the Path to Profitability, Evan Stubbs, 1st Ed.
68
BUS5PB – Principles of Business Analytics
Topic 2 – Analytics-driven Organisations
Learning Objectives
Anatomy of decisions
Business value of analytics outcomes
Analytics components
Analytics technologies
1
Decision-making
• Organizations make decisions concerning all aspects of its business – sales and marketing
to procurement and finances.
• Decisions are made affecting revenues, profit margins, reputation, goodwill and
sustainability.
2
Decision-making examples
3
Changing context of decision-making
4
Importance of the customer
• According to Peter Drucker, there is only one valid definition of a business purpose: ‘to
create a customer’
• For any business to thrive it must pursue a strategy that
• Gains more customers
• Keeps existing customers
• Increases the frequency and value of transactions
Peter Drucker creator and inventor of
modern management, writer and business
thinker (1909 – 2005).
5
Customer as an information source
• In the current business environment the individual customer has become an information
triggering and transmitting device
• E.g. During an average day many individuals will use an ATM card, credit card, a loyalty
card, a call centre, the internet etc.
• They will also use their smart device to read online news, comment and network on
social media.
• What consumers do, when they do it and how they do it are recorded.
• Why they do it requires further dialogue with the customer.
6
Customer-centric analytics
• Modern customers are better informed and much more demanding than ever – loyalty of a
customer cannot be taken for granted.
• A key (expected) outcome of BA is to optimise the lifetime value of customers
• To do this,
• Get to know the customers better and
• Interact appropriately with the customers
• Organizations need to think about finding products for the customers rather than
customers for their products!
7
Understanding the customer
• A business with thousands or millions of customers cannot be expected to have a
personal relationship with each and everyone of them.
• However careful interpretation of information, that is routinely accumulated, can be
used to drive the organization’s ‘behaviour’ so that customers feel as the business
‘understands’ their needs.
• What is needed is not blanket marketing campaigns, but targeted marketing
campaigns.
• Decisions need to be made when the customer is understood.
8
Cross-channel marketing approach
“with customers, not at customers”
Responsys Inc. 2012
9
Effective decisions
• Definition: Effective Decisions
• Effective decisions are choices that move an organization closer to an agreed set of
goals in a timely manner
• First key: a set of goals to work towards.
• Second key: means to measure if a chosen course is moving toward or away from those
goals.
• Third key: information based on those measures must be provided to the decision maker
in a timely manner.
10
Who are the decision-makers?
• Is it only the CEO, President, Chairperson?
• Absolutely brilliant strategic plans can go wrong because of poor decisions
made by those responsible for their implementation.
• Important to have effective decision makers throughout an organization.
• A change from the past where lower level managers worked to only implement
high level decisions.
• Effective decision making at every level leads to success
11
Three keys at many levels..
• The three keys to effective decisions are applicable to any level of an
organizational hierarchy.
12
Operational decisions
• Generally more concerned with day-to-day running of a business.
• Examples:
• How many unfulfilled orders are there?
• On which items are we out of stock?
• What is the position of a particular order?
• Questions about the situation right now.
13
Strategic decisions
• Strategic decisions deal with planning and policy making.
• Examples:
• When a telecommunication company decides to introduce very cheap off-peak tariffs to
attract callers away from the peak times, rather than install extra equipment to cope with
increasing demand
• A large supermarket chain deciding to open its stores on Sundays
• A general 20 percent price reduction for one month in order to increase market share
• Operations can be considered as the implementation of the organization’s strategy
14
Analytics for strategic decisions
• Analytics is generally used for strategic decisions, though it could be applicable at lower
levels of management.
• And how do we use BI/BA at the strategic level?
• When we know what we are looking for
• Layout-led Discovery
• Data-led Discovery
• Discovering new questions and answers
• analytics
15
Layout-led discovery
• When the question we want answered is known and we have a good idea where that
answer is going to be found
• E.g. A printed report of sales volumes by region
• Will only provide answers that the report designer has included.
• Highly-dependable and credible information
• No ability to further explore or drill down
16
Data-led discovery
• When the question is known, but we don’t know exactly where to look for our answer
• E.g. We want to look up sales by product in an unusually low sales region X
• The information we find determines where we want to go next.
• The developer of this type of solution cannot know everywhere the report user may want to
go.
• Instead, the developer must provide an interactive environment that enables the user to
navigate at will.
17
Analytics
• When the data contain trends, correlations, and dependencies at a level of detail that
would be impossible for a human being to notice using either layout-led or data-led
discovery.
• E.g. Which N number of products would sell best together
• These relationships can be discovered by a machine using data mining techniques.
• Where layout-led discovery and data-led discovery usually start with summarized data,
data mining works at the highest level of detail (transactional data).
18
Analytics in business organizations: purpose
• Align with business functions
• Align with knowledge of person taking action
• Address management objectives and issues
(strategy, performance, compliance, risk)
• Derive real value from the ideas generated through conversations by closely investigating
information
19
Analytics in business organizations: insight
• Discover new facts or information
• Create awareness of facts previously unknown
• Surface cause-effect information
• Enable future-looking decision-making
20
Analytics in business organizations: action
• Support decision-making and action-taking
• Improve discovery and insight, determination and resolve,
innovation and creativity
• Lead to actions that integrate well with organisational processes
• Initiate actions that improve or generate revenue
21
Business analytics – value adds
Optimise prices
Manage inventory
Engineer finances /
taxes
Forecast sales
(accurately)
Detect / avoid fraud
Retain customers
Analyse campaigns
Predict recruitment
success
Optimise locations /
routes
22
Business analytics implementation
23
Business analytics implementation
24
Business analytics focus areas
25
Business analytics focus areas
26
Analytics components and technology
27
Distilling the components I
• BI/BA is an umbrella term that includes architectures, tools, databases, applications and
methodologies.
• These can be grouped into:
• Data warehouses and data marts
• Online analytical processing (OLAP)
• Querying and reporting
• Analytics and data mining
• Business performance management
• Dashboards and benchmarking
28
Distilling the components II
• Data warehouses and data marts
• The cornerstone of medium-to-large BI/BA systems
• Eradicates organizational silos to present a single version of the truth.
• In the past, included only historical data that was organized and summarized for
producing end-user reporting
• Today, include access to current data for real-time decision support
• Structured on the dimensional model of a business process
• A warehouse is also the source for querying, reporting and OLAP.
29
Distilling the components III
• Analytics and data mining
• An umbrella term to refer to a suite of computer algorithms from fields such as statistics,
machine learning, data mining, AI.
• Transform data into insights
• Identify causes, make predictions and decision optimization
• Emerging field of big data analytics where unstructured data is also introduced into the
decision-making process.
• Unstructured data include text, images, audio/video.
• Overcome the silos created by the nature of data.
30
Distilling the components IV
• Business performance management (BPM)
• Also referred to as corporate PM or enterprise PM
• encompasses all the processes, information, and systems used by managers to set
strategy, develop plans, monitor execution, forecast performance, and report results with
a view to achieving sustainable success no matter how success may be defined.
• Loop activity: measurement – evaluation – adjustments
• Methods – six sigma, balanced scorecard, activity based costing
31
Distilling the components V
• Dashboards and benchmarking
• A dashboard provides a comprehensive graphical/pictorial view of corporate
performance measures, trends, and exceptions.
• This can extend from organizational level to department and all the way down to
employee.
• Dashboards facilitate benchmarking across the hierarchy
• A dashboard will retrieve data in real-time from a warehouse or reporting interface.
• Many categories of dashboards
• Strategic, analytical, operational
• Quantitative, qualitative
32
Another take…
33
From a technology perspective
• Three complementary technologies:
• Data warehouses (storage)
• Reporting, OLAP and dashboards (query and present)
• Analytics (knowledge discovery in data)
• Data warehouses (technology take)
• Thematic storing of aggregated, derived, transformed data
• Achieved through ETL (Extract, Transform and Load) tools that moves data from
operational databases to the data warehouse
34
Transaction systems vs analytics systems
Attributes Transactional systems (e.g., operational DBMS) BI/BA systems
Business purpose Support the operational business activities in an efficient
manner.
Support strategic and tactical activities by giving the right
information and new insights to the business.
Characteristic Operational processing Informational processing
Orientation Transaction Analysis
Function Day-to-day operations Long-term information requirements, decision support
Business The users often have no choice whether or not to use the
system; he or she is not obligated to use the system in order to
conduct business.
“Voluntarily”, in the sense that analyses and reports can
often be done with other tools (such as spreadsheets),
even though it may be less efficient.
User type Frontline worker, operational staff Knowledge worker, managerial staff
IT support Relatively easy to plan, unless the system suffers from quality
problems.
Demands flexibility.
Focus Data in Information out
Unit of work Short, simple transaction Complex query
View Detailed, flat relational Summarised, multidimensional
Data content Current values Archived, derived, summarised
Data structure Optimised for transactions (integrity) Optimised for complex queries (redundancy)
Database design Entity relational based, application oriented Star/snowflake schema, subject oriented
Access frequency High Medium to low
Access type Read/write, update, delete Mostly read 35
Analytics projects versus IT projects
36
Evolution of analytics technology
• http://www.businesscomputingworld.co.uk/wp-content/uploads/2012/03/BI.jpg
Classification by era
Ad hoc reports
Special Extract
programs
Purpose-built
applications
Decision support
systems
Enterprise Information
Systems
Business analytics
Platforms
37
Major capabilities of BI/BA platforms
Category Capabilities
Information
delivery
Reporting
Dashboards
Ad-hoc query
Office integration
Search-based BI
Integration BI infrastructure
Metadata management
Application development
Workflow and collaboration
Analysis Online analytical processing
Advanced visualisation
Predictive modelling
Data mining
Measurement Balanced scorecards
Activity based costing
Total quality management
38
Analytics vendors
39
Selecting a product
• In a highly competitive analytics
market, we need to know what
to look for in a product
• One way is to assess the five
characteristics of analytics (as
a product) and decide if it fits
your organisation’s needs.
Characteristics Descriptions
Integrated A single, enterprise-wide view of the data
Data integrity Information must be accurate and must conform to business rules
Accessible
Easily accessible with intuitive
access paths, and responsive for
analysis
Credible Every business factor must have one and only one value
Timely Information must be available within the stipulated time frame
40
Performance vs. deployment
41
A 2014 survey
• Survey: InformationWeek 2014 Analytics, Business Intelligence, and Information
Management Survey
• InformationWeek surveyed 312 business technology decision-makers at North
American companies in 2013.
• The InformationWeek media platform (subscribers, event attendees and online
users) reaches more than two million business technology buyers.
42
Summary
• Anatomy of decisions
• Business value of analytics outcomes
• Analytics components
• Analytics technologies
43
References
• Data Science for Business, Foster Provost and Tom Fawcett, 1st ed.
• Competing on Analytics: The New Science of Winning, Thomas H. Davenport, 1st Ed.
• Analytics at Work: Smarter Decisions, Better Results, Thomas H. Davenport, 1st Ed.
• The Value of Business Analytics: Identifying the Path to Profitability, Evan Stubbs, 1st Ed.
44
BUS5PB – Principles of Business Analytics
Topic 3 – Descriptive Analytics 1
Learning Objectives
Defining descriptive analytics
The nature of variables in analytics
Univariate analysis techniques
Bivariate analysis techniques
1
The analytics ecosystem
• Analytics derives insights from large volumes
of data. Essentially a reduction/summarisation
process.
• The first phase is descriptive analytics (also
diagnostic analytics, descriptive statistics ).
• The next phase is predictive analytics to
forecast what might happen in the future.
• Finally prescriptive analytics identifies
potential actions and consequences.
• More recently, exploratory analytics aims to
address the big data and social media
challenges.
Datafloq, 2014
2
The analytics ecosystem
Descriptive (and diagnostic) analytics
What and why?
Predictive analytics
What is likely to occur?
Prescriptive analytics
What actions can be taken?
Exploratory analytics
Exposing the ‘unknown unknowns’
Information builders, 2014
3
Defining descriptive analytics
• What occurred and why it occurred?
• It is always a historical view of the organisation and its business activities.
• Textbook definitions:
• “Consists of organizing, summarising and presenting data in an informative manner ”
• “techniques for summarizing data (sample or a population) using graphs and numbers”
• There’s also inferential statistics, which draws conclusions about populations based on
samples.
• Textbook definitions:
• “Understand the summaries and draw conclusions or estimates about the population”
• “Techniques that use samples to make generalizations about populations”
4
Defining descriptive analytics
• An old school definition:
• “Statistics is the science of collecting, organizing, summarizing, and analyzing information
to draw conclusions or answer questions. In addition, statistics is about providing a
measure of confidence in any conclusions”
• “Methods for organizing, summarising, presenting data using graphical and
numerical techniques in order to draw conclusions about business activities.”
• Examples:
• Sales revenue
• Customer satisfaction
• Market share
• Employee turnover
Sum
Mean
Median
Mode
Range
which one and then what?
5
Hindsight
What’s our approach?
Define the problem
Identify the variables
Collect the data
Quantitative analysis
Qualitative analysis
Summarise and
evaluate
Draw conclusions Decide
Structure the problem Analyse the problem
6
It all starts with a variable…
• A variable is an observed characteristic or measure of business interest.
• Sales revenue, customer satisfaction, economic health, sentiment, attitude, colour etc.
• The actual data (or the values) observed for a variable are called observations.
• Observations vary from one entity to another.
• Types of variables:
• Qualitative and quantitative
• Qualitative (categorical) : each observation belongs to one of a set of well-defined
categories
• Quantitative (numerical) : each observation takes numerical values that represent
different magnitudes of the variable.
• Quantitative variables can be either discrete or continuous
• Discrete : values form a set of separate numbers (value results from counting)
• Continuous: values form an interval (infinite number of possible values that are not
countable.)
7
It all starts with a variable…
• A variable is discrete if its value results from counting, it is continuous if its value is
measured.
• Examples: sales revenue, customer satisfaction, gender, temperature, postcode, units sold
• There’s also the number of variables being investigated;
• Univariate – observations on a single variable (customer satisfaction)
• Bivariate – observations on the relationship between two variables (customer satisfaction
and sales revenue)
• Multivariate – observations on the relationships between many variables (customer
satisfaction, demographics, purchasing power and sales revenue)
8
Level of measurement of a variable
• We can also assign a level of measurement to variables,
• Nominal level – values are solely used to name, label, or categorize.
• gender, nationality
• Ordinal level – demonstrates properties of the nominal level AND values
can be arranged in a ranked or specific order.
• grades, level of satisfaction
• Interval level – demonstrates properties of the ordinal level AND
differences in the values have meaning.
• temperature, date
• Ratio level – demonstrates properties of the interval level AND ratios of the
values have meaning.
• height, weight, price, time
9
More definitions…
• Variable is an observed characteristic or measure.
• And variables vary! (if they didn’t, we have constants and statistical methods are not
required – wouldn’t that be nice ).
• So we observe the variation of variables in groups.
• A population is the entire group to be studied.
• An individual is a person, entity or object that is a member of the population being
studied.
• A sample is a subset of the population that is being studied.
• A statistic is a numerical summary of a sample.
• A parameter is a numerical summary of a population.
• So variables are characteristics of individuals within a population.
• Since we cannot observe entire populations, we take samples and derive statistics to
draw conclusions on parameters.
10
And now the techniques
• “Techniques for summarising data using graphs and numbers”
• Graphs – graphical techniques
• Categorical: bar graph, pie chart, component bar graphs
• Quantitative: dot plots, stem and leaf plots, histograms, boxplots
• Increasing popularity of data visualisation has led to visual analytics (BUS5VA)
• Numbers – numerical techniques (mostly for quantitative variables)
• Measures of central tendency (mean, median, mode(used in categorical))
• Measures of dispersion (range, standard deviation, variance)
• Measures of position (z-score, percentiles, quartiles, interquartile range)
11
Let’s start with numbers..
• Measures of central tendency: the mean
• Computed by adding all values of the variable and dividing by the number of observations
• An example: Average revenue per unit (ARPU) – revenue generated per customer or unit.
Enables analysis of revenue generation and growth at the per-unit level.
• We can investigate the population by taking total sales revenue and total customer base
OR take a sample from the customer base and the corresponding sales revenue.
• Population mean ( ) and sample ( ) mean
• [total revenue $1,000,000, total customers 2000] population mean (ARPU) = $500
• Taking a sample of customers last month,
• [total revenue $100,000, customers 10] sample mean (ARPU) = $10,000
12
Measures of central tendency: the median
• Measures of central tendency: the median
• The median of a variable is the value that lies in the middle of the data when arranged in
ascending order.
M represents the median.
• How to find the median:
• Arrange the data in ascending order.
• Determine the number of observations,
n.
• Determine the observation in the middle of the data set.
• If the number of observations is odd, the median is the data value exactly in the middle
of the data set. That is, the median is the observation that lies at (n + 1)/2 position.
• If it’s even, the median is the mean of the two middle observations in the data set. That
is, the median is the mean of the observations that lie in the n/2 position and the (n/2)
+1 position.
13
Measures of central tendency: the mode
• Measures of central tendency: the mode
• The mode of a variable is the most frequent observation of the variable that occurs in the
dataset.
• A dataset can have no mode, one mode, or more than one mode.
• If no observation occurs more than once, the data have no mode.
• Bimodal: two modes
• Multimodal: three or more modes
• The mode is usually not reported for multimodal data because it is not representative of a
typical value
• Example: a list of location of injuries, the mode: back (12 instances)
14
Measures of central tendency: when to use what?
• Measures of central tendency: when to use what?
• Mean: when data are quantitative and the frequency distribution is roughly symmetric
• Median: when the data are quantitative and the frequency distribution is skewed left or right
• Mode: When the data are qualitative or the most frequent observation is the desired
measure of central tendency (so it’s not widely used).
• Skewness: the degree to which a graph differs from a symmetric graph. Histograms are
used to visualize skewness.
• Skewed left (negative skew) – long tail on the left or negative side of the peak.
• Skewed right (positive skew) – long tail on the right or positive side of the peak.
15
Histograms
• A bar graph that portrays the frequency (or relative frequency) of occurrence of observations.
• Histograms can be plotted for discrete data(classes) and continuous data (ranges or
intervals).
• X-axis: class, range or interval. Y-axis: frequency
• The height of each bar reflects the respective frequency. The width of each rectangle is the
same and the rectangles touch each other.
• Relative frequency: absolute frequency divided by total observations.
• Example: number of customers per day (discrete variable)
16
Histograms for continuous data
• For a continuous variable, the interval of possible values is divided into smaller intervals or
ranges formed with values grouped together.
• This can also be useful when a discrete variable has a large number of possible values,
such as an exam mark.
• So the data are categorized, or grouped, by intervals of numbers.
• Each interval represents a class. (also called discretisation)
• Start the intervals with the smallest number or a number slightly smaller than this.
• The goal in constructing a frequency distribution is to reveal interesting features, but we
also typically want the number of classes to be between 5 and 20.
• A rule of thumb for determining the size of the interval (or class width or size of range):
17
Histograms for continuous data
• An example: frequency distribution of five year RR of 40 mutual funds.
class width = 0.5 class width = 1
class width = 3
• too few classes cause a bunching effect
• too many classes disperse the data
• an in-between class width exposes the patterns
18
Shape of a histogram
• Also referred to as shape of a (frequency) distribution. (Make not that distribution or frequency
distribution usually implies histogram)
• Also keep in mind data will not always exhibit behaviour to perfectly match a well-defined shape.
• Some flexibility is required to identify the shape and there can be disagreements, since identifying
shape is subjective.
uniform distribution: frequency of each
value is evenly spread out across the
values of the variable.
bell-shaped (normal) distribution:
highest frequency occurs in the middle and
frequencies tail off to the left and right of
the middle
skewed right: tail to the right of the peak
is longer than the tail to the left.
skewed left: tail to the left of the peak is
longer than the tail to the right.
19
Skew examples
• IQ of a well-represented sample of humans
• Life span of humans
• Annual income of adults in Australia
20
Measures of dispersion
• Measures of central tendency describe the typical value of a variable.
• It is important to also know the amount of dispersion in the variable, so we have measures
of dispersion.
• Dispersion is the degree to which the data are spread out.
• Range (R): the difference between the largest and the smallest data values of a variable.
Range is used less as it only uses the largest and smallest observations.
• Standard deviation (s): represents a typical distance or a type of average distance of an
observation from the mean, also the square root of variance.
• Variance (s2): average of the squares of the deviations from the mean.
21
Measures of dispersion: standard deviation
• Population standard deviation (
σ): square root of the sum of squared deviations about the
population mean divided by the number of observations in the population,
N.
• The square root of the mean of the squared deviations about the population mean.
• Sample standard deviation (s): square root of the sum of squared deviations about the
sample mean divided by n – 1, where
n is the sample size.
• Why n – 1
? We already know that the sum of the deviations about the mean must equal
zero. If the sample mean is known and the first n – 1 observations are known, then the
n-th
observation must be the value that causes the sum of the deviations to equal zero. 22
Empirical Rule
• If the observations are normally distributed (bell curve), the Empirical Rule can be
used to determine the percentage of data that will lie within k-standard deviations of
the mean.
• Approximately 68% of the data lie within 1
standard deviation of the mean.
• Approximately 95% of the data will lie within
2 standard deviations of the mean.
• Approximately 99.7% of the data will lie
within 3 standard deviations of the mean.
• Empirical Rule can also be used based on a
sample with x (mean) and s (std dev).
23
Empirical rule – example
• Satisfaction scores (0-160) for 100 customers
• Frequencies plotted as a histogram
• Sample mean is 100 and sample std dev is 16.1
• We can draw a bell-shaped curve as shown below.
• According to the Empirical Rule, determine;
• percentage of customers with scores within 3 standard deviations of the mean
• percentage of customers with scores between 67.8 and 132.2
• percentage of customers with scores above 132.2
The empirical rule is named so because
many distributions of data observed in practice
(empirically) are approximately bell shaped. 24
Measures of position
• Measures of central tendency describe the typical value of the dataset
• Measures of dispersion describe the amount of spread of the dataset
• Measures of position describe the relative position of a specific value within the dataset.
• Z-score represents the distance between an observation and the mean in terms of the
number of standard deviations. Subtract the mean from the value and dividing this result
by the standard deviation. There is both a population z-score and a sample z-score.
• If the value is larger than the mean, the z-score is positive.
• If the value is smaller than the mean, the z-score is negative.
• If the value equals the mean, the z-score is zero.
• A z-score of 1.24: the value is 1.24 standard deviations above the mean.
• A z-score of -2.31: the value is 2.31 standard deviations below the mean. 25
Measures of position: percentiles and quartiles
• Recall that the median divides the lower 50% of a set of data from the upper 50%.
• Median is a special case of a general concept called the percentile.
• The k-th percentile, denoted
Pk, of a set of data is a value such that k percent of the
observations are less than or equal to the value.
• Percentiles divide a set of data that is written in ascending order into 100 parts; thus 99
percentiles can be determined.
• An organization’s customer satisfaction score of 116 is at the 74th percentile.
• A percentile rank of 74% means that 74% of scores are less than or equal to 116 and
26% of the scores are greater. So 26% other organisations scored better.
26
Measures of position: percentiles and quartiles
• The most common percentiles are quartiles.
• Quartiles divide data sets into fourths, or four equal parts
• First quartile = 25th percentile, second quartile = 50th percentile, third = 75th percentile
• The interquartile range, IQR, is the range of the middle 50% of the observations. It is the
difference between the third and first quartiles.
• Which measures are more accurate? (median and IQR are more resistant to skewness)
27
The five-number summary
• Recall how we aim to ‘summarise’ with descriptive analytics
• Mean and standard deviation are appropriate for normal distributions (bell curved).
• The five-number summary is necessary when the distribution is skewed.
• The five-number summary of a set of data consists of the smallest data value, Q1, the
median, Q3, and the largest data value.
• The five-number summary is useful to draw boxplots.
• Example: minimum 25, Q1 30, median 35, Q3 40, maximum 50
28
Outliers
• ‘An outlier is an observation which deviates so much from the other observations as to
arouse suspicions that it was generated by a different mechanism’ – Hawkins.
• We should always check for outliers in any dataset/observations.
• Outliers could be genuine or due to errors (errors in the measurement of a variable, during
data entry, from errors in sampling)
• Checking for outliers using quartiles:
• Determine the first and third quartiles of the data.
• Compute the interquartile range (IQR).
• Determine fences: Fences serve as cutoff points for determining outliers.
• Lower fence = Q1 – 1.5(IQR)
• Upper fence = Q3 + 1.5(IQR)
• If a data value is less than the lower fence or greater than the upper fence, it is
considered an outlier.
29
Outliers and boxplots
• Example: profit per unit for a set of products in an online store.
• 19.95, 23.25, 23.32, 25.55, 25.83, 26.28, 42.47, 28.58, 28.72, 30.18, 30.35, 30.95, 32.13,
49.17, 33.23, 33.53, 36.68, 37.05, 37.43, 41.42, 54.63
• In ascending order: 19.95, 23.25, 23.32, 25.55, 25.83, 26.28, 28.58, 28.72, 30.18, 30.35,
30.95, 32.13, 33.23, 33.53, 36.68, 37.05, 37.43, 41.42, 42.47, 49.17, 54.63
• smallest:19.95, Q1: 26.06, M: 30.95, Q3: 37.24, largest: 54.63
• The five-number summary is 19.95 26.06 30.95 37.24 54.63
• Lower fence = Q1 – 1.5(IQR) = 26.06 – 1.5(11.18) = 9.29
• Upper fence = Q3 + 1.5(IQR) = 37.24 + 1.5(11.18) = 54.01
• Do we have an outlier?
30
Outliers and boxplots
Start with the horizontal number line, Q1, Q3, M.
Box starts from Q1 to Q3
Temporary brackets for lower and upper fences
Whiskers:
horizontal line from Q1 to 19.95, the smallest data value that is
larger than 9.29. And from Q3 to 49.17, the largest data value that
is smaller than 54.01
54.63 (>54.01) is an outlier – denoted using an asterisk (*).
31
Boxplots and shape of distribution
• Skewed right: median is left of center in the box, the right whisker is longer than the left and
the distance from the median to the minimum value in the data set is less than to the max.
32
Applicability (so what?)
• Given any dataset (observations), it is pragmatic to start with the descriptive techniques.
• Numerical methods and graphical methods will expose the underlying nature of the dataset
• Best practices
• Identify the type of data and applicable techniques.
• Start with the measures and work towards frequency distribution and/or boxplots.
• All measures (central tendency and dispersion) should be compared with common
knowledge/past experiences with same (or similar) variables.
• An extreme skew or excessive outliers imply a strong bias, in most cases a new sample
should be collected.
• If outcomes are inconclusive or do not reflect common knowledge, consider revising the
sample or augmenting the current dataset with external data.
• Input datasets are not limited to observations, it could also be intermediate or final
analytics outcomes. (e.g predictive analytics outcomes such as decision trees,
association rules, clusters etc. ‘finding the dispersion of association rules’)
33
Applicability
• A normal distribution of expectations
“What Was I Thinking?” Consumers Average Consumers Target Consumers
• Low incidence but high impact
• Focus resources on effectiveness
34
Now how about bivariate analysis?
• Recall we spoke of univariate vs bivariate variables.
• All techniques thus far focused on univariate analysis – a single variable was measured for
each observation/individual.
• Bivariate analysis explores two variables on each observation/individual.
• www.gapminder.org.
• Mission of Gapminder Foundation: to fight devastating ignorance with a fact-based worldview that
everyone can understand.
• As before, the type of variables determines the technique to be used:
1. relationship between two quantitative variables (scatterplot, correlation coefficient, leastsquares
(linear) regression, coefficient of determination)
2. relationship between two qualitative variables (contingency tables, bar graphs)
3. relationship between a quantitative variable and a qualitative variables (logistical regression)
35
Bivariate analysis
• In order to represent bivariate data, we need to decide which variable will explain (or
predict) the value (or responses) of the other variable.
• The response variable is the variable whose values can be explained by the values of the
explanatory or predictor variable.
• Explanatory variable is the input and response variable the output.
• Does an increase in customer satisfaction explain or predict an increase in sales revenue?
• Does vice versa make sense?
• We can regard either or both variables as response variables (reduced cost vs increased
sales). Analysts decides this based on requirements.
• So what’s really the purpose?
• To investigate if there is an association and to describe the nature of that association.
• An association exists between two variables if a particular value for one variable is
more/less likely to occur with certain values of the other variable.
36
Bivariate: two quantitative variables
• First technique: the scatterplot
• A graph that shows the relationship between two quantitative variables measured on the
same observation.
• Each observation is represented by a point in the scatter dia
gram (points are not connected).
• Explanatory variable is plotted on the x-axis (horizontal) and the response variable on the yaxis
(vertical).
• Once plotted, we try to interpret the scatter diagram.
• The goal in interpreting is to distinguish scatter diagrams that imply a linear relation, a
nonlinear relation, or no relation.
37
Scatterplot
• (a) the data follow a linear pattern that slants upward to the right (b) downward to the right
• (c) (d) show nonlinear relations. (e) there is no relation
• Two quantitative variables (x,y) have a positive association when high values of x tend to
occur with high values of y, and when low values of x tend to occur with low values of y.
(As x goes up, y tends to go up & vice versa)
• Two quantitative variables have a negative association when high values of one variable
tend to pair with low values of the other variable, and low values of one pair with high
values of the other. (As x goes up, y tends to go down & vice versa).
38
Scatterplot: example
Investigate the relationship between internet penetration and social media penetration in 33
countries
Country
Internet
Penetration
Social Media
Penetration
Peru 26.20% 13.34%
Philippines 21.50% 19.68%
Poland 52.00% 11.79%
Russia 27.00% 2.99%
Saudi Arabia 22.70% 11.65%
South Africa 10.50% 7.83%
Spain 66.80% 30.24%
Sweden 80.70% 44.72%
Taiwan 66.10% 38.21%
Thailand 20.50% 10.29%
Turkey 35.00% 31.91%
USA 77.33% 46.98%
UK 70.18% 45.97%
Venezuela 25.50% 28.64%
………….. ………….. ……………
N Mean Std Dev Min Q1 M Q3 Max
Internet 33 47 24.4 7 24 49 68.5 83
Social
media 33 24.73 16.49 0 11 26 38 52
Calculating numerical summaries:
Frequency distributions (internet use, social media use):
39
Scatterplot: example
• The histograms portray each variable
separately, how about a scatterplot?
• What is the explanatory variable and what
is the response variable?
• There is a clear trend.
• Countries with larger percentages of Internet use generally have larger
percentages of Social Media use.
• For countries with relatively low Internet use (below 20%), there is little
variability in Social Media use. Social Media use ranges from about
2% to 13% for each such country.
• For countries with high Internet use (above 20%), there is high
variability in Social Media use. Social Media use ranges from about
2% to 52% for these countries.
• The point for Japan seems unusual. Its Internet use is among the
highest of all countries (74%), while its Social Media use is among the
lowest (2%).
• Based on values for other countries with similarly high Internet use, we
might expect Social Media use to be between 25% and 50% rather than
2%.
• Although not as unusual as Japan, Social Media use for the
Netherlands (21%) is a little lower than we’d expect for a country with
such high Internet use (83%).
• Which is the data point for the Netherlands?
40
Correlation coefficient
• The scale of graphs can be manipulated. So, numerical summaries of bivariate data should
be used in addition to graphs to determine any relation that exists between two variables.
• Types: Pearson product-moment correlation coefficient, intraclass correlation, rank
correlation, goodness of fit
• We will focus just on Pearson product-moment correlation coefficient or simply linear
correlation coefficient
• It is a measure of the strength and direction of the linear relationship (association) between
two quantitative variables. It takes values between -1 and +1, inclusive.
The Greek letter ρ (rho) represents the population correlation
coefficient, and r represents the sample correlation coefficient. 41
Correlation coefficient
• The linear correlation coefficient is always between -1 and 1, inclusive.
• If r = +1, then a perfect positive linear relation exists between the two variables.
• If r = -1, then a perfect negative linear relation exists between the two variables.
• The closer r is to +1, stronger evidence of positive association between the two
variables.
• The closer r is to -1, stronger evidence of negative association between the two
variables.
• If r is close to 0, then little or no evidence exists of a linear relation between the two
variables. So r close to 0 does not imply no relation, just no linear relation.
• The linear correlation coefficient is a unit-less measure of association. So the unit of
measure for x and y plays no role in the interpretation of r.
• Two variables have the same correlation regardless which is treated as the response
variable and which is treated as the explanatory variable.
42
Correlation coefficient
43
Least-squares regression
• If the scatterplot and correlation coefficient establish a linear relation, we can then find a
linear equation that describes this relation.
• One approach is to select two points from the data that appear to provide a good fit and to
find the equation of the line through these points.
• But this can be ambiguous – is there a line that fits the data better? Is there a line that fits
the data best?
• The least-squares regression line minimizes the sum of the squared errors (or residuals).
This line minimizes the sum of the squared vertical distance between the observed values
of y and those predicted by the line, (read
y-hat).
44
Least-squares regression
• Interpreting slope
• Interpreting slope for leastsquares
regression lines has a
minor twist in comparison to
algebra.
• Least-squares regression
equation are probabilistic.
• So two interpretations of slope
are acceptable:
• If
x increases by
p, then y
increases by
q, on average.
• If
x increases by
p, the
expected increase of y is
q.
• Interpreting the y-intercept
• y-intercept of any line is the point where the graph
intersects the vertical axis.
• Is 0 a reasonable value for the explanatory variable?
• Do any observations near
x=0 exist in the data set? 45
Other features of the least-squares regression line
• The coefficient of determination,
R2 = r2 measures the proportion of total variation in the
response variable that is explained by the least-squares regression line.
• Squaring the linear correlation coefficient to obtain the coefficient of determination works
only for the least-squares linear regression model. The method does not work in general.
• It is a measure of how well the least-squares regression line describes the relation
between the explanatory and response variables.
• The closer
R
2 is to 1, the better the line describes how changes in the explanatory variable
affect the value of the response variable.
• A residual analysis can be used to determine adequacy of the linear model.
• Determine whether a linear model is appropriate to describe the relation between the
explanatory and response variables
• Determine whether the variance of the residuals is constant
• Check for outliers
46
Other features of the least-squares regression line
• An influential observation is an observation that significantly affects the least-squares
regression line’s slope and/or y-intercept, or the value of the correlation coefficient.
• How do we identify influential observations? We first remove the point that is believed to
be influential from the data set, and then re-compute the correlation or regression line.
• If the correlation coefficient, slope, or y-intercept changes significantly, the removed point
is influential.
• Influential observations typically exist when the point is an outlier relative to the values of
the explanatory variable.
• As with outliers, influential observations should be removed only if there is justification to
do so.
47
Correlation does not imply causation!
• Simply observing an association (or correlation) between two variables is not enough to
imply a causal connection.
• Whenever two variables are associated, other variables may have influenced that
association.
• A lurking variable is an unobserved variable that influences the association between the
two variables of primary interest.
• Confounding is when two explanatory variables are both associated with a response
variable but are also associated with each other.
• A lurking variable is not measured in the study. It has the potential for confounding.
• If it were included in the study and if it were associated both with the response variable
and the explanatory variable, it would become a confounding variable.
48
Correlation does not imply causation
• The correlation between teenage birthrate and homicide rate since 1993 is 0.9987
• Lurking variable: time
• As air-conditioning bills increase, so does the crime rate
• Lurking variable: temperature
• A positive correlation between numbers drowned ice cream sales
• Lurking variable: beachgoers
• Over a 20-year study period, smokers had a greater survival rate than nonsmokers.
• Confounding variable: age.
• Older subjects were less likely to be smokers, and older subjects were more likely to die.
• Within each age group, smokers had a lower survival rate than nonsmokers. Age had a
dramatic influence on the association between smoking and survival status.
49
Contingency tables
• Now let’s look at examining relationships between two qualitative variables.
• A contingency table (or two-way table) is a display for two categorical variables. Its rows
list the categories of one variable and its columns list the categories of the other variable.
Each entry in the table is the number of observations in the sample at a particular
combination of categories of the two categorical variables.
• Example: does level of education explain employment status?
• Explanatory: level of education, response: employment status
50
Contingency tables
• First step is to determine the distribution of each variable separately
• For this we create marginal distributions, which is a relative frequency distribution of either
the row or column variable in the contingency table.
• A marginal distribution removes the effect of either the row variable or the column variable
in the contingency table.
• To create a marginal distribution for a variable, we calculate the row and column totals and
use these to calculate the relative frequency marginal distribution.
• The row totals represent the distribution of the row variable. The column totals represent
the distribution of the column variable.
51
Contingency tables: marginal distribution
• The relative frequency marginal distribution for the row variable, employment status, is
found by dividing the row total for each employment status by the table total.
• If level of education does not play any role, we would expect the relative frequencies for
employment status at each level of education to be close to the relative frequency marginal
distribution for employment status given in blue.
• So we would expect 56.0% who did not finish high school, 56.0% who finished high school,
56.0% with some college, and 56.0% with at least a bachelor’s degree to be employed.
52
Contingency tables: conditional distribution
• Now we calculate conditional distribution, which is the relative frequency of each category of the
response variable, given a specific value of the explanatory variable in the contingency table.
Observations:
• As the amount of schooling (the explanatory variable) increases, the proportion employed within
each category also increases.
• As the amount of schooling increases, the proportion not in the labour force decreases.
• The proportion unemployed with a bachelor’s degree is much lower than those unemployed in
the other three levels of education.
53
Contingency table: bar graph
• We can draw a bar graph of the conditional distributions.
• Label the values of the response variable (employee status) on the horizontal axis.
• Use different colored bars for each value of the explanatory variable (level of school).
• In this case, draw four bars, side by side, for each level of education.
• Let the horizontal axis represent employment status and the vertical axis represent the
relative frequency.
54
Simpson’s paradox
• Just as we spoke of lurking variables causing two unrelated quantitative variables to be
correlated, the same exists when exploring the relation between two qualitative variables.
• Simpson’s Paradox, describes a situation in which an association between two qualitative
variables inverts or goes away when a third variable is introduced to the analysis.
• A famous example: admission status and gender of students who applied to the University
of California, Berkeley in 1973.
• Total accepted: 0.395, men: 0.46 and women: 0.304
55
Simpson’s paradox
• Admission status (Accepted A or Not Accepted NA), for six programs of study (A, B, C, D,
E, F) by gender
• Conditional distribution
56
Simpson’s paradox
• Four of the six programs actually had a higher proportion of women accepted.
• The initial analysis did not account for the lurking variable, program of study.
• There were many more male applicants in programs A and B than female applicants, and
these two programs happen to have higher acceptance rates.
• The higher acceptance rates in these programs led to the false conclusion of a gender
bias.
57
To summarise (the summaries)
“Methods for organizing, summarising, presenting data using graphical and numerical
techniques in order to draw conclusions about business activities”
• Defining descriptive analytics
• Descriptive – predictive – prescriptive
• The nature of variables in analytics
• Quantitative vs qualitative, discrete vs continuous, univariate vs bivariate
• Levels of measurement – nominal, ordinal, interval, ratio
• Univariate analytics techniques (central tendency, dispersion, position)
• Bivariate analytics techniques (scatterplots, correlation, linear regression, contingency)
Contemplate:
How do we draw conclusions about a population?
How will you explore associations between multiple variables (multivariate)?
58
References
• Data Science for Business, Foster Provost and Tom Fawcett, 1st ed.
• Business statistics: Australia / New Zealand, Eliyathamby A. Selvanathan, Saroja
Selvanathan, Gerald Keller, 5th ed.
• Statistics: The Art and Science of Learning from Data, Alan Agresti and Christine Franklin,
3rd ed.
• Statistics Informed Decisions Using Data, Michael Sullivan, III, 4th ed.
59
BUS5PB – Principles of Business Analytics
Topic 4 – Descriptive Analytics 2
Learning Objectives
Probability and probability distributions
Sampling techniques
Estimation
Hypothesis testing
1
Defining descriptive analytics
• What occurred and why it occurred?
• It is always a historical view of the organisation and its business activities.
• Textbook definitions:
• “Consists of organizing, summarising and presenting data in an informative manner ”
• “techniques for summarizing data (sample or a population) using graphs and numbers”
• There’s also inferential statistics, which draws conclusions about populations based on
samples.
• Textbook definitions:
• “Understand the summaries and draw conclusions or estimates about the
population”
• “Techniques that use samples to make generalizations about populations”
2
Why probability?
• Inferential statistics use methods that generalize results obtained from a sample to the
population and measures their reliability.
• The methods used to generalize results from a sample to a population are based on
probability and probability models.
• Probability is a measure of the likelihood of an event occurring.
• Think of the probability of an outcome as the likelihood of observing that outcome.
• If an event has a high likelihood of occurring, it has a high probability (close to 1). If low
likelihood then a low probability (close to 0).
• In order to get to inferences we have to start by focusing on methods for determining
probabilities.
3
Defining a probability model
• The sample space, S, of a probability experiment is the collection of all possible outcomes.
• An event is any collection of outcomes from a probability experiment. An event consists of
one or more outcomes.
• The probability of any event E, P(E), must be greater than or equal to 0 and less than or
equal to 1.
• The sum probability of all outcomes must equal 1.
• If an event is impossible, the probability of the event is 0. If an event is a certainty, the
probability of the event is 1.
• An unusual event is an event that has a low probability of occurring. Typically less than
0.05 (or 5%).
• A probability model lists the possible outcomes of a probability experiment and each
outcome’s probability.
• Three methods for determining the probability of an event:
• empirical method, classical method, subjective method.
4
Three methods
• Empirical method
• Relies on evidence based on the outcomes of a probability experiment.
• Is always an approximation because different runs of the probability experiment lead to
different outcomes.
• The probability of an event E is approximately the number of times event E is observed divided by
the number of repetitions of the experiment.
• Example: conducting a survey to find the most popular product
• Classical method
• Relies on counting techniques.
• Assumes all outcomes are equally likely and their probabilities are known in advance.
• Difficult to satisfy this condition, therefore less real-world applicability.
• Example: rolling a fair die or brand of milk purchased.
5
Three methods
• Subjective probability
• Probability of an outcome obtained on the basis of personal judgment.
• Legitimate and are often the only method of assigning likelihood when an
experiment cannot be conducted.
• Example: asking an economist the likelihood of a recession next year.
• Comparing the classical and empirical methods
• A survey of 500 consumers (families with three children) asked to state the gender of
their children and found that 180 of the families had two boys and one girl.
• Estimate the probability of having two boys and one girl using the empirical method.
• Compute and interpret probability of the same using the classical method.
• probability of the event E = “two boys and one girl”
6
Comparing the classical and empirical methods
• Empirical: P(E) relative frequency of
E = 180/500 = 0.36 = 36%
• Classical:
• To determine the sample space, a tree diagram is useful to list equally likely outcomes
of the experiment.
• Sample space S = {BBB, BBG, BGB, BGG, GBB, GBG, GGB, GGG}, N(S) = 8.
• event
E = “two boys and a girl” = {BBG, BGB, GBB}, N(E) = 3
• P(E) = 3/8 = 0.375 = 37.5%
• Notice that the two probabilities are slightly different.
• But as the number of repetitions of a probability experiment increases, the empirical
probability should get closer to the classical probability.
7
Tree diagram for classical method
8
Rules for computing probabilities
• Addition rule
• Two events are disjoint (mutually exclusive) if they have no outcomes in common.
• Addition rule for disjoint events E,F: P (E or F ) =
P(E ) +
P(F )
• General addition rule (non-disjoint and disjoint): P (E or F ) =
P(E) +
P(F ) –
P(E and F )
• Complement rule
• Complement of event
E, Ec, is all outcomes in the sample space S that are not
outcomes of the event
E.
• P(Ec) = 1 –
P(E)
9
Rules for computing probabilities
• Independent events
• Two events E and F are independent if the occurrence of event E in a probability
experiment does not affect the probability of event
F.
• Two events are dependent if the occurrence of event E in a probability experiment
affects the probability of event
F.
• Disjoint events and independent events are different concepts
• Disjoint events are not independent, and independent events cannot be disjoint.
• Multiplication rule
• For independent events:
P(E and F ) =
P(E) *
P(F)
• General rule (both independent and dependent):
P(E and F ) =
P(E) *
P(F | E)
• The probability of E and F is the probability of event E occurring times the
probability of event F occurring, given the occurrence of event E.
10
Conditional probability
• We cannot always assume that two events will be independent.
• Conditional probability
• The notation P (F | E) “the probability of event F given event E”
• Probability that event F occurs, given event E has occurred.
• Conditional probability rule:
• The probability of event F occurring, given the occurrence of event E, is found by dividing
the probability of E and F by the probability of E, or by dividing the number of outcomes in
E and F by the number of outcomes in E.
11
Counting techniques
• Used in the classical method of determining the probability of an event.
• The number of outcomes in a sample space can be very large to list and count.
• If there are many stages to an experiment and several possibilities at each stage, the tree
diagram approach would become unmanageable.
• So algebraic counting techniques are required.
• Example: Victorian vehicle number plates, how many combinations with 3 letters followed
by 3 digits?
• Letter 1 > Letter 2 > Letter 3 > Number 1 > Number 2 > Number 3
• Every step can be done in a number of ways that does not depend on previous choices
• And we have 26 · 26 · 26 · 10 · 10 · 10 = 17,576,000 possibilities.
• More complex possibilities – no repetitions, permutations, combinations, non-distinct.
• Don’t worry each has a formula!
12
Summary of counting techniques
13
Summary flowchart for probability rules
14
Summary flowchart for counting techniques
15
Probability thus far..
• What we know,
• Empirical probability improves in accuracy more times the experiment is conducted.
• Classical probability uses counting techniques to obtain theoretical probabilities when all
outcomes are equally likely.
• A probability model lists the possible outcomes of a probability experiment and each
outcome’s probability.
• A probability model must satisfy the rules of probability, 0≤P(E)≤1 and sum of P(E) = 1.
• Now let’s look at probability models for random variables followed by probability
distributions.
16
Random variables
• A random variable is a numerical measure of the outcome of a probability experiment, so
its value is determined by chance.
• Random variables are typically denoted using capital letters such as X.
• Example:
• Flip a coin two times. The outcomes of the experiment are {HH, HT, TH, TT}.
• Instead of a particular outcome E, we might be interested in the number of heads.
• Let’s say random variable X represents the number of heads, so the possible values of
X are x = 0, 1, or 2.
• Instead of P(E) we have P(x)
• Uppercase X identifies the random variable while lowercase lists possible values (the
sample space).
17
Random variables: discrete vs continuous
• A discrete random variable has either a finite or countable number of values.
• The values of a discrete random variable can be plotted on a number line
with a space between each point.
• Example: number of cars that enter the tollway is a discrete random variable
because its value results from counting. If the random variable X represents
the number of cars, the possible values of X are x = 0, 1, 2, …..
• A continuous random variable has infinitely many values.
• The values of a continuous random variable can be plotted on a line,
uninterrupted.
• Example: speed of the next car entering the tollway is a continuous random
variable because speed is measured. If the random variable
S represents
the speed, the possible values of
S are all positive real numbers; s>0.
18
Probability distributions
• Because the value of a random variable is determined by chance, we may assign
probabilities to the possible values of the random variable.
• And this leads to probability distributions.
• Probability distribution of a discrete random variable X provides the possible values of the
random variable and their corresponding probabilities. A probability distribution can be in
the form of a table, graph, or mathematical formula.
• Two rules for discrete probability distribution
• Sum P(x) = 1 or ∑P(x)= 1
• 0
≤P(x)

1
• A discrete probability distribution is typically presented graphically with a probability
histogram.
• The horizontal axis corresponds to the value of the random variable and the vertical axis
represents the probability of each value of the random variable.
• Recall frequency distributions? 19
Probability histogram
• Probability histograms are similar to frequency histograms, the
vertical axis represents the probability of the random variable,
instead of its frequency.
• Center each rectangle at the value of the discrete random variable.
• The area of each rectangle in the probability histogram equals the
probability that the random variable assumes the particular value.
• For example, the area of the rectangle corresponding to the value
x = 2 is 1 * (0.38) = 0.38, where 1 represents the width and 0.38
represents the height of the rectangle.
• Probability histograms help determine the shape of the distribution.
• Recall distributions: skewed left, skewed right, or symmetric.
• The probability histogram shown here is skewed left.
20
Mean of a discrete random variable
• Computing the mean
• Example: number of people living in households (2, 4, 6, 6, 4, 4, 2, 3, 5, 5)
• What’s the mean number of people in these 10 households?
• We could add the observations and divide by 10
• Instead multiply the value of each random variable by its probability and sum
them up.
• ; x is the value of the random variable and P(x) is probability of
observing the value.
• Interpreting the mean
• The mean of a discrete random variable can be thought of as the mean
outcome of the probability experiment if we repeated the experiment many
times.
• Because the mean of a random variable represents what we expect to
happen in the long run, it is also called the expected value, E(X ). The
interpretation of expected value is the same as the interpretation of the mean
of a discrete random variable.
21
Standard deviation of a discrete random variable
• The standard deviation of a random variable describe the spread of the distribution.
• Computing the standard deviation
• where x is the value of the random variable,
μ
x is the mean of the random variable, and
P(x) is the probability of observing a value of the random variable.
22
Binomial probability distribution
• Binomial probability distribution is a discrete probability distribution that describes probabilities for
experiments in which there are two mutually exclusive (disjoint) outcomes.
• These two outcomes are generally referred to as success and failure.
• Criteria for a binomial probability experiment:
• The experiment is performed a fixed number of times. Each repetition of the experiment is
called a trial.
• The trials are independent. (outcome of one trial will not affect the other).
• For each trial, there are two mutually exclusive (disjoint) outcomes: success or failure.
• The probability of success is the same for each trial of the experiment.
• Notation used:
• There are
n independent trials of the experiment.
• Let p denote the probability of success for each trial so that 1 – p is the probability of failure
• Let X denote the number of successes in n independent trials of the experiment (0

x
≤ n).
23
Binomial probability distribution
• Binomial probability distribution function
• n
Cx, the number of ways of obtaining x successes in n trials
• p
x, the probability of success raised to the number of successes
x
• (1

p)n-x, the probability of failures raised to the number of failures n-x
• Mean (or Expected Value) and standard deviation of a bionomial random variable,
24
Binomial probability distribution
• Constructing binomial probability histograms
• n = 10 and p = 0.2. n = 10 and p = 0.5.
n = 10 and p = 0.8.
binomial probability distribution is:
skewed right if p < 0.5, symmetric and approximately bell shaped if p = 0.5, and skewed left if p > 0.5.
25
From discrete to continuous…
• To compute probabilities for discrete random variables, we usually substitute the value of
the random variable x into a formula.
• This does not apply very well to continuous random variables, since an infinite number of
outcomes are possible for continuous random variables.
• The probability of observing one particular value is zero.
• For example, the probability of the speed of a car entering the tollway being exactly 85.12
km/h is zero.
• This is because classical probability is found by dividing the number of ways an event can
occur by the total number of possibilities: there’s one way to observe 85.1223, and there
are an infinite number of possible values between 1 and 100.
• This is resolved by computing probabilities of continuous random variables over an interval
of values.
• So instead we compute the probability the speed is between 70 and 90 kmh.
• To find probabilities for continuous random variables, we use probability density functions.
26
Probability density functions
• Probability distribution functions were used for discrete random variables.
• A probability density function (pdf) is used to compute probabilities of continuous random
variables.
• The word density is used because it refers to the number of individuals per unit of area.
• It must satisfy the following two properties:
• The total area under the graph of the equation over all possible values of the random
variable must equal 1.
• The height of the graph of the equation must be greater than or equal to 0 for all
possible values of the random variable.
• Two popular pdfs: uniform and normal.
27
Normal probability distribution
• A continuous random variable is normally distributed, or has a normal probability
distribution, if its relative frequency histogram has the shape of a normal curve.
• The points at x =
μ

σ and x =
μ +
σ are the inflection points on the normal curve, the
points on the curve where the curvature of the graph changes.
• To the left of x =
μ

σ and to the right of x =
μ +
σ, the curve is drawn upwards.
• Between x =
μ

σ and x =
μ +
σ, the curve is drawn downwards.
28
Normal probability distribution
• Note how changes to
μ and
σ change the position or shape of a normal curve.
29
Normal probability distribution
• Properties of the normal density curve;
1. It is symmetric about its mean,
μ.
2. Because mean = median = mode, the curve has a
single peak and the highest point occurs at x =
μ.
3. It has inflection points at
μ

σ and
μ +
σ .
4. The area under the curve is 1.
5. The area under the curve to the right of
μ equals the
area under the curve to the left of
μ, which equals 0.5.
6. As
x increases without bounds the graph approaches,
but never reaches, the horizontal axis and vice versa.
7. The empirical rule.
30
Area under a normal curve
• Suppose that a random variable X is normally distributed with mean
μ and standard deviation
σ.
• The area under the normal curve for any interval of values of the random variable X represents
either;
• the proportion of the population with the characteristic described by the interval of values
• the probability that a randomly selected individual from the population will have the
characteristic described by the interval of values.
• So how would you find the area under the normal curve?
• z-score: transform a random variable X with mean
μ and standard deviation
σ into a random
variable Z with mean 0 and standard deviation 1. The random variable Z is said to have the
standard normal distribution.
• And then find the area to the left of a specified
z-score using standard normal distribution tables.
standard normal curve
31
Example: finding the area under a normal curve
• A production unit measures the lengths of 300 metal rods. The lengths
are approximately normally distributed, with mean 38.72 cm and
standard deviation 3.17 cm.
• Let’s use the normal curve to determine the proportion of rods that have
a length less than 35 cm.
• Visualise the desired area on a normal curve
• Convert the value of x to a
z-score
• Look up z = -1.17 in the tables
• The area to the left of z = -1.17 is 0.1210.
• Therefore, the area to the left of x = 35 is 0.1210.
Proportion of rods less than 35 cm in length is 0.1210.
Probability a randomly selected rod is less than 35 cm in length is 0.1210.
32
Example: finding probability of a normal random variable
• Find the probability that a randomly selected rod is between 35 and 40 cm
long, inclusive.
• Visualise the desired area on a normal curve
• Convert the value of
x1=35 and x
2=40 to
z-scores
• Look up z = -1.17 and z = 0.40 in the tables
• The area to the left of
z
2 = 0.4 (or
x2 = 40) is 0.6554 and the area to the
left of
z1 = -1.17 (or
x1 = 35) is 0.1210,
• so the area between
z1 = -1.17 and
z
2 = 0.40 is 0.6554 – 0.1210 =
0.5344.
Therefore,
P(35

X ≤ 40) =
P(-1.17
≤ Z
≤ 0.40) = 0.5344
33
Example: finding the value of a normal random variable
• Find the length of a rod at the 20th percentile..
• Visualise the desired area on a normal curve
• We want to find the z-score such that the area to the left of
the z-score is 0.20.
• refer to the tables and the area closest to 0.20, it is is 0.2005
• This corresponds to z-score = -0.84
• Substitute this to
• x = 36.1 cm.
• Length of a rod at 20th percentile is 36.1cm.
34
Sampling distributions
• Sampling distribution is the bridge between probability and statistical inference techniques.
• Statistics are random variables because the value of a statistic varies from sample to
sample.
• Statistics have probability distributions associated with them; probability distribution for the
sample mean, sample proportion etc.
• We can define sampling distribution of a statistic as the probability distribution for all
possible values of the statistic computed from a sample of size n.
• The sampling distribution of the sample mean ݔ̅is the probability distribution of all possible
values of the random variable ݔ̅ computed from a sample of size n from a population with
mean μ and standard deviation σ.
• The idea behind this is:
1. Obtain a simple random sample of size n.
2. Compute the sample mean.
3. Assuming that we are sampling from a finite population, repeat Steps 1 and 2 until all distinct simple random samples of size
n have been obtained.
35
Sampling distribution of the sample mean
• Regardless the distribution of the population, the sampling distribution of
ݔ̅will have a
mean equal to the mean of the population and a standard deviation equal to the standard
deviation of the population divided by the square root of the sample size.
• In other words, the mean of the sampling distribution of the sample mean is equal to the
mean of the underlying population and the standard deviation of the sampling distribution
of the sample mean is , regardless of the size of the sample.
• Central limit theorem: Regardless of the shape of the underlying population, the sampling
distribution of
ݔ̅ becomes approximately normal as the sample size,
n, increases.
36
Sampling distribution of the sample proportion
• The sample proportion, ݌̂, is a statistic that estimates the population proportion,
p.
• A random sample of size n is obtained from a population in which each individual either
does or does not have a certain characteristic. The sample proportion, denoted
݌̂(read “phat”)
= x / n , where x is the number of individuals in the sample with the specified
characteristic.
• Properties of the sampling distribution of the sample proportion
• For a simple random sample of size n with a population proportion
p,
• The shape of the sampling distribution of
݌̂is approximately normal provided np(1 –
p)≥10.
• The mean of the sampling distribution of
iŝ݌
μ
=̂ ݌
p.
• The standard deviation of the sampling distribution of
iŝ݌
37
What do we know now?
• Sample mean
ݔ̅is a random variable and has a distribution associated with it
• This distribution is called the sampling distribution of the sample mean.
• The mean of this distribution is equal to the mean of the population, μ.
• The standard deviation of this distribution is σ/√n
• The shape of the distribution of the sample mean is normal if the population is normal
• It is approximately normal if the sample size is large (n
≥ 30)
• Sample proportion
݌̂is also a random variable with mean
p and std dev =
• If np(1 –
p)≥10, the distribution of
݌̂is approximately normal.
38
Statistical inferences
We study two techniques: estimation and
hypothesis testing
Estimation is when we have no idea about the
value of the population parameter being
investigated.
Hypothesis testing is when we have some idea of
the value of the population parameter being
investigated, or if we have some hypothesised
value against which we can compare our sample
results.
“Techniques that use samples to make generalizations about populations”
39
Estimation
The objective of estimation is to determine the value of a population parameter on the basis
of a sample statistic.
• sample mean ݔ̅is an estimator of the population mean, μ.
• sample variance s2 is an estimator of the population variance 2.
• sample proportion ݌̂is an estimator of the population proportion p.
• Some examples;
• A bank conducts a survey to estimate the number of times customers will actually use
ATM machines.
• A random sample of processing times used to estimate the mean production time and
the variance of production time on a production line.
• A survey of eligible voters to gauge support for the federal government’s new carbon
emissions reforms
40
Estimation types
• Two types: point estimator and interval estimator
• A point estimator estimates the value of an unknown (population) parameter using a single
value calculated from the sample data.
• An interval estimator draws inferences about a population by estimating the value of an
unknown (population) parameter using a confidence interval.
41
Point vs Interval estimators
• A confidence interval for an unknown parameter consists of an interval of numbers based
on a point estimate.
• The level of confidence represents the expected proportion of intervals that will contain the
parameter if a large number of different samples is obtained.
• The level of confidence is denoted (1 –
α) * 100%.
• Interval estimators are far more useful because,
• Point estimator does not take into account the probability distribution of the sample.
• Point estimator provides no information on how close it is to the population parameter.
• Approach to estimation:
1. Identify the parameter to be estimated
2. Specify the parameter’s estimator and its sampling distribution
3. Construct an interval estimator.
42
Point estimate for population proportion
• Point estimate for the population proportion is
• where x is the number of individuals in the sample with the specified characteristic and n is
the sample size.
• Example: 1020 adults were asked if they feel taxes are too high, 490 said yes.
• So the point estimate for the population proportion is 490/1020 = 0.48 = 48%
43
Interval estimate for the population proportion
• What do we know about the sampling distribution of the sample proportion
?
• The shape of the sampling distribution of
݌̂is approximately normal provided np(1 –
p)≥10.
• The mean of the sampling distribution of
iŝ݌
μ
=̂ ݌
p.
• The standard deviation of the sampling distribution of
iŝ݌
• Because the distribution of the sample proportion is approximately normal, we know 95%
of all sample proportions will lie within 1.96 standard deviations of the population
proportion,
p, and 2.5% of the sample proportions will lie in each tail.
44
Interval estimate for the population proportion
• We can now deduce that 95% of all sample proportions are in the following equality;
• We can rewrite this as,
• So the margin of error for a 95% confidence interval is
45
Interval estimate for the population proportion
• When
α = 0.05, we constructed a (1 – 0.05) * 100% = 95% confidence interval.
• Now we need a method for constructing any (1 –
α) * 100% confidence interval.
• We generalize our current formula by first noting that (1 –
α) * 100% of all sample
proportions are in the interval,
• Rewritten as,
46
Critical value
• The value is the critical value of the distribution.
• It represents the number of standard deviations the sample statistic can be from the
parameter and still result in an interval that includes the parameter.
• Notice that higher levels of confidence correspond to higher critical values.
• Common critical values:
47
Interval estimate for the population proportion
• Computing the bounds of a (1 –
α) * 100% confidence interval for a population proportion,
• Note
݌̂in place of p in the standard deviation. This is because p is unknown, and
݌̂is the
best point estimate of
p.
• Example: 800 randomly sampled young adults were asked whether they used a mobile
phone while driving, 272 indicated that they text while driving. Obtain a 95% confidence
interval for the proportion who text while driving.
48
Interval estimate for the population proportion
1. Compute the value of
.̂݌
2. Verify that (the normality condition) and (the sample size is no more than
5% of the population size – the independence condition).
3. Determine the critical value.
4. Determine the lower and upper bounds of the confidence interval.
5. Interpret the result.
certainly more than 1,000,000 young adults so our sample
size is definitely less than 5%
Because we want a 95% confidence interval, we have
α = 1 – 0.95 = 0.05
49
Interval estimate for the population proportion
We are 95% confident that the proportion of young adults
who text while driving is between 0.307 and 0.373
50
Other estimation techniques
• Determine the sample size necessary for estimating a population proportion within a
specified margin of error.
• Obtain a point estimate for the population mean
• Construct and interpret a confidence interval for a population mean
• Determine the sample size needed to estimate the population mean within a given margin
of error
• Find critical values for the chi-square distribution
• Construct and interpret confidence intervals for the population variance and standard
deviation.
• Estimate a parameter using the bootstrap method.
51
Hypothesis testing
• Hypothesis testing aims to determine whether there is enough statistical evidence in favour of a
certain belief about a population parameter.
• Examples:
• In a criminal trial, a jury must decide whether the defendant is innocent or guilty based on the
evidence presented at the court.
• Is there statistical evidence in a random sample of potential customers, that supports the
hypothesis that more than 20% of potential customers will purchase a new product?
• Five components of a hypothesis test
• Null hypothesis (H0)
• Alternative hypothesis (HA)
• Test statistic
• Rejection region
• Decision rule
52
In a criminal trial…
• The presumption of innocence – innocent until proven guilty
• “Ei incumbit probatio qui dicit, non qui negat”
• The burden of proof is on he who declares, not on he who denies.
• What verdict does the jury pass?
53
In a criminal trial…
A criminal trial is an example of hypothesis testing without the statistics.
In a criminal trial, a jury must decide whether the defendant is innocent or guilty based on the
evidence presented at the court.
In a trial a jury must decide between two hypotheses, the null hypothesis H0 is
H0: The defendant is innocent.
The alternative hypothesis HA is
HA: The defendant is guilty.
• The jury does not know which hypothesis is true. They must make a decision on the basis
of evidence presented.
• Two outcomes – 1. Guilty 2. Not guilty
54
In a criminal trial…
Two outcomes – 1. Guilty 2. Not guilty
1. In the language of statistics, guilty or convicting the defendant is called,
rejecting the null hypothesis (the defendant is innocent) in favor of the alternative
hypothesis (the defendant is guilty).
That is, the jury is saying that there is enough evidence to conclude that the defendant is
guilty (i.e., there is enough evidence to support the alternative hypothesis).
2. In the language of statistics, not guilty or acquitting the defendant is called,
not rejecting the null hypothesis as there is not enough evidence to support the
alternative hypothesis.
Notice that the jury is not saying that the defendant is innocent, only that there is not enough
evidence to support the alternative hypothesis.
That is why we never say that ‘we accept the null hypothesis’ (that the defendant is
innocent).
55
Outcomes from a hypothesis test
Notice how the outcome is always about H
Four possible outcomes Decision
Do not reject H
0 when H
0 is true Correct
Reject H
0 when H
0 is true Incorrect
Do not reject H
0 when H
0 is false Incorrect
Reject H
0 when H
0 is false Correct
56
Type I and Type II errors
Two possible errors can be made in any hypothesis test.
A Type I error occurs when we reject a true null hypothesis (i.e. reject H
0 when H
0 is true).
In the criminal trial, a Type I error occurs when the jury convicts an innocent person.
The probability of a Type I error is denoted as
. Also called the level of significance.
P (making Type I error) =

A Type II error occurs when we don’t reject a false null hypothesis (i.e. do not reject H
when H
0 is false). In a criminal trial, a Type II error occurs when a guilty defendant is
acquitted.
The probability of a Type II error is

P (making Type II error) =

The two probabilities are inversely related, decreasing one increases the other.
57
Components of a hypothesis test
The two hypotheses: null hypothesis and alternative hypothesis. The usual notation is:
H
0: — the ‘null’ hypothesis (pronounced h-nought)
H
A: — the ‘alternative’ hypothesis
The null hypothesis (H
0) will always state that the parameter equals the value specified in the
alternative hypothesis (H
A).
Test on population means:
H
0: μ =
μ
0 (μ
0 is a given value for
μ)
H
A: μ

μ
0 or H
A: μ < μ 0 or H A: μ >
μ
Test on population proportions:
H
0: p =
p
0 (p
0 is a given value for
p)
H
A: p

p
0 or H
A: p < p 0 or H A: p >
p
58
Components of a hypothesis test
Test statistics
We need to use a sample statistic to test a hypothesis.
Test on population mean,
μ:
a) If population variance
σ
2 is known
Test statistic:
ܺത; standardised test statistic: ~N(0,1)
b) If population variance
σ
2 is unknown
Test statistic:
ܺത; standardised test statistic: ~tn-1
Test on population proportion,
p:
If np
 5 and nq
 5,
Test statistic:
݌̂; standardised test statistic: ~N(0,1)
/
X
Z
n




/
X t s
n



0 0
p p ˆ Z
pq n


59
Components of a hypothesis test
A rejection region of a test consists of all values of the test statistic for which H
0 is
rejected.
An acceptance region of a test consists of all values of the test statistic for which H
is not rejected.
The critical value is the value that separates the acceptance and rejection region.
The decision rule defines the range of values of the test statistic for which H
0 is
rejected in favour of H
A.
60
One and two tail tests
61
Example: population proportion – two-tailed test
• We are told 46% of all Australians are unhappy with the new metadata legislation. In order
to test this, a survey was conducted on 1267 Australians with at least a bachelor’s degree
and found that 559 were unhappy about the new legislation. Does this result suggest the
proportion of Australians with at least a bachelor’s degree feel differently than the general
population on the new legislation? Use level of significance of 0.1.
• We want to know whether the proportion is different from 0.46, which can be written
p
≠0.46,
so this is a two-tailed test.
• We assume the sample comes from a population with
p
0 = 0.46. The sample proportion is
559/1267 = 0.441.
• Test statistic:
62
Example: population proportion – two-tailed test
• Because this is a two-tailed test, we determine the critical values at the a = 0.10 level of
significance to be –
z0.1/2 = –
z0.05 = -1.645 and
z0.1/2 =
z0.05 = +1.645.
• Because the test statistic does not lie in the critical region, we do not reject the null
hypothesis.
• There is not sufficient evidence at the
α = 0.1 level of significance to conclude that
Australians with at least a bachelor’s degree feel different than general population about
the new legislation.
63
Other techniques on hypotheses
• Hypothesis testing on the population mean.
• Hypothesis testing on the population standard deviation (or variance).
• Classical approach and
P-value approach to testing hypotheses
• Compute the probability of making a Type II error
• Compute the power of the test.
• Inferences on two samples
• Inferences on categorical data
• Non-parametric methods
64
Summary
• “Techniques that use samples to make generalizations about populations”
• Probability and probability distributions
• Because methods used to generalize results from a sample to a population are based on
probability and probability models.
• Sampling techniques
• Because sampling distribution of a statistic is the probability distribution for all possible values.
• Inferential techniques
• Estimation
• When we have no idea about the value of the population parameter being investigated.
• Hypothesis testing
• When we have some idea of the value of the population parameter being investigated.
65
References
• Data Science for Business, Foster Provost and Tom Fawcett, 1st ed.
• Business statistics: Australia / New Zealand, Eliyathamby A. Selvanathan, Saroja
Selvanathan, Gerald Keller, 5th ed.
• Statistics: The Art and Science of Learning from Data, Alan Agresti and Christine Franklin,
3rd ed.
• Statistics Informed Decisions Using Data, Michael Sullivan, III, 4th ed.
66
BUS5PB – Principles of Business Analytics
Topic 5 – Business Performance Management
Learning Objectives
Define BPM and its role in a business
Know the BPM framework
Understand the role of analytics in BPM
Understand the Balanced Scorecard method
Understand the strategy map method
1
What is BPM?
• Many definitions and perspectives,
• a narrow concept that applies to planning, scheduling, and budgeting practices in
business.
• the context of legislation such as the Sarbanes–Oxley Act, 2002.
• On corporate governance and financial disclosure
• Australian equivalent – Corporate Law Economic Reform Program (CLERP)
• USA: Enron, WorldCom Australia: HIH Insurance and OneTel
• business activity monitoring, corporate process management, business activity
management, and business process management
• corporate performance management and enterprise performance management
• the process of assessing progress towards achieving predetermined goals.
• a holistic approach to fully manage performance through informed and proactive
decision making.
2
A formal definition
• BPM standards group,
A set of integrated, closed-loop management and analytic processes, supported by
technologies, that address financial and operational activities.
• Gartner report – BPM researchers,
Methodologies, metrics, processes and systems which are used to monitor and manage
business performance. (Geishecker and Rayner)
3
A formal definition
• Wayne Eckerson – director of research at TDWI,
A series of processes and applications designed to optimize the execution of business
strategy.
• Not just about improving performance, improving it in the right direction (business value,
long-term health).
• David Axson – BPM guru,
Encompasses all the processes, information, and systems used by managers to set strategy,
develop plans, monitor execution, forecast performance, and report results with a view to
achieving sustainable success no matter how success may be defined.
4
BPM, EPM or CPM?
• Corporate performance management and enterprise performance management
• Same purpose but distinguished by a technicality,
• CPM is widely used in industry and by consultants
• EPM is used by vendors, such as Oracle or SAP.
• EPM and CPM exclude public institutions and non-profit organisations.
• BPM maintains general applicability.
5
How BPM helps..
• “A series of business processes and applications designed to optimize both the development
and the execution of business strategy”
• Bridge the gap between strategy and execution
– by improving,
• Communication – executives can communicate strategy and expectations to managers
and staff at all levels
• Collaboration – two-way exchange of ideas and information
• Control – continuously adjust or improve operations.
• Coordination – among business units, sharing resources and information.
6
How BPM helps..
• Bridge the gap between strategy and execution
• Improved visibility into the business
• Identify operational problems before they grow
• Exploit market opportunities as they arise
• Reduce operational costs over time
• Automate the execution of strategy and optimise business management?
7
How BPM works..
• A framework proposed by the BPM Standards Group.
1. Strategize
2. Plan
3. Monitor and analyse
4. Act/adjust – take corrective action
A closed loop to captures business strategy and translate into execution.
Strategize and Plan – represents formulation of business strategy.
Monitor, analyse and take action – defines how to modify and execute strategy.
8
BPM framework
9
BPM framework – strategize
• Common tasks for the strategic planning process:
• Conduct a current situation analysis.
• Determine the planning horizon.
• Conduct an environment scan.
• Identify critical success factors, e.g., Walmart – logistic distribution; Dell – JIT; Apple –
design infused in technology
• Complete a gap analysis.
• Create a strategic vision.
• Develop a business strategy.
• Identify strategic objectives and goals.
10
BPM framework – strategize
• Where do we want to go?
• 90 percent of organizations fail to execute their strategies
• Four sources for the gap between strategy and execution
• Enterprise-wide communication
• Alignment of rewards and incentives
• Focus on the core business/elements
• Resources
11
BPM framework – plan
• The operational aspects of the business
• Translates organization’s strategic goals into well-defined,
• Tactics and initiatives
• Resource requirements
• Expected outcomes/results
• Planning involves setting a timeframe to its operations
12
BPM framework – plan
• How do we get there?
• Financial planning and budgeting
• An organization’s strategic objectives and key metrics should serve as top-down drivers
for the allocation of an organization’s tangible and intangible assets
• Resource allocations should be carefully aligned with the organization’s strategic
objectives and tactics in order to achieve strategic success (e.g., new sales channel)
• Budget-centric (financially focused)
13
BPM framework – monitor and analyse
• Monitoring is about ensuring that the execution “sticks to the plan”
• Plan will inform “what to monitor”, e.g.,
• Critical success factors
• Strategic goals and targets
• Plan will detail “how to monitor”, e.g., metrics and KPIs
14
BPM framework – act and adjust
• Act and adjust is an outcome of monitor and analyse
• Act on any deviating indicators during monitoring.
• Adjust any metrics, plans or even strategy, where required, to reflect changing
situations.
• Any act and adjust is done to continue achieving the organization’s goals.
15
BPM framework – act and adjust
• What to do differently?
• Success/survival depends on new projects such as
• Creating new products
• Entering new markets
• Acquiring new customers (or businesses)
• Streamlining some processes
• Many new projects and ventures fail!
• New Hollywood movies: 60% failure
• Mergers and acquisitions: 60%
• IT projects (large-scale): 70%
• New food products: 80%
16
BPM framework – act and adjust
• Benchmarking results indicate that world-class companies,
• Have hybrid sourcing strategies that combine shared services and outsourcing.
• Provide management with the tools and training to leverage corporate information and to
guide strategic planning, budgeting, and forecasting.
• Closely align strategic and tactical plans, enabling functional areas to contribute more
effectively to overall business goals.
• Are significantly more efficient than their peers at managing costs.
• Focus on operational excellence and experience significantly reduced rates of voluntary
employee turnover.
17
BPM framework – expanded
18
BPM framework – expanded
• Splits strategic and operational into separate loops.
• Strategic level defines business objectives and KPIs.
• Starting point is the analysis of the business and subsequent definition of business
objectives.
• Based on the business objectives, strategic KPIs are derived.
• Strategic KPIs influence the process design and the definition of process-oriented
indicators (operational KPIs).
19
The other BPM..
• The operational loop represents the Business Process Management (BPM) lifecycle.
• BPM includes analysing, designing, implementing, executing, monitoring and optimizing of
business processes.
• Process Monitoring forms an essential part within the BPM lifecycle.
• It assesses the performance of process instances being executed in the key dimensions –
quality, time and cost.
• It help to identify weaknesses and opportunities for improvement.
20
Business activity monitoring
• Traditional process monitoring is time-driven or request-driven, thereby delivers results
with a time-lag.
• This is a disadvantage as unexpected events need to be resolved promptly.
• Business activity monitoring (BAM) as the event-driven complement of traditional
monitoring addresses this shortcoming.
• A definition, “processes and technologies that provide real-time situation awareness, as
well as access to and analysis of critical business performance indicators, based on eventdriven
sources of data”
21
BAM benefits
• Continuous and simultaneous real-time monitoring of IT systems and services supporting
business processes.
• Fosters awareness and provides real-time visibility of business processes with KPIs
• Reduces IT blindness
• IT environment in the enterprise permanently creates a high number of single events
without any semantics
• Helps recognizing significant business events like bottlenecks or missed targets.
• Allows for better understanding the consequences of events and acting adequately by
putting them into their current, predictive and historical context.
22
Process monitoring vs. BAM
23
Back to BP(erformance)M
24
Linking operational to strategic
• It is important to align business processes with strategic KPIs
• How is this accomplished? operational KPIs
• Operational KPIs periodically quantify performance on the operational level.
• Compared to strategic KPIs their aggregation level is lower.
25
BI and BPM
• BI: “Concepts and methods to improve business decision-making by using fact-based
support systems”.
• “BI is the technological solution that enables a company to consolidate and leverage the
vast masses of data in organizations to improve decision making”
• BI provides the IT infrastructure and applications required to implement BPM.
• BPM includes a business process that leverages BI.
• BPM as an extension of BI,
• BI applications focus on the automated collection of data and its analysis (OLAP,
analytics etc),
• PM focuses on the process of systematic monitoring and on the control of business
objectives on different management levels.
26
BI and BPM
• Differentiators – scope, type of data, type of decision support provided and orientation of
application.
• BI implementations have a narrow scope limited to one or more departments or functional
areas, whereas BPM focuses on the entire enterprise
• BI applications support strategic and tactical decision making whereas BPM supports
operational as well.
• BPM sources real-time data while BI tends to rely on archival.
• BPM solutions are proactively, while BI maintains a reactive orientation.
• Overall, BPM helps BI cause operational decision-making to become more proactive,
timely and support a wide range of business users.
27
BPM data sources
• Enterprise data involved in managing business performance includes:
• Event data from business process operations
• Event data from IT infrastructure operations
• Historical business process analytics and metrics
• Business plans, forecasts, and budgets
• Data occurring from external events (e.g. changes in marketplace conditions)
28
And the outcomes
• This data is used by BI applications to create actionable management information that
enables BPM.
• The actionable information produced includes:
• KPIs
• Alerts
• Analytic context reports
• Recommendations for corrective action
29
An example
• A process-driven and closed-loop application environment.
• Business applications execute transactions in support of business processes (receiving
customer orders, managing inventory, shipping products, and billing customers).
• Transaction data and events are captured and integrated in a data warehouse
environment for reporting and analysis by BI applications.
• BI-driven BPM applications convert the integrated data in the warehouse into useful and
actionable business information.
• Business users apply their business expertise and guided analysis to evaluate this
actionable business information to determine what decisions need to be made to
optimise business operations and performance.
• Applying business expertise to business information creates business knowledge.
• Feedback loop – fed back to the business processes
30
An example
31
An example
32
Performance Measurement
33
Why measure?
Inspiration for the balanced scorecard,
“I often say that when you can measure what you are speaking about, and express it in
numbers, you know something about it; but when you cannot measure it, when you cannot
express it in numbers, your knowledge is of a meagre and unsatisfactory kind” – Lord Kelvin
• If you cannot measure it, you cannot improve it.
34
What to measure?
• BPM bridges the gap between strategy and execution.
• Measurement provides a means of evaluating progress toward this goal.
• Assess how well operations are aligned with business strategy.
• A performance measurement system,
• tracks implementations of business strategy by comparing actual results against strategic goals and
objectives.
• comprises systematic comparative methods that indicate progress against goals.
35
Performance measurement system
• An effective performance measurement system should help
• Align top-level strategic objectives and bottom-level initiatives.
• Identify opportunities and problems in a timely fashion.
• Determine priorities and allocate resources accordingly.
• Change measurements when the underlying processes and strategies change.
• Delineate responsibilities, understand actual performance relative to responsibilities, and
reward and recognize accomplishments.
• Take action to improve processes and procedures when the data warrant it.
• Plan and forecast in a more reliable and timely fashion.
36
Types of measures
• Important to have a good collection of performance measures.
• Measures should,
• Contain a mix of past, present, and future activity.
• Balance the needs of shareholders, employees, partners, suppliers, and other
stakeholders.
• Start at the top and flow down to the bottom.
• Have targets that are based on research and reality rather than be arbitrary.
37
What not to measure..
• Not what’s easily accessible and simple, such as existing finance ratios
• These are not always linked to business strategy.
• Not measures derived from bottom-up initiatives that do not consider the organization’s
strategic objectives.
• Excessive drill-down – the cost per sale per minute per employee as a percentage of
nondiscretionary income
• Incorrect performance measures used over time, may eventually become the company
standard.
38
Balanced Scorecard (BSC)
• A comprehensive framework/tool/methodology that translates strategic objectives into a
coherent set of performance measures.
• Contains a mixture of financial and non-financial measures each compared to a ‘target’
value within a single concise report.
• Art Schneiderman created the first design in 1987.
• In 1992, Kaplan and Norton expanded the primarily financial view of performance metrics
into four perspectives.
39
Balanced Scorecard (BSC)
• Attempts to answer four basic questions,
• How do we look to shareholders?
• How do customers see us?
• What must we excel at?
• Can we continue to improve and create value?
40
Balanced Scorecard
• The four perspectives introduce ‘balance’ to the limitation of being financially focused.
• Financial
• Customer
• Internal business process
• Learning and growth
41
Balanced Scorecard
• A balanced mix of figures to measure quality, finance, process efficiency, customer
satisfaction as well as progress in terms of learning and growth.
• The goal of strategic management is to balance these perspectives.
• By doing so, the focus shifts from traditional management of financial measures towards
integrated approaches which also include non-financial measures.
• Another variant adds a perspective regarding sustainability – Sustainable Balanced
SCorecard (SBSC).
42
BSC perspectives
• Contains a mixture of financial and non-financial measures each compared to a ‘target’
value within a single concise report.
• Financial – general ‘unbalanced’ focus. Many sources for handling and processing of
financial data
• KPIs: productivity, revenue, growth, usage, and overall shareholder value.
• Customer – analysed in terms of customer types and process types that provide a product
or service to those customer groups
• KPIs: customer acquisition, customer satisfaction rates, market share and brand
strength.
43
BSC perspectives
• Internal business processes – efficient and effective
• KPIs: resource usage, inventory turnover rates, order fulfilment and quality control.
• Learning and growth – employee training and corporate cultural attitudes related to both
individual and corporate self-improvement.
• Crucial in a knowledge-worker organization where people are the main resource.
• Not limited to training but includes communication and mentoring.
• KPIs: employee retention, employee satisfaction and employee training and
development.
44
BSC visualisation
45
BSC visualisation
46
BSC visualisation
47
BSC visualisation
48
Strategy map
• An extension to the Balanced Scorecard to describe the causal relationships between
strategic objectives.
• Defines and communicates how strategy should be deployed and implemented in an
organization by describing how to connect strategic objectives to operational initiatives.
• Supplements a BSC with the possibility to describe cause-and-effect relationships between
strategic goals.
• The cause-and-effect relationships between non-financial and/or financial measures can
also be described.
49
Causal relationships – example
• Employees: better trained in quality management tools reduce process cycle times and
process defects
• Internal processes: improved processes lead to shorter customer lead times, improved ontime
delivery, and fewer defects experienced by customers
• Customers: The quality improvements experienced by customers lead to higher
satisfaction, retention, and spending.
• Financial: Thereby, higher revenues and margins.
50
Strategy map method
• Kaplan and Norton provide a generic strategy map which can be customised to the needs
of a specific organisation.
• Start by describing the strategy and its economic consequences.
• Define strategic objectives for financial and customer perspective.
• Next, a description of how the strategy will be accomplished.
• The internal perspective defines objectives necessary to achieve the value proposition by
means of process measures.
• The last perspective, learning and growth, defines objectives concerning intangible assets.
51
Strategy map outcomes
• The strategy map is complete after defining and describing strategy in the context of the
perspectives.
• Next, a BSC is used to translate them into concrete KPIs and a set of targets.
• As a result, it provides a clear line of sight into how individual business unit activities are
linked to the overall objectives of the organization.
• The metrics derived through this methodology represent KPIs that are tied to strategic
objectives.
52
Kaplan and Norton’s generic strategy map
53
Custom strategy map
54
Strategy map with BSC and action plan
55
Implementing a BSC
• Automation is essential in order to manage the vast amount of information related to a
company’s mission and vision, strategic goals, objectives, perspectives, measures, causal
relationships, and initiatives.
• If the software used is intuitive and can be deployed through an organization readily, it can
bring visibility to the BSC process, ease a cultural transition, and enable participation by a
wider audience.
56
Approaches to automation
• Proprietary BI products – vendor identifies BSC as an extension of BI, develops it as an
add-on to their product line.
• Can limit focus to measures derived from available data rather than strategic objectives.
• Lacks capabilities to communicate strategy and manage non-numeric information
(reasons for selecting a measure)
• Can be cost-prohibitive for widespread deployment
• ERP-centric applications -interface with transactional systems and try to address the
common misconception of reporting capabilities
• Silos of transactional data
• Unable to integrating external data
• Unstructured content to educate employees
• Limited visualisations and dashboards (compared to a BI product)
57
Approaches to automation
• BSC-specific applications – Offer complete coverage from data integration to intuitive
visualisations that can relate to all levels within an organisation.
• Adaptive Insights
• Quick Score
• Rocket CorVu
58
Barriers to BPM
• Measures that do not focus on organisational strategy
• An IT department within a bank had identified measures and benchmarks for being a world
class IT department. They did very well but the measures were not tied to the bank’s business
strategy!
• Failure to communicate and educate.
• A scorecard is only effective if it is clearly understood throughout an organization.
• Misconception that BPM implementations are technology driven.
• Although technology supports BPM, strategically aligned business processes drive it.
• Unaccounted impact of organizational resistance.
• BPM introduces new or modified processes which make information transparent.
• The resulting resistance can hamper project implementation and adoption.
• Assuming the BPM project is completed when technology is in place.
• Post-implementation people issues
• Overlooking the importance of training end users.
• Dealing with lack of user confidence in the new systems.
59
Weaknesses of the BSC
• Any missing perspectives?
• Top management may fail to reach consensus on the firm’s strategy (different views on
what the strategy is) .
• BSC does not provide a solution for ‘how to measure’.
• Ignores activities and initiatives that lie beyond the original targets.
• Constrained view of stakeholders who interact with an organisation.
• Lack of a deployment system that breaks high level goals down to the sub-process level,
where actual improvement activities reside.
• Additional workload of developing and maintaining a BSC.
60
A potential solution..
• Integrating BSC with Six Sigma
• Allows an organisation to translate the strategy into high-level metrics and provide the
capability to improve high-level metrics through Six Sigma initiatives.
• BSC provides the capability to translate the strategy into relevant organisational metrics.
• Six Sigma is adopted as the vehicle for influencing these metrics.
61
Six Sigma overview
• Organisational approach to operational excellence developed by Motorola in 1986.
• Based on a structured data driven problem solving methodology that is uniquely driven by
a close understanding of customer needs.
• Six Sigma uses statistical analyses to measure and reveal opportunities for process
improvement by uncovering defects
• The term is associated with manufacturing processes.
• A six sigma process is one in which 99.99966% (3.4 defective parts/million) of the
products manufactured are statistically expected to be free of defects.
• Follows two project methodologies, each composed of five phases
• DMAIC (improve an existing business process)
• DMADV (creating new product or process designs)
62
DMAIC
• Define: select the process that needs improvement.
• Assumes that customer satisfaction with care is critical to success.
• Measure: translate the process into quantifiable forms, collect data, and assess current
performance.
• Define performance measures that encompass inputs, outputs and its users.
• Analyse: identify the root causes of defects and set goals for performance.
• Investigate and verify cause-and-effect relationships.
• Improve: implement and evaluate changes (solutions) to the process to remove root
causes of defects.
• Improve or optimize the current process.
• Control: standardise solutions, and continuously monitor improvement.
• Ensure goals are achieved and acceptable behaviour patterns are maintained
throughout the organization.
63
BSC Six Sigma integration
64
BSC Six Sigma integration
65
BPM critical success factors
• Organisational
• Strong executive sponsorship
• Management of resistance to BPM
• Management support
• Technical
• Consolidation of dispersed silos of data
• Existing data management infrastructure
• Sufficient resources
• User support
• Methodology related
• Effective communication
• Clear link to business strategy
• Team skills
66
Summary
• Define BPM and its role in a business
• Know the BPM framework
• Understand the role of BI in BPM
• Understand the Balanced Scorecard method
• Understand the strategy map
67
References
• Data Science for Business, Foster Provost and Tom Fawcett, 1st ed.
• Competing on Analytics: The New Science of Winning, Thomas H. Davenport, 1st Ed.
• Analytics at Work: Smarter Decisions, Better Results, Thomas H. Davenport, 1st Ed.
• The Value of Business Analytics: Identifying the Path to Profitability, Evan Stubbs, 1st Ed.
68
BUS5PB – Principles of Business Analytics
Topic 6 – Ethics and Emerging Trends
Learning Objectives
Understand the role of ethics in BI/BA
Know the application of ethics
Know the benefits of ethics
Understand the business case for ethics
An appreciation of emerging trends in BI/BA
1
Understanding ethics..
• Right vs Wrong
• Ethical vs Moral vs Legal
• Morals are an individual’s own beliefs and principles about what’s right and wrong.
• Similar but not the same.
• Ethics is not internal, bestowed upon by an external entity.
• Legal refers to mandatory actions imposed by law.
• Law is prescriptive.
• Although related and overlaps with ethics, the two are discrete and separate.
• Perception can shape ethics, but does not shape law.
2
Understanding ethics..
• A choice between right vs wrong and good vs bad.
• This highlights the inherent subjective nature of ethics.
• Subjectivity introduces ambiguity and the need for judgement.
• Judgement must not be based on opinion but on values, principles and critical thinking.
• The foundations of such ethical judgements must be resilient across diverse beliefs.
• Remember, ethical conduct is a choice!
3
Defining ethics..
• Richard Hackathorn,
• “A judgment by members of society about what is good or bad behaviour.”
• Not always accurate to hold society responsible for the distinction between right and
wrong.
• Frank Buytendijk,
• “A code of conduct to refer to in judging what is right and what is wrong with the premise
that ethics is based on core concepts of self, good, and other.”
• Ethical behaviour considers what is good for others as well as what is good for myself.
4
Ethics in analytics
• “Making the right choices by others and for ourselves and to prevent doing harm to others
or to ourselves”
• Good for others (stakeholders) does not imply damage to self (organisation) in order to
achieve good for others.
• Legality and morality are dealt as separate issues.
• Ethics in BI/BA applies to three activities,
• How data (input) and intelligence (output) are acquired (acquisition).
• How data and intelligence are used (use).
• How individual and organizational conduct is guided through the use of data and
intelligence (conduct).
5
An example..
• An online game service provider where users of any age must register to play games.
• The ethical issue – is it acceptable to collect name and address information from children
without parents’ consent?
• The moral issue – should children below 18 years be given access to violent games with or
without parental consent?
• The legal issue – are we allowed to collect game data from minors and if we do what
penalties exist if parents bring a lawsuit?
6
Ethics in analytics
• What is right for the customer / what is right for the company –
• when gathering data?
• when using data to drive business results?
• when using data to shape organizational conduct?
• Ambiguity lies at the intersection – What is the appropriate action when customer analytics
without disclosure has substantial business benefit but makes employees uneasy?
7
Ethics of data acquisition
• Gathering data on customers and competitors
• Two rules of thumb – avoid harm and build trust
• What is unethical?
• Deceptive marketing techniques, selling to third parties and identity theft
• Ten principles for ethical acquisition (TDWI) follows.
Overall, it’s about fairness – what is the right balance of “good for others” and “good for self”?
The ten are similar to US-EU Safe Harbor privacy principles.
8
Ten principles for acquisition
1. Informed consent – should the subject know data is being collected and agree to its collection?
2. Anonymity – should all personally identifying information be eliminated from the data? or collect
only in the form of aggregates such that individuals can’t be identified?
3. Confidentiality – should sources and providers of data be protected from disclosure?
4. Security – what level of protection from intrusion, corruption, and unauthorized access?
5. Privacy – should each individual have the ability to control access to personal data about
themselves?
6. Accuracy – what level of exactness and correctness is required of the data?
7. Ownership – is personal data about individuals an asset that belongs to the business or privately
owned information for which the business has stewardship responsibilities?
8. Honesty – to what degree should the business be forthright and visible about data collection
practices?
9. Responsibility – who is accountable and at what level for use and misuse of data?
10. Transparency – between the two extremes of open and stealth data collection, what is the right
level of transparency?
9
Safe Harbor principles
• EU maintains the most rigorous system of privacy legislation.
• Companies in the EU are not allowed to transfer personal data to non-EU locations
without a guarantee of adequate levels of protection.
• The US-EU Safe Harbor principles are the compliance expectations.
10
Principle Description
Notice Inform individuals of the purpose for which information is collected.
Choice Offer individuals the opportunity to choose or opt out – at many levels.
Consent Only disclose to third parties consistent with the principles of notice and choice.
Security Protect data from loss, misuse, unauthorized access, disclosure, alteration, and destruction.
Data Integrity Assure reliability of personal information for its intended use and ensure information is accurate, complete and current.
Access Individuals must be able to access information held about them, and correct or delete it if it is inaccurate.
Accountability An organisation must be accountable for following the principles and must include mechanisms for assuring compliance.
What data..
• Data acquired on a variety of subjects, some of which include,
• Market
• Customers
• Economy
• Competitors and own organisation
• Processes
• Products
• Employees
• Finances

…..
11
Which source, who collects..
• A variety of sources for each subject,
• Internal organisational records
• Third party tracking data
• Publicly available data
• Roles and responsibilities of data collectors,
• Level of awareness and accountability for each of the ten principles
• Methods used to collect – manipulation and misrepresentation
• Manipulate consumers into revealing personal information (apps).
• Misrepresent identity (collect survey data in the guise of students projects).
12
Ethics matrices
13
Principle Market –
Consumer
Competitor –
Products
Own –
Employees
Informed consent
Anonymity
Confidentiality
Security…..
Source – Internal data
Source – Third party data
Principle Market –
Consumer
Competitor –
Products
Own –
Employees
Informed consent
Anonymity
Confidentiality
Security…..
Ethics of use
• Analytics supports decision-making processes.
• Complete and accurate actionable information is crucial to make good decisions.
• Ethical considerations for reporting,
• What to report and what not to report?
• Is it ethical to introduce a bias to reports and dashboards?
• For analytics,
• Excluding sensitive attributes from analysis (political, cultural)
• Customer profiling > Persuasion profiling – when does it become manipulation?
• Ethics of use highly-dependent on context – intention of use is the primary factor (good to
others and to yourself)
14
Examples of context
• An unconscious patient rushed into the ED.
• Ethical to use any information about him/her to determine identity, allergies and current
medical conditions.
• Credit card limit increases
• Unethical to issue limit increases to a known gambling addict
• Healthcare industry conducts research on adverse reactions to medicine.
• Can patient data be used for this purpose?
• Healthcare provider was given patient data only to treat a condition and not for research.
• An ancillary purpose – unethical to use without consent and preferably anonymised
before its use.
15
Anonymised data
• Frequently the case with analytics projects.
• Remove/replace classification of personally identifiable information so that individuals
associated with that data can remain anonymous.
• Useful for segmentation – identifying collective patterns which does not require information
at individual level.
• But, the more anonymised the less useful it becomes.
• Identity information inevitably removes contextual information.
• A pragmatic solution – be transparent and provide consumers the choice to opt-in/opt-out.
16
Opt-in and opt-out
• Seen on many websites and social networks.
• Opt-out – default settings used by the service provider most often for all data collection.
• The user must explicitly choose to opt out of the default into custom settings.
• Opt-in – explicit permission has to be granted to collect and use information in a certain set
of ways before the collection of data begins.
• Tedious but useful as it forces end-users to consider repercussions before making a
choice.
• Less likely to be used as it’s frequently ignored (need to incentivise).
17
Ethics of conduct
• Creating a culture of ethical behaviours within the organisation
• When self-interest conflicts with values.
• How to respond to ambiguity and uncertainty.
• BI/BA is useful to detect unethical conduct
• BI is useful to detect fraud and misconduct both internal and external.
• Record and document unethical conduct, incorporate these into BI architecture.
• BI/BA is useful to detect ethical conduct
• Use of metrics in BPM to monitor compliance and good judgement.
• Use BI/BA to inform consumers of privacy/identity breaches .
18
Benefits of ethics
• Is it only ‘feel good’?
• Further benefits of BI/BA ethics, as a subset of corporate ethics,
• Positive reputation and brand value
• Employee commitment and morale
• Ease of employee recruiting and retention
• Access to investment capital
• Customer loyalty
• Financial performance?
19
Financial performance..
• Trudel conducted an experiment of the benefits achieved through ethics,
• Three groups of consumers were offered coffee.
• Group 1 was told it was produced with high ethical standards related to labour and the
environment.
• Group 2 was told the product was made using unethical methods.
• Group 3 was the control – no information on ethics.
• Results,
• Group 1 was willing to pay 17% more than the control group.
• Group 2 paid nearly 30% less than the control group.
20
Business case for analytics/BI ethics
• Adapted from Walter Maner’s levels of justification for computer ethics
21
Quality Privacy Security Compliance
Professionalism
Abuse
Policy gaps
New policies
New issues
Governance
Ethics management
• Awareness – Research – Judgement – Resolution
• Start with a code of ethics
• Does your company have a code of ethics?
• Does your industry have a code of ethics?
• Does your profession have a code of ethics?
• Are ethical positions expressed in governance practices? (corporate governance, data
governance and IT governance)
22
Code of ethics
• If there is a code of ethics,
• Does it address BI/analytics?
• Does it address or support data and information acquisition, use and organisational
conduct?
• Does it provide a structure to recognize and frame ethical questions?
• Does it provide a structure to reason through and resolve ethical questions?
• Does it provide resources to seek ethical guidance?
• Does it include scenarios to understand ethical positions?
23
Amazon.com
• Most successful/popular recommender engine – why?
• Opt-out mode – a user agrees to the privacy policy by default.
• Test runs on Amazon homepage yielded six different third party tracking devices (apps)!
• Admeld, DoubleClick, Millard Brown, OpenX Limited, Mediaplex and Microsoft adCenter.
• Each of these trackers monitor web activity of the user
• User details, browser settings, page interaction (scrolling, clicks, and mouse-overs) and
transactions.
• Amazon is just one of many Web sites contributing to the tracking of users through these
third-party trackers
• DoubleClick updates trillions of data points every 3-4 hours to provide an up to date view of
advertising performance.
24
Google
• In 2012, 85% of the search engine market share.
• Code of conduct starts by declaring: ‘Don’t be evil.’ Googlers generally apply those
words to how we serve our users….
• In March 2012, merged 60 different privacy policies into one with opt-in opt-out
options.
• Greater transparency to choose what Google services are allowed to collect and maintain
user data
• However late 2012, Google placed code to exploit a loophole in Apple’s Safari
browser that tricked a user into submitting form data when not. (tracking cookies and
targeted ads)
• Led to the largest FTC privacy penalty – $22.5 million
• A good example of unethical practice.
• Google figured out it could collect data from Safari users, but the decision to do so should
have included some consideration of the impact.
25
Cultural factors
• Hofstede developed the Individualism Index (IDV), which measures how collectivist or
individualist a society is.
• USA – individualist, India – collectivist
• Research at CMU applied IDV to privacy.
• Even after controlling for age and gender differences, the distinctions between three
cultures (American, Chinese, and Indian) were significant.
• Users in the US tend to be the most privacy concerned, followed by China.
• Phone number, address, e-mail, photo and employer were considered as privacy
sensitive by more than half of both US and Chinese respondents.
• Only the phone number was considered privacy sensitive by more than half the Indian
respondents.
26
Emerging Trends
27
Key elements
• Infrastructure
• Cloud
• In-memory
• Hadoop (distributed)
• Data
• Big Data (3Vs)
• MDM
• Technology
• Agile BI
• Operational BI
• Mobile BI
• MapReduce
• Columnar databases
• People
• Business analyst
• Data scientist
28
Infrastructure: Cloud
• A stack of services built on top of one another to enable on-demand network access to a
shared pool of configurable computing resources.
• Three categories,
• SaaS (Software as a Service) – applications are designed for end-users, delivered over the
web. (Office tools)
• PaaS (Platform as a Service) – set of tools and services designed to make development of
applications quick and efficient (databases, web servers)
• IaaS (Infrastructure as a Service) – underlying hardware and software (servers, storage,
networks, operating systems).
29
Cloud analytics
• Aims to make BI affordable and accessible to many organisations.
• Fast turnaround time for deployment.
• Cost-effective as it follows a PAYG model, limited capital investments and low TCO.
• Several challenges,
• Data privacy governance and security (Safe Harbor)
• Data integration (cloud with local)
• Limited control
30
In-memory databases (IMDB)
• Price of RAM (main memory) is gradually decreasing, so memory-intensive architectures
are primed to replace slow, mechanical spinning hard disk drives.
• Ideal for applications requiring high speed data-retrieval.
• large-scale online transactions or real-time forecasting and planning.
• An opportunity for ERP giant but database newcomer
• SAP Hana
• SAP Hana – four promises:
• Fast performance
• No disruption (integrating existing architecture)
• Simultaneous transactional and analytical applications
• New applications (unlike those on conventional databases)
31
In-memory databases (IMDB)
• RAM latency – 83 nanoseconds and Disk latency – 13 milliseconds. (1 millisecond =
1,000,000 nanoseconds)
• IBM, Microsoft, and Oracle are announcing their own in-memory capabilities, atop existing
technologies.
• IBM’s in-memory-based BLU Acceleration for DB2
• Microsoft In-Memory OLTP option for SQL Server 2014
• Oracle Database 12c and Exalytics
• BUT data volumes tend to outpace the growth of memory!
32
Hadoop
• Scale-up vs. Scale-out
• Prevalence and access to cheaper commodity platforms
• Hadoop provides a processing architecture to solve Big Data problems on cheaper
commodity hardware with fast scalability and parallel processing.
• Simply put, a framework that provides a reliable shared storage and analysis system.
• The storage is provided by Hadoop Distributed File System (HDFS) and analysis by
MapReduce programming model.
• Two leading distributors of Hadoop with management tools and professional
services: CloudEra and HortonWorks
• Hadoop-based solutions also from IBM, Teradata, Oracle, Microsoft, HP, SAP, and
DELL in partnership with other providers and distributors.
33
Hadoop ecosystem
34
Data: Big data
• First defined in 2001 by Doug Laney, updated in 2012 by Gartner,
• “Big data is high-volume, -velocity and -variety information assets that demand costeffective,
innovative forms of information processing for enhanced insight and decision
making”
• Composed of three parts:
• The 3Vs (volume, velocity, variety)
• cost-effective, innovative forms of information processing
• enhanced insight and decision making
35
Big Data volume
• Machine data – includes both usage and behaviour of the owners and detailed machine
activity logs.
• Sensors on building HVAC systems
• Sensors on automobiles
• Assembly line robots
• CT scanners, X-ray machines, body scanners at airports, hospitals
• Clickstream logs – usage statistics of a web page/site are captured in clickstream data.
• behaviour and usability analysis, market research
• Emails – available and auditable on a case-by-case basis
• insider trading, intellectual property, competitor analysis
• Contracts – many types: human resources, legal, vendor, supplier, customer
• Bankruptcy, mergers and acquisitions
36
Big Data velocity
• Traditional data analysis in batches, acquired over time.
• Split into fixed-size chunks and processed through different layers, and the end result is
stored in a warehouse for further use in reporting and analysis.
• Big data streams in a continuous manner and the result sets are useful when the
acquisition and processing delays are short.
• Clickstream data – deliver personalized browsing and shopping experiences.
• Amazon, Facebook, Yahoo and Google
• Mobile networks – performance of the network, each tower, the time of day and associated
geographies/demographics.
• Social media – size of the post, number of times it’s forwarded or shared and follow-on
data gathered.
37
Big Data variety
• Unpredictability of the input data format or the structure of the data.
• The processing complexity associated with a variety of formats is the availability of
appropriate metadata for identifying what is contained in the actual data.
• Absence of which leads to delays in output/insights.
• Text and images from social media
• CCTV recordings
• Call centre audio files
• A raw feed directly from a sensor source
38
Master data management
• A collection of technologies and processes to create and maintain consistent and accurate
data throughout the organisation.
• Important with the proliferation of data sources both on-site and in the cloud.
• Informatica, SAS and Stibo provide robust MDM solutions.
• Adoption can be challenging,
• Requires cultural buy-in – collaboration between business & IT
• Justifying the business case for MDM
• Managing project scope and risk of failure
39
Master data management
• What is master data?
• Critical nouns of a business; four groupings: people, things, places, and concepts.
• People: customer, employee and salesperson
• Things: product, part, store and asset
• Concepts: contract, warranty and licenses
• Places: office locations and geographic divisions
• Customer may be further segmented – based on incentives and history, normal, premiere
and executive customers.
• Product may be further segmented by sector and industry.
40
Master data management
• Needs to be error-free as master data is used in multiple applications
• How it’s done,
• Identify master data sources
• Identify master data consumers and producers
• Appoint data stewards
• Implement a data governance program
• Develop a model on a chosen toolset
41
Technology: Operational BI/BA
• BI/BA to the masses
• Supported by self-service BI/BA
• Defined by the requirements,
• Lower data latency
• Higher data selectivity
• High load of query concurrency than traditional analytic workloads.
• Opportunities to,
• Manage business activities as they occur
• Improve customer relations
• Increase business efficiency
42
Agile BI/BA
• Addresses the need for accelerated delivery of business value from BI/BA projects.
• Include technology deployment options such as self-service BI, cloud-based BI, and data
discovery dashboards
• Tableau, Qlikview, Spotfire
• Cognos Insight and Excel’s PowerPivot
• Elements of agile BI/BA,
• Agile development methods – scrum and XP
• Agile project management – continuous planning, execution, and feedback loop
• Agile infrastructure – cloud BI/BA and desktop BI/BA
43
Mobile BI/BA
• Microsoft Windows Surface is transforming the mobility factor beyond what the iPad
managed.
• Mobile BI adoption is behind the curve compared with other enterprise mobile applications.
• Growth of vendors entirely on mobile BI – http://roambi.com/
• Key challenge is to replicate desktop experience for navigating dashboards and guided
analytics.
44
MapReduce
• Store the data in a distributed environment
• Send the program to the data (at distributed locations)
45
MapReduce
• A programming model for processing large datasets.
• Runs on the Hadoop ecosystem.
• Splits processing into two phases – Map and Reduce
• Map: initial ingestion and transformation step, in which individual input records can be
processed in parallel.
• Reduce: aggregation or summarization step, in which all associated records must be
processed together by a single entity.
• Each phase has key-value pairs (KVP) as input and output.
• KVP – is simply a set of two linked data items, a key (identifier) and corresponding
value, expressed as tuples.
• (key1,value1) e.g. (colour,red) (age,21)
• Each phase has a function: the map function and reduce function.
46
MapReduce, diagrammatically
47
Columnar (NoSQL) databases
• Limitations of RDBMS –
• Scalability
• RDBMS can only be effectively scaled-up by running on more powerful and expensive
machines.
• Scale-out, distribute across multiple servers, is not well handled by relational databases
• Difficulties in joining tables across a distributed system
• Difficulties in maintaining the ACID properties (ACID?)
• Restricted representations
• RDBMS requires all data be converted into tables.
• SQL querying was developed for structured data.
48
Limitations – an example
49
Limitations – an example
• Over time, new attributes – street address and food preferences. Null values for
these attributes in the existing records.
50
Limitations – an example
• New attributes will evolve – need to store each such version
51
3D Excel spreadsheet
CAP theorem
• Applies to any distributed data environment (originally developed for web services)
• States three core requirements need to be considered when designing and deploying
applications in a distributed environment, but can only guarantee two of the three in the
actual system.
• Some minor clarifications in 2012 by Brewer himself.
• CAP (consistency, availability, and partition tolerance)
• Consistency – all data available at all nodes or systems
• Availability – every request will receive a response
• Partition tolerance – the system will operate irrespective of availability or a partition or loss of
data or communication
• Systems architected on this theorem referred to as BASE (basically available soft state
eventually consistent) architecture as opposed to ACID.
52
NoSQL implementations
• Combining the principles of the CAP theorem and data architecture of BigTable or Dynamo
many solutions have evolved
• HBase, MongoDB, Riak, Voldemort, Neo4J, Cassandra, HyperTable, HyperGraphDB,
Memcached, Tokyo Cabinet, Redis, CouchDB
• The more popular,
• HBase, HyperTable, and BigTable, which are architected on CP (from CAP).
• Cassandra, Dynamo, and Voldemort, which are architected on AP (from CAP).
53
Limitations – an example
• New attributes will evolve – need to store each such version
54
3D Excel spreadsheet
Addressing the limitation
• Features unique to a NoSQL database:
• Define column-families and not columns.
• Column-family is a set of columns grouped together into a bundle.
• Columns in a column-family are logically related to each other.
• This example has three column-families, any suggestions?
55
Addressing the limitation
• Each row of a column-oriented database table stores data values in only those columns for
which it has valid values.
• Null values are not stored at all.
56
Addressing the limitation
• On physical stores, data isn’t stored as a single table but is stored by columnfamilies.
• A single table often spans multiple machines.
57
Types of NoSQL databases
• Key-value pairs – implemented using a hash table, each entry consists of a unique
key and a pointer to a particular item of data creating a key-value pair; Voldemort.
• Column family stores – An extension of the key-value architecture with columns and
column families, the overall goal was to process distributed data over a pool of
infrastructure; HBase and Cassandra.
• Document databases – modelled after Lotus Notes and similar to key value stores.
The data is stored as a document and is represented in JSON or XML formats. The
biggest design feature is the flexibility to list multiple levels of key-value pairs; Riak
and CouchDB.
• Graph databases – based on graph theory, this class of database supports scalability
across a cluster of machines; Neo4J.
58
Business analyst
• Domain experts –know the business inside out.
• Strong appreciation for the business value of data.
• A data modeller – understand how the data best meets decision-making needs.
• An end-user – limited technology skills, most often makes use of pre-built data structures
and templates.
• BICC
59
Data scientist
• Sexiest job of the 21st century!?
• Or ‘business analyst’ in the US?
• Scientists because they ‘discover’ insights from data rather than simply reporting the data.
• See beyond technology limitations – use the cloud/Hadoop instead of waiting for the IT
department.
• A timeless unicorn – array of skills in data management, analytics, computer science,
statistics and business savvy.
• Usually three roles – data admin/modeller, business analyst, technology expert.
60
References
• Data Science for Business, Foster Provost and Tom Fawcett, 1st ed.
• Competing on Analytics: The New Science of Winning, Thomas H. Davenport, 1st Ed.
• Analytics at Work: Smarter Decisions, Better Results, Thomas H. Davenport, 1st Ed.
• The Value of Business Analytics: Identifying the Path to Profitability, Evan Stubbs, 1st Ed.
61

Is this the question you were looking for? If so, place your order here to get started!

Related posts

New Technologies in Nursing

New Technologies in Nursing New Technologies in Nursing Introduction The current nursing technologies have transformed how nurses conduct their duties. Evidently, such technologies and new healthcare systems have endured establishing better services to patients. According to the reports of...