Executive summary
The undertaking of this project is to address MarginFuels lack of a standardised business
process for the acquisition and implementation of new and updated Microsoft Azure features.
The end goal overall project is the creation of a business process for the evaluation of
Microsoft Azure features, thus enabling MarginFuel to reduce consulting costs as well as
taking steps to becoming a more self-reliant and proactive organisation. Within this report,
we examine the current existing analysis methods that allow for the business to run and what
steps of the analytics pipeline it could involve.
1.1 History behind project
MarginFuel provides SaaS price optimisation products for vehicle rental companies which
combines its automated dynamic market pricing model with its proprietary AI forecasting to
deliver critical information that aims to maximise revenue at any given time. The company
implements the Microsoft Azure platform to generate its products via predictive modelling,
forecast algorithms, and machine learning.
To stay at the cutting edge of technology and ensure that they are able to provide the best
possible results for their customers, MarginFuel seeks a standardised methodology for
assessing and researching new Azure features and upgrades. This will ensure that they are
using the best possible methods to produce the best results for their clients, while minimising
time and cost associated with seeking solutions from outside consultants.
1.1 Goals and benefits
For MarginFuel, who seek a standardised methodology as a means to researching and
assessing new/updated Microsoft Azure features, the new MS Azure feature evaluation
process is a business process based upon assessment and filtering. This process will capture
new-release features, as well as new-to-business features, and run them through a congruous
process to determine their viability to be implemented as a part of core business processes.
This process will reduce the need for external consultancy and is a feasible fit with current
resource constraints (time, money, manpower) as only features that are perceived to be
beneficial will proceed through the testing phase. By understanding the specifics of how this
may currently happen in the background our product will take a more proactive and
replicable approach by ensuring all features are captured and evaluated for compatibility.
1.3 Project sponsor and stakeholders
Andrew Pascoe, CEO
Klaus Borngraber, DBA and SQL Developer
Based on interviews with both key project sponsors, Andrew and Klaus, it can be seen that
both have heavy investment in achieving the overall objectives of the project. Andrew is
primarily invested in the ensuring that the overall strategic goals are being met, especially in
regard to discovering the problem with the disturbance of implementation in mind and the
level of disruption to the current operation and strategy as a whole. Aside from this, Andrew
also spearheads the approvals for the project as well as ensure that critical success factors are
being met.
Klaus is primarily driven toward improvement in the function that he is responsible for,
which relates to the management of the Azure platforms. Klaus as a project sponsor has clear
understanding of key requirements that need to be addressed in order to have a clear picture
of the problem at hand and has first-hand experience in working with new and undiscovered
features on Azure to critique progress with experience. As a project sponsor, Klaus is
responsible for ensuring that the problems identified are aligned with the problems faced by
MarginFuel on a daily basis affecting operations rather than being focused on potential future
problems or opportunities.
The starting point in stakeholder identification was the organisational chart provided by
MarginFuel, which aided us in identifying internal stakeholders. We have identified the
following stakeholders:
• Internal Stakeholders: Investors, Board, Executive Team, BA, Core Team1
• External Stakeholders: Clients, Consumers
The analytics pipeline I have chosen to adopt to meet the goals for this project is the
Knowledge discovery in databases (KDD).This a process that includes data
preparation and selection, data cleansing, incorporating prior knowledge on data
sets and interpreting accurate solutions from the observed results. This is particularly
relevant for Margin Fuel as a client as much of the data we will be using data found
from various sources, in order to ensure that data quality is high and reliable for
making decision.
Step 1: Goal identification
The first step of the KDD is goal identification from the customer’s perspective. As
part of this project, I have decided to limit the scope to a fairly small scenario with an
unusually small but representative database. This is created in order to test a
hypotheses that I already know the conclusion of in order to better understand the
process; as well as see if existing data can be useful for future predictions. The key
areas I have decided to focus on is the changes in the number of travellers entering
New Zealand and the number of nights in Motel stays in New Zealand. (I have
selected motels in particular, as it reflects solely travellers via car most accurately) I
will compare these two variable with historic data on car rentals registrations. By
identifying the relationship strength between these three variables, we can make
predictions for rental car prices.
Step 2: Selection
With the goal in mind, the next step of the KDD is to understand application domains
involved and the knowledge that’s required and to select a target data set or subset of
data samples on which discovery is be performed.
Application domain 1:
Tourism -> International travel and migration.
International travel and migration statistics give the number of overseas visitors, New
Zealand resident travellers, and permanent and long-term migrants entering or leaving
New Zealand.
Variables found: Average number in New Zealand each day by travel purpose and
county. (see appendix 2, 4, 8)
Application domain 2:
Tourism -> Accomodation survey
The Accommodation Survey provides information about short-term commercial
accommodation activity at national, regional, and lower levels. Statistics include
guest night numbers, capacity, and occupancy rates.
Variables found: Number of occupants in different accomodation types by region.
(see appendix 1, 5, 6, 7)
Application domain 3:
Historical data rental car sales. Statistics regarding currently registered rental cars
across New Zealand.
Variables found: Currently licensed rental cars.
(see appendix 3, 11)
Step 3: Preprocessing
Next, I set us to cleanse and preprocess data by deciding strategies to handle missing
fields and alter the data as per the requirements. Key areas looking towards
eliminating are incomplete, noisy, and inconsistent data.
They key steps performed based on the datasets on hand are as follows:
Date preprocessing
Pre-processing technique Reason Assumptions
1 Arrivals
Removal of
countries where
license holders
cannot drive in NZ.
Incomplete data Remove incomplete data. The
number of people eligible to
drive rental cars in NZ may not
be congurrent with the arrivals
number if license holders of
certain countries are not allowed
to drive in NZ. This is not the
case, as any valid license holder
is able to drive in NZ
All arriving
tourists with
valid licenses
can rent a

  1. Historic data
    rental cars
    Rental car prices
    weekly, change to
    Inconsistent data Since all other data that has been
    found are comparable on a
    monthly basis, the rental car
    prices should be averaged out
    from a weekly basis to a
    monthly one
    Impact on
    is minimal.
    Any outliers
    are dealt with
    Accom. Choose
    top 20 +
    regions to
    the main
    Inconsistent data As rental car prices are
    determined as per three
    transport hubs, Accomodation
    providers across the country are
    sorted by the nearest to these
    three transport hubs
    across the
    entire country/
    returning cars
    to another
    Step 4: Transformation
    Next, the selected and preprocessed data sets are transformed by removing unwanted
    variables. The visualisation of the end solution was also considered and various
    aspects were focused on to ensure that those goals could be reached by the end.
    General Per month basis Inconsistencies All datasets are process to be
    comparable in monthly
    segments. All data is drawn from
    the last calendar year of 2017
    from January till December.
    Pre-processing technique Reason Assumptions
    Date transformation
    Data transformation tecnique Reason Assumptions
    1 Arrivals
    Removal of countries outside of
    top 15 tourism
    partners for
    Remove noise. The
    dataset had an
    accumulation of over
    260 countries, most of
    which had meagre
    numbers of tourists;
    Since the primary
    purpose of the
    project is to
    distinguish trends,
    an accurate subset
    of data can be
    obtained from
    looking at numbers
    from the top 15
    countries for
    arrivals into NZ.
    1 Arrivals
    Purpose of travel refined.
    education, holiday
    vacation, unspecified,
    visit family or
    relatives. Combine
    The purpose of arriving
    tourists in NZ can be
    pinpointed to their
    primary purpose of
    travel. In order to garner
    a more representative
    dataset, all purposes of
    travel other than
    “holiday/ vacation” or
    “Visiting family or
    relatives” were removed
    completely from the
    Assumption that
    people travelling
    for work,
    convention etc on a
    short term visa are
    unlikely to rental
    cars to travel
    around the country
    Step 5: Data mining
    Next in the analytics pipeline, we want to match KDD goals with data mining
    methods to suggest hidden patterns.This process includes deciding which models and
    parameters might be appropriate for the overall KDD process. With the goal in mind
    of identifying the relationships between the three variables; I set out to first perform
    some basic descriptive analysis on the transformed data we have on hand. (see
    appendix 0)
    1 Arrivals
    Rental car multiple styles,
    choose one.
    Prices differ on rental
    cars based on the style
    and purpose of the car.
    For the purpose of
    the project, only
    mid-sized coupes
    are referred to.
    on survey
    Occupancy Percent. Guest
    arrivals, Guest nights
    The decision to
    completely remove
    incomplete variables
    that provide no value to
    on survey
    Creation of number of stays
    figure. Total nights figure/
    average nights stayed
    The data that we are
    able to find does not
    provide a figure for the
    number of stays
    registered, rather a
    nights booked figure
    within motels. For the
    sake of comparability to
    the number of rental car
    sales, the number of
    individual stays is more
    Assumption that
    few outliers exist.
    Data transformation tecnique Reason Assumptions
    Descriptive statistics
    Arrivals into NZ Number of Motel
    Cars registered
    Min 135396 111003 33532
    Max 281520 286439 41347
    Median 202720 187290 34303
    Mean 217953 189576 35559
    Standard deviation 71082 65782
    Looking at the information, it already raises questions to me regarding the correlation
    within the relationship between these three variables. Whether a relationship exists
    between them, whether it is positive or negative and how strong it is.
    The above scatter plot shows the positive linear relationship between the number of arrivals
    from the top 15 tourism partners of New Zealand and the number of motel stays on the basis
    of each month over a one year period. This suggests a positive relationship between the two
    variables and can confirm that our hypothesis that the changes in arrivals of tourists has an
    impact on motel stays and therefore can suggest an increase in demand for rental cars and
    therefore it’s prices. A basic correlation analysis returned with the result of 0.852 which
    suggests a strong positive relationship between the two. With an R square co-efficient of
    0.725, is is suggested that the predictive power of this relationship is reliable.
    (fig. 2)
    With this in mind, I was curious to see whether rental car companies have already caught
    onto this trend based on historic data by the times in which their vehicles were registered.
    With the transformed data of the number of registered rental cars at different months of the
    year in hand, we are able to generate another scatter plot to show the relationship between the
    Arriving tourists once again compared to the registered vehicles again on the basis on each
    month over a one year period, once again for consistency. There is a clear positive
    relationship between the two which suggests that rental car companies ar already aware of the
    influx of arriving tourists and the increase in demand for rental cars are specific months of
    the year. A basic correlation analysis for this returns the result of 0.774 which suggests a
    strong positive relationship between the two variables. With a R squared co-efficient of
    0.599, it is suggested that predictive power exists between the two variables
    Step 6: Interpretation evaluation
    Finally, we want to search for patterns of interest in a particular representational
    form, which include classification rules or trees, regression and clustering.Interpret
    essential knowledge from the mined patterns.
    With the previous information uncovered, that we could not easily had discovered prior to
    running these sets of descriptive statistics, it raises questions of which particular variables are
    the best predictors for the changes in price that are necessary to predict demand for rental
    cars, especially surrounding the proportion increases in the number of days they are required
    for and at which months. Rather than following the expected path of performing a regression
    analysis, network analysis seems like a suitable visualisation method for this particular set of
    information that raises a lot of question. I decided to look at the average nights stayed at a
    motel over a period of 10 years and it’s relationship to each of the 12 months of the year. I
    rounded the average nights to the nearest whole number.
    I modelled a network with (15 nodes and 128 edges)
    Using Gephi here is what I uncovered:
    (fig 3.)
    • (Network)Diameter: 4. This measures the maximum distance between two
    pairs of nodes, 4 is a low number for a network of this size which reflects the high level of
    connectedness in the graph
    • Modularity: 0.317. This is an indicator of the strength within the individual
    networks of the graph. At 0.317, it is safe to say that these individual networks have a
    moderate independent presence.
    • (Graph) Density: 0.219. This shows a value of how far a network is from
    being complete, with every node is connected to every other node, at 0.219, the current
    network is fairly well connected.
    • Closeness centrality: Highest 0.737, Lowest 0.326. This statistic measures the
    average distance from a given node to all other nodes, or in other words, how quickly can
    something spread within the network. For this particular network, the closeness centrality is
    collectively fairly high.
    • Betweenness centrality: Highest 41, Lowest 0. This statistic measures how
    often a node appears as part of the shortest paths between other nodes in the network.
    • Reciprocity: Irrelevant as an undirected network was modelled. However, this
    illustrates the ratio of connections that go both ways as a part of all the connection within the
    Overall, it was incredibly insightful to explore just how easily improvement within networks
    can be seen by visualising the current state. The argument I want to make with the business
    case is that the average number of days stayed in a hotel at different times of the month
    should be closely related to the different months of the year and fluctuate accordingly. The
    days stayed in a motel can reflect the days a customer rents a vehicle for and therefore
    reflects the demand for a vehicle. Again, I specifically selected to use Motels as a primary
    source of data because it almost always reflects holiday go-ers that are driving. With a close
    network like this, and clusters around specific months of the year, it can be seen that certain
    months of the year attract holiday goers that stay for longer periods of time. The previous
    correlation analysis already shows a positive relationship between the number of visitors and
    number of stays in motels as well as the registrations for vehicles that rental car companies
    are paying for. In conclusion, it is fair to say that the hypothesis is correct in that arrivals into
    NZ, the capacity and stays in Motels and registered rental cars combined act as a convincing
    case when predicting demand for rental cars.
    The quality issues with the data at hand are appalling, there were three main limitations from
    ensuring that this project acted as an accurate reflection of what could go on in the analysis
    pipeline of Margin Fuel as well as reach the goals initially set out. Firstly, I refrained from
    using any made up data, which is why historical data of rental car pricing is not used for this
    project as it is unavailable and recent prices would not be relevant for this particular analysis,
    this eventually became a bottleneck as it did not allow for actual predictions in price to
    happen as a part of this project. Given that all the data used in this project is sourced from
    StatisticsNZ, it was difficult to get exactly what I was looking for.
    Secondly, with my very limited understanding of data crawling/scraping and data mining in
    general, I have been unsuccessful in extracting useful information that could allow for deeper
    analysis which I deeply regret.
    Thirdly, I discovered that huge limitations exist in this field without the invasion of personal
    data scraped and pieced together from multiple sources, which I see as a breach of privacy
    and is unethical. As a result, multiple assumptions are made in the process of transforming
    and cleansing data especially where fields are incomplete. These assumptions can impact the
    integrity of the data later on when using it to make decision and decision makers should be
    aware of these implications.
    3039 words total
    File Name Description
    Appendix 0 Cleansed final data
    Appendix 1 Accomodation Survey: International stays by
    Appendix 2 Visitor arrivals: All countries by length of stay
    Appendix 3 Registered motor vehicles: Rental cars all
    Appendix 4 Visitor arrivals: All countries by purpose
    Appendix 5 Accomodation Survey: Average length of
    Appendix 6 Accomodation Survey: Stays by region
    Appendix 7 Accomodation Survey: Capacity by region
    Appendix 8 Visitor arrivals: By country and purpose
    Appendix 9 Gephi generated network analysis
    Appendix 10 network analysis input table
    Appendix 11 Registered motor vehicles: Rental cars by
    length of registration

Leave a Reply

Your email address will not be published. Required fields are marked *