Assignment 4 focuses on the use of information technology to help users obtain cutting edge information to users who need it desperately and would not be able to do their work effectively without it.
Task 1 and 2 focus on finding out the possible natural warning signs for loan delinquents. This analysis would help creditors formulate a carefully crafted technique to analyze who they should give loans more freely and where giving loans would be a riskier investment.
Task 3 makes use of Tableau to summarize the statistics about crime in different districts and displaying it in different ways so as to facilitate analysis by the users. The task focuses on different ways of viewing the same information in order to make it more meaningful for the user.
Table of Contents
Type chapter title (level 1) 1
Type chapter title (level 2) 2
Type chapter title (level 3) 3
Type chapter title (level 1) 4
Type chapter title (level 2) 5
Type chapter title (level 3) 6
Focusing specifically on people who have not paid back their loans after a period of two years, we find certain patterns that help us understand which customers are at a higher risk of forfeiting their loans. We find that the risk of forfeiting the loan increases from the age of 21 to the age of around 50, after which it reduces for ages beyond that. Excluding anomalies, we find that the average monthly incomes of most of the clients who forfeit their loans as well as those who don’t of all ages are more or less about the same average. Excluding anomalies, delving further we find that the debt ratio of most of the delinquents is about zero, which shows that they have no additional money to spare towards loan payments. This is a major distinguisher between potential delinquents and people who will most likely not default on their loan payments. No patterns could be found among the number of open credit lines of delinquents and non-delinquents. Individuals between the ages of 25 and 45 tend to have a greater number of risky unsecured loans as compared to their older counterparts, but on the whole the same age group is more likely to have a higher number of open credit lines as well. On average, most individuals do not have many unsecured open lines of credit. Most delinquents, on analysis, turn out not to have as many real estate loans. However, the proportion of real estate loans increases between the ages of 40 to 60, which needs individual analysis to determine whether or not the customer will be able to fulfill their future loan payments along with their present monetary commitments towards their real estate loans. There is a pattern to be seen among the delinquents who are late for their payments. Most delinquents who are later than 30 days for their payments tend to be later than 90 days for those payments. So if 30 days pass and the individual does not fulfill their loan commitment, it is highly likely that they will end up not paying for 90 days or more. Finally, individuals in their late thirties to early fifties tend to have the highest amount of dependents, and an analysis shows that the higher the number of dependents, the higher the number of defaults. Most individuals who defaulted had a greater number of dependents than their counterparts.
A data warehouse is a database that is kept separate from an organizations operational data base and uses data that is mined by a company. The warehouse is secured, confined and has limited to no updates to it. The processes of warehousing consolidate the historical data that the company hopes to analyse for its business usage. That said the data contained within the warehouse varies from warehouse to warehouse. At certain times as was the case, in this instance the data is highly classified and contains peoples personal and private data. Private data that has been given to an organization in trust is something highly classified and requires the utmost security and privacy.
Firstly, the database of the data warehouse will need to be separated from that of the company’s organizational database as the operational database is a database constructed for the running and usage of well-known tasks and workloads that search use and index queries that are running on a daily basis and the database is constantly being modified by different users with varying levels of access. The operational database supports the usage of concurrent processing of different transactions and this concurrency control and recovery mechanism is absent in a data warehouse. Lastly and most importantly an operational database allows for read and modify operations while the OLAP model needs a read only access of stored data. This means that data being accessed if assimilated with operational data would be potentially editable and this editing could corrupt the quality of data in the warehouse.
The data warehouse will need certain features that are a must for all good data warehouses; it will need to be subject oriented, that means it will provide information about a certain subject rather than the company’s ongoing operations. It will not focus on operations but instead be more centred on modelling and analysis of data for decision making. The data will need to be integrated as a heterogeneous source such as relational databases or flat files. This integration will enhance the effective analysis of the data in question and allow for shorter query run times. The data in the warehouse needs to be identified with a particular time period, it will need to be time variant and the information provided will need to be largely from a historical perspective. Lastly, the data will need to be non-volatile, it will have to make space for new data without erasing the old one and the quantity of data stored will increase rapidly over time moving from large sizes of data into the big data stratum over time and thus accommodatingly virtually unlimited sizes of data that approach to infinity once a significant enough time has passed. This option is not available in operational databases as they have a fixed memory and processing speed as well as a redundant need for such outdated data holding only a certain number of years’ worth of data at a time.
The system will run the OLAP model as opposed to using OLTP as the system is involved in the running of historical queries and resides in a system that is a standalone separate from the operations database. It will be based on a star schema or snowflake model rather than the entity relationship model and will contain historical data of clients.
The OLAP model is superior in the sense that it provides easier access and connectivity with the RDBMS, it allows for more efficient storage of data since no zero facts are allowed. It will not use any recalculated data cubes and the DSS server of micro-strategy that adopts the ROLAP approach will be integrated. This will lead to slightly slower query performance but more improved and significantly better system security.
For the purpose of security and efficiency the data model will consist of a three tier model; the bottom tier of the system architecture will be the database server, its relational system and the back end tools that feed data into the system. These back end tools will be performing the ETL and refresh functions. On the mid-tier level, we will have the OLAP server that will be either a relational model where the multidimensional data to standard operation is run, or a multidimensional model where the data and operations are kept directly for running. Regardless, the top tier will be the front end client layer where only query running tools and reporting and analysis systems will reside and data mining shall be an option integrated within the system.
The database will have a system backup and recovery manager with contingency plans in events of natural disasters or system malfunction, not only will the software side of the data be secured but the hardware itself too shall be protected. The system will have a firewall around it with a DMZ zone dedicated where data shall be processed and queries run and not inside the layers of data behind the firewall. Behind the system firewall only data shall reside that shall remain constant with systems scanning for anomalies continually. Lastly, all data shall be encrypted during transfers from the firewall to the DMZ, network shall choose routes the data can take based on specific restrictions designed for better security layering. The process of encryption and decryption will be optimized so that it does not affect overheads and query run times and would not require any additional power or processing times.
In 2011 it was estimated that the quantity of data produced globally would surpass 1.8 Zettabytes. By 2013 that figure had grown to 4 Zettabytes. While it is easy to be seduced by the many wonders of big data, it is not without its pitfalls and potential harm that it can do. A number of new threats and challenges are emerging with Big Data. The issues range from marginalisation, discrimination and lack of transparency, not to mention the constant and blatant breach of privacy. The breach of privacy being a huge concern has led Big Data analysts to invent the process of de-identification, a process where data is anonymized by the removal of personally identifiable information. A way of justifying the mass collection and usage of personal data, however no legislations exist to stop such actions or to regulate and govern them. Many scholars are still sceptical of the limitations that PII brings according to Tene and Polonetsky ‘once data-such as a clickstream or a cookie number are linked to an individual they become difficult to disentangle.’
Secondly the privacy and data protection frameworks that are based on the privacy principles are obsolete. Unlike other research where data is sampled to identify and target types of data sets, the big data model seeks to gather as much data as possible in order to achieve greater resolution of the phenomenon being studied. A task that is made easier in recent years since the advent of the internet and social media. While everyone is signing waivers and disclaimers are published no one actually is reading them and thus a new round of legislation on the subject that results in a solution to the problem of privacy needs to happen at an earliest possible data.
If we were using excel when we begin to find the most and least committed crimes first we would have to calculate the frequency of each occurrence and then tabulate the results from there, however with the help of tableau simply using the function of least and top five we can simply dissect into most and least committed crimes with a click of a button.
The dashboard in question answers all queries that the police department has and more. It shows the regions with highest and lowest crime rates.
Sheet 1 shows annually the highest ten offenses across the city are:
And suspicious occupation in that order.
And sheet 2 shows annually the lowest ten offenses are:
And Treason in that order.
Sheet 3 shows all offenses across all ten districts per year, and sheet 4 below shows the number of crimes per district per year.
As is shown, sheet 4 lists the number of crimes in a year according to the districts, with Bayview having the highest number of crimes among all districts and Tenderloin has the lowest number of crimes among all the districts per year.
Sheet 1 (2) is a heat map of crimes in the city paged by district. It shows a heat map of whichever city is required and shows circles the size of the number of crimes. This challenge was accomplished using Mapview and integrating it with Tableau. It gives the user the option to select which crime needs to be displayed on the map and shows circles for that crime in the selected district accordingly. This is an example of how two applications or software can be integrated to deliver cutting edge results, solving many problems effectively and efficiently, without the wastage of time and resources. This provides efficient and effective analysis for the teams to respond quickly and make fast decisions in this world of rapid change.
Using the data in all the sheets, three dashboards were constructed accordingly.
The first dashboard shows the lowest number of records of crimes (sheet 1), and the highest number of records of crimes (sheet 2), and in the middle there are crimes sorted by district.
The third dashboard is of vital importance, and is a compilation of sheet 7, sheet 9 and sheet 4 (shown above). Sheet 7 shows a pie chart of crimes per district across the decade, while Sheet 9 shows the type of crime patterns per year per district, and sheet 4 which lists the number of crimes in a year according to the districts. This is shown below:
These tools help with the crime situation analysis by the police departments. The police department needs up to date information at all times, and needs to be able to analyze it effectively in a meaningful manner so as to decide where to deploy which officers and to have necessary backup planned in case of different scenarios according to the likelihood of them occurring which can be analyzed using the data and meaningful information it is turned into by the software and analyzing it from various angles. This is just one example of how technology is important in today’s world of constant change and rapid responses. There are many other ways in which business intelligence is used in order to aid the various departments in need for statistical analysis so as to determine policy, find patterns etc. and plan their operations accordingly in order to make progress in the world today and do their jobs more effectively and efficiently, not wasting time and resources in areas where it is futile to employ time and resources. Business intelligence today helps with smart management techniques and helps an organization figure out what its next steps should be in order to be optimally effective in its job.