Data Expo 2009 - Airline on-time Performance Analysis

Photo by Ethan McArthur on Unsplash

Data File Explanation

Table 1: Flight Data (flight_data.csv)
Table 2: Carrier Data / Airlines (carriers.csv)
Table 3: Airport Data (airports.csv)
Table 4: Aircraft / Plane Data (plane_data.csv)
Figure 1: Entity Relationship

Data Import To Hadoop File System and HIVE database

hdfs dfs -mkdir flight_data
hdfs dfs -put flight_2008.csv flight_data
Figure 2: HIVE UI Query Editor
hdfs dfs -mkdir flight_data
hdfs dfs -put flight_2008.csv flight_data
hdfs dfs -ls flight_data
CREATE EXTERNAL TABLE IF NOT EXISTS Flight_Data (
Year STRING, Month STRING, DayofMonth STRING, DayOfWeek STRING,
DepTime STRING,CRSDepTime STRING,
ArrTime STRING, CRSArrTime STRING,
UniqueCarrier STRING,FlightNum STRING,
TailNum STRING, ActualElapsedTime INT,
CRSElapsedTime INT,AirTime INT,
ArrDelay INT, DepDelay INT,
Origin STRING, Dest STRING,
Distance INT, TaxiIn INT,
TaxiOut INT, Cancelled STRING,
CancellationCode STRING, Diverted STRING,
CarrierDelay INT, WeatherDelay INT,
NASDelay INT, SecurityDelay INT,
LateAircraftDelay INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE LOCATION ‘hdfs://localhost.localdomain:8020/user/cloudera/flight_data
tblproperties(“skip.header.line.count”=”1");
hdfs dfs -mkdir carriers
hdfs dfs -put carriers.csv carriers
hdfs dfs -ls carriers
CREATE EXTERNAL TABLE IF NOT EXISTS Carriers (
Code STRING, Description STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE LOCATION ‘hdfs://localhost.localdomain:8020/user/cloudera/carriers’
tblproperties(“skip.header.line.count”=”1");
hdfs dfs -mkdir airports
hdfs dfs -put airports.csv airports
hdfs dfs -ls airports
CREATE EXTERNAL TABLE IF NOT EXISTS Airports (
iata STRING, airport STRING, city STRING, state STRING,
country STRING, lat STRING, longs STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE LOCATION ‘hdfs://localhost.localdomain:8020/user/cloudera/airports’
tblproperties(“skip.header.line.count”=”1");
hdfs dfs -mkdir planes
hdfs dfs -put plane-data.csv planes
hdfs dfs -ls planes
CREATE EXTERNAL TABLE IF NOT EXISTS Planes (
tailnum STRING, type STRING, manufacturer STRING,
issue_date STRING, model STRING, status STRING,
aircraft_type STRING, engine_type STRING, year INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY “,” ESCAPED BY ‘\\’
STORED AS TEXTFILE LOCATION ‘hdfs://localhost.localdomain:8020/user/cloudera/planes’
tblproperties(“skip.header.line.count”=”1");
Figure 3: Meta Store
Photo by Carlos Muza on Unsplash

Data Exploration by Tableau

Figure 4: Entity Mapping in Tableau
Figure 5: Group Attribute from Month to Season
Figure 6: Create New Calculated Field for Percentage of Delay Flight
Figure 7: Create Bubble Chart to Show Delay Flight Information
SELECT CASE 
WHEN month in (12,1,2) THEN “Winter” WHEN month in (3,4,5) THEN “Spring”
WHEN month in (6,7,8) THEN “Summer” WHEN month in (9,10,11) THEN “Autumn”
END AS Season,
AVG(CASE WHEN arrdelay IS NULL THEN 0 ELSE arrdelay END) as avg_delay,
FROM flight_records GROUP BY CASE
WHEN month in (12,1,2) THEN “Winter” WHEN month in (3,4,5) THEN “Spring”
WHEN month in (6,7,8) THEN “Summer” WHEN month in (9,10,11) THEN “Autumn”
END
Figure 8: % of Arrival Delay by Day of Month
Figure 9: Heat map
SELECT 
CASE
WHEN dayofmonth <= 7 THEN “Week1”
WHEN dayofmonth >= 8 AND dayofmonth <= 14 THEN “Week2”
WHEN dayofmonth >= 15 AND dayofmonth <= 21 THEN “Week3”
WHEN dayofmonth >= 22 THEN “Week4”
END AS WeekOfTheMonth,
CASE
WHEN month in (12,1,2) THEN “Winter”
WHEN month in (3,4,5) THEN “Spring”
WHEN month in (6,7,8) THEN “Summer”
WHEN month in (9,10,11) THEN “Autumn”
END AS Season,
AVG(CASE WHEN arrdelay IS NULL THEN 0 ELSE arrdelay END) as avg_delay,
(SUM(CASE WHEN arrdelay IS NOT NULL AND arrdelay > 0 THEN 1 ELSE 0 END) / COUNT(*)) as perc_delay,
COUNT(*) AS ttl_flight
FROM flight_records WHERE dayofmonth IS NOT NULL GROUP BY
CASE
WHEN dayofmonth <= 7 THEN “Week1”
WHEN dayofmonth >= 8 AND dayofmonth <= 14 THEN “Week2”
WHEN dayofmonth >= 15 AND dayofmonth <= 21 THEN “Week3”
WHEN dayofmonth >= 22 THEN “Week4”
END,
CASE
WHEN month in (12,1,2) THEN “Winter”
WHEN month in (3,4,5) THEN “Spring”
WHEN month in (6,7,8) THEN “Summer”
WHEN month in (9,10,11) THEN “Autumn”
END
Figure 10: Histogram for Day of Week Delayed Flights
SELECT dayofweek,
CASE
WHEN month in (12,1,2) THEN “Winter”
WHEN month in (3,4,5) THEN “Spring”
WHEN month in (6,7,8) THEN “Summer”
WHEN month in (9,10,11) THEN “Autumn”
END AS Season,
AVG(CASE WHEN arrdelay IS NULL THEN 0 ELSE arrdelay END) as avg_delay,
(SUM(CASE WHEN arrdelay IS NOT NULL AND arrdelay > 0 THEN 1 ELSE 0 END) / COUNT(*)) as perc_delay,
COUNT(*) AS ttl_flight FROM flight_records WHERE dayofmonth IS NOT NULL
GROUP BY dayofweek,
CASE
WHEN month in (12,1,2) THEN “Winter”
WHEN month in (3,4,5) THEN “Spring”
WHEN month in (6,7,8) THEN “Summer”
WHEN month in (9,10,11) THEN “Autumn”
END
Figure 11: Flight Delayed by Time in Autumn
Figure 12: Flight Delayed by Time in Spring
Figure 13: Flight Delayed by Time in Summer
Figure 14: Flight Delayed by Time in Winter
Figure 15: Scatter Plot Diagram for Flights
Figure 16: Show Number of Cancelled Flight on Map

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store