Homework 3: SQL
October 2003
Due: Sunday, November 2, 2003
(7pm)
To be done
individually!!
1.
Introduction
In this assignment, you will write SQL queries that answer questions about a database containing information about players, teams and games from the National Basketball Association (NBA). Yahoo Sports (http://sports.yahoo.com/nba/) and ESPN (http://sports.espn.go.com/nba/) are examples of web sites that use such a database. We have simplified the schema by maintaining information for only last season[1].
Some notes to keep in mind as you go:
From your home directory, run the command:
gtar -zxvf /home/cc/cs186/fa03/Hw3/Hw3.tar.gz
This will create a directory Hw3/ in your home directory
with the following scripts:
Script |
Description |
initdb.sh |
Initialize
database directory at $PGDATA_HW3. To reinitialize, you have to delete
existing $PGDATA_HW3 first. |
startpg.sh |
Starts Postgres master process for database directory
at $PGDATA_HW3 |
loadnba.sh loadbignba.sh |
Creates
database ‘hw3’, and loads in some test data. The test data from loadbignba.sh is based on real data from this
year’s NBA roster. We have also provided a smaller data set that you
can load using loadnba.sh. Feel free to
insert in your own data for your own testing. Read loaddata.sql to see how to use the copy command to bulk load data. |
startpsql.sh |
Starts
psql for hw3 |
runquery.sh <QUERY_FILE> |
Runs
query stored in <QUERY_FILE> using psql
and outputs to screen. We have provided a sample query file query0.sql |
stoppg.sh |
Stops
Postgres master process for the
database directory at $PGDATA_HW3 before you log off. |
To set up the initial
database for the very first time,
you should run:
There are 4 relations in the schema, which are described below along with their integrity constraints. Columns in the primary key are underlined.
Player(playerID: integer, name : varchar(50), position : varchar(10), height : integer, weight : integer, team: varchar(30))
Each Player is assigned a unique playerID. The position of a player can either be Guard, Center or Forward. The height of a player is in inches while the weight is in pounds. Each player plays for only one team. The team field is a foreign key to Team.
Team (name: varchar(30),
city : varchar(20))
Each Team has a unique name associated with it. There can be multiple teams from the same city.
Game (gameID: integer, homeTeam: varchar(30), awayTeam : varchar(30), homeScore : integer, awayScore : integer)
Each Game has a unique gameID. The fields homeTeam and awayTeam are foreign keys to Team. Two teams may play each other multiple times each season. There is an integrity check to ensure homeTeam and awayTeam are different.
GameStats (playerID
: integer, gameID: integer,
points : integer, assists :
integer,
rebounds : integer)
GameStats records the performance statistics of a player within a game. A player may not play in every game, in which case it will not have its statistics recorded for that game. gameID is a foreign key to Game. playerID is a foreign key to Player. Assume that two assertions are in place. The first is to ensure that the player involved belongs to either the involving home or away teams, and the second is to ensure that the total score obtained by a team (in Game) is consistent with the total sum (in GameStats) of individual players in the team playing in the game[2].
3. Queries
Part I
1.
Find distinct names of players who play the “Guard” Position and have
name
containing “Jo”. (ORDER BY Players.name)
select list: Player.name
ordering: Player.name
ascending
2. List cities that have more than 1 team playing there. (ORDER BY Team.city)
select list: Team.city
ordering: Team.city ascending
3. Find the player(s) who has the highest score for this season (highest total score in all games of this season). Output the player’s ID, name, team, and total score.
select list: Player.playerID, Player.name, Player.team, total_score
ordering:
Player.playerID ascending
select list: Player.playerID, average_points
ordering: Player.playerID ascending
select list: Player.team, Player.position, avg_height, avg_weight
ordering: Player.team ascending, Player.position
descending
6. List each player’s playerID, their name, team and the number of “triple doubles” they earned. (A “triple double” is a game in which the player’s number of assists, rebounds, and points are all in the double-digits). Even if a player doesn’t have any triple doubles, he should appear in the output with the corresponding attribute equal to zero. (ORDER BY Player.playerID).
select list: Player.playerID, Player.name, Player.team, num_of_triple_doubles
ordering: Player.playerID ascending
7. Find the number of
games in
which both “Joey Loo” and “Hairy Harry”
play, and Loo scores more points than
Harry.
select
list: num_of_games
8. List the playerIDs and names of players who have played in at least two games during the season. Output the playerID and name followed by the number of games played. (ORDER BY Player.playerID)
select list: Player.playerID, Player.name, num_of_games
ordering: Player.playerID ascending
9.
Find the team(s) with the most wins compared to other teams. Output the
team
name and the number of wins. (ORDER BY team name)
select list: Team.name,
num_of_wins
ordering: Team.name
ascending
10. List the playerIDs, names and teams of players who played in all home games for their team. Do not output players whose teams did not play any home games. (ORDER BY Player.ID)
select list: Player.playerID, Player.name, Player.team
ordering: Player.playerID ascending
Part II
For this part of the assignment you will have to use the EXPLAIN clause. You can observe the execution plan that the optimizer has chosen for a particular query, using the statement:
EXPLAIN
<your_query>
This
command displays the execution plan that the PostgreSQL planner generates for the
supplied
statement. The execution plan shows how the table(s) referenced by the
statement will be scanned - by plain sequential scan, index scan, etc.
- and if
multiple tables are referenced, what join algorithms will be used to
bring
together the required row from each input table.
The
most critical part of the display is the estimated statement execution
cost,
which is the planner's guess at how long it will take to run the
statement
(measured in units of disk page fetches). Actually two numbers are
shown: the
start-up time before the first row can be returned, and the total time
to
return all the rows.
Note: You will need to run the ANALYZE command before using EXPLAIN to collect statistics about the contents of the tables in the database.
For each of the following queries, run the query using EXPLAIN, choose which query is best, and explain why.
Query 1:
List the shortest player(s).
select
* from player p1 where p1.height <= all (select height from player);
select * from player p1 where p1.height in (select min(height) from
player);
select * from player p1 where not exists (select * from player p2 where
p2.height < p1.height);
(select
* from player)
EXCEPT
(select p1.* from player p1, player p2
where p1.height>p2.height AND p1.playerid<>p2.playerid);
Query 2:
For each game return the total points scored in that game.
select gameId, homeScore + awayScore as totalScore from game;
For all of the
queries above,
use the EXPLAIN clause to
observe the
execution plan chosen by the optimizer. Which query would you choose in
each
case and why? Include at least one sentence explaining why one
query might be better or worse than another.
4.
Submission instructions
For Part I, submit each of your queries inside a separate file. Query X should go into a file called ‘queryX.sql’. E.g. query 1 goes into query1.sql, query 2 goes into query2.sql, and so on. You should be able to run ‘./runquery.sh queryX.sql’ to see the output of your query X. You are welcome to use view definitions within your query, as long as you ensure that runquery.sh command generates the correct output. This output will be used to be compared with our solutions to see if your query is written correctly.
For Part II, prepare a separate txt file for each query. The files should be named explain1.txt and explain2.txt respectively. Each file should contain the SQL queries provided for each question, each followed by the output of the EXPLAIN command for each one of them. Finally you should also add your comments about which of the options you would run on a large dataset and why.
You should submit the following files:
query[1,2,3,4,5,6,7,8,9,10].sql for each of your answers in Part I
explain[1,2].txt .sql for each of your answers in Part II
README – Your name and login
Make sure all the
above files are in a directory called
Project3/. cd into that directory and type submit Project3 to submit the
files.