Homework 1b

Overview

In this assignment we will be expanding our PostgreSQL countries database to capture more information, and learn to use XQuery to extract information from web pages to populate the database.  

Due

The project is due 12PM midnight on Monday March 10, with an ER diagram due in discussion section on Monday March 3rd.

Database Design

Please visit https://www.cia.gov/library/publications/the-world-factbook, and familiarize yourself with the information provided about the various countries (hint: look at ‘text, low bandwidth version”). Select a portion of the information available and create an entity-relationship diagram (on paper) representing this portion of the available information.  Your diagram should include …

1.      Five (or more) simple attributes not covered in the first part of the homework example.

2.      Three other Entity Sets, including International Organizations.

3.      Relationships as needed, including Borders. 

Be sure to show key constraints, weak entities, participation constraints, etc.  Label each relationship and comment on your choices around participation constraints and key constraints for each relationship in your diagram.

Building the Database

Add attributes to your existing country table and add new tables to capture the information in your ER model. 

Extracting Information

The web page sources for the ‘text, low-bandwidth version’ are available in directory /home/ff/cs186/factbook/xml.  Each file is  named AA.xml, where AA is a two-character abbreviation of the country name.    An example XQuery that uses catalog files to iterate over multiple documents and extract background information from these files is available in factback.xq.  Note that two catalogs are given, one called cat.xml refers to all the countries, while catsmall.xml covers only 5 countries and is useful for development and testing

In addition, a single XML file mondial.xml is available, along with a DTD for this file.  It is also fine to use XQuery to extract data from this file to populate your database.

In order to run XQuery, create a file ending in ‘.xq’ and then use the saxon XQuery processor , available in the /home/ff/cs186/Saxon subdirectory.  (The CLASSPATH for this should be set on login, but if necessary add saxon9.jar in the Saxon directory to your CLASSPATH.)

The goal is to write one XQuery for each table in your database,  to produce a new XML file consisting of a set of <record> elements, as shown in this figure.

 

A program is provided that will turn an XML file that looks like the above into SQL INSERT statements:

Challenges

Additional string processing may be required to clean up the data for insert into the table, including deciding on standard format of key fields, etc.  You are free to attack this problem, commonly referred to as the “Extract, Transform, Load” problem, any way you want, for example running perl or python scripts on either the <record> file above or the file of insert statements.  Alternatively, it is fine to insert data into “staging” tables for processing before the actual production tables are populated, making use of the many string-manipulation functions available in PostgreSQL (http://www.postgresql.org/docs/8.1/interactive/sql-commands.html).

Note that in XQuery the function tokenize($b/text(),sep) is available that takes as a first parameter a string and as the second parameter a separator, and returns a list of strings from the first parameter broken up by the second.  For example,  sep might be “, “ and the string $b/text() is “apple, orange, lemon”, then the result will be a list of three words, one for each fruit, and a for $w in tokenize(…) can be used to iterate over them.  You may want to look at w3c.org or w3schools.com for information on additional XPath Functions that can help you with this.

Deliverables
  1. ER diagram (hand drawn) with discussion attached.
  2. Turn in the following files by putting them in a directory hw1b and submit
    % cd hw1b
    % submit hw1b
    
Required Files for submission (the required files are createdb.sql, xq.tar, scripts.tar, dump.tar, hw1b.txt):
  1. Create Table statements in a file createdb.sql. You may want to use pg_dump -schema to dump this information.
  2. XQuery scripts for each table foo in a file foo.xq. The XQuery should work when run in the directory above the xml directory. Put them all together in a single xq.tar file. (only submit the tar file)
  3. Any scripts used to process the data, in a single scripts.tar tar file. (create an empty scripts.tar even if you don.t have scripts).
  4. The results of 'select * from foo' for each table foo in your database in a foo.out file (there should be more than one). Put them altogether in a single dump.tar tar file. (only submit the tar file)
  5. Write a short description file hw1b.txt. You should write down description about the purpose each script file if there is any (e.g. give an short example what a script file can parse). You should also mention whether you extracted files from country files or mondial.xml. You are welcome to document the effort you put to make the assignment better (e.g. special text processing cases you did, interesting cases you faced). Don't write too long. It should just be short description.
Note
  1. As mentioned in the above description, it's ok to use mondial.xml only.
  2. Extra credits will be given to those who put extra effort in text processing.
  3. If you don.t do extra text processing, it's ok to put some field as type String (for example, in the area attribute, you can leave it as '336,237 km' instead of '336237'), but again, we.ll give extra credits to extra work.