Homework 1b
Overview
In this assignment we will be expanding our PostgreSQL countries database to capture more information, and learn to use XQuery to extract information from web pages to populate the database.
Due
The project is due 12PM midnight on Monday March 10, with an ER diagram due in discussion section on Monday March 3rd.
Database Design
Please visit https://www.cia.gov/library/publications/the-world-factbook, and familiarize yourself with the information provided about the various countries (hint: look at ‘text, low bandwidth version”). Select a portion of the information available and create an entity-relationship diagram (on paper) representing this portion of the available information. Your diagram should include …
1. Five (or more) simple attributes not covered in the first part of the homework example.
2. Three other Entity Sets, including International Organizations.
3. Relationships as needed, including Borders.
Be sure to show key constraints, weak entities, participation constraints, etc. Label each relationship and comment on your choices around participation constraints and key constraints for each relationship in your diagram.
Building the Database
Add attributes to your existing country table and add new tables to capture the information in your ER model.
Extracting Information
The web page sources for the ‘text, low-bandwidth version’ are available in directory /home/ff/cs186/factbook/xml. Each file is named AA.xml, where AA is a two-character abbreviation of the country name. An example XQuery that uses catalog files to iterate over multiple documents and extract background information from these files is available in factback.xq. Note that two catalogs are given, one called cat.xml refers to all the countries, while catsmall.xml covers only 5 countries and is useful for development and testing
In addition, a single XML file mondial.xml is available, along with a DTD for this file. It is also fine to use XQuery to extract data from this file to populate your database.
In order to run XQuery, create a file ending in ‘.xq’ and then use the saxon XQuery processor , available in the /home/ff/cs186/Saxon subdirectory. (The CLASSPATH for this should be set on login, but if necessary add saxon9.jar in the Saxon directory to your CLASSPATH.)
The goal is to write one XQuery for each table in your database, to produce a new XML file consisting of a set of <record> elements, as shown in this figure.
A program is provided that will turn an XML file that looks like the above into SQL INSERT statements:
Challenges
Additional string processing may be required to clean up the data for insert into the table, including deciding on standard format of key fields, etc. You are free to attack this problem, commonly referred to as the “Extract, Transform, Load” problem, any way you want, for example running perl or python scripts on either the <record> file above or the file of insert statements. Alternatively, it is fine to insert data into “staging” tables for processing before the actual production tables are populated, making use of the many string-manipulation functions available in PostgreSQL (http://www.postgresql.org/docs/8.1/interactive/sql-commands.html).
Note that in XQuery the function tokenize($b/text(),sep) is available that takes as a first parameter a string and as the second parameter a separator, and returns a list of strings from the first parameter broken up by the second. For example, sep might be “, “ and the string $b/text() is “apple, orange, lemon”, then the result will be a list of three words, one for each fruit, and a for $w in tokenize(…) can be used to iterate over them. You may want to look at w3c.org or w3schools.com for information on additional XPath Functions that can help you with this.
Deliverables% cd hw1b % submit hw1b