HTML Table Parsing¶

Introduction¶

A lot of useful data is stored in HTML tables, but parsing HTML is an arduous task. Using Python’s standard library html.parser module, SQLify tries to simplify the task of parsing HTML tables by:

Creating a list of separate HTML tables
Trying to automatically find column headers
Handling different table designs, i.e. handling
- <tbody> and <thead> tags
- rowspan and colspan attributes

Note

Using SQLify’s HTML parser in conjunction with Jupyter notebooks is recommended.

Step 1: Reading in HTML¶

There are two avenues for reading in HTML.

a) Locally Saved HTML Files¶

b) From the Web¶

Step 2: Reviewing the Output¶

The functions above return TableBrowser objects, which are basically lists of HTML tables that were found. If viewing in Jupyter Notebook, the code above will display every table with an index next to the name of the table, e.g. [5] Players of the week.

Step 3: Cleaning the Tables¶

If the tables you wanted were parsed 100% correctly and don’t require any further processing steps, proceed to step 4. Otherwise, read on.

As seen above, when using the indexing operator on a TableBrowser object, you will get a Table object back. Table objects support a small set of data cleaning methods and contain attributes you may want to use or modify.

Step 4: Saving the Results¶

After you’ve cleaned the Table to your satisfaction, you can save the results as either a:

CSV file
JSON file
PostgreSQL Table

PostgreSQL¶

When saving to a new PostgreSQL database, you can either manually create it, or tell SQLify your preferred default database which should be used to create new databases.