- The fetcher function is defined as a method in the
Bookclass. You can fetch the html files using
from Book import Book
# get a instance of Book class
book = Book(bookname="name", date="2018-05-29", creator="MF)
# get the htmls from the Han-Ji website, The UR
book.fetch_data('http://hanchi.ihp.sinica.edu.tw/ihpc/hanji?@30^1389784921^802^^^60311004001000010006@@460127924',
pages_limit=1000, print_bookmark=False,)- Now,
bs4objects are stored in thebook.flat_bodieslist. You can write the html files intodatafolder via
# writing htmls into a folder
book.write_htmls(path="data")
# loading files to book
book.load_htmls(path="data")- To clean the data stored in
book.flat_bodies, we have set some general rules to clean the tree structure: we don't care 1) the tags that are not relevant to the actual text content, 2) the page dividers (and the page numbers), and 3) we don't care 標註. Just run the method below to remove all irrelevant tags:
# just run this class method to remove all irelevant tags from the tree structure in book.flat_bodies
book.strip_all_irrelevant_tags()- Exctract the bookmark (the dependencies of the poems) out of the page, just typed
book.extract_paths()- To preview the first page of Han-Ji page in a pretty HTML format, type
book.pretty_print(0) # 0 for the first page in fetched Han-Ji data- To convert rare char components (構字形) in
book.flat_bodieshtml sources, type
# if the {name}_rare_char.json exist in your path
book.update_rare_chars()otherwise
# if the {name}_rare_char.json does not exist
driver_path = '(PATH to your selenium driver)'
book.extract_rare_chars(driver_path) # this line would take a very long time, be careful before you execute it
book.write_rare_chars() # write to name_rare_char.json
book.update_rare_chars()- To count the occurances of the phrase attached with a certain character, use
book.char_word_counts. SeeSongShuOrganizer for futher details.
# list the 5 most common words with the last character == `char`,
# consider phrase with length 2 to 6.
# char should be a string
book.char_word_counts(char, limits=(1, 5)).most_common(5)Note: before using char_word_counts method, make sure you already extraced passages to book.flat_passages.
- The
WenXuan.pywas designed as a wrapper of theBook.pyand have specific methods to organize the texts files in WenXuan
from WenXuan import WenXuan
# get a instance out of WenXuan class
wenxuan = WenXuan('2018-05-29', 'MF')
wenxuan.fetch_data(URL="(URL for Han-Ji WenXuan)",
pages_limit=1000, print_bookmark=True,)
# organize the text files
wenxuan.extract_all()- Writing to CSV:
WenXuan.pyprovides a method to writewenxuan.flat_passagesandwenxuan.flat_metato a series of CSV files in folder (default folder is"/文選"). Metadata is listed in the comments (#) in the headers.
wenxuan.write_passages_ECSV()- To count the occurances of the phrase attached with a certain character, e.g., '曰':
wenxuan.char_word_counts('曰', limits=(1, 4)).most_common(5)
# [('子曰', 3517), ('書曰', 3495), ('詩曰', 2843), ('善曰', 2029), ('注曰', 2018)]- The
SongShu.pywas also designed as a wrapper ofBook.pyclass. SongShu organizer separated every pieces of works into passages.
from SongShu import SongShu
songshu = SongShu("2018-06-28", "MF")
songshu.fetch_data(URL="(The first page URL of SongShu in Han-Ji)", pages_limit=2000, print_bookmark=True)
songshu.write_htmls()- To recover the fetched data we downloaded last time, run
songshu = SongShu("2018-06-28", "MF")
songshu.load_htmls()
# [Info] Stop at loading data/ShongShu_0851.html.
# [Info] Total length of the data is 851.- To extract metadata, bookmarks, and organize the passages:
# preprocessing the songshu data to get metadata and bookmarks
# and separate the passages in every pages
songshu.extract_all()The Warning in the above output cell show that some pages in SongShu do not have a clear definition of separating the passages.
- To count the occurances of the phrase attached with a certain character, e.g., '洲':
songshu.char_word_counts('洲', limits=(1, 5)).most_common(5)
# [('蔡洲', 14), ('鬱洲', 9), ('嶸洲', 6), ('崢嶸洲', 6), ('至蔡洲', 6)]In this way, it is possible to extract natural geographical names.
Some characters in Han-Ji are rare chars. In this case, if we use urllib to parse the source page, we only get the fragments of the rare chars (構字形).
To resolve this situation, we can use JavaScript API in http://char.iis.sinica.edu.tw/API/normalization.htm to acquire the fragments of chars, and then we can use the fragments to search the correct rare char unicodes.
The following lines show how to fetch the a bag of rare char unicodes from a text string:
from rare_char_converter import rare_char_converter
selenium_driver = "(PATH TO YOUR SELENIUM DRIVER)"
text = "(YOUR HAN-JI TEXT)"
rare_char_converter(text, selenium_driver)
# Return: dict, {"(fragments of char)" : (UNICODE, string of the rare char)}