Choose a target of your choice:

Method 1: ctrl + c / ctrl + v

Method 2: Google spreadsheet

Method 3: import.io (or some other scraping service)

Method 4: Being the programmer

Let's try programming a crawler ourselves.

1. Identify Web page URL

  • First find a Web page where you can find item lists
  • Let's understand the URL (cf. Online URL decoder/encoder)
  • It is best to find a page where the URL is has an iterable parameter (ex: page numbers, item IDs)
    • TED.com talks: http://www.ted.com/talks?page=1
      • page is iterable from 1 to 54
    • Amazon.com TVs: http://www.amazon.com/s/?rh=n:172282,n:!493964,n:1266092011,n:172659&page=1
      • page is iterable from 1 to 143
  • Set the variables npages, url_base, and also a file_base to name download files. Normally, you want to set npages to a smaller number (ex: 3), rather than the actuall value (i.e., 54) to test the code before actually executing it.
    npages = 54
    url_base = 'http://www.ted.com/talks?page=%s'
    file_base = 'ted_talks_%s.html'
    

2. Download Web pages

  • First define a function named save_text

    def save_text(text, filename):
        with open(filename, 'w') as f:
            f.write(text)
    
  • Then using the list URL found on step 1, download list pages. We will use the package requests for this task

    import requests
    for page_num in range(1, npages+1):
        print(page_num)
        r = requests.get(url_base % page_num)
        save_text(r.text, file_base % page_num)
    
  • Check whether the pages have downloaded succesfully

3. Parse downloaded Web pages

  • Before we go any further, let's recall how an html document looks like. Take a look at one of the downloaded html docs as well.

    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="utf-8">
        <link rel="stylesheet" href="...">
        <script type="text/javascript" src="...">
    </head>
    <body>
        <!-- This is a comment -->
        ... and this is where the visible contents go ...
    </body>
    </html>
    
  • Now, let's set the page_num to 1 for detailed investigation, rather than creating a for loop for all pages

    page_num = 1
    
  • For parsing, we'll be using lxml.html. Many people also use regex. There are various many other options (ex: bs4), so feel free to Google them up.

    from lxml import html
    root = html.parse(file_base % page_num)
    
  • With root, we can easily parse a given text using xpaths, just by identifying the tag, class, or id of an html element. Why don't we try extracting all the text from the <body> tag? Try this:

    print(root.xpath('//body//text()'))
    
  • Cool, huh? Now, let's think of what we want to extract from our downloaded html, and picture what we want as a result. From our html page, we probably want to extract the title, speaker, view count, date, and url of all TED talks, resulting in a spreadsheet format as follows:

    titlespeakerviewsdateurl
    The price of shameMonica Lewinsky83KMar 2015http://...
    What if 3D printing was 100x faster?Joseph DeSimone212KMar 2015http://...
    Can we create new senses for humans?David Eagleman215KMar 2015http://...
    The good news about PMSRobyn Stein DeLuca248KMar 2015http://...
    ...............

  • So, in order to find what tag, class, or id we need to exact such elements, let's go back to www.ted.com/talks?page=1, right click, and "Inspect Element"s.

  • By navigating with the DOM, we can see that the div tag with id=browse-results contains all the talk items in div tags with class=col, each containing a talk item. (When identifying and html element, using id is better than using a class, because normally ids are unique within a html page.)

  • Now, let's use xpath to get the talk items.

    items = root.xpath('//div[@id="browse-results"]//div[@class="col"]')
    print(len(items))   # returns 36, the number of talk items in the page
    
  • Dig into one of the <div class="col">s, to further investigate the identifiers of talk information.

  • Choose one item and extract relevant data.

    item = items[0]
    print(item.xpath('.//h4[@class="h12 talk-link__speaker"]/text()'))
    print(item.xpath('.//h4[@class="h9 m5"]/a/text()'))
    print(item.xpath('.//h4[@class="h9 m5"]/a/@href'))
    print(item.xpath('.//span[@class="meta__val"]/text()'))
    
  • All the items are contained in a list, and it would be better if we could strip the new lines(\n) from the strings. Additionally, rather than handling each info separately, let's put the extracted info into one dictionary, and make that a function named parse_item().

    def parse_item(item):
        return {
            'speaker': item.xpath('.//h4[@class="h12 talk-link__speaker"]/text()')[0],
            'title': item.xpath('.//h4[@class="h9 m5"]/a/text()')[0].strip('\n'),
            'href': item.xpath('.//h4[@class="h9 m5"]/a/@href')[0],
            'views': item.xpath('.//span[@class="meta__val"]/text()')[0].split('\n')[1],
            'date': item.xpath('.//span[@class="meta__val"]/text()')[1].strip('\n')
        }
    
  • Perfect. Now let's iterate through the items on page 1.

    data = []
    for item in items:
        d = parse_item(item)
        data.append(d)
    print(len(data))    # returns 36
    
  • Great. Now we're ready to iterate through all 54 web pages.

    data = []
    for page_num in range(1, npages+1):
    
    root = html.parse(file_base % page_num)
    items = root.xpath('//div[@id="browse-results"]//div[@class="col"]')
    for item in items:
        d = parse_item(item)
        data.append(d)
    print(len(data))    # returns 1943 or a similar number
    

4. Save the parsed data to file

  • We currently have our data in a dictionary format.
  • Normally, it's enough to save this data directly into a json file.

    import json
    with open('data.json', 'w') as f:
        json.dump(data, f)
    
    • JSONView is a nice way to pretty print your json.
  • However, sometimes it's better to convert this data into a spreadsheet. If so, try this:

    import csv
    with open('data.csv', 'w') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    

5. What's next?

  • You can see a refactored, cleaner code of the crawler above here. Try to understand the syntax yourself.
  • Furthermore, you can traverse into the individual talk urls you have just acquired from the list pages.
  • Concurrent crawling may come of use. Consult this presentation material.
  • Try crawling some other website of interest.
  • If you already have enough data to crunch, go ahead and crunch 'em!

References

Below are some Web crawler examples. Most are for Python 2, so be careful.