Machine Learning for Algorithmic Trading
上QQ阅读APP看书,第一时间看更新

Working with alternative data

We will illustrate the acquisition of alternative data using web scraping, targeting first OpenTable restaurant data, and then move on to earnings call transcripts hosted by Seeking Alpha.

Scraping OpenTable data

Typical sources of alternative data are review websites such as Glassdoor or Yelp, which convey insider insights using employee comments or guest reviews. Clearly, user-contributed content does not capture a representative view, but rather is subject to severe selection biases. We'll look at Yelp reviews in Chapter 14, Text Data for Trading – Sentiment Analysis, for example, and find many more very positive and negative ratings on the five-star scale than you might expect. Nonetheless, this data can be valuable input for ML models that aim to predict a business's prospects or market value relative to competitors or over time to obtain trading signals.

The data needs to be extracted from the HTML source, barring any legal obstacles. To illustrate the web scraping tools that Python offers, we'll retrieve information on restaurant bookings from OpenTable. Data of this nature can be used to forecast economic activity by geography, real estate prices, or restaurant chain revenues.

Parsing data from HTML with Requests and BeautifulSoup

In this section, we will request and parse HTML source code. We will be using the Requests library to make Hypertext Transfer Protocol (HTTP) requests and retrieve the HTML source code. Then, we'll rely on the Beautiful Soup library, which makes it easy to parse the HTML markup code and extract the text content we are interested in.

We will, however, encounter a common obstacle: websites may request certain information from the server only after initial page-load using JavaScript. As a result, a direct HTTP request will not be successful. To sidestep this type of protection, we will use a headless browser that retrieves the website content as a browser would:

from bs4 import BeautifulSoup
import requests
# set and request url; extract source code
url = https://www.opentable.com/new-york-restaurant-listings
html = requests.get(url)
html.text[:500]
' <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title> <meta name="robots" content="noindex" > </meta> <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-16.jpg" sizes="16x16"/><link rel='

Now, we can use Beautiful Soup to parse the HTML content, and then look for all span tags with the class associated with the restaurant names that we obtain by inspecting the source code, rest-row-name-text (see the GitHub repository for linked instructions to examine website source code):

# parse raw html => soup object
soup = BeautifulSoup(html.text, 'html.parser')
# for each span tag, print out text => restaurant name
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(entry.text)
Wade Coves
Alley
Dolorem Maggio
Islands
...

Once you have identified the page elements of interest, Beautiful Soup makes it easy to retrieve the contained text. If you want to get the price category for each restaurant, for example, you can use:

# get the number of dollars signs for each restaurant
for entry in soup.find_all('p', {'class':'rest-row-pricing'}):
    price = entry.find('i').text

When you try to get the number of bookings, however, you just get an empty list because the site uses JavaScript code to request this information after the initial loading is complete:

soup.find_all('p', {'class':'booking'})
[]

This is precisely the challenge we mentioned earlier—rather than sending all content to the browser as a static page that can be easily parsed, JavaScript loads critical pieces dynamically. To obtain this content, we need to execute the JavaScript just like a browser—that's what Selenium is for.

Introducing Selenium – using browser automation

We will use the browser automation tool Selenium to operate a headless Firefox browser that will parse the HTML content for us.

The following code opens the Firefox browser:

from selenium import webdriver
# create a driver called Firefox
driver = webdriver.Firefox()

Let's close the browser:

# close it
driver.close()

Now, we retrieve the HTML source code, including the parts loaded dynamically, with Selenium and Firefox. To this end, we provide the URL to our driver and then use its page_source attribute to get the full-page content, as displayed in the browser.

From here on, we can fall back on Beautiful Soup to parse the HTML, as follows:

import time, re
# visit the opentable listing page
driver = webdriver.Firefox()
driver.get(url)
time.sleep(1) # wait 1 second
# retrieve the html source
html = driver.page_source
html = BeautifulSoup(html, "lxml")
for booking in html.find_all('p', {'class': 'booking'}):
    match = re.search(r'\d+', booking.text)
    if match:
        print(match.group())
Building a dataset of restaurant bookings and ratings

Now, you only need to combine all the interesting elements from the website to create a feature that you could use in a model to predict economic activity in geographic regions, or foot traffic in specific neighborhoods.

With Selenium, you can follow the links to the next pages and quickly build a dataset of over 10,000 restaurants in NYC, which you could then update periodically to track a time series.

First, we set up a function that parses the content of the pages that we plan on crawling, using the familiar Beautiful Soup parse syntax:

def parse_html(html):
    data, item = pd.DataFrame(), {}
    soup = BeautifulSoup(html, 'lxml')
    for i, resto in enumerate(soup.find_all('p',
                                           class_='rest-row-info')):
        item['name'] = resto.find('span',
                                 class_='rest-row-name-text').text
        booking = resto.find('p', class_='booking')
        item['bookings'] = re.search('\d+', booking.text).group() \
            if booking else 'NA'
        rating = resto.find('p', class_='star-rating-score')
        item['rating'] = float(rating['aria-label'].split()[0]) \
            if rating else 'NA'
        reviews = resto.find('span', class_='underline-hover')
        item['reviews'] = int(re.search('\d+', reviews.text).group()) \
            if reviews else 'NA'
        item['price'] = int(resto.find('p', class_='rest-row-pricing')
                            .find('i').text.count('$'))
        cuisine_class = 'rest-row-meta--cuisine rest-row-meta-text sfx1388addContent'
        item['cuisine'] = resto.find('span', class_=cuisine_class).text
        location_class = 'rest-row-meta--location rest-row-meta-text sfx1388addContent'
        item['location'] = resto.find('span', class_=location_class).text
        data[i] = pd.Series(item)
    return data.T

Then, we start a headless browser that continues to click on the Next button for us and captures the results displayed on each page:

restaurants = pd.DataFrame()
driver = webdriver.Firefox()
url = https://www.opentable.com/new-york-restaurant-listings
driver.get(url)
while True:
    sleep(1)
    new_data = parse_html(driver.page_source)
    if new_data.empty:
        break
    restaurants = pd.concat([restaurants, new_data], ignore_index=True)
    print(len(restaurants))
    driver.find_element_by_link_text('Next').click()
driver.close()

A sample run in early 2020 yields location, cuisine, and price category information on 10,000 restaurants. Furthermore, there are same-day booking figures for around 1,750 restaurants (on a Monday), as well as ratings and reviews for around 3,500 establishments.

Figure 3.2 shows a quick summary: the left panel displays the breakdown by price category for the top 10 locations with the most restaurants. The central panel suggests that ratings are better, on average, for more expensive restaurants, and the right panel highlights that better - rated restaurants receive more bookings. Tracking this information over time could be informative, for example, with respect to consumer sentiment, location preferences, or specific restaurant chains:

Figure 3.2: OpenTable data summary

Websites continue to change, so this code may stop working at some point. To update our bot, we need to identify the changes to the site navigation, such as new class or ID names, and correct the parser accordingly.

Taking automation one step further with Scrapy and Splash

Scrapy is a powerful library used to build bots that follow links, retrieve the content, and store the parsed result in a structured way. In combination with the Splash headless browser, it can also interpret JavaScript and becomes an efficient alternative to Selenium.

You can run the spider using the scrapy crawl opentable command in the 01_opentable directory, where the results are logged to spider.log:

from opentable.items import OpentableItem
from scrapy import Spider
from scrapy_splash import SplashRequest
class OpenTableSpider(Spider):
    name = 'opentable'
    start_urls = ['https://www.opentable.com/new-york-restaurant-
                   listings']
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,
                                callback=self.parse,
                                endpoint='render.html',
                                args={'wait': 1},
                                )
    def parse(self, response):
        item = OpentableItem()
        for resto in response.css('p.rest-row-info'):
            item['name'] = resto.css('span.rest-row-name-
                                      text::text').extract()
            item['bookings'] = 
                  resto.css('p.booking::text').re(r'\d+')
            item['rating'] = resto.css('p.all-
                  stars::attr(style)').re_first('\d+')
            item['reviews'] = resto.css('span.star-rating-text--review-
                                         text::text').re_first(r'\d+')
            item['price'] = len(resto.css('p.rest-row-pricing > 
                                i::text').re('\$'))
            item['cuisine'] = resto.css('span.rest-row-meta—
                                         cuisine::text').extract()
            item['location'] = resto.css('span.rest-row-meta—
                               location::text').extract()
            yield item

There are numerous ways to extract information from this data beyond the reviews and bookings of inpidual restaurants or chains.

We could further collect and geo-encode the restaurants' addresses, for instance, to link the restaurants' physical location to other areas of interest, such as popular retail spots or neighborhoods to gain insights into particular aspects of economic activity. As mentioned previously, such data will be most valuable in combination with other information.

Scraping and parsing earnings call transcripts

Textual data is an essential alternative data source. One example of textual information is the transcripts of earnings calls, where executives do not only present the latest financial results, but also respond to questions by financial analysts. Investors utilize transcripts to evaluate changes in sentiment, emphasis on particular topics, or style of communication.

We will illustrate the scraping and parsing of earnings call transcripts from the popular trading website www.seekingalpha.com. As in the OpenTable example, we'll use Selenium to access the HTML code and Beautiful Soup to parse the content. To this end, we begin by instantiating a Selenium webdriver instance for the Firefox browser:

from urllib.parse import urljoin
from bs4 import BeautifulSoup
from furl import furl
from selenium import webdriver
transcript_path = Path('transcripts')
SA_URL = 'https://seekingalpha.com/'
TRANSCRIPT = re.compile('Earnings Call Transcript')
next_page = True
page = 1
driver = webdriver.Firefox()

Then, we iterate over the transcript pages, creating the URLs based on the navigation logic we obtained from inspecting the website. As long as we find relevant hyperlinks to additional transcripts, we access the webdriver's page_source attribute and call the parse_html function to extract the content:

while next_page:
    url = f'{SA_URL}/earnings/earnings-call-transcripts/{page}'
    driver.get(urljoin(SA_URL, url))
    response = driver.page_source
    page += 1
    soup = BeautifulSoup(response, 'lxml')
    links = soup.find_all(name='a', string=TRANSCRIPT)
    if len(links) == 0:
        next_page = False
    else:
        for link in links:
            transcript_url = link.attrs.get('href')
            article_url = furl(urljoin(SA_URL, 
                           transcript_url)).add({'part': 'single'})
            driver.get(article_url.url)
            html = driver.page_source
            meta, participants, content = parse_html(html)
            meta['link'] = link
driver.close()

To collect structured data from the unstructured transcripts, we can use regular expressions in addition to Beautiful Soup.

They allow us to collect detailed information not only about the earnings call company and timing, but also about who was present and attribute the statements to analysts and company representatives:

def parse_html(html):
    date_pattern = re.compile(r'(\d{2})-(\d{2})-(\d{2})')
    quarter_pattern = re.compile(r'(\bQ\d\b)')
    soup = BeautifulSoup(html, 'lxml')
    meta, participants, content = {}, [], []
    h1 = soup.find('h1', itemprop='headline').text
    meta['company'] = h1[:h1.find('(')].strip()
    meta['symbol'] = h1[h1.find('(') + 1:h1.find(')')]
    title = soup.find('p', class_='title').text
    match = date_pattern.search(title)
    if match:
        m, d, y = match.groups()
        meta['month'] = int(m)
        meta['day'] = int(d)
        meta['year'] = int(y)
    match = quarter_pattern.search(title)
    if match:
        meta['quarter'] = match.group(0)
    qa = 0
    speaker_types = ['Executives', 'Analysts']
    for header in [p.parent for p in soup.find_all('strong')]:
        text = header.text.strip()
        if text.lower().startswith('copyright'):
            continue
        elif text.lower().startswith('question-and'):
            qa = 1
            continue
        elif any([type in text for type in speaker_types]):
            for participant in header.find_next_siblings('p'):
                if participant.find('strong'):
                    break
                else:
                    participants.append([text, participant.text])
        else:
            p = []
            for participant in header.find_next_siblings('p'):
                if participant.find('strong'):
                    break
                else:
                    p.append(participant.text)
            content.append([header.text, qa, '\n'.join(p)])
    return meta, participants, content

We'll store the result in several .csv files for easy access when we use ML to process natural language in Chapters 14-16:

def store_result(meta, participants, content):
    path = transcript_path / 'parsed' / meta['symbol']
    pd.DataFrame(content, columns=['speaker', 'q&a', 
              'content']).to_csv(path / 'content.csv', index=False)
    pd.DataFrame(participants, columns=['type', 'name']).to_csv(path / 
                 'participants.csv', index=False)
    pd.Series(meta).to_csv(path / 'earnings.csv')

See the README in the GitHub repository for additional details and references for further resources to learn how to develop web scraping applications.