Machine Learning for Algorithmic Trading
上QQ阅读APP看书,第一时间看更新

Criteria for evaluating alternative data

The ultimate objective of alternative data is to provide an informational advantage in the competitive search for trading signals that produce alpha, namely positive, uncorrelated investment returns. In practice, the signals extracted from alternative datasets can be used on a standalone basis or combined with other signals as part of a quantitative strategy. Independent usage is viable if the Sharpe ratio generated by a strategy based on a single dataset is sufficiently high, but that is rare in practice. (See Chapter 4, Financial Feature Engineering – How to Research Alpha Factors, for details on signal measurement and evaluation.)

Quant firms are building libraries of alpha factors that may be weak signals inpidually but can produce attractive returns in combination. As highlighted in Chapter 1, Machine Learning for Trading – From Idea to Execution, investment factors should be based on a fundamental and economic rationale; otherwise, they are more likely the result of overfitting to historical data than persisting and generating alpha on new data.

Signal decay due to competition is a serious concern, and as the alternative data ecosystem evolves, it is unlikely that many datasets will retain meaningful Sharpe ratio signals. Effective strategies to extend the half-life of the signal content of an alternative dataset include exclusivity agreements, or a focus on datasets that pose processing challenges to raise the barriers to entry.

An alternative dataset can be evaluated based on the quality of its signal content, qualitative aspects of the data, and various technical aspects.

Quality of the signal content

The signal content can be evaluated with respect to the target asset class, the investment style, the relation to conventional risk premiums, and most importantly, its alpha content.

Asset classes

Most alternative datasets contain information directly relevant to equities and commodities. Interesting datasets targeting investments in real estate have also multiplied after Zillow successfully pioneered price estimates in 2006.

Alternative data on corporate credit is growing as alternative sources for monitoring corporate payments, including for smaller businesses, are being developed. Data on fixed income and around interest-rate projections is a more recent phenomenon but continues to increase as more product sales and price information are being harvested at scale.

Investment style

The majority of datasets focus on specific sectors and stocks, and as such, naturally appeal to long-short equity investors. As the scale and scope of alternative data collection continues to rise, alternative data will likely also become relevant to investors in macro themes, such as consumer credit, activity in emerging markets, and commodity trends.

Some alternative datasets that reflect broader economic activity or consumer sentiment can be used as proxies for traditional measures of market risk. In contrast, signals that capture news may be more relevant to high-frequency traders that use quantitative strategies over a brief time horizon.

Risk premiums

Some alternative datasets, such as credit card payments or social media sentiment, have been shown to produce signals that have a low correlation (lower than 5 percent) with traditional risk premiums in equity markets, such as value, momentum, and quality of volatility. As a result, combining signals derived from such alternative data with an algorithmic trading strategy based on traditional risk factors can be an important building block toward a more persified risk premiums portfolio.

Alpha content and quality

The signal strength required to justify the investment in an alternative dataset naturally depends on its costs, and alternative data prices vary widely. Data that scores social sentiment can be acquired for a few thousand dollars or less, while the cost of a dataset on comprehensive and timely credit card payments can cost several million per year.

We will explore in detail how to evaluate trading strategies driven by alternative data using historical data, so-called backtests, to estimate the amount of alpha contained in a dataset. In isolated cases, a dataset may contain sufficient alpha signal to drive a strategy on a standalone basis, but more typical is the combined use of various alternative and other sources of data. In these cases, a dataset permits the extraction of weak signals that produce a small positive Sharpe ratio that would not receive a capital allocation on its own but can deliver a portfolio-level strategy when integrated with similar other signals. This is not guaranteed, however, as there are also many alternative datasets that do not contain any alpha content.

Besides evaluating a dataset's alpha content, it is also important to assess to which extent a signal is incremental or orthogonal—that is, unique to a dataset or already captured by other data—and in the latter case, compare the costs for this type of signal.

Finally, it is essential to evaluate the potential capacity of a strategy that relies on a given, that is, the amount of capital that can be allocated without undermining its success. This is because a capacity limit will make it more difficult to recover the cost of the data.

Quality of the data

The quality of a dataset is another important criterion because it impacts the effort required to analyze and monetize it, and the reliability of the predictive signal it contains. Quality aspects include the data frequency and the length of its available history, the reliability or accuracy of the information it contains, the extent to which it complies with current or potential future regulations, and how exclusive its use is.

Legal and reputational risks

The use of alternative datasets may carry legal or reputational risks, especially when they include the following items:

  • Material non-public information (MNPI), because it implies an infringement of insider trading regulations
  • Personally identifiable information (PII), primarily since the European Union has enacted the General Data Protection Regulation (GDPR)

Accordingly, legal and compliance requirements need a thorough review. There could also be conflicts of interest when the provider of the data is also a market participant that is actively trading based on the dataset.

Exclusivity

The likelihood that an alternative dataset contains a signal that is sufficiently predictive to drive a strategy on a standalone basis, with a high Sharpe ratio for a meaningful period, is inversely related to its availability and ease of processing. In other words, the more exclusive and harder to process the data, the better the chances that a dataset with alpha content can drive a strategy without suffering rapid signal decay.

Public fundamental data that provides standard financial ratios contains little alpha and is not attractive for a standalone strategy, but it may help persify a portfolio of risk factors. Large, complex datasets will take more time to be absorbed by the market, and new datasets continue to emerge on a frequent basis. Hence, it is essential to assess how familiar other investors already are with a dataset, and whether the provider is the best source for this type of information.

Additional benefits to exclusivity or being an early adopter of a new dataset may arise when a business just begins to sell exhaust data that it generated for other purposes. This is because it may be possible to influence how the data is collected or curated, or to negotiate conditions that limit access for competitors at least for a certain time period.

Time horizon

A more extensive history is highly desirable to test the predictive power of a dataset in different scenarios. The availability varies greatly between several months and several decades, and has important implications for the scope of the trading strategy that can be built and tested based on the data. We mentioned some ranges for time horizons for different datasets when introducing the main types of sources.

Frequency

The frequency of the data determines how often new information becomes available and how differentiated a predictive signal can be over a given period. It also impacts the time horizon of the investment strategy and ranges from intra-day to daily, weekly, or an even lower frequency.

Reliability

Naturally, the degree to which the data accurately reflects what it intends to measure or how well this can be verified is of significant concern and should be validated by means of a thorough audit. This applies to both raw and processed data, where the methodology used to extract or aggregate information needs to be analyzed, taking into account the cost-benefit ratio for the proposed acquisition.

Technical aspects

Technical aspects concern the latency, or delay of reporting, and the format in which the data is made available.

Latency

Data providers often provide resources in batches, and a delay can result from how the data is collected, subsequent processing and transmission, as well as regulatory or legal constraints.

Format

The data is made available in a broad range of formats, depending on the source. Processed data will be in user-friendly formats and easily integrated into existing systems or queries via a robust API. On the other end of the spectrum are voluminous data sources, such as video, audio, or image data, or a proprietary format, that require more skills to prepare for analysis, but also provide higher barriers to entry for potential competitors.