By Rick Roche, CAIA, Managing Director of Little Harbor Advisors, LLC
Part 1 of a two-part series on Alternative Investment Data (Alt-Data) in the COVID Era.
In Part 1, the author makes the case for high-frequency, short-interval Alt-Data while discussing three primary drawbacks of interpreting official economic statistics amid a global pandemic.
The profound toll of the Global Coronavirus Crisis is being measured by lives lost and its accompanying economic fallout in terms of staggering unemployment, lost income, bankruptcy, and recession. The novel Coronavirus pandemic has already had an enormous impact on global GDP and the investment climate. It is an important reminder of the humbling and daunting task that an epidemic represents to investment professionals: we are essentially tasked with shaping favorable outcomes and solutions for all of the underlying constituents we serve.
A core belief is that the Coronavirus and its disease, COVID-19 remain – above all else – a human tragedy. The virus knows no borders and reminds us of our existential vulnerability. Sadly, this pandemic also highlights society’s inequalities, as rates of infection and death are often worse for lower income-earners, people of color and older individuals.
The accompanying rapid regime changes in 1Q and 2Q 2020 demonstrate that many quantitative investment models–whether programmed by machines or humans–have run amuck. “If, statistically speaking, something is really an unknown, then all models–human models and quant models–will struggle,” stated Miguel Noguer Alonso, PhD, co-founder of the Artificial Intelligence Finance Institute.
Using real-time, high frequency alternative data may enable asset managers to rapidly adjust exposures to cope with the consequences of events previously unencountered. Alternative data (Alt-Data) flows in ways that conventional economic indicators, which appear only periodically and lag on-the-ground in reality, do not.
Select alternative datasets are timelier and may be more reliable than “official” statistics. For example, data on air quality, auto congestion, public transit, and foot traffic were used to determine how quickly China’s workers returned after factory closures. It is important to establish “real-time base rates” of economic activity in other regions affected and to establish comparable periods of market dislocations and turbulence.
In the context of this paper, the author asserts the working definition of “alternative investment data” or Alt-Data “Alternative data refers to data and information outside the usual scope of securities pricing (tick data), company fundamentals, or macroeconomic indicators”. For example, social media is ranked as one of the top categories of digital data used by hedge funds and select discretionary managers (refer to illustration above). Asset managers hire experts or contract with Alt-Data vendors skilled at applying Natural Language Processing (NLP) to financial news and unstructured documents such as analysts’ earnings estimates and surprises, SEC 10K and 13F filings, patent filings, etc.
Novel digital datasets include web scraping to extract product reviews, credit card purchases, emailed purchase receipts, satellite imagery in the petroleum industry, and retailer foot traffic via smartphone sensors. Alternative data generated by business includes company exhaust data, credit card purchases, and emailed purchase receipts. Data is generated by sensors such as satellite images, foot traffic via smartphone GPS, and the Internet of Things (IoT). The definition of alternative data changes over time. As a data source becomes more widely available, it becomes “main-stream” and no longer considered ‘Alternative’.
Alternative Data Wrangling
“Data is the new oil? No. Data is like raw land or undeveloped real estate. It needs value-added improvements and enhancements.”
– Rick Roche, CAIA
Financial market data is fundamentally different from the traditional datasets used to train machine learning algorithms. In machine learning applications such as image or text recognition, the underlying data generation is relatively stable over time. But financial market data does not have fixed physical properties like dice (in gambling) or elements in the periodic table. Securities market data is highly unstable, replete with market microstructure noise and low signal content.
The Economist magazine published a headline in 2017, blaring, “Data, not oil, is the world’s most valuable resource”. Although this analogy is widely used, I respectfully disagrees. Using the analogy of data as oil is not helpful for asset managers considering alternative datasets. Oil itself is valuable, marketable, and tradeable – the most difficult part is extracting petroleum. 
Data has no intrinsic value of its own. Data only generates value or decreases cost when it is being applied to a process or a decision. Data is more akin to raw land or underdeveloped property – there must be value-adding activities to create value. In real estate, the three most important factors are location, Location and LOCATION. The performance value of data is the relative change in Key Performance Indicators (KPIs) over time. Regardless of how advanced a machine learning model may be, the model’s ability to add value is contingent on the quality of datasets used in training, validation, and testing.
In addition to being accurate (accounting for missing or non-existent data) and clean, data must be meaningful and additive in the investment management industry. Certain basic facts or data that arrive every day must meet that tes;, for example, a properly labeled security issuer and tick data that shows the price and volume of every print. However, the universe of alternative datasets is expanding so quickly that many of them do not go go back far enough in order to reveal patterns or capture signals. With limited historical examples, early adopters of Alt-Data have discovered that back-testing is one of the most challenging obstacles they need to overcome.
Data preparation and validation is a holistic process that requires diligent screening and handling. Biases in data must be acknowledged and addressed so as not to trigger algorithmic bias. Inaccuracies must be corrected, and accounted for missing data. Meaningful, value-added data is not achieved without conscientious processing, aggregation, and integration into its respective alpha, position-sizing, or risk management model.
When alternative data is used by asset managers, various KPIs are evaluated as sources of uncorrelated alpha, more consistent returns, lower draw downs and quicker recovery periods. Most investment managers want to determine if adding alternative data to existing models helps to increase predictability. Managers are using Alt-Data not so much to replace existing data sources, but instead to augment them.
Alt-Data is fungible. The value of a dataset can differ significantly between firms. For example, large quants funds may insist that candidate datasets have plentiful tickers in which to trade. Conversely, a boutique discretionary trader may seek datasets with fewer trade candidates with much deeper dives into firm demographics. Asset managers can derive value from alternative data for their specific purposes without eroding the potential value available to another strategists.
Trade-worthy data has an “expiration date” due to somewhat efficient markets and arbitrage. Benefits from traditional datasets, such as historical tick data and SEC regulatory filings (10-Qs & 10-Ks) have high signal potential but are over-harvested. The more broadly a dataset is used, the less likely it will exhibit a strong Sharpe ratio. As a result, a growing number of asset managers seek to exploit non-price-based datasets in sophisticated prediction models. This is one of the most compelling reasons to consider using alternative investment data and BIG Datasets.
The unruly digital data of the Web is now called BIG Data. With etymological sleuthing, we discovered that the term and concept of “Big Data” was evidently coined in the 1990’s by John R. Mashey. Mashey, a U.S. computer scientist, entrepreneur, and blogger who frequently used the term “Big Data” while working as a Chief Scientist at Silicon Graphics (SGI). Mashey himself said his role was to popularize the term with a simple, short phrase that conveyed computing advances. The now-ubiquitous term Big Data is often used interchangeably for alternative data.
Talk about BIG. There are Big Dollars in Big Data. In March 2020, The Economist estimated that “Data” capital is worth between $1.4 to 2 trillion in the United States alone. What about the spend on Alternative Data? Precise numbers are hard to come by. AlternativeData.org says that there are roughly 450 Alt-Data vendors. They estimated that in 2019, the Buy-side spend on alternative datasets was $1.1 billion. This figure included money spent on data aggregators plus employees who wrangle alternative data. Alternativedata.org forecasts a $1.7 billion spend in 2020, but the COVID crisis has undoubtably impacted asset managers’ personnel and data acquisition budgets.
The authors of “J.P. Morgan 2019 Alternative Data Handbook”, state that the “Costs of alternative datasets vary widely”. J.P. Morgan estimates that sentiment analysis can be obtained for a few hundred or thousand dollars. Comprehensive credit card exhaust and emailed receipts can cost up to a few million dollars per year. In this report, the authors suggest that most Alt-Datasets have small positive Sharpe ratios making them unsuitable on a standalone basis. Despite this inferred drawback, they believe alternative dataset signals can be combined with other signals to yield “viable portfolio level strategy”.
Flying Blind: A Cure for Coronavirus-Infected Data
Following the COVID Crash and market recovery, we have learned that traditional indicators on the health of the economy and survey data are grossly inadequate. Traditional indicators are no match for a fast-paced, virus-induced global slowdown. The chaotic, COVID-induced economic coma created an unprecedented global slowdown in speed, scale, and scope. The U.S equity market fallout was quite literally -- unprecedented.
It took the S&P 500 only 22 trading days to fall 30% from its record high reached on Feb. 19, making it the fastest drop of such magnitude in history. And in mid-Aug-2020, the S&P 5oo experienced the quickest recovery from bear-market territory in its history, according to Dow Jones Market Data.
There are three primary problems in collecting and interpreting macro-economic data during a pandemic. In the United States, most official economic statistics are “annualized”. Annualizing works by stating that whatever happens in one quarter will continue, exactly in the same way, for a year. This was never a great idea in ‘normal’ times and is an absurd practice now.
For instance, “real” gross domestic product (GDP) decreased at an annual rate of 32.9% in 2Q 2020. What is the estimate for the next quarter (3Q 2020)? The Federal Reserve Bank of Atlanta produces a “nowcast” estimate for quarterly GDP. Their “GDPNow” model estimate for real GDP growth in 3Q 2020 is 32.0%.
Both numbers are ridiculous. At year-end 2019, the United States GDP was an estimated $21.45 trillion. Because GDP data is annualized, the “official statistics” grossly overestimated the decline of our $21 trillion+ economy in 2Q of 2020 just as the Atlanta Fed’s GDPNow forecast of 3Q 2020 growth of 32% is equally silly.
The second principal problem with most economic data is that it is survey-based and is issued with significant lag times. In trader terms, the lag is called latency–the time it takes from the moment a signal is sent to its receipt. Consumer confidence surveys are generally unreliable estimates of consumer sentiment. Surveys are a nuisance. People are surveyed to death. Most folks can’t be relied upon to complete surveys, much less be truthful when filling them out. During a pandemic, you probably don’t want to rely on the opinions of anyone eager enough to give you their opinion. Add in the fact, that most surveys are issued with consequential delays and you end up with stale data – or worse still–stale opinions.
Here is what the Philadelphia Fed recently said about their survey. “Given the sudden, extreme impact of the COVID-19 outbreak on initial unemployment claims in recent weeks, our researchers’ standard approach for estimating the 6-month change in coincident indexes is not appropriate. Therefore, the Philadelphia Fed has suspended the release of the state leading indexes indefinitely.”
The third pandemic phenomenon is that economic data is slow to recognize structural breaks in consumer habits. After “stay-at-home” and lockdown orders were issued, quarantined masses flocked online to Amazon, Alibaba and grocer online sites. Since the pandemic’s onset, online grocery buying has soared by 30% globally. Grocery chain operator Kroger Co. reported online sales grew 92% year over year in its fiscal first quarter 2020, ended May 2, 2020. At Walmart Inc., America’s largest grocer, ecommerce sales shot up by 74% in 1Q 2020 (ended May 1) as the coronavirus outbreak drove more shoppers online for home delivery or pick up outside Walmart stores.
At the same time, same brick-and-mortar store sales tanked in March 2020 and are still trepid as of this writing. Online sales overall were up 55% year-over-year in July 2020 vs July 2019’s figures. And nearly half (49.2 percent) of e-commerce sales are made through mobile devices in 2020. Online sales via mobile devices represent structural breaks from shoppers’ ingrained habits to go to malls. It may be years–if ever–before same store sales return to 4Q 2019 levels.
During the ongoing coronavirus crisis, traditional economic metrics will be less useful and unreliable. Expect frequent data revisions and updates. Asset managers should be cautious when using “official” economic data in their macro-economic models. Data that is not annualized, but rather reported at higher frequency and shorter intervals, is better suited for analysis and prediction.
Rick Roche is a Managing Director at Little Harbor Advisors, LLC. Little Harbor Advisors (LHA) is a sponsor of alternative investment strategies. Rick is also the Founder of Roche Invest AI, LLC, a consultancy that promotes the use of machine learning and alternative investment data in quantitative models. He holds Series 3 (Commodities), 7, 63 and 65 licenses. He earned his Chartered Alternative Investment Analyst (CAIA) charter designation in 2014.
 McVey, H., “Phase II: The Next Chapter,” KKR, May 19, 2020 ttps://www.kkr.com/global-perspectives/publications/phase-ii-next-chapter
 Mannix, R., “Alt Data Lends A Different Light to Coronavirus Impact,” Risk.net, March 6, 2020
 Ryll, L. et al., Cambridge Centre for Alternative Finance (CCAF), “Transforming Paradigms A Global AI in Financial Services Survey,” Jan. 2020
 Ross, C., Editor, “The Hidden Data Economy: Companies Need to Get Serious About Managing And Leveraging Data.” The Economist, June 1, 2020
 Noble, L., & Balint, A., “Casting The Net: How Hedge Funds are Using Alternative Data,” AIMA and SS&C, 2020, page 30
 Diebold, F. X., “On the Origin(s) and Development of the Term “Big Data,” Sept. 21, 2012 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2152421
 Lohr, S., “The Origins of ‘Big Data’: An Etymological Detective Story,” New York Times, Feb. 1, 2013
 Ibid, Ross, C., “The Hidden Data Economy,” The Economist, June 1, 2020
 Kolanovic, PhD, M., & Smith, PhD R., “Big Data and AI Strategies: 2019 Alternative Data Handbook,” Oct. 2019, page 7
 Ibid, Kolanovic, M. & Smith, R.
 Ibid, Kolanovic, M. & Smith, R.
 DeCambre, M., “S&P 500 logs first record close in 6 months and marks fastest recovery in history amid coronavirus: 'record breaking and heartbreaking,” MarketWatch, Aug. 18, 2020
 “Gross Domestic Product, 2nd Quarter 2020 (Advance Estimate) and Annual Update,” U. S. Bureau of Economic Analysis, July 30, 2020
 Gross Domestic Product, Fourth Quarter and Year 2019 (Advance Estimate), U. S. Bureau of Economic Analysis, Jan, 30, 2020
 “Philadelphia Fed Suspends the Release of the State Leading Indexes,” www.PhiladelphiaFed.org, accessed on 08/10/20
 Melton, J., “The Coronavirus Pandemic Lifts Global Online Grocery Sales,” Digital Commerce 360, July 22, 2020
 Ibid, Melton, J. Digital Commerce 360.
 Ibid, Melton, J. Digital Commerce 360.
 Crets, S., “Online Sales Taper Off in July as Retail sales Reopen,” Adobe Analytics, Aug. 2020, Digital Commerce 360, Aug. 11, 2020
 Kristensen, E., “15 Eye-Opening Online Shopping Statistics for 2020,” Statista 2020, Sleeknote.com, April 15, 2020