Own the Future of AI

Posted on:Jan 15, 2024

Data has become the new oil, powering everything from AI technologies to everyday decision-making. But, much like the oil industry, the data supply chain is mired in complexities and inequalities.

We believe blockchain should be an integral piece of AI. The union fosters a more accountable, democratic, and equitable AI landscape, underpinned by a foundation of trust built on indisputable proof and shared benefits. Starting with data mining, the combined use of blockchain and AI from the outset ensures a system that continuously evolves and improves in a synergistic manner.

Inefficient Markets are Inequitable

Two of the more critical components in the supply chain are sourcing and structuring. Sourcing data is akin to drilling for oil; it involves identifying and extracting information from various sources. It is important to know that not all oil is created equal – there are different types of crude oil that have a different refining process – similarly, not all data is created equal; for instance, NASDAQ data differs from Reddit data. Structuring is needed because these disparate sets of data serve different purposes. Structuring parallels the refining process in the oil industry, organizing and processing raw materials into a format that is useful and meaningful for everyday use. For example, raw NASDAQ meta-data transforms into purposeful models for hedge funds and raw conversational HTML-data from Reddit into large language models used in chatbot GPTs.

The problem, however, is that markets for sourcing and structuring data are built on asymmetric information. Companies use their competitive advantages to artificially raise prices to increase their profit margins. Consequently, you, the user, are not fairly rewarded for your data, and companies that pay for the data do so at inflated prices.

This is happening in AI. Companies scrape vast amounts of web data to train their algorithms, yet the users who contributed to the process receive little to no compensation. There wasn’t a solution for this due to the opaqueness involved in sourcing, i.e., the monopolistic web scraping industry, and the absence of a reliable system to link the origin of the data directly to the data used in training models.

Efficient markets for abundant resources, like data, should be competitive and fair, featuring low costs and greater accessibility, thus driving economic growth. What is needed is transparency and verticalization of the data stack—an open network that sources and structures data for the customers who use it, and properly rewards those who contribute to it. Hence, a single source of truth for data.

The Constraints of Artificial Intelligence

To train a Large Language Model (LLM) you need two things: data and compute power. The latter is served by companies like Nvidia with their scarce H100 GPUs, and can be consumed in hyperscale marketplaces such as AWS, Azure and GCP for a premium. Right now, there aren’t enough GPUs for everyone who needs them, which is causing problems for AI companies. But, in the coming years, it’s expected that conditions will improve.

The often overlooked problem in the AI industry involves data collection and refinery. In the relatively near future, the quality and quantity of information, not compute, will be the only competitive advantages. Let’s break this down further.

The traditional method for training LLMs, known as “The Kitchen-Sink Approach,” advocates using extensive amounts of data to help the model identify patterns. While effective to some degree, this method is inefficient, it requires lots of data, substantial amounts of computational power, and is prone to overfitting – limiting its real-world applicability.

https://www.telm.ai/blog/demystifying-data-qualitys-impact-on-large-language-models/

To challenge the Kitchen Sink Approach, there is a school of thought that believes in feeding models with “Selective Ingredients,” or extremely high data quality. Using the optimal number of  predictors allows LLMs to learn faster while operating more cheaply. Additionally, selective Ingredients help reduce problems like biases, mistakes, and unwanted noise.

An experiment by TELMAI, revealed that when noise (i.e., bad data) was added to the training set, the quality of predictions went from 89% precision down to 72%. It’s clear there is a necessity to use high-caliber data for efficient model improvement.

So, the question for AI companies then becomes: how do you source sufficient amounts of high-quality data?

The Problems: Drilling and Refining Data

Overview

AI companies collect data from both internal and external sources. A material reason why AI companies partnered with big tech is due to their grip on immense amounts of internal data, largely their user’s. For instance, OpenAI has partnered with Microsoft, while Anthropic has collaborated with Google and subsequently Amazon. However, this is plutocratic: big tech gets bigger; users are not compensated for their contributions to datasets.

Another route AI giants have taken, particularly for newer material, is by scraping the web. And the way to do that is through Proxy Services.

Proxy services sell companies the internet bandwidth they need to scrape the web. They operate as intermediaries, hiding the user’s real IP address, and often replacing it with an IP from a datacenter. This process helps maintain the user’s online anonymity while also facilitating web scraping for companies.

Proxy servers come in various types, including data center, residential, and mobile proxies. Among these, residential and mobile proxies are the most valuable. Their primary advantage lies in their reduced likelihood of being detected as malicious and their effectiveness in bypassing geo-restrictions. As a result, they are particularly useful tools for web scraping, enabling the collection of more accurate data from websites.

Drilling the Data

Proxy service providers get user IP addresses in two ways: in-house website or app, or a software development kit (SDK) that is sold to other websites or apps.

If a user gives their IP address through their website or app, they likely get paid a minimal amount for it. For example, right now a user can earn about 25 cents for every gigabyte (GB) of bandwidth that is shared via a proxy service consumer app. The average user has about 656GB that isn’t used each month, so users could make around $160 a month. But, proxy service providers sell this bandwidth for $2 to $40 per GB, depending on whether the traffic is sent through their residential or mobile proxy networks.

https://scrapeops.io/web-scraping-playbook/residential-mobile-proxies-economics/

If it’s through a third-party app,  like an AdBlocker or a coupon extension, users get access to the service for free, in return for a portion of their unused bandwidth. Proxy services prefer this approach because it’s extremely profitable, with profit margins reaching 99.99%. However, this practice raises ethical issues, which these services often choose not to share with the public.

Refining the Data

The first issue centers on individuals not receiving adequate compensation for their computing resources, which are vital for web scraping. A lesser-known but equally critical problem is the absence of rewards for the data itself. In other words, compensation for helping source the data that is used in the refined models. This is a critical point that ties back to the earlier mention that the specificity of data (‘Selective Ingredients’) can significantly impact the effectiveness of a model, compared to using a less curated dataset ( ‘Kitchen Sink’).

While proxy services facilitate data access, they do not  structure these datasets; that responsibility shifts to AI companies and other third-parties (e.g., Yipitdata). However, these companies typically overlook compensating the users who originally helped them source the data in the first place.

Blockchain technology presents a more equitable solution. It can incentivize users to not only share their unused bandwidth, but also attribute the data back to them, thereby rewarding their contributions to the fullest extent. The collected data can then be refined and compiled into data sets for use in LLMs.

At AA, we advocate for a decentralized and verticalized approach – removing walled gardens, promoting collaboration from many parties, and rewarding users in an equitable fashion. Centralization results in asymmetric information and unfair advantages against many parties, including the users. Decentralization means trustlessness; there are no unfair advantages – just an open, fair marketplace.

Grass is the Optimal Approach

Grass is redefining the proxy services industry, altering how networks operate and users are compensated. This decentralized network not only enables transparent accounting of unused bandwidth but also ensures secure and fair compensation for the data generated.

The network operates akin to residential proxy services, renting out users’ unused bandwidth for web scraping purposes. However, instead of earning a fraction of the value, users receive 100% of the rewards for their bandwidth contribution.

From Grass’ core team

But that’s not all – uniquely – Grass offers extra rewards determined by the value of data extracted through the user’s unused bandwidth. Selective Ingredients get compensated! Being a decentralized network, combined with the use of oracles and on-chain proof of requests, enables Grass to authenticate the origin and use of data. From our research, this marks the first time that proper attribution of data extraction and usage has been accomplished.

With the web scraping market projected to grow to $16 billion by 2035, combined with Grass’s commitment to pass 100% of rewards to users indicates a definite increase in user earnings beyond the average market rate of $0.24 per GB. Additionally, rewards are expected to rise even higher with the attribution of data accruing back to users (although specific figures are not yet available). This model highlights the effectiveness of using blockchain technology in creating fair marketplaces.

https://twitter.com/getgrass\_io/status/1732082508840436062

And the buck doesn’t stop at the users. When users are better compensated, more bandwidth is connected to the network, generating an increased amount of data. This data is then organized into datasets, which are readily available for AI companies. The best part is these companies have the option to choose either ‘Selective Ingredients’ or the ‘Kitchen Sink,’ whatever they prefer! Prices are lower and first-rate data sets become abundant, leading to improved AI products and equipping users with more effective tools to foster a more productive and efficient world.

At AA, we’re idealists. That’s why we’re writing this and why we work in this industry. This is the world we want to help push. Using technology in transparent marketplaces to turn inefficiencies into efficiencies. Rewarding users righteously, and accelerating technology responsibly. Grass is a leader in this realm, and we are excited to see it progress as a standard in the data industry.

Check out Grass for yourself!


Advisors Anonymous (AA)

Advisors Anonymous specializes in supporting early stage web3 startups with commercialization, growth, strategy, marketing, and community engagement. We want to help you get your first customer, foster strategic partnerships, raise your next round, and find your marketing niche . Our team has spent their initial years careers in Big Tech and Wall St and are now. building, scaling, and investing in Web3 companies. If we see alignment with technical founders that are interested in our services, we embed ourselves in your team and work with you to help scale your business.

This is our first publication with many more coming!