Effortless Data Scraping from Any Website with Advanced Automation
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeIf you found this summary useful, consider buying us a coffee. It would help us a lot!
Introduction
In the digital age, access to data is crucial for businesses, developers, and analysts. Data scraping has become an essential tool for extracting information from websites efficiently. If you've ever wondered how to scrape data from any website using just the URL and specific fields, this article is for you. We will explore an advanced application that automates the scraping process, allowing you to collect data from sites like Hacker News or car listings with ease.
What Is Data Scraping?
Data scraping involves extracting data from websites where the information is displayed in a structured format. This process can be applied to various types of data, including text, images, and links. With the right tools, scraping can be both easy and cost-effective.
How This App Works
The application discussed in the provided transcript allows users to scrape any website by following these simple steps:
- Input URL: Enter the target website URL.
- Define Fields: Specify the fields you want to extract, such as title, points, creator, date of posting, and number of comments.
- Click Scrape: Initiate the scraping process.
- Receive Data: The app will retrieve the relevant information and display it in a user-friendly table format.
Key Features of the Application
- Versatile Website Compatibility: The app works with various sites, including news platforms, car listings, and more.
- Data Export Options: After scraping, users can export the data in multiple formats like JSON, Excel, or Markdown.
- Cost-Effective: Using advanced AI technologies, such as GPT-4 mini, offers a budget-friendly way to scrape data without manual scripts for each site.
- Token Management: The app tracks token usage to ensure transparent pricing for the data extraction process.
Detailed Breakdown of the Scraping Process
1. Setting Up the Project Environment
To begin scraping, you’ll need to set up your coding environment with several libraries:
- Beautiful Soup: For parsing HTML and XML documents.
- Pandas: For data manipulation and analysis.
- Selenium: To automate web browser interaction.
- OpenAI Libraries: For leveraging AI models to enhance scraping precision.
2. The Scraping Workflow
a. Input and URL Handling
You start by providing the URL of the page you want to scrape. The application is designed to handle URLs smoothly, ensuring it can navigate to the required page without being blocked.
b. Field Definition
Next, you’ll define the fields of interest. This could include:
- Title
- Number of points
- Creator information
- Date of posting
- Comment counts
c. Scraping the Data
Once the URL and fields are set, clicking the "scrape" button prompts the application to begin the extraction process. The scraping gives a notice to "Please wait, data is being scraped..." The app will display the data in a structured table, making it easy to analyze.
3. Exporting the Data
After scraping, the user can choose to:
- Download as JSON: For programmatic access and further manipulation.
- Open in Excel: For data analysis using familiar spreadsheet tools.
- Export to Markdown: For documentation purposes. Each option provides flexibility depending on your needs.
4. Token Calculation and Cost Efficiency
During the scraping, the application not only collects data but also tracks token usage. For instance, if the input tokens count at 3,868 and the output tokens are 1,500, the total cost might be as low as 0.0015, significantly lower than traditional scraping methods.
Addressing Common Concerns
Consistency in Data Extraction
Some users raised concerns about receiving consistent data formats. With advancements in structured output, the application guarantees uniformity in field names, thus enhancing reliability.
Alternatives to Using Libraries
While libraries like Fir Craw are helpful, users can implement scraping without them by reading raw HTML. This method gives additional control but may require more coding expertise to handle error management and data consistency.
The Future of Scraping Technologies
As AI continues to evolve, data scraping is likely to advance with it. Traditional methods may not suffice to keep up with the rapidly changing landscape of data technologies; therefore, incorporating AI into scraping processes can provide an edge.
Conclusion
The ability to scrape data from any website efficiently is no longer a daunting task due to innovative applications. By utilizing the right tools and methods laid out in this article, data collection can be executed with minimal effort and cost. With new AI advancements, the future looks promising for both new and experienced developers. If you have any questions or improvements on the scraping process, feel free to reach out in the comments!
Happy scraping!
this application can scrape any website on the internet using only the URL and the fields that you want to extract for
example if you want to scrape data out of Hacker News all we need to do is to get the URL place it in here then Define
the fields that we want to extract in this case it's going to be the title the number of points the Creator date of
posting and the number of comments and go back here and Define them so the title number of points Creator date of
posting and number of comments then we are going to click on scrape and it's going to start scraping the
data and then it will say please wait data is being scraped and it will basically show me a table in
here as you can see it just scraped all the data give it to me in a table format in a nice table format then I can open
it either in Json Excel or I can even open the markdowns if I want and here if I open in Excel I will have a
file and this file will have exactly the data that I wanted and another thing is that it will show me exactly how much it
3,868 these are the tokens that we have in the markdowns then the output tokens were 1,500 this is what we have inside
of the Json this table representation as a Json and the total cost is 0.0015 it is absolutely cheap because we
have used GPT for om mini and here we can introduce as many models as we want in this case I have gbt 40 mini and a
GPT 40 therefore if I want more precision and more power I can always default the GPT 40 but in this case GPT
40 mini does the job for an absolute cheap price and this works on any websites let's say for example we are
going to take another website this is a website that has listing of cars and we want to scrape this table that we have
here all we need to do as always URL here I am going to define the new Fields so image vehicle name
data and as you can see here it has basically scraped the data these are the urls that will take us to the cars if I
copy one of them and base it in here it will take me to the URL of the car the first car 8,900 4,800 38,000 let's see
8,000 4,800 3800 and that is basically it and that will work on any website I've tried so many websites and this
application actually works on all of them of course this is going to be a bit more expensive because we have so much
more data here we have 21,000 tokens and we are still not even at one cent we are half of a cent so it makes per perfect
sense to use these application to scrape data instead of making more efforts to create one script per website so before
jumping into the code and seeing how this application works and don't worry we are going to see all of this in
detail I want to actually address the comments that I got on my last video that was by the way about the same topic
I have already created a first version of this application and there were a lot of comments about let's say three main
categories so the first one is about how I couldn't get consistent names every time and how can use something like
identic data validation library in order to make sure that I am going to get the same names now this is something that I
had in mind last time but I didn't want to do it not to make the video very long but since then open aai have actually
introduced structured output which basically made my life so much easier because now I can define object schemas
using btic meaning that I can basically Define the names and open AI with 100% accuracy will give me the same names
every time so that is very important and it was actually a great remark from you guys the second thing is about why the
use of fir craw now there are some people that were genuinely asking about why am I using firr and there are other
people that just thought I was just I don't know sponsored or a tick sale or something like that first of all I am
not sponsored anything it does not make sense for a tiny Channel like mine therefore if you can subscribe it would
be amazing but the idea is why do we even need to use fire craw we can just read the whole HTML and extracts
markdowns from that HTML without going through any kind of library and you're right about that and I actually did that
this sign we're not going to use fir craw we're not going to use any library but the use of fir craw or Gina AI or
scrape graph AI they're actually very good because they simplify the process of us getting the markdowns we only need
three or four lines of code and we have the markdowns ready if you don't have that we will have to go through some
ways in order to make sure that the websites that we are scraping are not blocking us from scraping by introducing
captas we have to make sure that the the website has to be open in our machine and a lot of other complexities that are
not present with all of these other libraries but still going through the process of opening the websites
ourselves is going to give us so much more possibilities that we didn't have before so not using fire C can actually
be beneficial and this is what we are going to do today the third point is about the fact that this will never
replace scraping as we know it today and honestly I don't want to argue about this point cuz I don't know the answer
but what I'm sure of is that the established industry of scraping does not have the same Innovation Pace as the
AI industry today and we can confidently say that because every two weeks we have a new State ofth art model that
outperforms all of the other models and beats all of the benchmarks and introduces a new layer of possibilities
therefore dismissing this way of scraping data is not very wise because someone who is willing to get outside of
the comfort zone and basically try this new method will at least have a way of scraping Data before going into the old
way of getting your xat to scrape every elements from the websites and there are actually use cases where this would at
least give you a starting point so let's close this bracket and let's continue with the video now let's jump to our
code and see how I have created this so the first thing that we start with is some voiler plate Imports so nothing
important to see here after that we are going to use pandas beautiful soup and fentic which is going to be very
important it's what going to allow us to create the schemas then we have hml2 text this is going to help us create the
markdowns and we have tick token this is what we are going to use in order to calculate the number of tokens and their
cost and finally we will have the selenium Imports and open AI ones so the first thing that we are going to start
with and we should absolutely pay attention to is the setup selenium because if you're just trying to export
data using let's say for example requests. get URL without the setup you will absolutely get the verify you're
not a human and solve capture so you have to mimic some human behavior in order for the website not to block you
and this is why I have downloaded the Chrome driver which you can basically find in here I will keep the link in the
description below okay so the first thing that we're going to start with is create an instance of the options class
in order to add arguments to it first argument is disabled GPU this is important because it's going to help you
disable the GPU if you're running this on a VM and it will basically make it faster because it will not try to
initialize any kind of integrated GPU that you have inside of your CPU after that we will use this argument this is
to make sure that our Chrome instance that we are going to open is independent and separate because it will have to
access a folder called temp it's a detail you don't really need to know about it but if you are running this
code in a Docker container this will prove to be important after that we will Define the site this will help the
website think that we are not scrapers we are actually a human user opening the window it's really not that importance
but it could help here we arrive at the first really important argument which is the user agent argument and here we have
a long text this is quite important because this is what proves to the website that we are not scrapers because
basically are using artifacts that are being used normally if we were to open that website ourselves so everything in
here means something for example this is Windows 10 this is Chrome and its version and all of the other things mean
something we don't need to go into details but these are just artifacts that usually are present when we are
opening a websites ourselves after that we will open the service that I already have inside of my project in here and
then we are going to initialize the web driver and we are going to return it we didn't get to fetch HTML is selenium and
this is where we are going to open the URL and then we are going to add some sleeps and mimic a scroll action which
always going to help us not to get blocks and also this is quite good if we have an infinite scroll case this is how
we can do it we can just add three or four in here just so that all of the data is being loaded and then we can get
the page source and then return the HML and then we have driver. quick inside of a finalist just to make sure that if
something crashes in here we still are going to close our Chrome instance and not not leave it open this is quite
important we then get to the part where we are going to create markdowns here inside the clean HTML we want to keep
only the main content so what we are going to do is that we are going to remove the footer and the header using
decompose unless you want to script some information from the header on the footer you should probably comment this
part after that in HTML to mark down with readability the goal basically is to get the the HTML content that we are
going to retrieve from clean h ML and get it into a format that is readable much like the markdowns that we had with
fire craw so it's actually the same I've compared between the two and it's actually the same so here we are going
to initialize markdown converter with html2 text and then we are going to Define ignore links as false I've
actually tried this with true but still it did not ignore the links it kept them inside of the markdowns and then we are
going to use the handle function and this is basically the function that it is going to take our html text and
create markdown that are kind of readable by human beings it's not 100% readable it's not like structured data
or anything but they are like semi readable data for us and of course the semi readable data is actually very good
for large language models because it helps them tremendously to get structured data after that we are going
to Define our models here I have GPT 40 mini and GPT 40 latest version and all of the pricing we are going to use that
in the function calculate price but for now we just Define them and then we Define the model that is going to to be
used which is our jpt for all mini after that it's basically the same code as I have created last time I have save R
data which is means that we are going to take these markdowns and then we are going to save it after that I have added
a a function to remove the URLs this is basically just to offset the ignore links that I had before because it did
not work and we get to the most important part which is creating the dynamic schema that we want and as you
can see here it's not a very complicated function but at the same time it is very very important so here we Define our
return type which is going to be type base model and if you read the comments you're going to see that I'm going to
dynamically creat a ptic model based on the provided Fields before actually adopting a list of strings I was using a
dictionary because I have said maybe sometimes if the user basically defines some kind of let's say fields that are
not well written I will read the data and then I will reconstruct these names that I have in the fields so that the
model can do a better job extracting them but then this means that I have to add another part where I am going to
call chat GPT meaning that it's going to be more expensive so this is why I just decided to go with a list instead of
dictionary so this dictionary that was going to basically have field aliases of external names and internal names is
very good but at the same time it is going to be more expensive this is why field name should be a list of the
fields that we are going to extract from the mapd down so here we have a little syntax in order to create the schema
that we want and if you're not really familiar with this all I'm doing here basically is just trying to create these
formats dynamically so instead of just defining one of them I'm just going to read whatever the user is going to give
me and I'm going to try to create this I'm going to create them in this specific format let's say for example I
have this and then I it's going to be a tle inside of here with three points for example if the fields are explanation
and output just so that this this will be a mandatory field which is going to prevent chat GPT from giving me empty
Fields because I've tried it with empty Fields just with scr and sometimes I get a lot of empty Fields where I shouldn't
this seems to help it give me a more reliable output so let's delete this and here we basically have a return of the
dynamic listing with the field definition that we have just given and now we have a model we need to put it
inside of a container because we don't want just one listing we want multiple rows and our input is going to be tip
based model which is what we have here and our output is going to be a list of these listing models that we just got in
the input so this is going to be our Dynamic listing container which is going to contain all of our data after this
part the only interesting thing that we need to do is to basically count how many tokens we have used and then the
price that we have paid for the extraction but before that we need to trim our token limits sometimes if a
website is crazy enough to have more than 200,000 tokens I think even Amazon only have like worst case scenario is
going to have 40 or 60,000 tokens per page if our website has more than 200,000 we need to trim it and only take
the first part which likely going to have the data and then we are going to basically for gbt 40 or Gemini if we
decide to do so so this is the function format data that is basically going to take the data and the dynamic Lis in and
then I already have my prompt in here by the way I didn't change it from last time it seemed like it was working all
the time so I didn't really need to change it and here we have client. beta. shck completion so maybe this will
change in the future and if this project proves to be important to you guys I will just come back to this code and
basically update this part but the most important thing is the structured output which basically means that I only need
to add this one line of code and I will be guaranteed 100% of having the same names every time meaning no
postprocessing of my output which is absolutely amazing then I will get the data passed and from from there I will
use the same functions as I've used last time now we reach the save format data and here what we are going to do is
basically try to save our data as Json and then in a table in an Excel sheet so we are going to take the formatted data
and then we are going to check if it has a dict which is going to be the case since this is an instance of the dynamic
listing container identic schema so it should have a dict so we are going to basically take that dictionary and we
are going to put it in inside of a new variable called formatted data dict and then we are going to save that variable
as a Json format using the json. dump and after that we are going to check if that formatted data dict have either one
value or have multiple values if it only have one value that means that we have one key which is going to be the case we
have listings and inside of them we have all the data so it has that one value of listing meaning that inside of it we are
going to have all of the data that we want so we are going to take the values from inside of that one listing value we
are going to be able to see here here we have listings we're going to check that we have only one value
listing and then we are going to take all the values that we have inside of that listing and from there we are going
to put it inside of a data for data frame which we are going to take inside of a data frame which we are going to
create an Excel file with later on in here and the last function that we have which is going to be the calculate price
and basically this calculate is going to use an encoder to take the number of inputs and then another encoder to take
the number of outputs and then we are going to calculate the price depending on the pricing model and then input
output in the dictionary that we have defined before and then we are going to return all the values here I was testing
this model in here so you can use another file where I have examples that you can basically just copy paste in
here in order for you to test only this Parts but the workflow itself is going to be called inside of the streamlit
application so inside of our streamlit application of course we are going to have streamlit and then we are going to
use the tags this is quite important because only this way we can basically have this format for the tags names Etc
so this is the library that is going to help us to do so we make sure that we are taking it for the sidebar and then
we are going to import all the functions that we already have defined in scraper so we're going to start by some boiler
plate to define the title and then we are going to Define what we are going to have in the sidebar so here we have the
select box I am a bit lazy I should probably take these values from inside of here I should have the same values
that we have inside of the models that we have inside of pricing then we are going to have the tags and these are
basically the configuration of the tags just make sure that you have Max tags as minus one if you want to add some
suggestions if you basically want to make a better user experience and if you already have a prior knowledge to what
people are going to SCP you can either add them at the value so they are already there when you open the
application or you can add them as a suggestion I left it empty but you can change that if you want and then we will
have a markdown just to separate the two parts and here because I was using a dictionary before that's why I have
Fields equals tags I had a little function in here but it does not exist anymore so do not pay attention to this
and then this is basically my workflow as you can see here perform scrape and it's going to go through all
the functions that I had before and going to return all these values and all of these values are going to be used in
here so if perform scrape not in session if this the first time I open it I am going to set it to false and then only
when I click on scrape this is where I will have my spinner and then I will call the perform scrape and put the
results inside of here once this is finished it's going to go here and then it will put all of the values inside of
these variables from the results that we have in the session State and then it will start displaying them here I have
created some kind of columns because the buttons were all over the place and as you can see here this is actually
download not just open as you can see here we have the download Json and in my column two I have data dick where I am
going to get the formatted data for my pen schema as we have talked before and then I will get the first key and I will
access that first key using the data dict and then I will transform that into a data frame and then I will basically
use that data frame I don't understand what did I add this in here this does not anything okay so I don't understand
this does not do anything for me I don't know why I added it anyways so here as you can see I will basically use that
data frame and I will transform it into a CSV and then I can just basically download CSV and after I did that
actually I discovered that for example I want to scrape let's say a scrape domme website let me take it really quickly
let me put it inside up here let me click on scrape so I discovered that when I get a table in here I can
actually download the CSV so all of that work that I had to do on the data frame did not mean anything because basically
I can just download it in here directly and we are going to see that now so as you can see here I will have download so
I can just download a CS CSV so this is basically just name and price and I am scraping data out of this website which
is a website scrapers normally use just to try their scripts so that's very good last thing we need to talk about is this
two lines of code which are very important so whenever I am uh changing something inside of here let's say I am
going to change this to gbt 4 mini it's very important for the user experience to have all of this to say the same
until we click on scrape so these two lines just basically make sure that the session stage will always keep these
values so that we show them and unless we click on scrape this will not change and that's it if you guys have any
comments if you guys think I have forgotten something or if you have suggestions to basically enhance this
script I'm all ears just drop them in the comments thank you guys so much for watching don't forget to like And