Introduction to Puppeteer and Headless Browsers
-
What is Puppeteer?
Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It allows developers to automate browser tasks, perform web scraping, and run tests without a visible browser interface. For more on web scraping techniques, check out our summary on Effortless Data Scraping from Any Website with Advanced Automation. -
What is a Headless Browser?
A headless browser is a web browser without a graphical user interface. It can be controlled programmatically, allowing developers to run automated tests or scrape web pages without displaying the browser window.
Getting Started with Puppeteer
-
Installation
To install Puppeteer, run the command:
npm install [email protected]
This installs version 19.11.1, which is used in the examples. -
Basic Setup
Create an npm package and set up your project structure. Use ES6 imports by adding"type": "module"
in yourpackage.json
.
Key Features of Puppeteer
-
Launching a Headless Browser
Useawait puppeteer.launch({ headless: true })
to start a headless browser. You can also set viewport dimensions and geolocation. -
Navigating to URLs
Useawait page.goto('https://example.com')
to navigate to a specific URL. -
Taking Screenshots
Capture screenshots usingawait page.screenshot({ path: 'screenshot.png', fullPage: true })
. -
Web Scraping
Extract content from web pages usingpage.evaluate()
to run JavaScript in the context of the page. For more insights on web scraping, refer to our guide on Understanding Headless, Boneless, and Skinless UI in Modern Development. -
Automated Testing
Automate user interactions like filling forms and clicking buttons with commands likeawait page.type()
andawait page.click()
.
Advanced Usage
-
Handling Asynchronous Operations
UsePromise.all()
to wait for multiple asynchronous operations to complete, such as clicking a button and waiting for navigation. -
Downloading Images
Scrape images from a webpage by filtering responses based on content type and size, then save them using thefs
module. -
Using Plugins
Enhance Puppeteer with plugins likepuppeteer-extra
andpuppeteer-extra-plugin-stealth
to avoid detection as a headless browser. For more on browser automation tools, see our summary on Unlocking the Unlimited Power of Cursor: Boost Your Productivity!.
Conclusion
Puppeteer is a versatile tool for developers looking to automate browser tasks, perform web scraping, and conduct testing. With its powerful API and support for headless browsing, it opens up a world of possibilities for web automation. For code examples and further details, refer to the video description.
all right so today we're talking about Puppeteer and headless browsers what is Puppeteer what is a headless browser why
should you care and what are the cool things that you as a developer can do with a headless browser
so the website for puppeteer this is the package that we're going to be focused on today
Puppeteer PPD pptr.dev is the website this is where you can come for all the references
there's some basic guides here at the start at the start if you go to the API section this is probably where you're
going to spend most of your time looking up references for things to make sure that you're using them properly so
you've got all the different parts inside of here all the different methods the different parts
lots and lots and lots of stuff and a lot of them have little code samples like this that you can use so we're
going to be using that I'm going to be using a specific version of Puppeteer I'll show you in a minute
how to install that I'm going to be using version 19 of Puppeteer version 20 is a fairly new thing and I've tested
all the examples that I'm going to be showing you today with version 19 so we'll be using that
um Puppeteer is this package that uses the Chrome Dev tools protocol so this is a protocol that you can use you can
write a program yourself that uses this protocol to access all the dev tools so when you inspect on a web page and you
go in here all these tools access to this can be given to you through a script so you can write your own program
just like Puppeteer which will access all that information and chromium Chrome supports this Firefox
also supports the devtools protocol so lots of cool stuff we're going to be able to do with this
all right so website I'm going to jump back into the code here and let's do a little bit of
setup so I have some basic Pages here just with some comments and them on things that we're going to be talking
about today so basic setup for the project right now we're going to create an npm package
so npm init Dash y to take the defaults yes for the defaults so here is my four scripts there's my package.json that got
created uh and then these are empty folders that we're going to be using in our scripts
we're going to be putting stuff inside there automatically so Puppeteer we're also going to npm
install the package puppeteer now if you just do npm I or npm install Puppeteer you're
going to get the default version which is the current version 20.3.0 but I'm going to be installing this version
right here it's the version right before 20.0 so we'll say at
19.11.1 that's the version that I'm going to be using here so if you want to follow all and if you want to use my
code you'll find links to all of this code down in the description so you can
follow along or you can just have a copy of your own to play with there it is and one last thing that we
want to change here is we want to go into the package.json and because I'm going to be doing Imports I want to do
this I want to use type equals module so I can use es6 Imports instead of the older common JS node
require statements so here we go import Puppeteer now that we have this inside of here
we want to build a headless browser or we want to use a headless browser so
what is a headless browser basically when you launch the browser yourself so
here you go this is a browser it's got the Chrome it's got the dev tools it's got
everything here you're watching everything going on you are interacting with it yourself
but if you're running it from a script it's really optional whether or not you actually see the web pages if you've got
a JavaScript that's running and the JavaScript is talking to the different Dev tools and it's downloading pages and
running scripts and doing things if it's not showing the visual interface if it's not showing you as the developer the
browser that is a headless browser so that's what we're doing here with puppeteers we can use headless browsers
we don't need to watch the automated tests we just need to get the results I want to get a Json file that has the
content from a page so that's scraping or testing I want to test the interface I want to make sure that this works if I
navigate to this page and click on this button and fill in this form field and then click on the login button does it
take me to another page and can I run a search by typing into the search field and clicking the button
then can I get a screenshot of the results of that and I can do all of that without actually having the browser
launch now I will in this video be showing the browser I'll tell it yeah yeah go ahead and show me so I can see
what it's doing but you do not need to actually display the browser while you're doing these tests
all right once we have Puppeteer installed what we're going to be doing is we're going to be running a whole
bunch of commands that are asynchronous so I've got an iffy here an immediately
invoked function expression and I'm just going to make the function and asynchronous one
and then inside of here this is where I'm going to put my code so this will run my script as soon as
the page loads and then inside of here I'm going to be running a whole bunch of commands and I'm going to be putting a
weight in front of all these things because I want to make sure that these things are done in a specific order now
I could chain them together with DOT then.then.then but I'd be adding a lot of extra code here a lot of extra curly
braces and parentheses that I don't want to see I want to be able to do things like this just browser and await
Puppeteer dot launch now this is the basic command to start it this is hey Puppeteer can you please
go and launch the default browser which every version of Puppeteer has a specific version of
Chrome that is installed and used the first time I run this it's going to take an extra few seconds because it has to
actually install its version of Chrome so Puppeteer launch and then inside of here we're going to say I want a web
page I want to actually create a tab inside of there so we'll say browser dot new page this will create a tab inside
the browser and at the very end every time you do this the last step is always going to be
browser Dot close when you're done with the browser you want to shut it down you
don't need it running in the background there we go that's the absolute minimum like let's launch a browser
and uh we'll run this with node so basic.js I'm going to be running this script
and it's giving me a warning because I didn't put any options in here for the launch
uh there's a new thing where they want us to say new right now with nothing passed inside of
here launch is going to be using all the default values and inside here in these
options headless is one of the options and we can say true to say headless I don't want to see it
there's some new features that are coming and they're going to be deprecating the True Value they want new
to be there as true so launch it using headless new and that will be headless true so if I run it
okay now it's actually installing Chrome so it's going to take that few seconds to install Chrome once it's ready it
launches it in the background because we're saying that it it is headless it means it's not not going to actually
display the browser to us there our script is done so it ran it installed Chrome it launched it opened a
tab and then closed it we didn't see anything but if I change this from headless new
to headless false what I'm saying is yeah I actually want to see it I want to follow along with what you are doing I
want to see the browser actually launch so taking a second and there it is that
little flash on the screen that was the browser launching opening a tab and then closing
all right well that wasn't very exciting so let's actually do something let's go to somewhere in the website as we'll say
oh wait and on my page I want to go to and here's the URL that I'm going to go to
so https google.ca now let's try that one now
it will actually launch the browser and then navigate to google.ca there it is and then the browser shuts down
so this is what we're doing with Puppeteer we're using a script to automatically launch a browser so that
we can do tests all right now I've got my page here what are some
other options that we can do well before we actually go to a URL you can set the viewport on the page you can say hey I
want my browser page to be these Dimensions so I can say await like I said there's going to be a lot of
Weights here page dot set viewport and inside of here we pass in the
options to say okay I'm going to set it at 1600 pixels wide I'm going to set the height at a thousand pixels High I am
going to say that no pretend it's not a mobile browser so is mobile is going to be set to false is landscape yeah we'll
set that to true so it's a landscape display and these values are going to be used with the CSS to rendering things
has touch so is it a touch device I'm going to say false and device scale factor
this is the not the aspect ratio but the pixel density on the screen so I'm going to just set it to one again
something that will be used in the CSS so we can set the dimensions of the screen
if you're going to be doing anything if you're visiting a page that uses geolocation for anything you can set
that there is a page dot set geolocation which we always start with a weight page and set geolocation you can set your
latitude and longitude foreign so if I was visiting a page that had a
map these are the coordinates that the browser would feed to the map to say hey this is where I am
we can look after we've gone to the page so these are the kinds of things that you would typically set before you go to
the page but once the page has launched then we can start looking at the page and doing things like get the content
get the title get the URL so const URL await page dot URL this is a method that will give us
the URL of the page that we're on now we already know what we said it but maybe you did something dynamic in the
page to navigate someplace else you've filled out a form you click the submit button you want to know what URL you're
on we can take a look at content
so this will take the source code of the page we can take a look at that that can be displayed
so we have all of that content for us as well taking a screenshot very common thing to do
and there is a screenshot method now with this we have to pass in some options at the very least what we want
to pass in here is an option for path what is the URL that you want to use for saving it so we're talking about here
not the URL of the page we've already set a page we've loaded a page but let's say I'm going to save it in my
screens folder so inside my screens folder I'm going to take this and say sample
Google dot jpeg now I'm going to do a second one here just to show the other options that we
can get we can add a clip object with the clip object you can set what is the X and Y coordinate and what
is the height and width of the area that you want to take starting at The X and Y coordinate
so let's say starting at 200 pixels x 200 pixels y I want a screenshot that has a width
of 500 pixels and a height of 500 pixels we've got additionally after the clip and the path we can set what the
whether or not it's going to be a full page render actually we should do this on the first
one take this out and put that here full page
by default is false but I'm going to set this one to true and that means that it's not just the
screen that you're seeing right here but the screen and the entire page so if the page is longer than your screen it will
capture the entire image and we could do something else here let's go to uh
chapters.indigo.ca so we'll go there instead and we'll get sample
change this name to make more sense chapters one sample chapters two
there we go all right so path full page clip and we've got uh encoding
with the encoding we can set this to base64 or binary the default is binary meaning you're going to get an image and
then if it is binary you can set the type to say that I want jpeg or I want PNG those are the two options JPEG and
PNG all right so do that we'll run it again so this will launch the browser it'll go
to the chapters website load the page take a screenshot get the URL get the content and take two
screenshots so here we have all the content so we wrote out all the HTML and CSS that were
inside there if you view Source on the page this is what you're getting so I'll scroll away up actually there's
more that would fit inside of here so we got all that content the URL was written out before that as well
and here are the two images so here is the full page true so you can see it's more than just what we saw on
the original screen right here it's the entire page and then the other one where we clipped
we went from 200 pixels over 200 pixels down and then it was 500 pixels wide 500 pixels high so the zoomed in portion of
the page we were getting this section right here of the page okay so that is basically how this works
if you want to click or type in stuff you can do that as well there's commands for things like that so we can say
page.type and then you'd have your CSS selector and the text that you want to type we could do that we can say await
page wait for selector now this can be useful let's say that these are CSS selectors
wait for selector is I've got a page that's loading but it loads a little bit and then it loads a little bit more then
it loads a little bit more maybe it's fetching data from the server or something and it's not until it's
finished and it's actually displayed some part of the page that I want to do the next step so what we're doing here
is we're saying wait until you can find the selector now it does have a default timeout I think it's about 30 seconds
where if it hasn't loaded it'll just create an error and it'll stop running the script
but this is very useful at times in your script to pause before moving on type
we're going to find this so assuming that this is some sort of input element input or text area I want to type this
text inside of here so a couple more options for you now we will be looking at a whole bunch more
when we get into the testing and scraping but this is what we have for the basics
so setting the viewport setting the location navigating to URLs retrieving the current URL getting the
page title getting the page content taking screenshots all these kinds of things all these automated tasks that
you would do to test your page to make sure if it's working to get proof that it's working
to download content from there so we're going to be talking about running some UI tests we're going to scrape some
content out of another website and then we're going to fetch some images so we're going to go to the unsplash
website and actually download some images from that website and save them locally so we have copies we're going
out fetching copies of that here we're going to generate a Json file based on content that we have on a website after
we fill out an interface and do something and then the testing will test to see that we can step through some
stuff all right so let's jump into the testing here and I'm just going to hide this
temporarily so we're going to go to YouTube on the home page I want to get two
different kinds of screenshots I want to get a regular screenshot and a blurred screenshot
so assuming that you're doing testing that's sort of that we're wearing a testing hat right now we want to test
our website to make sure if it's working we've built this wonderful thing called YouTube and we want to make sure that
it's working the way that we expect so I want to get two screenshots a regular one and then one that's been
blurred slightly and we're going to get the Blurred slightly one to make sure that things stand out things are large
enough to read for people with visual disabilities there's a whole bunch of different types of
Vision deficiencies that we can emulate with this as well we're going to do just this one with the blur but we'll show
you how they all work we're going to fill in the form so on the YouTube homepage we want to fill in
a search form we want to click the button to do the search we're going to get a screenshot after we get search
results back once we have search results back we want to read the content the title of the
first search results we're going to click on that we're going to navigate to the next page we're going to look at the
content there get a screenshot we want to count the number of comments that have been written there we're going to
get the also the title of the first suggested video in the sidebar for our search result
so we're going to step through all these things get a whole bunch of screenshots just so we can make sure that we're
understanding how to do some testing all right puppeteer we've got that imported we're
going to do our async iffy as always just like this and it's always
going to be the same commands here we're always going to be doing the same thing that we did here back on the basic page
these are going to be your first two steps foreign
and then your last one will be the browser.close like that
so I'm going to be doing headless false for all of these because I want to see it running I want to see that it's doing
the steps that I expect now what we're going to do inside of here
is going to the YouTube homepage now once we get here we want to fill out that search form
now I could hard code inside of here what is the value that I want to type into that search form every time that I
test or I can give myself a little bit of flexibility and say Hey what if I took the value that I'm going to search
for and I put it into an environment variable or I put it into one of the arguments on the command line as I run
my script so let's look at the two ways of doing that
first of all we'll create a couple of variables so we'll say my search term CLI and my search term
e and B so from the command line from environment variables we want to be able to get these two things
from the command line and what I mean by that is here if I'm going to say node let's run the script
test.js and then I pass in options so I'm going to say that I want to
search for the term uh Green Day
this is what I'm going to search for so if I do this this value right here is actually going
to be available to me in node.js so from the command line we're going to go to process Dot argv
Dot well this is right here this is going to be the array of everything the arguments
on the command line This is number zero this is number one and this is number two
so as long as the length of this array is greater than or equal to 3 then I will
proceed and I want to get process argv number two this thing whatever I wrote here
or if it wasn't three or longer then I'm going to pass in here's going to be my
default value so I still have a hard-coded value but the hard-coded value is only if I didn't type something
here now the other option is using environment variables
if you want to create an environment variable so in a variable that exists while your script is running we can do
that like this in Unix Linux Mac we use the command export if you're on Windows the command is set
but then the rest of it is the same so set instead of export and we're going to say
search txt is going to be equal to and then we use quotation marks if there's going to be
any spaces in there you don't have to put the quotation marks if there's no space in what you're typing but I know
that I was going to do this I was going to type Green Space date so I need the quotation marks around it
all right so export search text boom that's it I have now created an environment variable and I can take a
look at it by doing this putting a dollar sign in front of it
there it is so here if I want to access it again it's the node
process.env dot hey and what was the name of it it was called search txt and if that doesn't exist again here's
my default fallback Volbeat all right I now have two variables I can use either one of these
down inside my script as what I want to fill in so we've gone to our home page we have
our text that we can fill in we're going to use the type command to do that we'll say wait
page DOT type and inside of the search box element I'm going to put one of these two things
right here I'll use the CLI one it doesn't really matter both of them will work for what we're trying to do here
now I know the um foreign
of the field because I looked this up while I was doing the prep here and up in the ear in the top in the comments
this is the name of the search input this is the name of the button that we're going to click on that search
input so we're telling it that we want to find this thing and type whatever this value
was inside that field now if I just leave it like this it takes JavaScript no time at all to write
that text inside that input if I want to sort of watch and see it be filled in maybe there's something that I
want to watch maybe because I know I'm doing it with headless false I can actually watch it be typed there is an
option we can pass in here called delay and we can say for each character take a hundred milliseconds to do this so a
tenth of a second for each letter that you're typing so you can actually watch it being typed
in the screen maybe there's some validation something that's going on with the input event while you're typing
that might be something that you want to type so this would be something that you could do to slow down the entry of the
data into your web pages all right one other issue now right now I've gone to the page I've typed this
into the input problem is just because I've gone to this page doesn't mean that this input is ready
yet for me I don't know if that's actually been found on the page yet so I'm going to wait
I will say page wait for selector so great command where I can put the same thing inside of here and say okay
wait until this thing exists before you start trying to type it and once that's done I'm going to take a
screenshot of this as well we'll save it right here inside of our screens folder we're going to take a screenshot
and remember I want to do those two I wanted to do one that was blurred and one that was not
so I will be calling screenshot twice and we'll call it
screens YouTube home.jpg and I'll just use the default
options oh sorry and this is an options object with path it's not just this ring so I'm
going to do this twice home and home blurred I'm going to do the Blurred one first
and to get it blurred you have to set the option called emulate Vision deficiency
there it is and you just pass in a value here so blurt vision that is the option that will just from
this point on all screenshots will be blurred until I do the same thing again
but instead of blurred vision I pass in none alright so we can try to run this now
we're going to say node test.js and pass in the search string that we want
so we'll say a Green Day American Idiot there we go launch the browser
load the YouTube page and it craps out okay not a problem we can use control C to
kill the browser and I was hoping this wouldn't work I do I have sometimes had issues with versions of Puppeteer above
19.5 so what we're going to do is we're going to switch back from 1911 to 1905. very easy to switch just npm install
puppeteer at 19.5.2 there we go
with that done we'll run this command again and that should run the version that we
just installed and there we go so it's loading the page and here up at the top we can see it's
typing in and then it took two screenshots just very rapidly at the end so here we have
the clear version there's the screenshot and here's the Blurred version so we have both those screenshots taken so you
can see if you want to find out if the text that you have is big enough to read with somebody with a visual disability
if somebody's got poor vision somebody's old like me and they don't have their glasses on will they be able to read the
contact on the page the text on the page this is a good way to find out so like these titles yeah those are fine those
are plenty big but this might be a little bit hard and these labels over here are definitely too small to read if
the text is blurred like that okay so moved over from version 1911 to 1905 and now we're back everything's
running fine okay what else do we want to do okay yeah we've got the search field filled
in now I want to click on the button to navigate to the next page so we're going to do that
one thing about navigating though when you click on the button sometimes you click and you also want to wait for
the navigation to finish before you move on to the next thing so we could do it as two steps or a common way the
recommended way from Puppeteer actually is to do this so not sorry not new but promise.all so
if you've never worked with promise.all before what you can do is you can pass in a whole series
an array of asynchronous methods inside of here so just a comma separated list of a whole bunch of asynchronous things
and it will not complete and not report back to the weight that this is finished until all of those tasks are completed
so what do we want to do well we're going to we've typed in this thing then we're going to click on the search
button which we have up here this is it right here that's the button in the search
to navigate to the next page so we're going to do these two things we're going to say page dot click
no semicolon there we want to do that but we also want to do page dot wait for navigation
so these two things are going to be inside of our promise.all array right here and we won't go down to the next
step here until both of those things have finished now
wait till next page once we get to the next page what do we want to do I'm going to do another one
of the wait for selector things I want to wait until something's on the screen
that I I want to be there then I'll do another screenshot
so I'm just going to copy these in I have some of these written out already so this is the one that I'm looking for
right here so the YouTube video renderer inside of it there's an H3 with an anchor so this
is the search results once you run a search there's the container and then each one of those elements in the search
results is a custom an HTML custom element if you've never worked with custom elements before
there's a link up there at the top to a series that I've done on custom elements but inside this one we've got an H3
element and then inside that there's an anchor tag with this ID that is the thing that we want to click
on that's the title of the video that we want to see and we're going to get a search result
screenshot before we move on to the next page so we want to find the first one of these this
will give us the first we want to get the first one and move on to the next screen now we can just test this before
we go on just to make sure we're getting that screenshot so we'll do it again we will launch it
running that search we'll see the form get filled in and the button is right here that they're going
to click on I'm not clicking on it the script clicks on it move to the search results it was
only there very briefly but it did give us this search results right here so these items down the side here these are
those YouTube things and this part right here that is the H3 that we're actually clicking on to go to the page for this
video all right so we're going there then once we have it
I want to get the text so it's not just enough to navigate and go from page to page to page I wanted to actually get
the text out of here and display that we've got another method that we can use to do that
so I'm going to say my first match in my search results is Page dot eval with a single dollar sign
so there's one dollar sign or two dollar sign very much like query selector query selector all
you're going to pass in what is the query selector right here we do that so find the first one of
those on the page that you want if I used two dollar signs this would find everything that matched this so all of
the video titles all the anchors inside of all of those components and we're going to want to when that
happens this function right here is going to run so we have that
I'm going to return from this function the value of that element right here this element that is this
selector right here so we found the element and we're going to return not just the element itself but we're
going to take the inner text that's inside of there that value is going to be put back into
this variable right here so we have access to that which we can use now
after we have that then we're going to want to write that out to the console and then we're going to click on it and
navigate so we're going to do the same sort of thing that we did right here the promise all
and first we're going to console.log first match and I'm going to wrap it in curly braces
just so we can see it it'll say first match and then the label once that's done we want to navigate to
the next page so we will await another promise all with an array inside that array what are the things that we
want to do well again we're going to wait for navigation and we're going to click on
this same thing that H3 anchor element inside there that had the results we're going to click on side of that
click inside of that and let's take a look to see what happens there
after that we want to take one final screenshot so let's do an await screenshot here
this is going to be the actual video that we've selected there we go so our first match video
that's what we're going to be saving so right inside of here we should see a first video up here if this all works
correctly there's the results we clicked on it and here's the title right here first match
so there was the title this is the title that we clicked on right here this one and first video
yeah there we go now when we get to the next page picturing YouTube You've clicked on
you've done a search you've clicked on some results you're going to the page with the video
now it can take a little bit of time for that video to load it can take a bit of time sometimes there's ads that play
before the video so if you wanted to skip over that so right now we've got wait for navigation yet fine we're doing
that clicked on this thing fine we can do that there's another
weight instead of wait for navigation you can wait for Network idle that's a similar one
foreign is going to wait until the page has stopped asking for things basically
you've downloaded everything for the page there's no more requests that are happening no no Cascade of requests
that's happening but if you just want to wait there used to be a wait for
timeout now you can see there's a slash through this and that is because this method has been deprecated it's no
longer one that they want you to use they say okay it's better if you wait for navigation wait for Network idle do
something like that but on the rare occasions that there is a time that you just want to wait I just
need to pause 15 seconds before I do something we can do that ourself by creating a new
promise and wrapping it around a set timeout so resolve
and reject and we're just going to resolve we don't need to put the reject inside of there for a promise
but I'm going to create a timer wrapped inside of a promise and what do I want to call I want to call resolve which
tells the promise that hey you're done you're resolved after and let's say it's going to be 17 seconds
so there we have it wait for navigation click and then wait 17 seconds after that before moving on to the next step
which was to take the screenshot so let's find out what we get now we run that script to launch the browser
we should be pausing when we get to that final result so we type something in we click
on the button go to the search results there it is we click on that takes us to this page
there we go and we're waiting the 17 seconds and hopefully the ad will be finished by
this point so that we can then get the screenshot and this will shut down right after that there we go so just at the
last moment we got this first video screenshot there we go so just before it started playing so that's an example of
some time that you might want to just pause set timeout like this
all right and then on the final thing if you want to do the last couple of steps that I had up here we've got
dismiss button that's the thing and that we saw in the screenshot right here this no thanks
check it out this no thanks has an ID of dismiss button so that's something you can take the source code and play around
with that see if you can add the click event to get rid of this before you take the screenshot
and the last thing that we want to do is we want to check for the number of comments
so I'm going to show you here let's just bring up YouTube
and we'll do a search for American Idiot not Green Day American Idiot but here it is this is the video and then
on this page right here what I want to get is this message right here I want to get the total number of
comments inside this description box up at the top here and I also want to get the title of the first suggested video
in this list over in the side so those two things so the comments
this there's two spans right here so I want to get this element with all of its text and I want to get this right here
this text as well Okay so what we have right here
this is the comments section so when we want to get the inner text for that
down here at the bottom we're going to add our wait for selector after we've done our screenshot
and I'm just going to paste this in here speed this up a little bit so wait for selector this is the one and inside of
that we're going to use the eval command again the H2 that's inside of that block this is where the comments
the number and the word comments are both written the H2 that's inside that get its inner
text and we will see that number show up here and then
we want to get that first suggested so that's going to be very much the same as what we just did here so the first
suggested is using the eval we're going to jump in here to find this element once we have that element to get the H3
inside of it so this compact video renderer that's the little thing off to the side right here this thing right
here each one of these is a custom element right here with that tag name we want
the h3s intertext so now we should be able to see the video comment count and first suggested when we run those
there we go we're up and running typing in automatically I'm not doing the typing that is my script that's doing it
click on it we get to here there speed that up and I didn't save that so I'm going to
fast forward through the rendering of this running it again so you don't have to wait for that
17 second delay and actually what I'll do here one more thing before I run this
again is I will come up here and I'm going to set the viewport size on this you'll notice how it was quite small
when it was running on the screen here because I didn't actually set any viewport size so I will do that
let's um I'll just copy in the command with some Dimensions already in it there we go
so setting a width and a height gives the browser context for how big it's going to render that will change
the rendering of the page sometime depending on media queries and elements may or may not be there at when you
expect them to be so we've got here we're waiting for the selector for that header renderer we've
got the video comments and then the first suggested so we're waiting for this one we want to make sure that's
there before we do either of these things and I'm setting the viewport size on it just to make sure that it's going
to be the right size so there we go we'll run this again and I'll fast forward through this so
you don't have to watch the whole thing again and there we go and then we're back so
first match video comments there it is first suggested there it is so takeaways from this
wait for selector is a good way to pause and wait for things to be loaded screenshots
they're great but sometimes you want to wait before firing them when you do want to wait a set amount of time you can
still do this even though they've deprecated the wait for timeout command we do have the wait for navigation wait
for Network idle and we can always add our own set timeouts inside of here the one and two dollar sign eval commands to
find content once you've done this inside of here you really just talking
to the Dom you're no longer in the context of node and thinking about writing node script it's
in the web page you were actually dealing with the Dom so anything that you would do in a client-side script you
can do inside of here all right so that's our test scraping content
similar to doing this but when you want to have larger amounts of content when there's going to be a whole bunch of
things that you want to gather from the page or you want to step through to a certain part of a page to a certain part
of a website and from there gather all this information so here I'm going to go to the Algonquin
College website and I want to go to one of the pages I want to do a search for
um a specific program I'm going to search for the word mobile actually I could do
something a little bit different than what I've got written here but same idea so I'm going to go to the Algonquin
College website I'm going to fill out the form to search for programs that have the word Mobile in them there's
going to be two that come back as results and then I want to take details from that and extract the information
and save it as a Json file so that's what we're going to be doing here so if we look at example here
if I go to the Algonquin website here it is what I want to do is down here I'm going to fill out this form
we're going to search for mobile when we come back with the results from that it'll take a little bit of time
there we go we want to get the results with two records and then we're going to gather the program name the area the
campus and the length so those four bits of information for each one of these rows so we're going to have to figure
out in the HTML what is it that we're looking for so if I was to inspect this and we zoom in we can see
inside the table body for this head there's four rows I only want the ones with class odd and
class even so anytime I get one of those I'm going to find all of the TRS I'm going to Loop through those and then I'm
going to get the ones that are odd and even open them up and then inside the TDS I want to get not number zero but
number one number two number three and number five so those pieces are what I want to extract and
put into a Json file just to give you a sense of what scraping web scraping is like
this is what we're doing with whip scraping now one other thing
I haven't talked about yet with puppeteer we're dealing with the built-in one
we're using Puppeteer but it does come across when it's running as if it is a headless browser so there's settings
that aren't necessarily going to be there there's headers in the request for the resources that aren't going to be
there because it's not coming from a normal browser it's coming from a headless browser it's coming from a
browser that's controlled by a script so sometimes websites will have detection for that
and back over here this is a plug-in there's a plugin
called Puppeteer extra and Puppeteer extra plug-in stealth that will actually add in all those headers when when you
use this instead of the default Puppeteer engine it's going to provide all those
additional headers and they've got if you go through their notes here you'll find that there's actually a couple of
websites that you can use to test so you can load this URL in CR in Puppeteer to see what
it says do you does it detect that you are a headless browser and this is another great website that'll give you
green or red for each of these different things that attests to know okay well that one failed but everything else I'm
passing and this is just me visiting on the website but if you go on a headless browser you're going to fail a lot of
these things so we're going to add these things in here as well so down here at the bottom we will
install those so npm install Puppeteer extra and pump it here
extra plug-in stealth with those two things installed we're
going to be adding a couple of things up at the top here instead of the defaults just plain import right here for
puppeteer we're going to use this one we're going to import Puppeteer from Puppeteer extra and then
tell Puppeteer to use this plugin so this is the thing that's going to add in all those extra headers
we're also going to want to be able to save this in a Json file to the system so we can say import
write file from fs and that's the file system that is module that is part of node so we're
just bringing this in from node this is the function that we want to be able to save our files
all right and I'm just going to create a variable here I'm just going to hard code this one
instead of going to the command line or environment variables but keyword mobile that's the thing that we're going to
type now let's create our asynchronous iffy there we go
and the same three steps as always const browser equals Puppeteer launch and I'm going to say headless false
because I do want to watch this happen and then we'll say page is browser dot new page and have you spotted what I
left off yes my await commands okay
so everything that we're doing here wait a way to wait and a weight browser.close there we go now to set a
viewport we could do that as well here and if you are going to set a viewport remember to do it before you call the go
to command so page dot go to okay
there we go so there's the go to after the viewport has been set we can take a screenshot at this point
let's call this one Algonquin home now we want to search for the input field now I need to
wait until that has loaded so this is the ID for that input that we're going to type the word mobile into
now another thing we can use this for this wait for selected this will actually return to us a reference to
the field that you want to type in or the field that you want to click more importantly so instead of
doing this one actually let's do the the one for the button the button that we're going to click is a better example for
this there we go so this is going to be my button
that's the search button and once that's on the page then I know I'm good to go I can type I can click I
can do whatever I want and foreign keyword variable that we created up here
this one this is what we're going to type there's the input that's the keyword and
we're going to slow it down so we can see it being typed then I want to click on that button but instead of doing
page.click I can do BTN click because I've got a reference to that
element so running this gotta save our page
so run it so launching the home page and it should fill out down here at the bottom of the
search there it is okay and we got our screenshot yep there we are we got the screenshot we filled this in we clicked
it and it is going to Now navigate now previous time I did the click and the
promise promise.all method with clicking of the button and waiting for navigation if you
want you can also do this as two separate things so I've got the click for the button and then after I've done
that I can wait for navigation and then wait for an element on the next page and that's
what I'm doing right here so I'm going to click the button I'm going to wait for the navigation
wait until the page load event has fired then I'm going to wait for this selector so it's going to wait up to about 30
seconds for this to be found on the page then I'm going to take another screenshot which should show me
those two programs showing up there that should be giving me enough time for that so run it one more time
there we go fill in the form down here at the bottom click on the button
goes to the other page waits for the table and then does the screenshot to get program list
there it is and there we have it so we do have it now so the table does exist there it is in the screenshot so now I
want to take this data and I want to turn that into a Json file which was the whole thing that I was building towards
here first part just the setup you're loading the browser you're loading the page
you're filling in a form you're navigating to the place where you want to get the data now we've got that page
we're on the page where the data is so let's let's do this let's
run a function so I'm going to do the two dollar sign evals I want to go to that table that I
was waiting to see if it existed the table body TR and remember there was four TRS when we looked at the source
earlier two of them had odd and even as the class names the first two were just sort
of fillers they were header ones that were inside the body so inside of here
this is what we're evaluating so we're doing basically query selector all on this to get those things out of the web
page this variable right here is that collection of rows now normally when you
do document query selector all you get a node list but Puppeteer actually will convert that node list into an actual
array so the type the data type of this is actually an array so now inside of here what I'm going to
do is I want to return some data so right here I'm going to return something from this function that will
be assigned to the variable data I'm going to take my rows and I'm going to call the map method on
them now this is going to create a brand new array based on the array of rows I'm
going to be building an array of objects because that's what I'm going to turn into the Json
I don't want to have a null value for the first two I have to only look there was four rows
and it's only number three and number four those are the two that actually have the data so I'm going to do a
filter on the end of this and I'm going to say okay for the row only return it
so this will be either no or true depending on what we do right here so inside of here
for the row for the individual row and looking at them one at a time inside this row if row dot class list
contains odd or it contains even so these are going to be my selectors that I'm looking for class list contains
even so if either of those things are true
there we go now it will do this part the else part will be the default if I don't return anything that's what I'll get
here but I'm going to hard code it just to make this a little bit more explicit that this is what I'm doing
so we're returning an object and inside the object I'm going to have a name property
I'm going to have an area campus length those are the values that we had inside of here
right here this the name the area the campus and the length those are the four properties that I'm going to be
returning so row is the TR I need to get the TDS the table data cells inside of there so
let's go here let or const TDS equals row dot query selector all I'm going to get all the TDS inside each
row so it's got the class otter even I'm going to get all the TDS inside of it and I'm going to return
TDS number one it's inner text area
and this is number two number three and number five campus and length so the length of the
program there we go and that's it so we've gone inside the
table body we have the four TRS we're gonna Loop over the four TRS we're only going to take the two that actually have
data to return these objects the other ones will return no so I'll have an array that has four things null null and
then object object but I'm going to filter to get rid of the nulls so I will end up with a an
array that has two objects in it and right here let's console.log data run that
and that should write out those two objects the array with those two objects
filling in the form building the table get the data and there we have it there's our array
called Data that has two objects inside of it so now I want to take that array that object
data and I want to turn that into Json and I want to save it here
so let's save it inside this folder called Data so I'm going to call a weight
file and this is node.js so just plain node.js that I'm doing
inside my data folder I'm going to create a file called course details dot Json
there we go what do I want to write inside there well the data variable but I can't just put data because that is an
object I need to put json.stringify around that so I'm going to turn
this array into a string and then that string will be written inside of this file
and the final parameter for rate file is what's the um sorry there's two more parameters one
is the data type and or the sorry the character encoding
not the data type the data type is Json the character encoding utf-8 and then a function to run
when this is done error is a parameter that could be empty or it
could have an error object if there was a problem in writing this file this is the error so we need to check that we
say okay if error happened I'm going to throw the error so I'll actually have the error show up here on the console
and if we get past that line console.log saved the file there we go
and write file remember at the very top here we imported that right here from FS that's the file system in node
okay so there it is last time running this one there we are and there's our data file
saved screenshots we've got the Algonquin one we've got the program list so we have all of those
well that was the one here yeah so it's the Algonquin home and the program list those were the two screenshots that we
took and course details was available to us as well okay so that is how you scrape and pull content out of a web
page so you can save that in a text file for Access later on and the very last one last section in
here images how do you take that okay the scraping we saw how we got text Data how we could extract content from a web
page but what about images what if we have images on a web page that we want to download
so we're going to be looking now at the unsplash website try that again
on splash.com there we go so here what I want to do is I want to do a search so let's say I want to look for
Lakes and I do a search and then I want to get all these images that we've got down
here so all these images that show up as results on the first page so I want to download those I don't just want to
screenshot yeah okay fine I can take a screenshot but I want to get the individual images themselves so how do I
do that all right so inside of here we're going to do the same thing that we always do
for startup I may as well just copy and paste a chunk from here take off these same things that we do at
the start every time so we'll import Puppeteer from puppeteer
and very bottom await browser.close there we go there's the end of our iffy
and we're going to unsplash.com all right so launching the browser new page set the viewport go to the website
get a screenshot there's our basic start now what I want to do is I want to be able to
do searches now this is one I probably want to want to change the term so I will use the environment
variable or the CLI again to do that so let's go back here to our test and grab that line
so we'll put this up at the top search term CLI and my default will be mountains there
we go so if I put something there when I'm typing node images.js and then a word that follows it that's
what I will use as my search term if I don't put anything there mountains will be used so this will be my search term
that I'm going to be using all right so we need to fill those in and let's go
back to the home page here and inspect to see what it is that we're searching for input type equals search
all right so the best value here is probably going to be this data test attribute so input data test this value
so we'll just copy that come back here and we will wait for that so await page wait for selector
and this is an input square bracket around that there we go now it's a CSS selector so this is what
we're looking for and then oh here we'll take these two and just
put them in comments so that's the text field and then we're going to have another one for the button
and the button is going to be the one that we wait for so we can click it
so let's find the button that we click this one right here inspect and okay there we go so it's an SVG
inside of a button and again it's got a data test property perfect okay that makes it easy so we want a
button that has this attribute there we go all right so this is the button that we're going to
click so let's wait to make sure that that's on the page and then we can type
inside of this we want to write
our search term CLI and I'm not going to bother with the delay this time we'll just let it go
right ahead and stick it in there then we want to navigate to the next page and btn.click those the two things that
we're doing in our array there we go so I'm going to type and then we're going to click wait for the navigation
to the next page we'll take a screenshot on the second page to see our screen our results and
we want to get the full screen we want to see all the images that are on there so we see how many results there are so
page dot screenshot path and we'll put it inside of screens so I'll call it search.jpeg
and I'm going to set full page to true because I want to see all those results
okay we can test this at this point because we're going to have to make some changes
here in a minute but let's try this out so node images dot JS and we'll search for Forest
there we go no quotation marks because there is no space in the search term it's just one
word so there we are unsplash and it'll throw forest in here there it is and jumps to
the next page does the screenshot and unsplash home there it is unsplash search
there we go so Forest yeah sure enough we've navigated we're on the next page Forest is the results
but we haven't waited for anything before doing the screenshot we just said hey as
soon as you navigate there do this now we can wait for Network idle meaning
wait for all the pages to load all the requests to be made so as the HTML is parsed and the images are red
from the HTML and then request it we can wait for all that to happen so it'll wait
um I can't remember the the time delay 15 seconds or something like that for Network idle so 15 seconds after the
last request is made but we can wait for that we can add something inside of here to say yeah
before doing the screenshot let's wait for something that's going to be on the page or give it a timeout or use the
built-in wait for Network delay all right Network idle not delay but Network idle there we go so running it now it'll
take a little bit longer but it's going to give us a screenshot that actually has all the pictures loaded on it now
so on the home page fill in the form navigate there's and we saw the pictures before it actually did this so now we
get the full page with the results there Okay so we've gotten to this point we actually have to step back a little
bit we have to intercede to start getting those images before we do that screenshot before everything closes down
and that is by adding an event listener so inside of our code here right after we've done the new
page before we start navigating or we can do it after we navigate to the first one after the go to but usually you'll
add this event listener up at the top here so I will right up here say page dot on
inside of here I'm going to say what is the event what is the thing that I'm waiting for
and the thing that I'm going to wait for is a response and every response I want to call this function and I will get my
response object here there we go so it's if you've ever worked with
service workers this is very much like a service worker every time there is a response so every time something is
being loaded on the page could be a CSS file a font a script an image anytime anything is loaded this function
runs so inside of here let's get the the headers say const
headers from the response object there we go so that is the headers for one response remember this is going to
be running again and again and again and again so const URL equals response dot URL
and I will actually put this into a URL object new URL there we go so we're taking
that value URL this is returning a string we're passing the string into a URL object so
that we can extract bits and pieces of it I'm going to do that just so I can
demonstrate here so DOT log and the URL there we go so every time
there's a response I want to see the URL get written out let's slide this up we'll move the
script up here so inside here every time there's a request
we're going to get something written out in the console you can see a bunch of stuff flying by
over here there we go so here are all the URL objects now there's a lot of data inside
there href is the entire thing what I'm looking for really is just this part right here the path name I want
that so path name it's going to remove
the host from all that the protocol the port number all that stuff is going to be removed it's just going to be
that list of file names basically the folder and file names there we go so here is a list of all the
files now you can see there's CSS there's JavaScript other things API calls Json files fonts
all kinds of stuff most of it we don't care about we just wanted to get some images off this page
so what we're going to do is we're going to look at each one of these things that are
coming back we're going to look at the headers and if
the headers for the request that we just got back if it's got a property called content
length if it has that and the value of that includes
and I know I misspelled length there if that's bothering you I apologize there we go
includes this so now if
it's got a header called content length which is the size oh sorry this would be the content type it's going to include
this the we'll look at the length in a minute the length is the size of the file in bytes content type is stuff like
this is it an image and on the unsplash website this is the format that I'm getting back for all of those images
so if that is the case then what I'll do is I will console log that so I'm only going to
write out the images now there we go so this is all of the images just the images
now there's going to be some images in there that are really really tiny little icon images and things like that so
that's where the content length comes in so we're going to also look at that so inside of here we've got
headers content length exists and it has this and
we're also going to say if the response dot URL or yeah we'll do the response.url we'll get
the whole thing dot href or actually we extracted this array so just
url.href there we go so the href if you remember was the entire thing so if that starts with
the ones that I want are going to start with https images Dot unsplash .com photo
so that's another way that we can filter this list like profile that's not one that we want placeholder Avatar is not
one that we want we want the ones that were photo if they have photo in the front of them
like this that's one of the ones that we want to get and the last thing was the content length length
so content length and if that value is greater than let's say
yeah thirty thousand so this is 30k if the image is bigger
than 30k we're going to write this out so run it one more time to test and if this works fine if this is giving us a
small enough list a reasonable list one that we want with those images let's take a look here
yeah so from here down so we're only getting about 20 images that's a reasonable result set so we
want those images that seems like a reasonable size they're big enough images they're all the right format okay
so we have the names what we want to do is we want to save those not inside of screens but we've got this
other folder called images this is where we want to put them and so very similar to what you do with
a fetch call or what you would do with a service worker when you're handling fetches
we've got a response object the browser in JavaScript if you've got a response object you can call the blob
method like this it's going to be an asynchronous call that will return the binary data to you there's something
similar in node and what we're going to do is we're going to put a weight in front of this we made our function
asynchronous so that we could do this and then await will take our response object the method that we're going to
call is buffer so this gives us the array buffer in the browser the method is called array
buffer in node it's just called buffer once you have that then we're going to call another function which will make it
asynchronous as well and here is the buffer so here is the data that we got back so the array
buffer this is all the data inside the image so it's the binary data of the image here you go here's the data I want
to take that data and I want to save it as a file so asynchronously
read the buffer pass it into this function right here asynchronously and we call the same
method that we did on our last one to save the Json so write file
and we're going to call I'm going to save it inside of our images and we'll use these names that we
already have so we'll turn this into a template string with backtick characters
extra one there we go and inside of here this is the URL path name that we're writing out
there we go so that is the data or that's the file name buffer is the data that we're going
to save and then we have that error handling function at the very end or the function that gets called when it's
complete if error throw the error so we write something
out to the console and we don't need to write out anything else we can add a catch on there if we
want to handle it more gracefully so if the first image fails the rest of them might
not fail so I'll say console.log fail to
show you what I'm writing here failed to save image
yeah so there we go so if we throw an error
inside of here the catch that's chained onto the end of this will
break this message out for us all right so clear that out running this for what should be the last time
so write file is not defined and that's because we didn't import it back up to the top
so import file from FS there we go so we've done the import of
that method so this can now save them launch the browser yeah okay let's strip this off here at
the end I'm not going to worry about the catch now hopefully this will be the last time
that we run this there we go fill in the search run the search there's the results save the
images write them out and our images folder now has a bunch of image files these are
actually image files but vs code doesn't know what to do with them because there are no file extensions
so right here at the end avif because that's what these images are there we go all right so
there's all of our forest images the ones with actual file extensions we have it
all right so hopefully that will give you a whole bunch of inspiration for all the different kinds of things that you
can do with puppeteer again if you're looking for a copy of the code it's down in the description if
you've got any questions feel free to put those in the comments I answer whatever whatever I have time for
and as always thanks for watching
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeRelated Summaries

Comprehensive Selenium WebDriver Tutorial: Setup and Basic Automation
This detailed guide introduces Selenium WebDriver, covering its architecture, setup methods, and a basic automation test case. Learn how to configure your environment using manual and Maven approaches, understand WebDriver's role as a Java interface and API, and write your first automated browser test.

Effortless Data Scraping from Any Website with Advanced Automation
Learn how to scrape data from any website effortlessly using just the URL and defined fields.

Unlocking the Unlimited Power of Cursor: Boost Your Productivity!
Discover how to harness Cursor for ultimate productivity, from controlling apps to optimizing workflows!

Understanding Headless, Boneless, and Skinless UI in Modern Development
Explore the concepts of headless, boneless, and skinless UI and how they reshape component libraries in modern web development.

The Hidden Magic Behind Accessing Your Favorite Websites
Discover the complex technology and processes that power your internet experience.
Most Viewed Summaries

A Comprehensive Guide to Using Stable Diffusion Forge UI
Explore the Stable Diffusion Forge UI, customizable settings, models, and more to enhance your image generation experience.

Mastering Inpainting with Stable Diffusion: Fix Mistakes and Enhance Your Images
Learn to fix mistakes and enhance images with Stable Diffusion's inpainting features effectively.

How to Use ChatGPT to Summarize YouTube Videos Efficiently
Learn how to summarize YouTube videos with ChatGPT in just a few simple steps.

Pag-unawa sa Denotasyon at Konotasyon sa Filipino 4
Alamin ang kahulugan ng denotasyon at konotasyon sa Filipino 4 kasama ang mga halimbawa at pagsasanay.

Ultimate Guide to Installing Forge UI and Flowing with Flux Models
Learn how to install Forge UI and explore various Flux models efficiently in this detailed guide.