Exploring Puppeteer and Headless Browsers: A Comprehensive Guide

Convert to note

Introduction to Puppeteer and Headless Browsers

What is Puppeteer?
Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It allows developers to automate browser tasks, perform web scraping, and run tests without a visible browser interface. For more on web scraping techniques, check out our summary on Effortless Data Scraping from Any Website with Advanced Automation.
What is a Headless Browser?
A headless browser is a web browser without a graphical user interface. It can be controlled programmatically, allowing developers to run automated tests or scrape web pages without displaying the browser window.

Getting Started with Puppeteer

Installation
To install Puppeteer, run the command:
npm install [email protected]
This installs version 19.11.1, which is used in the examples.
Basic Setup
Create an npm package and set up your project structure. Use ES6 imports by adding "type": "module" in your package.json.

Key Features of Puppeteer

Launching a Headless Browser
Use await puppeteer.launch({ headless: true }) to start a headless browser. You can also set viewport dimensions and geolocation.
Navigating to URLs
Use await page.goto('https://example.com') to navigate to a specific URL.
Taking Screenshots
Capture screenshots using await page.screenshot({ path: 'screenshot.png', fullPage: true }).
Web Scraping
Extract content from web pages using page.evaluate() to run JavaScript in the context of the page. For more insights on web scraping, refer to our guide on Understanding Headless, Boneless, and Skinless UI in Modern Development.
Automated Testing
Automate user interactions like filling forms and clicking buttons with commands like await page.type() and await page.click().

Advanced Usage

Handling Asynchronous Operations
Use Promise.all() to wait for multiple asynchronous operations to complete, such as clicking a button and waiting for navigation.
Downloading Images
Scrape images from a webpage by filtering responses based on content type and size, then save them using the fs module.
Using Plugins
Enhance Puppeteer with plugins like puppeteer-extra and puppeteer-extra-plugin-stealth to avoid detection as a headless browser. For more on browser automation tools, see our summary on Unlocking the Unlimited Power of Cursor: Boost Your Productivity!.

Conclusion

Puppeteer is a versatile tool for developers looking to automate browser tasks, perform web scraping, and conduct testing. With its powerful API and support for headless browsing, it opens up a world of possibilities for web automation. For code examples and further details, refer to the video description.

all right so today we're talking about Puppeteer and headless browsers what is Puppeteer what is a headless browser why

should you care and what are the cool things that you as a developer can do with a headless browser

so the website for puppeteer this is the package that we're going to be focused on today

Puppeteer PPD pptr.dev is the website this is where you can come for all the references

there's some basic guides here at the start at the start if you go to the API section this is probably where you're

going to spend most of your time looking up references for things to make sure that you're using them properly so

you've got all the different parts inside of here all the different methods the different parts

lots and lots and lots of stuff and a lot of them have little code samples like this that you can use so we're

going to be using that I'm going to be using a specific version of Puppeteer I'll show you in a minute

how to install that I'm going to be using version 19 of Puppeteer version 20 is a fairly new thing and I've tested

all the examples that I'm going to be showing you today with version 19 so we'll be using that

um Puppeteer is this package that uses the Chrome Dev tools protocol so this is a protocol that you can use you can

write a program yourself that uses this protocol to access all the dev tools so when you inspect on a web page and you

go in here all these tools access to this can be given to you through a script so you can write your own program

just like Puppeteer which will access all that information and chromium Chrome supports this Firefox

also supports the devtools protocol so lots of cool stuff we're going to be able to do with this

all right so website I'm going to jump back into the code here and let's do a little bit of

setup so I have some basic Pages here just with some comments and them on things that we're going to be talking

about today so basic setup for the project right now we're going to create an npm package

so npm init Dash y to take the defaults yes for the defaults so here is my four scripts there's my package.json that got

created uh and then these are empty folders that we're going to be using in our scripts

we're going to be putting stuff inside there automatically so Puppeteer we're also going to npm

install the package puppeteer now if you just do npm I or npm install Puppeteer you're

going to get the default version which is the current version 20.3.0 but I'm going to be installing this version

right here it's the version right before 20.0 so we'll say at

19.11.1 that's the version that I'm going to be using here so if you want to follow all and if you want to use my

code you'll find links to all of this code down in the description so you can

follow along or you can just have a copy of your own to play with there it is and one last thing that we

want to change here is we want to go into the package.json and because I'm going to be doing Imports I want to do

this I want to use type equals module so I can use es6 Imports instead of the older common JS node

require statements so here we go import Puppeteer now that we have this inside of here

we want to build a headless browser or we want to use a headless browser so

what is a headless browser basically when you launch the browser yourself so

here you go this is a browser it's got the Chrome it's got the dev tools it's got

everything here you're watching everything going on you are interacting with it yourself

but if you're running it from a script it's really optional whether or not you actually see the web pages if you've got

a JavaScript that's running and the JavaScript is talking to the different Dev tools and it's downloading pages and

running scripts and doing things if it's not showing the visual interface if it's not showing you as the developer the

browser that is a headless browser so that's what we're doing here with puppeteers we can use headless browsers

we don't need to watch the automated tests we just need to get the results I want to get a Json file that has the

content from a page so that's scraping or testing I want to test the interface I want to make sure that this works if I

navigate to this page and click on this button and fill in this form field and then click on the login button does it

take me to another page and can I run a search by typing into the search field and clicking the button

then can I get a screenshot of the results of that and I can do all of that without actually having the browser

launch now I will in this video be showing the browser I'll tell it yeah yeah go ahead and show me so I can see

what it's doing but you do not need to actually display the browser while you're doing these tests

all right once we have Puppeteer installed what we're going to be doing is we're going to be running a whole

bunch of commands that are asynchronous so I've got an iffy here an immediately

invoked function expression and I'm just going to make the function and asynchronous one

and then inside of here this is where I'm going to put my code so this will run my script as soon as

the page loads and then inside of here I'm going to be running a whole bunch of commands and I'm going to be putting a

weight in front of all these things because I want to make sure that these things are done in a specific order now

I could chain them together with DOT then.then.then but I'd be adding a lot of extra code here a lot of extra curly

braces and parentheses that I don't want to see I want to be able to do things like this just browser and await

Puppeteer dot launch now this is the basic command to start it this is hey Puppeteer can you please

go and launch the default browser which every version of Puppeteer has a specific version of

Chrome that is installed and used the first time I run this it's going to take an extra few seconds because it has to

actually install its version of Chrome so Puppeteer launch and then inside of here we're going to say I want a web

page I want to actually create a tab inside of there so we'll say browser dot new page this will create a tab inside

the browser and at the very end every time you do this the last step is always going to be

browser Dot close when you're done with the browser you want to shut it down you

don't need it running in the background there we go that's the absolute minimum like let's launch a browser

and uh we'll run this with node so basic.js I'm going to be running this script

and it's giving me a warning because I didn't put any options in here for the launch

uh there's a new thing where they want us to say new right now with nothing passed inside of

here launch is going to be using all the default values and inside here in these

options headless is one of the options and we can say true to say headless I don't want to see it

there's some new features that are coming and they're going to be deprecating the True Value they want new

to be there as true so launch it using headless new and that will be headless true so if I run it

okay now it's actually installing Chrome so it's going to take that few seconds to install Chrome once it's ready it

launches it in the background because we're saying that it it is headless it means it's not not going to actually

display the browser to us there our script is done so it ran it installed Chrome it launched it opened a

tab and then closed it we didn't see anything but if I change this from headless new

to headless false what I'm saying is yeah I actually want to see it I want to follow along with what you are doing I

want to see the browser actually launch so taking a second and there it is that

little flash on the screen that was the browser launching opening a tab and then closing

all right well that wasn't very exciting so let's actually do something let's go to somewhere in the website as we'll say

oh wait and on my page I want to go to and here's the URL that I'm going to go to

so https google.ca now let's try that one now

it will actually launch the browser and then navigate to google.ca there it is and then the browser shuts down

so this is what we're doing with Puppeteer we're using a script to automatically launch a browser so that

we can do tests all right now I've got my page here what are some

other options that we can do well before we actually go to a URL you can set the viewport on the page you can say hey I

want my browser page to be these Dimensions so I can say await like I said there's going to be a lot of

Weights here page dot set viewport and inside of here we pass in the

options to say okay I'm going to set it at 1600 pixels wide I'm going to set the height at a thousand pixels High I am

going to say that no pretend it's not a mobile browser so is mobile is going to be set to false is landscape yeah we'll

set that to true so it's a landscape display and these values are going to be used with the CSS to rendering things

has touch so is it a touch device I'm going to say false and device scale factor

this is the not the aspect ratio but the pixel density on the screen so I'm going to just set it to one again

something that will be used in the CSS so we can set the dimensions of the screen

if you're going to be doing anything if you're visiting a page that uses geolocation for anything you can set

that there is a page dot set geolocation which we always start with a weight page and set geolocation you can set your

latitude and longitude foreign so if I was visiting a page that had a

map these are the coordinates that the browser would feed to the map to say hey this is where I am

we can look after we've gone to the page so these are the kinds of things that you would typically set before you go to

the page but once the page has launched then we can start looking at the page and doing things like get the content

get the title get the URL so const URL await page dot URL this is a method that will give us

the URL of the page that we're on now we already know what we said it but maybe you did something dynamic in the

page to navigate someplace else you've filled out a form you click the submit button you want to know what URL you're

on we can take a look at content

so this will take the source code of the page we can take a look at that that can be displayed

so we have all of that content for us as well taking a screenshot very common thing to do

and there is a screenshot method now with this we have to pass in some options at the very least what we want

to pass in here is an option for path what is the URL that you want to use for saving it so we're talking about here

not the URL of the page we've already set a page we've loaded a page but let's say I'm going to save it in my

screens folder so inside my screens folder I'm going to take this and say sample

Google dot jpeg now I'm going to do a second one here just to show the other options that we

can get we can add a clip object with the clip object you can set what is the X and Y coordinate and what

is the height and width of the area that you want to take starting at The X and Y coordinate

so let's say starting at 200 pixels x 200 pixels y I want a screenshot that has a width

of 500 pixels and a height of 500 pixels we've got additionally after the clip and the path we can set what the

whether or not it's going to be a full page render actually we should do this on the first

one take this out and put that here full page

by default is false but I'm going to set this one to true and that means that it's not just the

screen that you're seeing right here but the screen and the entire page so if the page is longer than your screen it will

capture the entire image and we could do something else here let's go to uh

chapters.indigo.ca so we'll go there instead and we'll get sample

change this name to make more sense chapters one sample chapters two

there we go all right so path full page clip and we've got uh encoding

with the encoding we can set this to base64 or binary the default is binary meaning you're going to get an image and

then if it is binary you can set the type to say that I want jpeg or I want PNG those are the two options JPEG and

PNG all right so do that we'll run it again so this will launch the browser it'll go

to the chapters website load the page take a screenshot get the URL get the content and take two

screenshots so here we have all the content so we wrote out all the HTML and CSS that were

inside there if you view Source on the page this is what you're getting so I'll scroll away up actually there's

more that would fit inside of here so we got all that content the URL was written out before that as well

and here are the two images so here is the full page true so you can see it's more than just what we saw on

the original screen right here it's the entire page and then the other one where we clipped

we went from 200 pixels over 200 pixels down and then it was 500 pixels wide 500 pixels high so the zoomed in portion of

the page we were getting this section right here of the page okay so that is basically how this works

if you want to click or type in stuff you can do that as well there's commands for things like that so we can say

page.type and then you'd have your CSS selector and the text that you want to type we could do that we can say await

page wait for selector now this can be useful let's say that these are CSS selectors

wait for selector is I've got a page that's loading but it loads a little bit and then it loads a little bit more then

it loads a little bit more maybe it's fetching data from the server or something and it's not until it's

finished and it's actually displayed some part of the page that I want to do the next step so what we're doing here

is we're saying wait until you can find the selector now it does have a default timeout I think it's about 30 seconds

where if it hasn't loaded it'll just create an error and it'll stop running the script

but this is very useful at times in your script to pause before moving on type

we're going to find this so assuming that this is some sort of input element input or text area I want to type this

text inside of here so a couple more options for you now we will be looking at a whole bunch more

when we get into the testing and scraping but this is what we have for the basics

so setting the viewport setting the location navigating to URLs retrieving the current URL getting the

page title getting the page content taking screenshots all these kinds of things all these automated tasks that

you would do to test your page to make sure if it's working to get proof that it's working

to download content from there so we're going to be talking about running some UI tests we're going to scrape some

content out of another website and then we're going to fetch some images so we're going to go to the unsplash

website and actually download some images from that website and save them locally so we have copies we're going

out fetching copies of that here we're going to generate a Json file based on content that we have on a website after

we fill out an interface and do something and then the testing will test to see that we can step through some

stuff all right so let's jump into the testing here and I'm just going to hide this

temporarily so we're going to go to YouTube on the home page I want to get two

different kinds of screenshots I want to get a regular screenshot and a blurred screenshot

so assuming that you're doing testing that's sort of that we're wearing a testing hat right now we want to test

our website to make sure if it's working we've built this wonderful thing called YouTube and we want to make sure that

it's working the way that we expect so I want to get two screenshots a regular one and then one that's been

blurred slightly and we're going to get the Blurred slightly one to make sure that things stand out things are large

enough to read for people with visual disabilities there's a whole bunch of different types of

Vision deficiencies that we can emulate with this as well we're going to do just this one with the blur but we'll show

you how they all work we're going to fill in the form so on the YouTube homepage we want to fill in

a search form we want to click the button to do the search we're going to get a screenshot after we get search

results back once we have search results back we want to read the content the title of the

first search results we're going to click on that we're going to navigate to the next page we're going to look at the

content there get a screenshot we want to count the number of comments that have been written there we're going to

get the also the title of the first suggested video in the sidebar for our search result

so we're going to step through all these things get a whole bunch of screenshots just so we can make sure that we're

understanding how to do some testing all right puppeteer we've got that imported we're

going to do our async iffy as always just like this and it's always

going to be the same commands here we're always going to be doing the same thing that we did here back on the basic page

these are going to be your first two steps foreign

and then your last one will be the browser.close like that

so I'm going to be doing headless false for all of these because I want to see it running I want to see that it's doing

the steps that I expect now what we're going to do inside of here

is going to the YouTube homepage now once we get here we want to fill out that search form

now I could hard code inside of here what is the value that I want to type into that search form every time that I

test or I can give myself a little bit of flexibility and say Hey what if I took the value that I'm going to search

for and I put it into an environment variable or I put it into one of the arguments on the command line as I run

my script so let's look at the two ways of doing that

first of all we'll create a couple of variables so we'll say my search term CLI and my search term

e and B so from the command line from environment variables we want to be able to get these two things

from the command line and what I mean by that is here if I'm going to say node let's run the script

test.js and then I pass in options so I'm going to say that I want to

search for the term uh Green Day

this is what I'm going to search for so if I do this this value right here is actually going

to be available to me in node.js so from the command line we're going to go to process Dot argv

Dot well this is right here this is going to be the array of everything the arguments

on the command line This is number zero this is number one and this is number two

so as long as the length of this array is greater than or equal to 3 then I will

proceed and I want to get process argv number two this thing whatever I wrote here

or if it wasn't three or longer then I'm going to pass in here's going to be my

default value so I still have a hard-coded value but the hard-coded value is only if I didn't type something

here now the other option is using environment variables

if you want to create an environment variable so in a variable that exists while your script is running we can do

that like this in Unix Linux Mac we use the command export if you're on Windows the command is set

but then the rest of it is the same so set instead of export and we're going to say

search txt is going to be equal to and then we use quotation marks if there's going to be

any spaces in there you don't have to put the quotation marks if there's no space in what you're typing but I know

that I was going to do this I was going to type Green Space date so I need the quotation marks around it

all right so export search text boom that's it I have now created an environment variable and I can take a

look at it by doing this putting a dollar sign in front of it

there it is so here if I want to access it again it's the node

process.env dot hey and what was the name of it it was called search txt and if that doesn't exist again here's

my default fallback Volbeat all right I now have two variables I can use either one of these

down inside my script as what I want to fill in so we've gone to our home page we have

our text that we can fill in we're going to use the type command to do that we'll say wait

page DOT type and inside of the search box element I'm going to put one of these two things

right here I'll use the CLI one it doesn't really matter both of them will work for what we're trying to do here

now I know the um foreign

of the field because I looked this up while I was doing the prep here and up in the ear in the top in the comments

this is the name of the search input this is the name of the button that we're going to click on that search

input so we're telling it that we want to find this thing and type whatever this value

was inside that field now if I just leave it like this it takes JavaScript no time at all to write

that text inside that input if I want to sort of watch and see it be filled in maybe there's something that I

want to watch maybe because I know I'm doing it with headless false I can actually watch it be typed there is an

option we can pass in here called delay and we can say for each character take a hundred milliseconds to do this so a

tenth of a second for each letter that you're typing so you can actually watch it being typed

in the screen maybe there's some validation something that's going on with the input event while you're typing

that might be something that you want to type so this would be something that you could do to slow down the entry of the

data into your web pages all right one other issue now right now I've gone to the page I've typed this

into the input problem is just because I've gone to this page doesn't mean that this input is ready

yet for me I don't know if that's actually been found on the page yet so I'm going to wait

I will say page wait for selector so great command where I can put the same thing inside of here and say okay

wait until this thing exists before you start trying to type it and once that's done I'm going to take a

screenshot of this as well we'll save it right here inside of our screens folder we're going to take a screenshot

and remember I want to do those two I wanted to do one that was blurred and one that was not

so I will be calling screenshot twice and we'll call it

screens YouTube home.jpg and I'll just use the default

options oh sorry and this is an options object with path it's not just this ring so I'm

going to do this twice home and home blurred I'm going to do the Blurred one first

and to get it blurred you have to set the option called emulate Vision deficiency

there it is and you just pass in a value here so blurt vision that is the option that will just from

this point on all screenshots will be blurred until I do the same thing again

but instead of blurred vision I pass in none alright so we can try to run this now

we're going to say node test.js and pass in the search string that we want

so we'll say a Green Day American Idiot there we go launch the browser

load the YouTube page and it craps out okay not a problem we can use control C to

kill the browser and I was hoping this wouldn't work I do I have sometimes had issues with versions of Puppeteer above

19.5 so what we're going to do is we're going to switch back from 1911 to 1905. very easy to switch just npm install

puppeteer at 19.5.2 there we go

with that done we'll run this command again and that should run the version that we

just installed and there we go so it's loading the page and here up at the top we can see it's

typing in and then it took two screenshots just very rapidly at the end so here we have

the clear version there's the screenshot and here's the Blurred version so we have both those screenshots taken so you

can see if you want to find out if the text that you have is big enough to read with somebody with a visual disability

if somebody's got poor vision somebody's old like me and they don't have their glasses on will they be able to read the

contact on the page the text on the page this is a good way to find out so like these titles yeah those are fine those

are plenty big but this might be a little bit hard and these labels over here are definitely too small to read if

the text is blurred like that okay so moved over from version 1911 to 1905 and now we're back everything's

running fine okay what else do we want to do okay yeah we've got the search field filled

in now I want to click on the button to navigate to the next page so we're going to do that

one thing about navigating though when you click on the button sometimes you click and you also want to wait for

the navigation to finish before you move on to the next thing so we could do it as two steps or a common way the

recommended way from Puppeteer actually is to do this so not sorry not new but promise.all so

if you've never worked with promise.all before what you can do is you can pass in a whole series

an array of asynchronous methods inside of here so just a comma separated list of a whole bunch of asynchronous things

and it will not complete and not report back to the weight that this is finished until all of those tasks are completed

so what do we want to do well we're going to we've typed in this thing then we're going to click on the search

button which we have up here this is it right here that's the button in the search

to navigate to the next page so we're going to do these two things we're going to say page dot click

no semicolon there we want to do that but we also want to do page dot wait for navigation

so these two things are going to be inside of our promise.all array right here and we won't go down to the next

step here until both of those things have finished now

wait till next page once we get to the next page what do we want to do I'm going to do another one

of the wait for selector things I want to wait until something's on the screen

that I I want to be there then I'll do another screenshot

so I'm just going to copy these in I have some of these written out already so this is the one that I'm looking for

right here so the YouTube video renderer inside of it there's an H3 with an anchor so this

is the search results once you run a search there's the container and then each one of those elements in the search

results is a custom an HTML custom element if you've never worked with custom elements before

there's a link up there at the top to a series that I've done on custom elements but inside this one we've got an H3

element and then inside that there's an anchor tag with this ID that is the thing that we want to click

on that's the title of the video that we want to see and we're going to get a search result

screenshot before we move on to the next page so we want to find the first one of these this

will give us the first we want to get the first one and move on to the next screen now we can just test this before

we go on just to make sure we're getting that screenshot so we'll do it again we will launch it

running that search we'll see the form get filled in and the button is right here that they're going

to click on I'm not clicking on it the script clicks on it move to the search results it was

only there very briefly but it did give us this search results right here so these items down the side here these are

those YouTube things and this part right here that is the H3 that we're actually clicking on to go to the page for this

video all right so we're going there then once we have it

I want to get the text so it's not just enough to navigate and go from page to page to page I wanted to actually get

the text out of here and display that we've got another method that we can use to do that

so I'm going to say my first match in my search results is Page dot eval with a single dollar sign

so there's one dollar sign or two dollar sign very much like query selector query selector all

you're going to pass in what is the query selector right here we do that so find the first one of

those on the page that you want if I used two dollar signs this would find everything that matched this so all of

the video titles all the anchors inside of all of those components and we're going to want to when that

happens this function right here is going to run so we have that

I'm going to return from this function the value of that element right here this element that is this

selector right here so we found the element and we're going to return not just the element itself but we're

going to take the inner text that's inside of there that value is going to be put back into

this variable right here so we have access to that which we can use now

after we have that then we're going to want to write that out to the console and then we're going to click on it and

navigate so we're going to do the same sort of thing that we did right here the promise all

and first we're going to console.log first match and I'm going to wrap it in curly braces

just so we can see it it'll say first match and then the label once that's done we want to navigate to

the next page so we will await another promise all with an array inside that array what are the things that we

want to do well again we're going to wait for navigation and we're going to click on

this same thing that H3 anchor element inside there that had the results we're going to click on side of that

click inside of that and let's take a look to see what happens there

after that we want to take one final screenshot so let's do an await screenshot here

this is going to be the actual video that we've selected there we go so our first match video

that's what we're going to be saving so right inside of here we should see a first video up here if this all works

correctly there's the results we clicked on it and here's the title right here first match

so there was the title this is the title that we clicked on right here this one and first video

yeah there we go now when we get to the next page picturing YouTube You've clicked on

you've done a search you've clicked on some results you're going to the page with the video

now it can take a little bit of time for that video to load it can take a bit of time sometimes there's ads that play

before the video so if you wanted to skip over that so right now we've got wait for navigation yet fine we're doing

that clicked on this thing fine we can do that there's another

weight instead of wait for navigation you can wait for Network idle that's a similar one

foreign is going to wait until the page has stopped asking for things basically

you've downloaded everything for the page there's no more requests that are happening no no Cascade of requests

that's happening but if you just want to wait there used to be a wait for

timeout now you can see there's a slash through this and that is because this method has been deprecated it's no

longer one that they want you to use they say okay it's better if you wait for navigation wait for Network idle do

something like that but on the rare occasions that there is a time that you just want to wait I just

need to pause 15 seconds before I do something we can do that ourself by creating a new

promise and wrapping it around a set timeout so resolve

and reject and we're just going to resolve we don't need to put the reject inside of there for a promise

but I'm going to create a timer wrapped inside of a promise and what do I want to call I want to call resolve which

tells the promise that hey you're done you're resolved after and let's say it's going to be 17 seconds

so there we have it wait for navigation click and then wait 17 seconds after that before moving on to the next step

which was to take the screenshot so let's find out what we get now we run that script to launch the browser

we should be pausing when we get to that final result so we type something in we click

on the button go to the search results there it is we click on that takes us to this page

there we go and we're waiting the 17 seconds and hopefully the ad will be finished by

this point so that we can then get the screenshot and this will shut down right after that there we go so just at the

last moment we got this first video screenshot there we go so just before it started playing so that's an example of

some time that you might want to just pause set timeout like this

all right and then on the final thing if you want to do the last couple of steps that I had up here we've got

dismiss button that's the thing and that we saw in the screenshot right here this no thanks

check it out this no thanks has an ID of dismiss button so that's something you can take the source code and play around

with that see if you can add the click event to get rid of this before you take the screenshot

and the last thing that we want to do is we want to check for the number of comments

so I'm going to show you here let's just bring up YouTube

and we'll do a search for American Idiot not Green Day American Idiot but here it is this is the video and then

on this page right here what I want to get is this message right here I want to get the total number of

comments inside this description box up at the top here and I also want to get the title of the first suggested video

in this list over in the side so those two things so the comments

this there's two spans right here so I want to get this element with all of its text and I want to get this right here

this text as well Okay so what we have right here

this is the comments section so when we want to get the inner text for that

down here at the bottom we're going to add our wait for selector after we've done our screenshot

and I'm just going to paste this in here speed this up a little bit so wait for selector this is the one and inside of

that we're going to use the eval command again the H2 that's inside of that block this is where the comments

the number and the word comments are both written the H2 that's inside that get its inner

text and we will see that number show up here and then

we want to get that first suggested so that's going to be very much the same as what we just did here so the first

suggested is using the eval we're going to jump in here to find this element once we have that element to get the H3

inside of it so this compact video renderer that's the little thing off to the side right here this thing right

here each one of these is a custom element right here with that tag name we want

the h3s intertext so now we should be able to see the video comment count and first suggested when we run those

there we go we're up and running typing in automatically I'm not doing the typing that is my script that's doing it

click on it we get to here there speed that up and I didn't save that so I'm going to

fast forward through the rendering of this running it again so you don't have to wait for that

17 second delay and actually what I'll do here one more thing before I run this

again is I will come up here and I'm going to set the viewport size on this you'll notice how it was quite small

when it was running on the screen here because I didn't actually set any viewport size so I will do that

let's um I'll just copy in the command with some Dimensions already in it there we go

so setting a width and a height gives the browser context for how big it's going to render that will change

the rendering of the page sometime depending on media queries and elements may or may not be there at when you

expect them to be so we've got here we're waiting for the selector for that header renderer we've

got the video comments and then the first suggested so we're waiting for this one we want to make sure that's

there before we do either of these things and I'm setting the viewport size on it just to make sure that it's going

to be the right size so there we go we'll run this again and I'll fast forward through this so

you don't have to watch the whole thing again and there we go and then we're back so

first match video comments there it is first suggested there it is so takeaways from this

wait for selector is a good way to pause and wait for things to be loaded screenshots

they're great but sometimes you want to wait before firing them when you do want to wait a set amount of time you can

still do this even though they've deprecated the wait for timeout command we do have the wait for navigation wait

for Network idle and we can always add our own set timeouts inside of here the one and two dollar sign eval commands to

find content once you've done this inside of here you really just talking

to the Dom you're no longer in the context of node and thinking about writing node script it's

in the web page you were actually dealing with the Dom so anything that you would do in a client-side script you

can do inside of here all right so that's our test scraping content

similar to doing this but when you want to have larger amounts of content when there's going to be a whole bunch of

things that you want to gather from the page or you want to step through to a certain part of a page to a certain part

of a website and from there gather all this information so here I'm going to go to the Algonquin

College website and I want to go to one of the pages I want to do a search for

um a specific program I'm going to search for the word mobile actually I could do

something a little bit different than what I've got written here but same idea so I'm going to go to the Algonquin

College website I'm going to fill out the form to search for programs that have the word Mobile in them there's

going to be two that come back as results and then I want to take details from that and extract the information

and save it as a Json file so that's what we're going to be doing here so if we look at example here

if I go to the Algonquin website here it is what I want to do is down here I'm going to fill out this form

we're going to search for mobile when we come back with the results from that it'll take a little bit of time

there we go we want to get the results with two records and then we're going to gather the program name the area the

campus and the length so those four bits of information for each one of these rows so we're going to have to figure

out in the HTML what is it that we're looking for so if I was to inspect this and we zoom in we can see

inside the table body for this head there's four rows I only want the ones with class odd and

class even so anytime I get one of those I'm going to find all of the TRS I'm going to Loop through those and then I'm

going to get the ones that are odd and even open them up and then inside the TDS I want to get not number zero but

number one number two number three and number five so those pieces are what I want to extract and

put into a Json file just to give you a sense of what scraping web scraping is like

this is what we're doing with whip scraping now one other thing

I haven't talked about yet with puppeteer we're dealing with the built-in one

we're using Puppeteer but it does come across when it's running as if it is a headless browser so there's settings

that aren't necessarily going to be there there's headers in the request for the resources that aren't going to be

there because it's not coming from a normal browser it's coming from a headless browser it's coming from a

browser that's controlled by a script so sometimes websites will have detection for that

and back over here this is a plug-in there's a plugin

called Puppeteer extra and Puppeteer extra plug-in stealth that will actually add in all those headers when when you

use this instead of the default Puppeteer engine it's going to provide all those

additional headers and they've got if you go through their notes here you'll find that there's actually a couple of

websites that you can use to test so you can load this URL in CR in Puppeteer to see what

it says do you does it detect that you are a headless browser and this is another great website that'll give you

green or red for each of these different things that attests to know okay well that one failed but everything else I'm

passing and this is just me visiting on the website but if you go on a headless browser you're going to fail a lot of

these things so we're going to add these things in here as well so down here at the bottom we will

install those so npm install Puppeteer extra and pump it here

extra plug-in stealth with those two things installed we're

going to be adding a couple of things up at the top here instead of the defaults just plain import right here for

puppeteer we're going to use this one we're going to import Puppeteer from Puppeteer extra and then

tell Puppeteer to use this plugin so this is the thing that's going to add in all those extra headers

we're also going to want to be able to save this in a Json file to the system so we can say import

write file from fs and that's the file system that is module that is part of node so we're

just bringing this in from node this is the function that we want to be able to save our files

all right and I'm just going to create a variable here I'm just going to hard code this one

instead of going to the command line or environment variables but keyword mobile that's the thing that we're going to

type now let's create our asynchronous iffy there we go

and the same three steps as always const browser equals Puppeteer launch and I'm going to say headless false

because I do want to watch this happen and then we'll say page is browser dot new page and have you spotted what I

left off yes my await commands okay

so everything that we're doing here wait a way to wait and a weight browser.close there we go now to set a

viewport we could do that as well here and if you are going to set a viewport remember to do it before you call the go

to command so page dot go to okay

there we go so there's the go to after the viewport has been set we can take a screenshot at this point

let's call this one Algonquin home now we want to search for the input field now I need to

wait until that has loaded so this is the ID for that input that we're going to type the word mobile into

now another thing we can use this for this wait for selected this will actually return to us a reference to

the field that you want to type in or the field that you want to click more importantly so instead of

doing this one actually let's do the the one for the button the button that we're going to click is a better example for

this there we go so this is going to be my button

that's the search button and once that's on the page then I know I'm good to go I can type I can click I

can do whatever I want and foreign keyword variable that we created up here

this one this is what we're going to type there's the input that's the keyword and

we're going to slow it down so we can see it being typed then I want to click on that button but instead of doing

page.click I can do BTN click because I've got a reference to that

element so running this gotta save our page

so run it so launching the home page and it should fill out down here at the bottom of the

search there it is okay and we got our screenshot yep there we are we got the screenshot we filled this in we clicked

it and it is going to Now navigate now previous time I did the click and the

promise promise.all method with clicking of the button and waiting for navigation if you

want you can also do this as two separate things so I've got the click for the button and then after I've done

that I can wait for navigation and then wait for an element on the next page and that's

what I'm doing right here so I'm going to click the button I'm going to wait for the navigation

wait until the page load event has fired then I'm going to wait for this selector so it's going to wait up to about 30

seconds for this to be found on the page then I'm going to take another screenshot which should show me

those two programs showing up there that should be giving me enough time for that so run it one more time

there we go fill in the form down here at the bottom click on the button

goes to the other page waits for the table and then does the screenshot to get program list

there it is and there we have it so we do have it now so the table does exist there it is in the screenshot so now I

want to take this data and I want to turn that into a Json file which was the whole thing that I was building towards

here first part just the setup you're loading the browser you're loading the page

you're filling in a form you're navigating to the place where you want to get the data now we've got that page

we're on the page where the data is so let's let's do this let's

run a function so I'm going to do the two dollar sign evals I want to go to that table that I

was waiting to see if it existed the table body TR and remember there was four TRS when we looked at the source

earlier two of them had odd and even as the class names the first two were just sort

of fillers they were header ones that were inside the body so inside of here

this is what we're evaluating so we're doing basically query selector all on this to get those things out of the web

page this variable right here is that collection of rows now normally when you

do document query selector all you get a node list but Puppeteer actually will convert that node list into an actual

array so the type the data type of this is actually an array so now inside of here what I'm going to

do is I want to return some data so right here I'm going to return something from this function that will

be assigned to the variable data I'm going to take my rows and I'm going to call the map method on

them now this is going to create a brand new array based on the array of rows I'm

going to be building an array of objects because that's what I'm going to turn into the Json

I don't want to have a null value for the first two I have to only look there was four rows

and it's only number three and number four those are the two that actually have the data so I'm going to do a

filter on the end of this and I'm going to say okay for the row only return it

so this will be either no or true depending on what we do right here so inside of here

for the row for the individual row and looking at them one at a time inside this row if row dot class list

contains odd or it contains even so these are going to be my selectors that I'm looking for class list contains

even so if either of those things are true

there we go now it will do this part the else part will be the default if I don't return anything that's what I'll get

here but I'm going to hard code it just to make this a little bit more explicit that this is what I'm doing

so we're returning an object and inside the object I'm going to have a name property

I'm going to have an area campus length those are the values that we had inside of here

right here this the name the area the campus and the length those are the four properties that I'm going to be

returning so row is the TR I need to get the TDS the table data cells inside of there so

let's go here let or const TDS equals row dot query selector all I'm going to get all the TDS inside each

row so it's got the class otter even I'm going to get all the TDS inside of it and I'm going to return

TDS number one it's inner text area

and this is number two number three and number five campus and length so the length of the

program there we go and that's it so we've gone inside the

table body we have the four TRS we're gonna Loop over the four TRS we're only going to take the two that actually have

data to return these objects the other ones will return no so I'll have an array that has four things null null and

then object object but I'm going to filter to get rid of the nulls so I will end up with a an

array that has two objects in it and right here let's console.log data run that

and that should write out those two objects the array with those two objects

filling in the form building the table get the data and there we have it there's our array

called Data that has two objects inside of it so now I want to take that array that object

data and I want to turn that into Json and I want to save it here

so let's save it inside this folder called Data so I'm going to call a weight

file and this is node.js so just plain node.js that I'm doing

inside my data folder I'm going to create a file called course details dot Json

there we go what do I want to write inside there well the data variable but I can't just put data because that is an

object I need to put json.stringify around that so I'm going to turn

this array into a string and then that string will be written inside of this file

and the final parameter for rate file is what's the um sorry there's two more parameters one

is the data type and or the sorry the character encoding

not the data type the data type is Json the character encoding utf-8 and then a function to run

when this is done error is a parameter that could be empty or it

could have an error object if there was a problem in writing this file this is the error so we need to check that we

say okay if error happened I'm going to throw the error so I'll actually have the error show up here on the console

and if we get past that line console.log saved the file there we go

and write file remember at the very top here we imported that right here from FS that's the file system in node

okay so there it is last time running this one there we are and there's our data file

saved screenshots we've got the Algonquin one we've got the program list so we have all of those

well that was the one here yeah so it's the Algonquin home and the program list those were the two screenshots that we

took and course details was available to us as well okay so that is how you scrape and pull content out of a web

page so you can save that in a text file for Access later on and the very last one last section in

here images how do you take that okay the scraping we saw how we got text Data how we could extract content from a web

page but what about images what if we have images on a web page that we want to download

so we're going to be looking now at the unsplash website try that again

on splash.com there we go so here what I want to do is I want to do a search so let's say I want to look for

Lakes and I do a search and then I want to get all these images that we've got down

here so all these images that show up as results on the first page so I want to download those I don't just want to

screenshot yeah okay fine I can take a screenshot but I want to get the individual images themselves so how do I

do that all right so inside of here we're going to do the same thing that we always do

for startup I may as well just copy and paste a chunk from here take off these same things that we do at

the start every time so we'll import Puppeteer from puppeteer

and very bottom await browser.close there we go there's the end of our iffy

and we're going to unsplash.com all right so launching the browser new page set the viewport go to the website

get a screenshot there's our basic start now what I want to do is I want to be able to

do searches now this is one I probably want to want to change the term so I will use the environment

variable or the CLI again to do that so let's go back here to our test and grab that line

so we'll put this up at the top search term CLI and my default will be mountains there

we go so if I put something there when I'm typing node images.js and then a word that follows it that's

what I will use as my search term if I don't put anything there mountains will be used so this will be my search term

that I'm going to be using all right so we need to fill those in and let's go

back to the home page here and inspect to see what it is that we're searching for input type equals search

all right so the best value here is probably going to be this data test attribute so input data test this value

so we'll just copy that come back here and we will wait for that so await page wait for selector

and this is an input square bracket around that there we go now it's a CSS selector so this is what

we're looking for and then oh here we'll take these two and just

put them in comments so that's the text field and then we're going to have another one for the button

and the button is going to be the one that we wait for so we can click it

so let's find the button that we click this one right here inspect and okay there we go so it's an SVG

inside of a button and again it's got a data test property perfect okay that makes it easy so we want a

button that has this attribute there we go all right so this is the button that we're going to

click so let's wait to make sure that that's on the page and then we can type

inside of this we want to write

our search term CLI and I'm not going to bother with the delay this time we'll just let it go

right ahead and stick it in there then we want to navigate to the next page and btn.click those the two things that

we're doing in our array there we go so I'm going to type and then we're going to click wait for the navigation

to the next page we'll take a screenshot on the second page to see our screen our results and

we want to get the full screen we want to see all the images that are on there so we see how many results there are so

page dot screenshot path and we'll put it inside of screens so I'll call it search.jpeg

and I'm going to set full page to true because I want to see all those results

okay we can test this at this point because we're going to have to make some changes

here in a minute but let's try this out so node images dot JS and we'll search for Forest

there we go no quotation marks because there is no space in the search term it's just one

word so there we are unsplash and it'll throw forest in here there it is and jumps to

the next page does the screenshot and unsplash home there it is unsplash search

there we go so Forest yeah sure enough we've navigated we're on the next page Forest is the results

but we haven't waited for anything before doing the screenshot we just said hey as

soon as you navigate there do this now we can wait for Network idle meaning

wait for all the pages to load all the requests to be made so as the HTML is parsed and the images are red

from the HTML and then request it we can wait for all that to happen so it'll wait

um I can't remember the the time delay 15 seconds or something like that for Network idle so 15 seconds after the

last request is made but we can wait for that we can add something inside of here to say yeah

before doing the screenshot let's wait for something that's going to be on the page or give it a timeout or use the

built-in wait for Network delay all right Network idle not delay but Network idle there we go so running it now it'll

take a little bit longer but it's going to give us a screenshot that actually has all the pictures loaded on it now

so on the home page fill in the form navigate there's and we saw the pictures before it actually did this so now we

get the full page with the results there Okay so we've gotten to this point we actually have to step back a little

bit we have to intercede to start getting those images before we do that screenshot before everything closes down

and that is by adding an event listener so inside of our code here right after we've done the new

page before we start navigating or we can do it after we navigate to the first one after the go to but usually you'll

add this event listener up at the top here so I will right up here say page dot on

inside of here I'm going to say what is the event what is the thing that I'm waiting for

and the thing that I'm going to wait for is a response and every response I want to call this function and I will get my

response object here there we go so it's if you've ever worked with

service workers this is very much like a service worker every time there is a response so every time something is

being loaded on the page could be a CSS file a font a script an image anytime anything is loaded this function

runs so inside of here let's get the the headers say const

headers from the response object there we go so that is the headers for one response remember this is going to

be running again and again and again and again so const URL equals response dot URL

and I will actually put this into a URL object new URL there we go so we're taking

that value URL this is returning a string we're passing the string into a URL object so

that we can extract bits and pieces of it I'm going to do that just so I can

demonstrate here so DOT log and the URL there we go so every time

there's a response I want to see the URL get written out let's slide this up we'll move the

script up here so inside here every time there's a request

we're going to get something written out in the console you can see a bunch of stuff flying by

over here there we go so here are all the URL objects now there's a lot of data inside

there href is the entire thing what I'm looking for really is just this part right here the path name I want

that so path name it's going to remove

the host from all that the protocol the port number all that stuff is going to be removed it's just going to be

that list of file names basically the folder and file names there we go so here is a list of all the

files now you can see there's CSS there's JavaScript other things API calls Json files fonts

all kinds of stuff most of it we don't care about we just wanted to get some images off this page

so what we're going to do is we're going to look at each one of these things that are

coming back we're going to look at the headers and if

the headers for the request that we just got back if it's got a property called content

length if it has that and the value of that includes

and I know I misspelled length there if that's bothering you I apologize there we go

includes this so now if

it's got a header called content length which is the size oh sorry this would be the content type it's going to include

this the we'll look at the length in a minute the length is the size of the file in bytes content type is stuff like

this is it an image and on the unsplash website this is the format that I'm getting back for all of those images

so if that is the case then what I'll do is I will console log that so I'm only going to

write out the images now there we go so this is all of the images just the images

now there's going to be some images in there that are really really tiny little icon images and things like that so

that's where the content length comes in so we're going to also look at that so inside of here we've got

headers content length exists and it has this and

we're also going to say if the response dot URL or yeah we'll do the response.url we'll get

the whole thing dot href or actually we extracted this array so just

url.href there we go so the href if you remember was the entire thing so if that starts with

the ones that I want are going to start with https images Dot unsplash .com photo

so that's another way that we can filter this list like profile that's not one that we want placeholder Avatar is not

one that we want we want the ones that were photo if they have photo in the front of them

like this that's one of the ones that we want to get and the last thing was the content length length

so content length and if that value is greater than let's say

yeah thirty thousand so this is 30k if the image is bigger

than 30k we're going to write this out so run it one more time to test and if this works fine if this is giving us a

small enough list a reasonable list one that we want with those images let's take a look here

yeah so from here down so we're only getting about 20 images that's a reasonable result set so we

want those images that seems like a reasonable size they're big enough images they're all the right format okay

so we have the names what we want to do is we want to save those not inside of screens but we've got this

other folder called images this is where we want to put them and so very similar to what you do with

a fetch call or what you would do with a service worker when you're handling fetches

we've got a response object the browser in JavaScript if you've got a response object you can call the blob

method like this it's going to be an asynchronous call that will return the binary data to you there's something

similar in node and what we're going to do is we're going to put a weight in front of this we made our function

asynchronous so that we could do this and then await will take our response object the method that we're going to

call is buffer so this gives us the array buffer in the browser the method is called array

buffer in node it's just called buffer once you have that then we're going to call another function which will make it

asynchronous as well and here is the buffer so here is the data that we got back so the array

buffer this is all the data inside the image so it's the binary data of the image here you go here's the data I want

to take that data and I want to save it as a file so asynchronously

read the buffer pass it into this function right here asynchronously and we call the same

method that we did on our last one to save the Json so write file

and we're going to call I'm going to save it inside of our images and we'll use these names that we

already have so we'll turn this into a template string with backtick characters

extra one there we go and inside of here this is the URL path name that we're writing out

there we go so that is the data or that's the file name buffer is the data that we're going

to save and then we have that error handling function at the very end or the function that gets called when it's

complete if error throw the error so we write something

out to the console and we don't need to write out anything else we can add a catch on there if we

want to handle it more gracefully so if the first image fails the rest of them might

not fail so I'll say console.log fail to

show you what I'm writing here failed to save image

yeah so there we go so if we throw an error

inside of here the catch that's chained onto the end of this will

break this message out for us all right so clear that out running this for what should be the last time

so write file is not defined and that's because we didn't import it back up to the top

so import file from FS there we go so we've done the import of

that method so this can now save them launch the browser yeah okay let's strip this off here at

the end I'm not going to worry about the catch now hopefully this will be the last time

that we run this there we go fill in the search run the search there's the results save the

images write them out and our images folder now has a bunch of image files these are

actually image files but vs code doesn't know what to do with them because there are no file extensions

so right here at the end avif because that's what these images are there we go all right so

there's all of our forest images the ones with actual file extensions we have it

all right so hopefully that will give you a whole bunch of inspiration for all the different kinds of things that you

can do with puppeteer again if you're looking for a copy of the code it's down in the description if

you've got any questions feel free to put those in the comments I answer whatever whatever I have time for

and as always thanks for watching

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

Comprehensive Selenium WebDriver Tutorial: Setup and Basic Automation

This detailed guide introduces Selenium WebDriver, covering its architecture, setup methods, and a basic automation test case. Learn how to configure your environment using manual and Maven approaches, understand WebDriver's role as a Java interface and API, and write your first automated browser test.

Effortless Data Scraping from Any Website with Advanced Automation

Learn how to scrape data from any website effortlessly using just the URL and defined fields.

Unlocking the Unlimited Power of Cursor: Boost Your Productivity!

Discover how to harness Cursor for ultimate productivity, from controlling apps to optimizing workflows!

Comprehensive Guide to HTTP Protocol and Express.js for Web Developers

Explore the fundamentals of the HTTP protocol—including request methods, status codes, headers, and the stateless nature of HTTP. Learn practical usage of HTTP concepts through Node.js and Express.js examples, and discover how tools like Postman and browser DevTools help in testing and debugging APIs effectively.