Self Explore App part 2

Royer Adames
14 min readMar 16, 2021

--

Possible solution: change the download path from the google settings.’

Remember to scape your string and have it based on your drive.

D:\\Lambda\\projects\\puppeteer_test\\dataD:\\Lambda\\projects\\puppeteer_test\\data

The script will make the folder if the folder does not exist.

The puppeteer way

const downloadBrowser = await puppeteer.launch({env: {PUPPETEER_DOWNLOAD_PATH: "D:\\Lambda\\projects\\puppeteer_test"}});

The bound to change the way

// set download location to local project pathawait page._client.send("Page.setDownloadBehavior", {behavior: "allow",downloadPath: "D:\\Lambda\\projects\\puppeteer_test",});

The custom way

chrome://settings/downloads

Change button

document.querySelector("settings-ui").shadowRoot.querySelector("#main").shadowRoot.querySelector("settings-basic-page").shadowRoot.querySelector("[section='downloads']").querySelector("settings-downloads-page").shadowRoot.querySelector("#changeDownloadsPath")

Change location

  • select file path
  • paste desire path
D:\Lambda\projects\puppeteer_test
  • click select path

I can do all of this in another tab before the download starts.

Can create a custom wait for file download when working with the chrome download page.

chrome://downloads/

  • I can wait for 1 sec and until the download progress bar stops.
  • If the download reaches 0 and (download MD is not equal to total MB or 0 seconds left), I download a file again, and in 3 seconds, I cancel it.

Progress bar

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#progress")

Not so fast. The div goes invisible.

You can still find the document here. I have to find the parent div of #progress and notice the style of display none.

This could work

$('li[style*="display: none"]')

Working code:

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details div:nth-child(4)[style*='display: none']").hidden

Note:

  • when it finishes downloading, the download speed stays at 0 for 2–3 seconds.
Can stop here and receive a where to download option

Suppose it fails to download because of lack of space. Then it downloads again.

Select the element with the shadow tree without having to use the shadow root method.

What is Shadow DOM?

Source

https://stenciljs.com/docs/end-to-end-testing#find-an-element-in-the-shadow-dom

part A space-separated list of the part names of the element. Part names allow CSS to select and style specific elements in a shadow tree via the ::part pseudo-element.

I cannot use it because the part needs to be set.

The: host pseudo-class, when evaluated in the context of a shadow tree, matches the shadow tree’s shadow host. In any other context, it matches nothing.

The :host() function pseudo-class has the syntax:

:host( <compound-selector> )

The using host got me null values.

A workaround could be to use the page.evaluateHandle to target the DOM element.

page.waitForSelector(selector[, options])

There is no selector option. I cannot use this when evaluating.

await settingPage.waitForNavigation({ timeout: 0, waitUntil: "networkidle0"})

Waits indefinitely or until time out.

setTimeout()

best to clearTimeout() when done using the timeout.

Remember to catch id by setting the setTimeout to a variable

do…while

setInterval()

have to use ClearInterval to stop the repeat of the code or unscoped it.

Remember to catch id by setting the setInterval() to a variable

await

try…catch

Optimization hints

  • It takes 52 sec to 1 minute for the download to timeout.
  • Is there a setInterval like the function that can wait for an event to be fire? Maybe a chrome dev tools event listener breakpoints that I can use with puppeteer that lets me know when a file is downloading, stop downloading, and finish downloading.

Progress so far

Custom file waiter is working.

  • It needs a way to deal with a Facebook network error.
  • It can be optimized by making the code wait for downloading of the file to be finished.
  • It needs to be tested with the other scrips.

Success

Need to do before moving to electron integration

  • does it work headless?
  • Network issues need to be solved automatically.

Network issues need to be solved automatically.

Cancel button

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#safe").querySelectorAll("cr-button")[1]

Picking a download folder

They are ordered by row index. The newest will always be number 1, and the oldest will be the total of all download files. This example is 3. The example below shows how to select the 2nd downloaded folder.

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item[aria-rowindex='2']")

Remove option

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#remove")

description

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#description").innerText
"0 B/s - 248 MB of 694 MB, Paused"

Download speed

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#description").innerText.split(" ")[0]"0"

Current Data

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#description").innerText.split(" ")[3]"112"

Total data

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#description").innerText.split(" ")[6]"694"

Conver the string to number

Number( )

Number(document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#description").innerText.split(" ")[3])

The contrast of finish download description and still working on it

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#description").hiddendocument.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#description").hidden

Interesting

The Location.reload() method reloads the current URL, like the Refresh button.

JavaScript Date objects represent a single moment in time in a platform-independent format.

Date().split(" ")[4]

Can you overwrite same name files in chromium?

Node Debugger does not work with Puppeteer evaluator

Solutions:

  • wrap the evaluator DOM element into a variable and use puppeteer instead of the evaluator
  • create a call back that loops around until a successful download of the file are made
  • redownload the document on the Dev console

The document could be redownloaded by clicking the downloading URL

document.querySelector("downloads-manager").shadowRoot.querySelector("#mainContainer").querySelector("#downloadsList").querySelector("downloads-item").shadowRoot.querySelector("#details").querySelector("#url")

Auto network fix is a success.

Start dev console the console tab.

No easy way to do so

Refactor the script

Get all functions into their own file

  • loginToFacebook
  • createData
  • waitForData
  • downloadData
Starting off file structure

New functions names

  • loginToFacebook
  • createFile
  • waitForFile
  • downloadFile

Others

research

const fs = require("fs");

Twitter Data Archive Automation

Current Script Hierarchy

Testing for a breakthrough:

Fix frame detach

frame detachment could have been because when I reloaded the page, I didn’t wait for the frame to load its elements back up

Simplify waitForFileDownload

can be simplified the evaluation by using it to create a JSHandler of the variables and extract a boolean value.

  • This opens up the reuse of start file download if they ask the user to reenter their credentials and takes away the custom waiting functions.
  • Allows for console logs to be on the script instead of on the headless chromium-browser

Can we wait for navigation use load instead of waiting for all network activities to die down?

?

Puppeteer works fine in dev mode but breaks in headless mode.

Chrome download is not accessible in headless mode.

chrome://downloads/

chrome://downloads/ can be the part of the GUI that headless ( command line mode) does not load.

Solution: I need a command that updates me on the current download status

class: HTTPRequest

puppeteer github docs feels more complete than their website doc.

httpRequest.url()

  • returns: <string> URL of the request.

httpRequest.headers()

  • returns: <Object> An object with HTTP headers associated with the request. All header names are lower-case.
  • MDN

Useful headers

  • request URL
  • Status Code

Where can I find the download speed?

page.waitForRequest(urlOrPredicate[, options])

  • capture the request URL of the get 200 status
  • wait for the request URL with a timeout of 0

Research and development

Playwright makes everything I want to do with Puppeteer easy period.

Wait for the file solution.

Handle authentication gracefully

The major component for fixing network error.

Useful yarn commands

yarn add <package...> [--dev/-D]

Using --dev or -D will install one or more packages in your devDependencies.

yarn run [script] [<args>]

Create React app on Yarn

yarn create <starter-kit-package> [<args>]

This command is a shorthand that helps you do two things at once:

  • Install create-<starter-kit-package> globally, or update the package to the latest version of it already exists
  • Run the executable located in the bin field of the starter kit’s package.json, forwarding any <args> to it

For example, yarn create react-app my-app is equivalent to:

$ yarn global add create-react-app
$ create-react-app my-app

If you have defined a scripts object in your package, this command will run the specified [script]. For example:

Start package.json

yarn init

This command walks you through an interactive session to create a package.json file. Some defaults such as the license and initial version are found in yarn’s init-* config settings.

Rebuilding Download file script on playWright

document.querySelector("iframe")

Waiting for the iframe element

Sometimes Facebook adds a iframe on the login page

waiting for navigation load

waiting for navigation domcontentloaded

wait for navigation network idle

Console.log await download.createReadStream()

Error

[Error: EPERM: operation not permitted,
unlink 'C:\Users\Owner\AppData\Local\Temp\playwright_chromiumdev_profile-3XZLP0\CrashpadMetrics-active.pma'] {
errno: -4048,
code: 'EPERM',
syscall: 'unlink',
path: 'C:\\Users\\Owner\\AppData\\Local\\Temp\\playwright_chromiumdev_profile-3XZLP0\\CrashpadMetrics-active.pma'

After some time, the network idle became like this.

How to change the name of a download file

const { chromium } = require('playwright');(async () => {const browser = await chromium.launch({args: ["--start-maximized", "--disable-notifications",  '--disable-extensions', '--mute-audio'],devtools: true,slowMo: 50,defaultViewport: null,downloadsPath: "D:\\Lambda\\projects\\puppeteer_test\\data",});const page = await browser.newPage({acceptDownloads: true,viewport: null,});;await page.goto("https://www.writeaplay.co.uk/cms/wp-content/uploads/2015/04/The-Progressive-Playwright-1080x380.jpg");page.on('download', async download => {await download.saveAs("./data/img.png")});//   await browser.close();})();

The problem is that it 2 downloads.

Solution

page.on('download', download => {
// save the download file has the suggested file name
download.saveAs(`./data/${ download.suggestedFilename()}`)// delete the criptic file name
download.delete()
});

If the same file is redownloaded it will be overwritten.

Weir behavior recorded

I notice that it didn’t work for the first 3 downloads for a large image, but after that, it did. Also, it works for videos, and it works fine.

Error: EPERM: operation not permitted, unlink

[Error: EPERM: operation not permitted, unlink 'C:\Users\Owner\AppData\Local\Temp\playwright_chromiumdev_profile-lE48xW\CrashpadMetrics-active.pma'] {
errno: -4048,
code: 'EPERM',
syscall: 'unlink',
path: 'C:\\Users\\Owner\\AppData\\Local\\Temp\\playwright_chromiumdev_profile-lE48xW\\CrashpadMetrics-active.pma'
}

To solve it, you have to:

1) add your project folder to the Windows Defender exclusions list.

2) remove your node_modules folder

3) run:

npm install

How can I tell the user how to progress is going?

  • What time it started downloading.
  • Current download speed
  • Current data — Max data — current time

I don’t know how to get the above live data, but I can do an interval console.log for every 1/2 a minute to say that it is being downloaded.

1/2 a minute download reports

Downloading more than 1 file will keep the report coming until the program is stop

setInterval()

Things to do

- Network fix needs to be implemented

- File report bug needs fixing

Downloading more than 1 file will keep the report coming until the program is stop

— Move the whole scrip to Playwright

- Add Reuse authentication state solution

Convert Puppeteer scripts to Playwright

Issue

That said, it still works on dev tools.

Solution

The error was due to all files expiring. The solution is to get a new file.

Polishing and debugging script for v1 integration

handle the case when the get file button is disabled

document.querySelector("iframe").contentDocument.querySelector("button[aria-disabled='false'")

The cancel button also appears, so I need a more specific identifier path.

$x("//button//div[text()='Create File']/../..")[0]

When the button should have been true, it was false. Does this mean that the doc var data is a snipe of a previous state?

Frame detaching after reloading the page

Solution for download issues

Error

Solution

Speed up wait for download event to fail

I will need to wait for the input selector to fail then wait until the event time out

even with the error, the script continues and stops after the file has been downloaded.

Why do I have a bunch of chromium tasks running when the browser is closing correctly?

Possible cause: When the scrip fails and it does not get to browser.close it still takes CPU resources. Only in headful mode

Error: fill(“input[type=password]”) timeout

The download button broke. You click it, and nothing happens. No download or anything, but if you reload the page, it works again.

Error: not constant error

use this for Twitter

Moving to electron integration.

A button that runs a script

  • script tag
  • on click event

Other resources:

--

--