Quick start

Welcome to Phantombuster!
In just a few minutes you'll be able to navigate to any page and extract and export its data.

To get started, here's a step-by-step guide to extract the data on the front page of Hacker News and download the data in various formats (CSV and JSON).

Before we start: Agents, NickJS and Buster

Agents, Buster and NickJS are the core of Phantombuster.

Agents handle all your scraping and automation. They are made of:

  • a script: to define the navigation scenario and the data to retrieve
  • settings: to schedule execution, notifications, etc.
  • a console: to follow all steps of the execution
  • logs of all executions

NickJS and BusterJS are the main libraries you'll use to write your scraping scripts:

  • NickJS is a wrapper on Headless Chrome. It is open-source, feel free to check it out on GitHub!
  • BusterJS the interface between your scrapers and all the features available on Phantombuster. Here is a quick overview:
    • Agent scheduler
    • MongoDB database
    • S3 cloud storage
    • CAPTCHA solver
    • Proxy pools

Those libraries are maintained by Phantombuster's developers and will always be.

NickJS vs other headless browser libraries?

We believe NickJS is the easiest and most powerful high-level headless-browser library.

One of the key features of NickJS is its support for async/await.

  • If you're not familiar with async/await, check our blog post here.
  • Writing callbacks-style is also possible, the API reference provides examples for each method.

If you have existing PhantomJS, CasperJS or NodeJS scripts, they can be used instead of NickJS.

Create an agent

Head to the Agents page and click "Create".
By default, new agent contains some sample code. The purpose of this quick start is to explain it step by step.

Note, if you expand line 1, you'll see some basic configuration. You do not need to touch this for the moment.

// Phantombuster configuration {
"phantombuster command: nodejs"
"phantombuster package: 4"
"phantombuster flags: save-folder"
const Buster = require("phantombuster")
const buster = new Buster()
const Nick = require("nickjs")
const nick = new Nick()
// }

Load a URL and check for success

We load URLs by opening new tabs, just like a regular browser.
It's best practice to ensure our page has successfully loaded the HTML elements we need.

API reference: waitUntilVisible(), newTab(), open()

nick.newTab().then(async (tab) => {
  await tab.open("news.ycombinator.com")
  await tab.untilVisible("#hnmain") // We make sure we have loaded the page

Manipulate the page and access content

Say we want the list of posts and their respective URLs. In the browser, we inspect the page and see the data is presented in table rows with the class "athing".

Back to our scraping script, to put that data in an array, we replicate what we would do in the browser inspector/console and evaluate a function in the DOM.
At the end of the evaluation, we simply pass the return value of this function in the callback.

NB: We like to use jQuery. Since it's not already loaded on Hacker News, we inject it ourselves.
API reference: inject(), evaluate()

await tab.inject("../injectables/jquery-3.0.0.min.js") 

const hackerNewsLinks = await tab.evaluate((arg, callback) => {
  const data = []
  $(".athing").each((index, element) => {
    data.push({
      title: $(element).find(".storylink").text(),
      url: $(element).find(".storylink").attr("href")
    })
  })
  callback(null, data)
})

How come we can run JavaScript "in the console"?

A headless browser acts just like a regular browser, except it doesn't have a graphical interface (roughly speaking, no screen, mouse, keyboard or buttons to click on).

It is possible to inject any library using a CDN. E.g. for UnderscoreJS: tab.inject("https://cdnjs.cloudflare.com/ajax/libs/underscore.js/1.8.3/underscore-min.js")

Return data as JSON

To send the array of results back to Phantombuster as JSON simply use setResultObject().

await buster.setResultObject(hackerNewsLinks)

Take a screenshot of the page

A screenshot() is a good debugging tool. If the screenshot doesn't look like what you expect, chances are the data you scraped won't either.

await tab.screenshot("hacker-news.jpg") 

Catch errors and end the script

Catching errors with async/await is easy.

// all scraping code above
})
.then(() => {
	console.log("Job done!")
	nick.exit()
})
.catch((err) => {
	console.log(`Something went wrong: ${err}`)
	nick.exit(1)
})

Launch the script and check the results

If you opened a new agent and are seeing the default sample script, click "Launch" in the top-right corner of the editor. Otherwise, paste the code from this Gist.

You will be taken to the console and see:

  • The success and error messages
  • Your JSON Result object
  • The files saved (in this case the screenshot, but it could be any other image or text or CSV file you saved)
  • The API endpoint to control your script
See all your results and files in the console

See all your results and files in the console

That's it! Congratulations on your first scraping job with Phantombuster :)

Quick start

Welcome to Phantombuster!
In just a few minutes you'll be able to navigate to any page and extract and export its data.

To get started, here's a step-by-step guide to extract the data on the front page of Hacker News and download the data in various formats (CSV and JSON).