Scraping data in 3 minutes with Javascript

Prerequisites: Know a little bit about javascript and of course, understand HTML and CSS.

Today’s goal will be to scrape some data out of an HTML page and to smartly structure the output data so we can save it right into an hypothetical database.

Companies List Page

We’ve got a list of 2 companies to extract.

<!DOCTYPE html>
<html lang="en">
<head>
</head>

<body>
<!-- 	Data we want to scrape starts here -->
  <div class="list items">
    <div class="item">
      <div class="header">
        <h1 itemprop="name">  <a href="/comp/tessera">Tessera  </a>        </h1>
        <p rel="description"> Proud of our wide range of product
          we developped many project in the past 4 years. <br><br> You can find the company 
          in 14 different countries <p></p> in the world. <br>
          Blablabla. <br>
        </p>
      </div>
      <div class="contact">
        <span itemprop="employeeName">  Mike Layn        </span> <br>
        <span itemprop="employeeJobTitle">      Marketing Assistant</span> <br>
        <span itemprop="telephone">       Phone: (841) 467-168  </span> <br>
        <span itemprop="email">      Email: mike.layn@tessera.io</span> <br>
      </div>
    </div>
    <div class="item">
      <div class="header">
        <h1 itemprop="name">  <a href="/comp/marcox">  Marcox   </a>     </h1>
        <p rel="description"> Lorem ipsum dolor  <p></p> sit amet, consectetur adipisicing elit. Cupid 
          in any actions we  <br> take in the world. <br> <br>
        </p>
      </div>
      <div class="contact">
        <span itemprop="employeeName">  Jake Kannegan      </span> <br>
        <span itemprop="employeeJobTitle">   Owner </span> <br>
        <span itemprop="telephone">       Phone:    +1497 467168  </span> <br>
        <span itemprop="email">      Email:     jakek@marcox.com</span> <br>
      </div>
    </div>
  </div>
	<!-- 	Data we want to scrape ends here -->
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
	<script src="./js/script.js"></script>
</body>

</html>

 

Unique selectors

We want now to identify which css rules will allow us to identify each element of our structure. Some tools exist to help. Like the great SelectorGadget for Chrome.

Here, we’ve got a pretty simple structure. We could do something as follows:

company : .list.items .item 
|_  name : .header [itemprop=name]
|_  description : .header [rel=description]
|_  url : .header [itemprop=name] a
|_  contact : .contact
    |_  telephone : [itemprop=telephone]
    |_  employee
        |_  name : [itemprop=employeeName]
        |_  jobTitle : [itemprop=employeeJobTitle]
        |_  email : [itemprop=email]

As you can see, employee doesn’t have a selector for example. It’s because we focus on making sense out of data and employee isn’t represented in the HTML.

Get the data

We could either do it via vanilla javascript or we could take advantage of an amazing wrapper : cheerio.js (more). It’s based on jQuery syntax and so you’ll find it really friendly to use.

Now, if you wanted to do it only with cheerio, you would end up with something looking like this:

let cheerio = require('cheerio')
let $ = cheerio.load('our html page url here')

var companiesList = [];

// For each .item, we add all the structure of a company to the companiesList array
// Don't try to understand what follows because we will do it differently.
$('.list.items .item').each(function(index, element){
companiesList[index] = {};
var header = $(element).find('.header');
companiesList[index]['name'] = $(header).find('[itemprop=name]').text();
companiesList[index]['description'] = $(header).find('[rel=description]').text();
companiesList[index]['url'] = $(header).find('.header [itemprop=name] a').getAttribute('href');
var contact = $(element).find('.contact');
companiesList[index]['contact'] = {};
companiesList[index]['contact']['telephone'] = $(contact).find('[itemprop=telephone]').text();
companiesList[index]['contact']['employee'] = {};
companiesList[index]['contact']['employee']['name'] = $(contact).find('[itemprop=employeeName]').text();
companiesList[index]['contact']['employee']['jobTitle'] = $(contact).find('[itemprop=employeeJobTitle]').text();
companiesList[index]['contact']['employee']['email'] = $(contact).find('[itemprop=email]').text();
});

console.log(companiesList); // Output the data in the terminal
// Here is the output data:
// [
//     {
//         "name": "  Tessera       ",
//         "description": " Proud of our wide range of product\n\t\t\t\twe developped many project in the past 4 years.  You can find the company \n\t\t\t\tin 14 different countries ",
//         "contact": {
//             "telephone": "       Phone: (841) 467-168  ",
//             "employee": {
//                 "name": "  Mike Layn        ",
//                 "jobTitle": "      Marketing Assistant",
//                 "email": "      Email: mike.layn@tessera.io"
//             }
//         }
//     },
//     {
//         "name": "  Marcox      ",
//         "description": " Lorem ipsum dolor  ",
//         "contact": {
//             "telephone": "       Phone:    +1497 467168  ",
//             "employee": {
//                 "name": "  Jake Kannegan      ",
//                 "jobTitle": "   Owner ",
//                 "email": "      Email:     jakek@marcox.com"
//             }
//         }
//     }
// ]

As you can see we get our 2 companies in an array. Data is pretty dirty though. Many spaces are still there. Email contains “Email:”. Phone contains “Phone:” too and their not rendered really nicely.

You can then clean these data or add some more code to the one above to do it live. But nevermind, I’ll show you something magic now.