User Controls

Lanny's Quick and Dirty Web Scraping Tutorial

  1. #1
    Lanny Bird of Courage
    Somebody asked me for some help with a scraper for mangareader.net recently and I ended up writing a fair amount on it, most of it is applicable to web scraping in general. Thought it might have some use to a wider audience so here's a lightly edited version:

    For web scraping kind of things you want to use node which can run outside of a browser context, the browser security model makes running scraping code in-browser a significant hurdle. You'll need node and npm installed.

    On a very high level web scraping consists of two parts: enumeration and retrieval. Enumeration is finding the list of all the things you want to scrape. Retrieval is fetching the actual content you want (pages of manga in this case) once they've been identified

    If you want to do a full site scrape enumeration looks kind of hierarchical: first you'll want to collect all the series hosted on a site, then all the chapters belonging to each series, then each page belonging to each chapter. For the sake of simplicity I suggest starting with scraping all the pages from just one chapter. Once you have code that works on one chapter you can start to generalize, turn it into a parameterized process that fetches all the pages of some chapter and move onto enumerating chapters of a series. You can work bottom-up in this fashion. In general this is a pretty good strategy for scraping.

    So to look at something a little more concrete let's take a look at scraping all the pages from a particular chapter on mangareader. Let's look at the markup for a page like this one: http://www.mangareader.net/bitter-virgin/1

    We're looking for something that will list all the pages in the chapter. The jump-to-page thing looks promising:

    <select id="pageMenu" name="pageMenu"><option value="/bitter-virgin/1" selected="selected">1</option>
    <option value="/bitter-virgin/1/2">2</option>
    <option value="/bitter-virgin/1/3">3</option>
    ...
    </select>


    Awesome, it looks like there's an element that has a bunch of sub elements with `value` attributes that point to each page in the chapter. Now we need to write some code to grab those. Here's what I came up with on the fly, no error handling or anything but it's simple. It depends on two libraries, "request" and "jsdom" for making requests and parsing responses respectively, you'll need to install these on your system using npm if you haven't before:

    #!/usr/bin/env node

    var jsdom = require('jsdom');
    var request = require('request');

    var firstPage = 'http://www.mangareader.net/bitter-virgin/1';

    // Make a request to the firstPage url which we know contains urls to each
    // page of the chapter.
    request(firstPage, function(err, response, body) {

    // `response` is just a string, here we parse the content into something we
    // can work with.
    var content = new jsdom.JSDOM(body);

    // Use a CSS selector to get the elements we're interested in. The selector is
    // the '#pageMenu option' part. It says "return the list of all option elements
    // which are a descendent of the element with pageMenu as its id". We know
    // each of those elements have a value attribute we're interested in
    var opts = content.window.document.querySelectorAll('#pageMenu option')

    // Iterate over all the options elements we just collected and print their
    // value attribute.
    opts.forEach(function(opt) {
    console.log(opt.value);
    });
    });


    This works for me, it outputs a list of page urls. So there's enumeration done, now we need to write fetching logic. Each page has an element with "img" as its id, which points to the image of the page. So we need to fetch the viewer page to get that and then fetch the image itself and save it. Here's what that looks like:

    #!/usr/bin/env node

    var jsdom = require('jsdom');
    var request = require('request');
    var fs = require('fs');

    // Visit a page url and download and save the page image
    function fetchPage(pageUrl, filename, done) {
    // Grab the page url. Note there's a difference between the asset at the page
    // url (which contains html for ads and navigation and such) and the actual
    // image which we want to save.
    request(pageUrl, function(err, response, body) {
    // Parse content
    var content = new jsdom.JSDOM(body);
    // Identify img tag pointing to the actual manga page image
    var img = content.window.document.querySelector('#img')

    // Make another request to get the actual image data
    request({
    url: img.src,
    // Must specify null encoding so the response doesn't get intrepreted
    // as text
    encoding: null
    }, function(err, response, body) {
    // Write the image data to the disk.
    fs.writeFileSync(filename, body);

    // Call done so we can either fetch another page or terminate the program.
    done();
    });
    });
    }

    // Fetch the each image from a list of pages and save them to disk
    function fetchPageList(pageList, timeout) {
    var idx = 0;

    // This recursive style may looks strange, you don't _really_ need to worry
    // about it, but it's important becauses the async nature of HTTP request.
    // If we did a simple for loop here every request would be fired in parallel,
    // so instead we'll process one page at a time, starting the next page's fetch
    // slightly after the previous one finishes.
    function fetchNext() {
    // Wait some amount of time between making each request. We could make
    // all these reqeust in parallel or one right after the other but aside
    // from being unkind to the host, many websites will refuse to serve requests
    // if we make too many at once or too close to eachother
    setTimeout(function() {
    // Fetch the actual page
    fetchPage(pageList[idx], 'bitter-virgin-' + idx + '.jpg', function() {
    // Fetch complete. Move onto the next item if there is one.
    idx++;
    if (idx < pageList.length) {
    fetchNext();
    } else {
    console.log('Done!');
    }
    });
    }, timeout);
    }

    fetchNext();
    }

    var firstPage = 'http://www.mangareader.net/bitter-virgin/1';

    // Make a request to the firstPage url which we know contains urls to each
    // page of the chapter.
    request(firstPage, function(err, response, body) {

    // `response` is just a string, here we parse the content into something we
    // can work with.
    var content = new jsdom.JSDOM(body);
    fs.writeFileSync('bar', body);

    // Use a CSS selector to get the elements we're interested in. The selector is
    // the '#pageMenu option' part. It says "return the list of all option elemnts
    // which are a descendent of the element with pageMenu as its id". We know
    // each of those elements have a value attribute we're interested in
    var opts = content.window.document.querySelectorAll('#pageMenu option')

    var pageList = [];
    opts.forEach(function(opt, index) {
    pageList.push('http://www.mangareader.net' + opt.value);
    });

    fetchPageList(pageList, 200);
    });


    As it stands fetchPage is parameterized, that is it's equipped to fetch any page from any manga on mangareader. `fetchPageList` and the logic for fetching a chapter assume one particular series however. To do a full scrape of the site they need to be generalized so that another process can enumerate the series and chapters and execute the generalized version of this process for each. That generalization is left as an exercise for the reader.
    The following users say it would be alright if the author of this post didn't die in a fire!
  2. #2
    SBTlauien African Astronaut
    For Android/Java, I just started using the Jsoup library. It seems to do a good job.

    Is web crawling/spidering, the same as webscraping? Or does web crawling/spidering, implement web scrapping? Or neither? Or both? Or something else? Or stfu? Or else? Or else if?
  3. #3
    Lanny Bird of Courage
    Originally posted by SBTlauien Or does web crawling/spidering, implement web scrapping?

    This one. Typically crawling implies you're indexing some large space by recursive link following, scraping some portion of each page you land on. When you just talk about scraping on its own you usually mean there's some known set of data (like all the manga on some manga hosting website for example) that you're trying to retrieve.
  4. #4
    manga sucks but it would probably be pretty dope if you were on acid
  5. #5
    Lanny Bird of Courage
    I like some of it, it's a wider medium with a lower barrier to entry so there's more artistic freedom than in anime. Also more garbage and unbridled edginess but you can afford to be picky.

    The point really isn't manga though, web scraping is useful in many other contexts.
  6. #6
    SBTlauien African Astronaut
    I use to just make a raw connection, then search for particular strings, and then split accordingly.
Jump to Top