{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Automated Data Collection With R\n", "\n", "## ** Ivan Hernandez, Ph.D** (ivan.hernandez@depaul.edu)\n", "\n", "### **DePaul University**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** The Steps of Automatically Collecting Data Online **\n", "\n", "+ Download the HTML Source of a page\n", "+ Extract the content from the HTML\n", "+ Save the Content\n", "+ Repeat the Process on A Different Page\n", "\n", "All of these steps can be automated, running indepdent of human interaction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 0: Import the needed libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before beginning, make sure you have R installed on your computer with the necessary libraries.\n", "\n", "If you do have R installed, a recommended development environment is [**RStudio**]( [https://www.rstudio.com/products/rstudio/download](https://www.rstudio.com/products/rstudio/download)\n", "\n", "We are first going to load a key library called `rvest`\n", "+ [rvest](https://cran.r-project.org/web/packages/rvest)\n", "\n", "The `rvest` library allows us to parse the HTML of a webpage and isolate content within HTML tags" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "library(rvest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Download the HTML Source of a page\n", "\n", "\n", "+ 1.1 Direct R to access the webpage, and save the page's HTML to a variable\n", "\n", "+ 1.2 Pass the downloaded HTML to an HTML parser in R (rvest)\n", "\n", "+ 1.3 Examine the downloaded content\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Direct R to Access the Webpage, and Save the Page's HTML to a variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step of scraping information from the web is downloading the source code of a specific webpage.\n", "\n", "You should have an idea in mind of the page(s) that contain the information you need.\n", "\n", "When you know what page you need to extract content from, then you can direct R to download it.\n", "\n", "We are going to extract content from the example html page:\n", "\n", "http://ivanhernandez.com/example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below does the following:\n", " + Creates a variable, called \"url\", that has the address of the webpage we want to scrape\n", " + Downloads the html for the webpage and save it as a variable called \"webpage\"\n", " + Prints that the Webpage has been downloaded" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Webpage downloaded\"\n" ] } ], "source": [ "url <- \"http://ivanhernandez.com/webscraping/example.html\"\n", "webpage <- read_html(url)\n", "print(\"Webpage downloaded\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can examine the contents of the source code using the \"paste\" function, which has R print the source code of the webpage as text." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "'<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\\n<html>\\n<head>\\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\\n<title>Example Page</title>\\n</head>\\n<body>\\n\\t<h1>\\n\\t\\tThis Title is in an H1 tag\\n\\t</h1>\\n\\t<div class=\"box1\">This text is inside a div tag, whose class is equal to box1</div>\\n\\t<br><div class=\"box2\">This text is inside a div tag, whose class is equal to box2</div>\\n\\t<br><span id=\"box3\">This text is inside a span tag, whose id is equal to box3</span>\\n\\t<p id=\"box4\">This text is inside a p tag, whose id is equal to box4</p>\\n\\t<div role=\"extra\">This text is inside a div tag whose role is equal to extra</div>\\n\\t<a href=\"http://google.com\">This is a link to Google</a>\\n\\t<br><br>\\n\\tAdditional Content: This content is not inside any tag\\n\\t<br><img src=\"http://ivanhernandez.com/webscraping/profile.png\">\\n</body>\\n</html>\\n'" ], "text/latex": [ "'\\textbackslash{}n\\textbackslash{}n\\textbackslash{}n\\textbackslash{}nExample Page\\textbackslash{}n\\textbackslash{}n\\textbackslash{}n\\textbackslash{}t

\\textbackslash{}n\\textbackslash{}t\\textbackslash{}tThis Title is in an H1 tag\\textbackslash{}n\\textbackslash{}t

\\textbackslash{}n\\textbackslash{}t
This text is inside a div tag, whose class is equal to box1
\\textbackslash{}n\\textbackslash{}t
This text is inside a div tag, whose class is equal to box2
\\textbackslash{}n\\textbackslash{}t
This text is inside a span tag, whose id is equal to box3\\textbackslash{}n\\textbackslash{}t

This text is inside a p tag, whose id is equal to box4

\\textbackslash{}n\\textbackslash{}t
This text is inside a div tag whose role is equal to extra
\\textbackslash{}n\\textbackslash{}tThis is a link to Google\\textbackslash{}n\\textbackslash{}t

\\textbackslash{}n\\textbackslash{}tAdditional Content: This content is not inside any tag\\textbackslash{}n\\textbackslash{}t
\\textbackslash{}n\\textbackslash{}n\\textbackslash{}n'" ], "text/markdown": [ "'<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\\n<html>\\n<head>\\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\\n<title>Example Page</title>\\n</head>\\n<body>\\n\\t<h1>\\n\\t\\tThis Title is in an H1 tag\\n\\t</h1>\\n\\t<div class=\"box1\">This text is inside a div tag, whose class is equal to box1</div>\\n\\t<br><div class=\"box2\">This text is inside a div tag, whose class is equal to box2</div>\\n\\t<br><span id=\"box3\">This text is inside a span tag, whose id is equal to box3</span>\\n\\t<p id=\"box4\">This text is inside a p tag, whose id is equal to box4</p>\\n\\t<div role=\"extra\">This text is inside a div tag whose role is equal to extra</div>\\n\\t<a href=\"http://google.com\">This is a link to Google</a>\\n\\t<br><br>\\n\\tAdditional Content: This content is not inside any tag\\n\\t<br><img src=\"http://ivanhernandez.com/webscraping/profile.png\">\\n</body>\\n</html>\\n'" ], "text/plain": [ "[1] \"\\n\\n\\n\\nExample Page\\n\\n\\n\\t

\\n\\t\\tThis Title is in an H1 tag\\n\\t

\\n\\t
This text is inside a div tag, whose class is equal to box1
\\n\\t
This text is inside a div tag, whose class is equal to box2
\\n\\t
This text is inside a span tag, whose id is equal to box3\\n\\t

This text is inside a p tag, whose id is equal to box4

\\n\\t
This text is inside a div tag whose role is equal to extra
\\n\\tThis is a link to Google\\n\\t

\\n\\tAdditional Content: This content is not inside any tag\\n\\t
\\n\\n\\n\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "paste(webpage)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 2: Extracting Content from a Page\n", "\n", "Once you have the webpage source code downloaded, you are ready to tell R to extract the content you want.\n", "\n", "We can parse through the elements of the source code in an easy way by using `html_nodes()` function.\n", "\n", "In the `html_nodes` function, we pass along the webpage source code that we already downloaded in Step 1.\n", "\n", "We then pass along a string indicating what parameters we want to match in the content such as the tag name, class name, or id of the html element.\n", "\n", "If we know the type of tag (e.g., div, p, a, etc.), and we know an identifying selector (e.g., class, id, name, etc.), we can grab the content in that specified tag.\n", "\n", "Depending on your project and goals, you will find yourself in one these possible situations:\n", "\n", "+ 2.1 You want a **Single Piece of Text** Data from a Page\n", " + a. You want content from a specific tag with a specfied name\n", " + b. You want content from a specific tag, but the selector does not matter\n", " + c. You want content that has a selector, but the tag does not matter\n", "\n", "\n", "+ 2.2 You want **Multiple Pieces of Text** Data from a Page\n", " + a. You want all the content from a specific tag that occurs many times\n", " + b. You want all the content that could come from many possible tags\n", " + c. You want all the content that could come from many possible id/class names\n", " + d. You want all the content that comes from a specific tag, and partially matches an id/class/name \n", "\n", "\n", "+ 2.3 You want **Information from Links**\n", " + a. You want to get the link text\n", " + b. You want to get the url of a link\n", "\n", "\n", "+ 2.4 You want to Extract **Non-tagged** Content\n", " + a. You want to get a specific piece of text\n", " + b. You want to get a file found on the page\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1 Collecting a Single Piece of Text Data from a Page\n", "\n", "In the following section, we are going to extract a single peice of text found in the webpage.\n", "\n", "Below is an image of the example page source code for reference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1.a When you know the tag name and identifying information" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In HTML, there are many common tag names, which are used to create different kinds of layouts.\n", "\n", "The tag name is the name that comes immediately after a \"<\" sign in the HTML source code.\n", "\n", "Some commonly used tag names are:\n", "- div\n", "- span\n", "- p\n", "- h1\n", "- a\n", "- strong\n", "\n", "To help identify different tag names or to provide different layout options, tags often have a selector after the tag that identifies that content. Some commonly used selectors are:\n", "- class\n", "- id\n", "- name\n", " \n", "If you know the tag and selector for the element you want to retrieve, then the second argument in the **`html_nodes()`** function should be a string with the tag name first and then square brackets with the identifying name and the value for that selector within the brackets.\n", "\n", "\n", "Let's grab the content from the **div** tag with a selector called **class** that is equal to **box1**\n", "\n", "We need to:\n", "- Specify in quotations the tag name of the element first (\"div\")\n", "- Then indicate in square brackets what selector we want (\"class\")\n", "- Then use an equals sign(\"=\")\n", "- Then indicaete what the selector's value is equal to (\"box1\")\n", "- Make sure that the value has single quotes around it\n", "- Result: **\"div[class='box1']\"**\n", "\n", "We save that to the variable called \"content\", and can access the text within that tag using the **`html_text()`** function." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This text is inside a div tag, whose class is equal to box1'" ], "text/latex": [ "'This text is inside a div tag, whose class is equal to box1'" ], "text/markdown": [ "'This text is inside a div tag, whose class is equal to box1'" ], "text/plain": [ "[1] \"This text is inside a div tag, whose class is equal to box1\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "content <- html_nodes(webpage, \"div[class='box1']\")\n", "html_text(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's grab the content from the **div** tag with a class equal to **box2**\n", "\n", "**\"div[class='box2']\"**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This text is inside a div tag, whose class is equal to box2'" ], "text/latex": [ "'This text is inside a div tag, whose class is equal to box2'" ], "text/markdown": [ "'This text is inside a div tag, whose class is equal to box2'" ], "text/plain": [ "[1] \"This text is inside a div tag, whose class is equal to box2\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "content <- html_nodes(webpage, \"div[class='box2']\")\n", "html_text(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's grab the content from the **span** tag with an id equal to **box4**\n", "\n", "**\"span[id='box3']\"**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This text is inside a span tag, whose id is equal to box3'" ], "text/latex": [ "'This text is inside a span tag, whose id is equal to box3'" ], "text/markdown": [ "'This text is inside a span tag, whose id is equal to box3'" ], "text/plain": [ "[1] \"This text is inside a span tag, whose id is equal to box3\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "content <- html_nodes(webpage, \"span[id='box3']\")\n", "html_text(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1.b When you only know/need the tag name\n", "\n", "If you only know the tag name (div, span, p, h1, etc.), then you only need to pass along the name of it to the `html_nodes()` function.\n", "\n", "Let's grab the content from the **span** tag" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This text is inside a span tag, whose id is equal to box3'" ], "text/latex": [ "'This text is inside a span tag, whose id is equal to box3'" ], "text/markdown": [ "'This text is inside a span tag, whose id is equal to box3'" ], "text/plain": [ "[1] \"This text is inside a span tag, whose id is equal to box3\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "content <- html_nodes(webpage, \"span\")\n", "html_text(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's grab the content from the **p** tag" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "'This text is inside a p tag, whose id is equal to box4'" ], "text/latex": [ "'This text is inside a p tag, whose id is equal to box4'" ], "text/markdown": [ "'This text is inside a p tag, whose id is equal to box4'" ], "text/plain": [ "[1] \"This text is inside a p tag, whose id is equal to box4\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "content <- html_nodes(webpage, \"p\")\n", "html_text(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1.c When you only know/need the selector name\n", "\n", "If you only know the selector name of the element you want to retreive, then start the second argument in the `html_nodes()` function with\n", "- a string that begins with a square bracket, leaving out the tag name\n", "- specify the selector name\n", "- an equals sign\n", "- what the value of the selector is equal to for the content you want\n", "- close out the square bracket\n", "\n", "Let's grab the content from the element with the **class equal** to **box1**" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This text is inside a div tag, whose class is equal to box1'" ], "text/latex": [ "'This text is inside a div tag, whose class is equal to box1'" ], "text/markdown": [ "'This text is inside a div tag, whose class is equal to box1'" ], "text/plain": [ "[1] \"This text is inside a div tag, whose class is equal to box1\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "content <- html_nodes(webpage, \"[class='box1']\")\n", "html_text(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's grab the content from the element with the **id** equal to **box4**" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This text is inside a p tag, whose id is equal to box4'" ], "text/latex": [ "'This text is inside a p tag, whose id is equal to box4'" ], "text/markdown": [ "'This text is inside a p tag, whose id is equal to box4'" ], "text/plain": [ "[1] \"This text is inside a p tag, whose id is equal to box4\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "content <- html_nodes(webpage, \"[id='box4']\")\n", "html_text(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though \"extra\" is not a commonly used identifier, we can still select the content using the same method as other commonly used identifiers." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This text is inside a div tag whose role is equal to extra'" ], "text/latex": [ "'This text is inside a div tag whose role is equal to extra'" ], "text/markdown": [ "'This text is inside a div tag whose role is equal to extra'" ], "text/plain": [ "[1] \"This text is inside a div tag whose role is equal to extra\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "content <- html_nodes(webpage, \"[role='extra']\")\n", "html_text(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 1: Collect the Stock Price Information" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**This exercise is based on the html source from https://finance.google.com/finance?q=aapl*\n", "\n", "In the space below, write the code that would extract **ONLY** the stock price for Apple.\n", " \n", "Some starter code is provided.\n", " \n", "Your code should\n", "- specify the url\n", "- read the url's html into variable called webpage\n", "- indicate which tag and/or selector contains the Apple stock price, and save that information to a variable called **content**.\n", "- save the text information from content to a variable called **price**\n", "- remove any extra whitespace by using: `trimws(price)`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Go to this url and right-click on the price, then click \"inspect\" to inspect the source code:**\n", "\n", "**http://ivanhernandez.com/webscraping/GoogleFinance/aapl.html**" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'174.97'" ], "text/latex": [ "'174.97'" ], "text/markdown": [ "'174.97'" ], "text/plain": [ "[1] \"174.97\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "url <- \"http://ivanhernandez.com/webscraping/GoogleFinance/aapl.html\"\n", "webpage <- read_html(url)\n", "content <- html_nodes(webpage, \"span.pr\")\n", "price <- html_text(content)\n", "trimws(price)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2 Collecting Multiple Pieces of Text Data from a Page\n", "\n", "If there are multiple elements you want to extract content from, you can use the **`html_nodes`** function.\n", "\n", "Note that the name of the function is the same as before (html_node), only with an \"s\" at the end indicating you expect many matches (html_nodes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's redownload the example webpage in case variable called \"content\" was used for the last activity" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Webpage downloaded\"\n" ] } ], "source": [ "url <- \"http://ivanhernandez.com/webscraping/example.html\"\n", "webpage <- read_html(url)\n", "print(\"Webpage downloaded\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2.a When you know the tag name\n", "\n", "For example, there are two div tags in the entire HTML (the one with class=box1 and the one with class=box2). \n", "\n", "If we wanted to extract all of the content within div tags at once, then just ask rvest to find all the div tags and save them to a list called \"tags\"" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"tags are downloaded\"\n" ] } ], "source": [ "content <- html_nodes(webpage, \"div\")\n", "tags <- html_text(content)\n", "print(\"tags are downloaded\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `html_nodes()` function saves the results in a list, which is a data structure that holds a collection of items in order.\n", "\n", "We can iteratively access the specific content in the list using a \"for loop\"" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"This text is inside a div tag, whose class is equal to box1\"\n", "[1] \"This text is inside a div tag, whose class is equal to box2\"\n", "[1] \"This text is inside a div tag whose role is equal to extra\"\n" ] } ], "source": [ "for (tag in tags){\n", " print(tag)\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use indices (1,2,3, etc.) to select specific items in that list" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This text is inside a div tag, whose class is equal to box2'" ], "text/latex": [ "'This text is inside a div tag, whose class is equal to box2'" ], "text/markdown": [ "'This text is inside a div tag, whose class is equal to box2'" ], "text/plain": [ "[1] \"This text is inside a div tag, whose class is equal to box2\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tags[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the last element, you can see how many items were retrieved and as for the item in that position" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This text is inside a div tag whose role is equal to extra'" ], "text/latex": [ "'This text is inside a div tag whose role is equal to extra'" ], "text/markdown": [ "'This text is inside a div tag whose role is equal to extra'" ], "text/plain": [ "[1] \"This text is inside a div tag whose role is equal to extra\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "numberitems <- length(tags)\n", "tags[numberitems]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2.b When there are many different tag names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted to extract content that could be in either a div or a span or a p tag, then, we can place them all within quotations (separating each one with a comma), and run the `html_nodes()` function using that quotation of tags.\n", "\n", "Below, we specify that we want returned any content within a div, p, or span tag." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"This text is inside a div tag, whose class is equal to box1\"\n", "[2] \"This text is inside a div tag, whose class is equal to box2\"\n", "[3] \"This text is inside a span tag, whose id is equal to box3\" \n", "[4] \"This text is inside a p tag, whose id is equal to box4\" \n", "[5] \"This text is inside a div tag whose role is equal to extra\" \n" ] } ], "source": [ "contents <- html_nodes(webpage, \"div , p, span\")\n", "results <- html_text(contents)\n", "print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2.c When there are many different selector names\n", "\n", "If we know precisely the names of the selectors that could match (class id, etc.), we can specify the tag name, as well all the id and class names we want to match by separating each with a comma." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"This text is inside a div tag, whose class is equal to box1\"\n", "[2] \"This text is inside a div tag, whose class is equal to box2\"\n" ] } ], "source": [ "contents <- html_nodes(webpage, \"div[class='box1'], div[class='box2']\")\n", "results <- html_text(contents)\n", "print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You also can exclude certain classes from the results if you use the \"not\" operator** and in parentheses indicate what not to match.\n", "\n", "Separate each thing you do not want to match with a colon." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"This text is inside a div tag whose role is equal to extra\"\n" ] } ], "source": [ "contents <- html_nodes(webpage, \"div:not([class='box1']):not([class='box2'])\")\n", "results <- html_text(contents)\n", "print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2.d When you only know the tag and part of the class name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may be a situation where we know what we want to extract (e.g., we want something that is in between div tags , that has a class with the word \"box\" in it).\n", "\n", "Using the regular expression library (whose library is called \"re\"), we can have partial matches of tags or classes/ids.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you know what the name STARTS with, use the ^ operator" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"This text is inside a div tag, whose class is equal to box1\"\n", "[2] \"This text is inside a div tag, whose class is equal to box2\"\n" ] } ], "source": [ "contents <- html_nodes(webpage, \"div[class^='box']\")\n", "results <- html_text(contents)\n", "print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you know what the name ENDS with, use the * operator" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"This text is inside a div tag, whose class is equal to box1\"\n" ] } ], "source": [ "contents <- html_nodes(webpage, \"div[class*='1']\")\n", "results <- html_text(contents)\n", "print(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 2: Scrape the Headlines from Google News\n", "\n", "**This exercise is based on the html source from https://news.google.com*\n", "\n", "**Go to the following page and right-click and inspect the headline titles.**\n", "\n", "**http://ivanhernandez.com/webscraping/GoogleNews/GoogleNews.html**\n", "\n", "**What is the selector that identifies headlines?**\n", "\n", "When you see what selector identifies headlines, in the space below, write the code that would extract ALL of the headline titles from the Google News homepage.\n", "\n", "Save the list of headlines as a variable called **contents**.\n", "\n", "Then, extract the text from the contents variable, and save that to a variable called **headlines**\n", "\n", "Print the headlines on the last line." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [1] \"The Real Risks of Trump's Steel and Aluminum Tariffs\" \n", " [2] \"Trump says US to impose steep tariffs on steel, aluminum imports\" \n", " [3] \"Trump announces tariffs on steel, aluminum imports\" \n", " [4] \"GOP meltdown over Trump plan to impose steel, aluminum tariffs\" \n", " [5] \"The Right Import Tax Is Zero\" \n", " [6] \"'Every day is a new adventure': Trump upends Washington and Wall Street with shifts on trade, guns\" \n", " [7] \"Don't Buy Putin's Missile Hype\" \n", " [8] \"Why would Putin want to nuke Florida?\" \n", " [9] \"Putin Vows to Lift Russia's Struggling Middle Class\" \n", " [10] \"What Russia's newly announced nuclear systems actually mean\" \n", " [11] \"Georgia Senate approves tax bill, snubbing Delta in NRA feud\" \n", " [12] \"Casey Cagle on Twitter: \\\"I will kill any tax legislation that benefits @Delta unless the company changes its position and ...\"\n", " [13] \"Georgia lawmakers yank tax break for Delta after airline cuts ties with NRA\" \n", " [14] \"The Latest: Pro-gun lawmakers win victory over Delta\" \n", " [15] \"READERS WRITE: MAR. 2\" \n", " [16] \"White House preparing for McMaster exit as early as next month\" \n", " [17] \"Hanson: Is Donald Trump going to fire one of his generals?\" \n", " [18] \"White House preparing for exit of national security advisor HR McMaster: NBC News\" \n", " [19] \"Dangerous nor'easter targets East Coast with snow, rain, wind\" \n", " [20] \"Powerful nor'easter has nearly the entire East Coast under high wind watches, warnings\" \n", " [21] \"Jared Kushner's troubles include an impending $1.2 billion company debt\" \n", " [22] \"Kushner loses access to top-secret intelligence\" \n", " [23] \"Jared Kushner's Future In The White House After A Security Clearance Downgrade\" \n", " [24] \"Jared Kushner Should Be Fired After His Week From Hell, Democrats Say\" \n", " [25] \"Regulator Seeks Kushner Loan Details From Deutsche Bank, Two Others\" \n", " [26] \"'Javanka' should follow Hope Hicks out the door\" \n", " [27] \"A Trump ally is likely to replace a career diplomat as US ambassador, and Mexicans are worried\" \n", " [28] \"US Ambassador to Mexico to Quit Amid Tense Relations Under Trump\" \n", " [29] \"US ambassador to Mexico resigning in another big loss for State Department\" \n", " [30] \"US ambassador to Mexico unexpectedly resigns amid increased tensions under Trump\" \n", " [31] \"Trump's lack of diplomacy threatens to derail US-Mexico relations\" \n", " [32] \"US ambassador to Mexico stepping down\" \n", " [33] \"US ambassador to Mexico quits in latest departure of high power talent from the State Department\" \n", " [34] \"Xi Jinping's latest power grab is bad news for China's economy\" \n", " [35] \"Sensitive Words: Xi to Ascend His Throne (Updated) - China Digital Times (CDT)\" \n", " [36] \"China Blocks Winnie the Pooh, Again\" \n", " [37] \"Why China banned the letter 'N'\" \n", " [38] \"Language Log » The letter * has bee* ba**ed in Chi*a\" \n", " [39] \"China Focus: CPC hears opinions on deepening reform of Party and state institutions\" \n", " [40] \"Xi Jinping's authoritarian rise in China has been powered by sexism\" \n", " [41] \"The Latest: UN Imagery Shows New Damage in Syria Suburb\" \n", " [42] \"Syria war: Pakistani civilians leave besieged Eastern Ghouta\" \n", " [43] \"UN official: Pauses in Syria suburb unilateral, 'not enough'\" \n", " [44] \"Dr. Marc Siegel: Doctors in Syria are fighting an unwinnable battle against death -- They need our help\" \n", " [45] \"Syrian civilians unable to evacuate despite pauses in fighting in besieged Ghouta\" \n", " [46] \"A Teen Tried To Shoot Queen Elizabeth In 1981, Intelligence Report Says\" \n", " [47] \"The Snowman and the Queen: Declassified intelligence service documents confirm assassination attempt on Queen ...\" \n", " [48] \"New Zealand teenager tried to kill Queen Elizabeth in 1981\" \n", " [49] \"A teenager allegedly tried to kill Queen Elizabeth in 1981. Police suspect a coverup.\" \n", " [50] \"New Zealand teenager tried to assassinate Queen Elizabeth in 1981: intelligence agency\" \n", " [51] \"Revealed: the day a schoolboy, 17, tried to assassinate the Queen\" \n", " [52] \"Befuddled By Trump, Senate Will Not Vote On Gun Measures Next Week\" \n", " [53] \"Donald Trump's dubious attack on US-Canada trade\" \n", " [54] \"Baffled Republicans distance themselves from Trump on guns\" \n", " [55] \"The Latest: Democrat pleased with Trump support on gun laws\" \n", " [56] \"Why Democrats Are Losing the Gun Debate\" \n", " [57] \"Trump's willingness to consider gun control is welcome. But can we believe him?\" \n", " [58] \"Body of missing mother found, suspect in custody: Authorities\" \n", " [59] \"Man charged with murder after missing Middlesex mom's body found\" \n", " [60] \"TerriLynn St. John case: Grim end to search for mom who vanished from driveway\" \n", " [61] \"Virginia mom, 23, who mysteriously disappeared from front yard found dead; suspect in custody\" \n", " [62] \"Missing Virginia mother mysteriously disappeared from front yard\" \n", " [63] \"Man accused of sending abusive letters with 'suspicious white powder' to Trump Jr., Sen. Stabenow\" \n", " [64] \"Massachusetts Man Arrested for Mailing Threatening Letters Containing Suspicious White Powder | USAO-MA ...\" \n", " [65] \"Massachusetts man charged in Donald Trump Jr. white powder hoax\" \n", " [66] \"Powder hoax suspect's bizarre rants revealed\" \n", " [67] \"Daniel Frisiello charged with 'powder' letters to Trump son, others\" \n", " [68] \"Did HUD really need to spend $31000 of taxpayer money on that dining furniture for Ben Carson?\" \n", " [69] \"ttongress of tbe Wntteb tates - Oversight and Government Reform - House.gov\" \n", " [70] \"Ben Carson Tries to Cancel $31000 Dining Furniture Purchase for HUD Office\" \n", " [71] \"$31000 dining set purchased by Ben Carson's HUD came from a Baltimore interior design firm\" \n", " [72] \"Did Ben Carson Purchase a $31000 Dining Set and Charge It to HUD?\" \n", " [73] \"Ben Carson's HUD Spends $31000 on Dining Set for His Office\" \n", " [74] \"Ben Carson needs to check out Ikea\" \n", " [75] \"There are far more major scandals brewing. So why is a $31000 dining set in the news?\" \n", " [76] \"Equifax finds its big data breach hit an additional 2.4 million people\" \n", " [77] \"Equifax Releases Updated Information on 2017 Cybersecurity Incident | Equifax\" \n", " [78] \"Equifax Identifies Additional 2.4 Million Customers Hit by Data Breach\" \n", " [79] \"Equifax Identifies Additional 2.4 Million Affected by 2017 Breach\" \n", " [80] \"Equifax Releases Updated Information on 2017 Cybersecurity Incident\" \n", " [81] \"Gunmaker American Outdoor Brands plummets 20% after quarterly sales plunge\" \n", " [82] \"Financial Newsletter - Zacks\" \n", " [83] \"Smith & Wesson parent AOBC off target in most recent earnings\" \n", " [84] \"Smith & Wesson Parent Cuts Forecast, Sees Weak Market Ahead\" \n", " [85] \"American Outdoor Brands Corporation - AOBC - Stock Price Today - Zacks\" \n", " [86] \"Mass shootings have made gun stocks toxic assets on Wall Street\" \n", " [87] \"Drunk man accidentally takes $1600 Uber from West Virginia to New Jersey\" \n", " [88] \"Drunk Man Accidentally Takes $1600 Uber To NJ After Partying With Friends In West Virginia\" \n", " [89] \"Drunk man passes out in Uber, takes $1600 ride\" \n", " [90] \"Drunk bro 'blacked out' and took $1600 Uber ride\" \n", " [91] \"A $1600 Uber ride? Drunk man blacks out, takes trip from W.Va. to NJ\" \n", " [92] \"Corporations only break with the gun industry when it's cheap and easy\" \n", " [93] \"Walmart Statement on Firearms Policy\" \n", " [94] \"This Time It's Guns: Retail Activism Goes Mainstream With Dick's Sporting Goods\" \n", " [95] \"Kroger Raises Age Limits on Gun Sales, Joining Walmart and Dick's\" \n", " [96] \"Commentary: Dick's Is Showing Us Why Companies Can't Afford to Be Cowards Anymore\" \n", " [97] \"Android Go is here to fix super cheap phones\" \n", " [98] \"Android Go for Feature Phones: What You Need to Know\" \n", " [99] \"New “Android Go” phones show how much you can get for $100\" \n", "[100] \"Android P Release Date, Name, Features & Expected Android P Phones List\" \n", "[101] \"Twitter: We Know the Platform Is Toxic, Please Help Us Fix\" \n", "[102] \"Measuring The Health of Our Public Conversations — Cortico\" \n", "[103] \"Twitter is asking the public to help measure how toxic it is\" \n", "[104] \"Twitter seeks help measuring 'health' of its world\" \n", "[105] \"Facebook to End News Feed Experiment in 6 Countries That Magnified Fake News\" \n", "[106] \"News Feed FYI: Ending the Explore Feed Test | Facebook Newsroom\" \n", "[107] \"Facebook ends its experiment with the alternative “explore” news feed\" \n", "[108] \"Facebook Ditches Plan for 2 Separate News Feeds\" \n", "[109] \"First Look: GMC goes big with the redesigned 2019 Sierra Denali, but will luxury truck buyers drive one home?\" \n", "[110] \"2019 GMC Sierra 1500 Denali puts a tailgate in your tailgate\" \n", "[111] \"Forget submarine steel: The 2019 GMC Sierra truck is made from carbon fiber\" \n", "[112] \"2019 GMC Sierra 1500 First Look: Distinguishing Itself from Silverado\" \n", "[113] \"4 Reasons Marvel And Disney Moved 'Avengers: Infinity War' To April\" \n", "[114] \"'Avengers: Infinity War' Release Date Moves Up One Week to April\" \n", "[115] \"Avengers: Infinity War toys showcase new villains\" \n", "[116] \"Tori Spelling's Struggles: Overcoming Health Scares, Marriage Drama, Family Feuds and Money Troubles\" \n", "[117] \"Police Called to Tori Spelling's Los Angeles Home Over 'Disturbance'\" \n", "[118] \"Tori Spelling's 'breakdown' may be due to demands of motherhood, Corinne Olympios says\" \n", "[119] \"Tori Spelling Visited by LAPD After Mysterious 'Disturbance' 911 Call Insinuating a Mental Breakdown\" \n", "[120] \"Dean McDermott Friendly Chat with Cops After Tori's Apparent Mental Breakdown\" \n", "[121] \"Corinne Olympios Says Tori Spelling Seemed 'Distant' Prior to Police Being Called to Her House (Exclusive)\" \n", "[122] \"On the Red Carpet, Ryan Seacrest Is a Distraction in an Important Year\" \n", "[123] \"Ryan Seacrest Sexual Abuse Allegations: E! Stylist Goes Into Detail – Variety\" \n", "[124] \"Kelly Ripa defends Ryan Seacrest, says she's 'blessed' to work with him\" \n", "[125] \"Kelly Ripa defends co-host Ryan Seacrest amid sex misconduct allegations\" \n", "[126] \"Publicists to steer stars away from Seacrest on Oscars red carpet\" \n", "[127] \"Hollywood's reckoning has transformed the red carpet\" \n", "[128] \"Kylie Jenner Poses in Underwear One Month After Giving Birth to Baby Stormi\" \n", "[129] \"She's snapped back into shape! Kylie Jenner, 20, proudly poses in a thong just one month after giving birth to Stormi\" \n", "[130] \"Kylie Jenner Shows Off Post-Baby Body in Black Underwear 1 Month After Birth of Daughter Stormi\" \n", "[131] \"Jeffree Star Vs. Kylie Jenner Continues: \\\"Who's Ready for Some Hot Tea Today???\\\"\" \n", "[132] \"What did Sixers' Joel Embiid think of Rockets' James Harden's ridiculous crossover vs. Clippers? (VIDEO)\" \n", "[133] \"The 13 most disrespectful things from James Harden's crossover on Wesley Johnson, ranked\" \n", "[134] \"Cleveland Cavaliers SF LeBron James: James Harden pulled off move players dream about\" \n", "[135] \"Celebrities spotted at the Rockets' win over the Clippers\" \n", "[136] \"Three Things to Know: James Harden does Wesley Johnson, Clippers wrong\" \n", "[137] \"Sean Miller's Statement Takes the Fight to ESPN: Is a Lawsuit the Next Step?\" \n", "[138] \"Sources: Sean Miller talked payment on wiretap\" \n", "[139] \"Sean Miller stands tall with Arizona's support, but he's hardly in the clear\" \n", "[140] \"Sean Miller Denies Wrongdoing, Will Coach Arizona Amid Deandre Ayton Scandal\" \n", "[141] \"Sources cast doubt on reported timeline of Miller-Dawkins call\" \n", "[142] \"Could Sean Miller wiretap report be wrong? Seeds of doubt arise with ESPN recruiting story\" \n", "[143] \"Vikings emerge as favorites to land Kirk Cousins at NFL Combine\" \n", "[144] \"Minnesota Vikings - NFL - CBSSports.com\" \n", "[145] \"Mike Zimmer: Wrong QB decision for Vikings means I'll 'probably get fired'\" \n", "[146] \"Zimmer: If Vikings don't pick right QB, I'll 'get fired'\" \n", "[147] \"Minnesota Vikings | Bleacher Report\" \n", "[148] \"NFL Combine 2018: Saquon Barkley grew up a Jets fan, but could they actually draft him?\" \n", "[149] \"2018 NFL combine: What we learned from RB and OL weigh-ins and measurements\" \n", "[150] \"Saquon Barkley would love to be drafted by the Cleveland Browns\" \n", "[151] \"Saquon Barkley: It Would Be 'Awesome' to Be Drafted by Browns, Struggling Teams\" \n", "[152] \"Did Dark Matter Make The Early Universe Chill Out?\" \n", "[153] \"An absorption profile centred at 78 megahertz in the sky-averaged spectrum | Nature\" \n", "[154] \"Signal detected from 'cosmic dawn'\" \n", "[155] \"A rare signal from the early universe sends scientists clues about dark matter\" \n", "[156] \"Possible interaction between baryons and dark-matter particles revealed by the first stars | Nature\" \n", "[157] \"The birth of the first stars\" \n", "[158] \"Watch a bus-size asteroid buzz Earth to start the weekend\" \n", "[159] \"Asteroid Watch - NASA Jet Propulsion Laboratory\" \n", "[160] \"Watch Live Stream: Bus-Sized 2018 DV1 Asteroid Will Fly By Earth On Friday\" \n", "[161] \"Bus-size asteroid to pass within 70000 miles of Earth Friday, closer than moon\" \n", "[162] \"Virtual Telescope's WebTV - The Virtual Telescope Project 2.0\" \n", "[163] \"A Bus-Size Asteroid Will Whiz by Earth Friday\" \n", "[164] \"Advanced GOES satellites launched to improve weather forecasting\" \n", "[165] \"Watch live as NASA launches the future of weather forecasting\" \n", "[166] \"NASA launches advanced weather satellite for western US\" \n", "[167] \"Watch NOAA's GOES-S Weather Satellite Launch, Live\" \n", "[168] \"Flagship US space telescope facing further delays\" \n", "[169] \"NASA's James Webb Telescope Likely To Be Delayed Yet Again\" \n", "[170] \"NASA's Hubble successor may miss its launch window\" \n", "[171] \"Nuts, Especially Tree Nuts, and Improved CRC Survival\" \n", "[172] \"Eating Nuts Could Lower Colon Cancer Reoccurrence\" \n", "[173] \"Nuts may be key to fighting this common cancer\" \n", "[174] \"Doctors: More People Want Nose Jobs To Make Selfies Look Better\" \n", "[175] \"AAFPRS - Media Resources - Statistics\" \n", "[176] \"Selfies distort faces like a \\\"funhouse mirror,\\\" study finds\" \n", "[177] \"Think Your Nose Is Too Big? Selfies Might Be to Blame\" \n", "[178] \"Selfies make your nose look 30% bigger, study says\" \n", "[179] \"A teen was told he likely had the flu. It turned out to be late-stage cancer.\" \n", "[180] \"Helping Hunter Fight Cancer | Medical Expenses - YouCaring\" \n", "[181] \"Florida Teen Initially Diagnosed with the Flu Discovers He Actually Has Stage 4 Cancer\" \n", "[182] \"Teen who was told he had the flu, really had stage 4 cancer\" \n", "[183] \"Tampa teen diagnosed with flu discovers he is battling stage 4 cancer\" \n", "[184] \"FDA Committee Recommends 2018-2019 Influenza Vaccine Strains\" \n", "[185] \"Interim Estimates of 2017–18 Seasonal Influenza Vaccine Effectiveness — United States, February 2018 | MMWR - CDC\" \n", "[186] \"Flu deaths reach modern-day state record of 253; elderly comprise 73 percent of victims\" \n", "[187] \"What's going around? Experts explain why the 2017-2018 flu season was one of the harshest yet\" \n", "[188] \"Flu Articles, Photos, and Videos - Chicago Tribune\" \n", "[189] \"Brutal flu has killed 84 children in the US - but its spread...\" \n", "[190] \"All Children Should Have to Get the Flu Shot\" \n", "[191] \"Five myths about outbreaks\" \n" ] } ], "source": [ "url <- \"http://ivanhernandez.com/webscraping/GoogleNews/GoogleNews.html\"\n", "webpage <- read_html(url)\n", "contents <- html_nodes(webpage, \"[role='heading']\")\n", "headlines <- html_text(contents)\n", "print(headlines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3 Extract content from a link" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's re-download the example page to make sure our webpage variable has the correct content" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "url <- \"http://ivanhernandez.com/webscraping/example.html\"\n", "webpage <- read_html(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may have a link that, if clicked, directs the user to a different page. \n", "\n", "We can extract various content from a link including:\n", "+ The text the user sees for the link\n", "+ The address the link directs the user to go, when clicked\n", "\n", "In the source code of the example page, the link is written as follows:\n", "\n", "`This is a link to Google`\n", "\n", "This link has both text: **\"This is a link to Google\"**\n", "as well as a url: http://google.com\n", "\n", "We can use either the html_text() or html_attr() function, depending on what we want to extract." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3.a Extract the text from the link" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"This is a link to Google\"\n" ] } ], "source": [ "content <- html_nodes(webpage, \"a\")\n", "linktext <- html_text(content)\n", "print(linktext)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3.b Extract the url from the link\n", "\n", "Before, we used the html_text function to get the text encapsulated by the found tag/selector.\n", "\n", "We can extract other elements from the matched tag, such that information contained in the href selector.\n", "\n", "To get specific values for the selector, use the **html_attr** function.\n", "\n", "Below, we will extract the link information, which is contained in the href selector for a link." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"http://google.com\"\n" ] } ], "source": [ "content <- html_nodes(webpage, \"a\")\n", "link <- html_attr(content,'href')\n", "print(link)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 3: Scrape the Links from the White House Press Briefing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**This exercise is based on the html source from https://www.whitehouse.gov/briefing-room/press-briefings*\n", "\n", "Go to the following page and right-click on a press briefing link title (either a remark or statement):\n", "http://ivanhernandez.com/webscraping/whitehouse/breifingstatements.html\n", "\n", "**What is the tag used for the headlines? Is this tag always within another tag?**\n", "\n", "In the space below, write the code that would extract all of the links to **JUST** the press briefings from the White House website. \n", "\n", "Save the list of links as a variable called **content**\n", "\n", "From your content variable get the link found the href tag. Save those links to a variable called **links**\n", "\n", "Use the print function to print the links\n", "\n", "*Hint on how isolate a tag within another tag*:\n", "- *To get a tags that are only found within another specific tag, separate them with a space in the html_nodes() function*\n", "- *So if the link tags (a) you want is only found within a paragraph tag (p), you would say `html_nodes(webpage, \"p a\")`*" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [1] \"https://www.whitehouse.gov/briefings-statements/remarks-president-trump-listening-session-representatives-steel-aluminum-industry/\" \n", " [2] \"https://www.whitehouse.gov/briefings-statements/remarks-first-lady-white-house-opioids-summit/\" \n", " [3] \"https://www.whitehouse.gov/briefings-statements/readout-president-donald-j-trumps-call-president-moon-jae-republic-korea-6/\" \n", " [4] \"https://www.whitehouse.gov/briefings-statements/remarks-vice-president-pence-department-homeland-security-15th-anniversary-event/\" \n", " [5] \"https://www.whitehouse.gov/briefings-statements/national-colorectal-cancer-awareness-month/\" \n", " [6] \"https://www.whitehouse.gov/briefings-statements/president-donald-j-trump-combatting-opioid-crisis/\" \n", " [7] \"https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-bipartisan-members-congress-meeting-school-community-safety/\"\n", " [8] \"https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-policy-agenda-annual-report-free-fair-reciprocal-trade/\" \n", " [9] \"https://www.whitehouse.gov/briefings-statements/readout-president-donald-j-trumps-call-emir-tamim-bin-hamad-al-thani-qatar-2/\" \n", "[10] \"https://www.whitehouse.gov/briefings-statements/remarks-president-trump-ceremony-preceding-lying-honor-reverend-billy-graham/\" \n" ] } ], "source": [ "url <- \"http://ivanhernandez.com/webscraping/whitehouse/breifingstatements.html\"\n", "webpage <- read_html(url)\n", "content <- html_nodes(webpage, \"h2 a\")\n", "links <- html_attr(content,'href')\n", "print(links)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.4 Extract Everything Else" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's re-download the example page to make sure our webpage variable has the correct content" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "url <- \"http://ivanhernandez.com/webscraping/example.html\"\n", "webpage <- read_html(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4.a Extract non-tagged text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, we have content that is not contained within tags, or has an irregular structure.\n", "\n", "**Notice in the source code above that the phrase, \"This content is not inside any tag\" does not a tag specific to it. It is by iteself.**\n", "\n", "**We can still extract this information if we know the text/characters that come immediately before and immediately after the content.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "+ Determine the HTML that goes immediately before the content (precontent)\n", " + In this example, \"Additional Content:\" came before the text we want to extract\n", " + We will split the HTML on where it says \"Additional Content:\" and keep the text after it\n", " \n", " \n", "+ Determine the HTML that goes immediately after the content (postcontent)\n", " + In this example, \"<\" came immediately after the text we want to extract\n", " + We will split the HTML that we kept on where it says \"<\" and keep the text before it\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The following code takes the html text and splits it at the phrase we indicate, and keeps the second half (the text after the indicated phrase of \"Additional Content:\")." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "' This content is not inside any tag\\n\\t<br><img src=\"http://ivanhernandez.com/webscraping/profile.png\">\\n</body>\\n</html>\\n'" ], "text/latex": [ "' This content is not inside any tag\\textbackslash{}n\\textbackslash{}t
\\textbackslash{}n\\textbackslash{}n\\textbackslash{}n'" ], "text/markdown": [ "' This content is not inside any tag\\n\\t<br><img src=\"http://ivanhernandez.com/webscraping/profile.png\">\\n</body>\\n</html>\\n'" ], "text/plain": [ "[1] \" This content is not inside any tag\\n\\t
\\n\\n\\n\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "webpagetext <- as.character(webpage)\n", "splittext <- strsplit(webpagetext, \"Additional Content:\")\n", "firsthalf <- unlist(splittext)[2]\n", "firsthalf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code then takes the text that was saved, and keeps the text before the character we indicate (<)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "' This content is not inside any tag\\n\\t'" ], "text/latex": [ "' This content is not inside any tag\\textbackslash{}n\\textbackslash{}t'" ], "text/markdown": [ "' This content is not inside any tag\\n\\t'" ], "text/plain": [ "[1] \" This content is not inside any tag\\n\\t\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "splittext <- strsplit(firsthalf, \"<\")\n", "secondhalf <- unlist(splittext)[1]\n", "secondhalf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then tidy the extracted text by stripping the extra whitespace using the trimws() function" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "'This content is not inside any tag'" ], "text/latex": [ "'This content is not inside any tag'" ], "text/markdown": [ "'This content is not inside any tag'" ], "text/plain": [ "[1] \"This content is not inside any tag\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "trimws(secondhalf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4.b Download Files (Photos, CSVs, Videos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, webpages have file content that we want to scrape.\n", "\n", "Some examples of files we might want to scrape are:\n", "- images\n", "- documents\n", "- videos\n", "\n", "If we know the url of the file we want, we can use the download.file function to access it and then save it to our hard drive with a filename we provide.\n", "\n", "Below, we will find all the images on the example page. Images are contained within the \"img\" tag\n", "\n", "An example of the html code to hold an image is:\n", "\n", "``\n", "\n", "We can first ask for the img tags.\n", "\n", "We will then extract the link to the image, which is contain the src tag.\n", "\n", "We will then create a filename to save the image as on our computer.\n", "\n", "We then use use the download.file function to download that url to the flename we specified in write binary (wb) mode " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"photo downloaded\"\n" ] } ], "source": [ "images <- html_nodes(webpage,\"img\")\n", "imgurl <- html_attr(images,'src')\n", "destinationfilename <- paste(\"downloaded_image\",Sys.Date(),\".png\",sep=\"_\")\n", "download.file(imgurl, destinationfilename, mode = \"wb\")\n", "print(\"photo downloaded\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4.c Download Tables\n", "\n", "Although tables can be downloaded using the method we have learned, rvest has a function specifically to simplify the process of downloading html tables.\n", "\n", "If you go to the following page, you'll see an example of an html table, which already has the information structured in a rectangular, data frame-like, layout.\n", "\n", "[http://ivanhernandez.com/webscraping/Wikipedia/medaltable.html](http://ivanhernandez.com/webscraping/Wikipedia/medaltable.html)\n", "\n", "**Based on the html source code from: https://en.wikipedia.org/wiki/2018_Winter_Olympics_medal_table*" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. \n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
    RankNOCGoldSilverBronzeTotal
    1 Norway (NOR) 14 14 11 39
    2 Germany (GER) 14 10 7 31
    3 Canada (CAN) 11 8 10 29
    4 United States (USA) 9 8 6 23
    5 Netherlands (NED) 8 6 6 20
    6 Sweden (SWE) 7 6 1 14
    7 South Korea (KOR) 5 8 4 17
    8 Switzerland (SUI) 5 6 4 15
    9 France (FRA) 5 4 6 15
    10 Austria (AUT) 5 3 6 14
    11 Japan (JPN) 4 5 4 13
    12 Italy (ITA) 3 2 5 10
    13 Olympic Athletes from Russia (OAR) 2 6 9 17
    14 Czech Republic (CZE) 2 2 3 7
    15 Belarus (BLR) 2 1 0 3
    16 China (CHN) 1 6 2 9
    17 Slovakia (SVK) 1 2 0 3
    18 Finland (FIN) 1 1 4 6
    19 Great Britain (GBR) 1 0 4 5
    20 Poland (POL) 1 0 1 2
    21 Hungary (HUN) 1 0 0 1
    21 Ukraine (UKR) 1 0 0 1
    23 Australia (AUS) 0 2 1 3
    24 Slovenia (SLO) 0 1 1 2
    25 Belgium (BEL) 0 1 0 1
    26 Spain (ESP) 0 0 2 2
    26 New Zealand (NZL) 0 0 2 2
    28 Kazakhstan (KAZ) 0 0 1 1
    28 Latvia (LAT) 0 0 1 1
    28 Liechtenstein (LIE) 0 0 1 1
    Total (30 NOCs) Total (30 NOCs) 103 102 102 307
    \n", "
  2. \n", "
\n" ], "text/latex": [ "\\begin{enumerate}\n", "\\item \\begin{tabular}{r|llllll}\n", " Rank & NOC & Gold & Silver & Bronze & Total\\\\\n", "\\hline\n", "\t 1 & Norway (NOR) & 14 & 14 & 11 & 39 \\\\\n", "\t 2 & Germany (GER) & 14 & 10 & 7 & 31 \\\\\n", "\t 3 & Canada (CAN) & 11 & 8 & 10 & 29 \\\\\n", "\t 4 & United States (USA) & 9 & 8 & 6 & 23 \\\\\n", "\t 5 & Netherlands (NED) & 8 & 6 & 6 & 20 \\\\\n", "\t 6 & Sweden (SWE) & 7 & 6 & 1 & 14 \\\\\n", "\t 7 & South Korea (KOR) & 5 & 8 & 4 & 17 \\\\\n", "\t 8 & Switzerland (SUI) & 5 & 6 & 4 & 15 \\\\\n", "\t 9 & France (FRA) & 5 & 4 & 6 & 15 \\\\\n", "\t 10 & Austria (AUT) & 5 & 3 & 6 & 14 \\\\\n", "\t 11 & Japan (JPN) & 4 & 5 & 4 & 13 \\\\\n", "\t 12 & Italy (ITA) & 3 & 2 & 5 & 10 \\\\\n", "\t 13 & Olympic Athletes from Russia (OAR) & 2 & 6 & 9 & 17 \\\\\n", "\t 14 & Czech Republic (CZE) & 2 & 2 & 3 & 7 \\\\\n", "\t 15 & Belarus (BLR) & 2 & 1 & 0 & 3 \\\\\n", "\t 16 & China (CHN) & 1 & 6 & 2 & 9 \\\\\n", "\t 17 & Slovakia (SVK) & 1 & 2 & 0 & 3 \\\\\n", "\t 18 & Finland (FIN) & 1 & 1 & 4 & 6 \\\\\n", "\t 19 & Great Britain (GBR) & 1 & 0 & 4 & 5 \\\\\n", "\t 20 & Poland (POL) & 1 & 0 & 1 & 2 \\\\\n", "\t 21 & Hungary (HUN) & 1 & 0 & 0 & 1 \\\\\n", "\t 21 & Ukraine (UKR) & 1 & 0 & 0 & 1 \\\\\n", "\t 23 & Australia (AUS) & 0 & 2 & 1 & 3 \\\\\n", "\t 24 & Slovenia (SLO) & 0 & 1 & 1 & 2 \\\\\n", "\t 25 & Belgium (BEL) & 0 & 1 & 0 & 1 \\\\\n", "\t 26 & Spain (ESP) & 0 & 0 & 2 & 2 \\\\\n", "\t 26 & New Zealand (NZL) & 0 & 0 & 2 & 2 \\\\\n", "\t 28 & Kazakhstan (KAZ) & 0 & 0 & 1 & 1 \\\\\n", "\t 28 & Latvia (LAT) & 0 & 0 & 1 & 1 \\\\\n", "\t 28 & Liechtenstein (LIE) & 0 & 0 & 1 & 1 \\\\\n", "\t Total (30 NOCs) & Total (30 NOCs) & 103 & 102 & 102 & 307 \\\\\n", "\\end{tabular}\n", "\n", "\\end{enumerate}\n" ], "text/markdown": [ "1. \n", "Rank | NOC | Gold | Silver | Bronze | Total | \n", "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| 1 | Norway (NOR) | 14 | 14 | 11 | 39 | \n", "| 2 | Germany (GER) | 14 | 10 | 7 | 31 | \n", "| 3 | Canada (CAN) | 11 | 8 | 10 | 29 | \n", "| 4 | United States (USA) | 9 | 8 | 6 | 23 | \n", "| 5 | Netherlands (NED) | 8 | 6 | 6 | 20 | \n", "| 6 | Sweden (SWE) | 7 | 6 | 1 | 14 | \n", "| 7 | South Korea (KOR) | 5 | 8 | 4 | 17 | \n", "| 8 | Switzerland (SUI) | 5 | 6 | 4 | 15 | \n", "| 9 | France (FRA) | 5 | 4 | 6 | 15 | \n", "| 10 | Austria (AUT) | 5 | 3 | 6 | 14 | \n", "| 11 | Japan (JPN) | 4 | 5 | 4 | 13 | \n", "| 12 | Italy (ITA) | 3 | 2 | 5 | 10 | \n", "| 13 | Olympic Athletes from Russia (OAR) | 2 | 6 | 9 | 17 | \n", "| 14 | Czech Republic (CZE) | 2 | 2 | 3 | 7 | \n", "| 15 | Belarus (BLR) | 2 | 1 | 0 | 3 | \n", "| 16 | China (CHN) | 1 | 6 | 2 | 9 | \n", "| 17 | Slovakia (SVK) | 1 | 2 | 0 | 3 | \n", "| 18 | Finland (FIN) | 1 | 1 | 4 | 6 | \n", "| 19 | Great Britain (GBR) | 1 | 0 | 4 | 5 | \n", "| 20 | Poland (POL) | 1 | 0 | 1 | 2 | \n", "| 21 | Hungary (HUN) | 1 | 0 | 0 | 1 | \n", "| 21 | Ukraine (UKR) | 1 | 0 | 0 | 1 | \n", "| 23 | Australia (AUS) | 0 | 2 | 1 | 3 | \n", "| 24 | Slovenia (SLO) | 0 | 1 | 1 | 2 | \n", "| 25 | Belgium (BEL) | 0 | 1 | 0 | 1 | \n", "| 26 | Spain (ESP) | 0 | 0 | 2 | 2 | \n", "| 26 | New Zealand (NZL) | 0 | 0 | 2 | 2 | \n", "| 28 | Kazakhstan (KAZ) | 0 | 0 | 1 | 1 | \n", "| 28 | Latvia (LAT) | 0 | 0 | 1 | 1 | \n", "| 28 | Liechtenstein (LIE) | 0 | 0 | 1 | 1 | \n", "| Total (30 NOCs) | Total (30 NOCs) | 103 | 102 | 102 | 307 | \n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "[[1]]\n", " Rank NOC Gold Silver Bronze Total\n", "1 1 Norway (NOR) 14 14 11 39\n", "2 2 Germany (GER) 14 10 7 31\n", "3 3 Canada (CAN) 11 8 10 29\n", "4 4 United States (USA) 9 8 6 23\n", "5 5 Netherlands (NED) 8 6 6 20\n", "6 6 Sweden (SWE) 7 6 1 14\n", "7 7 South Korea (KOR) 5 8 4 17\n", "8 8 Switzerland (SUI) 5 6 4 15\n", "9 9 France (FRA) 5 4 6 15\n", "10 10 Austria (AUT) 5 3 6 14\n", "11 11 Japan (JPN) 4 5 4 13\n", "12 12 Italy (ITA) 3 2 5 10\n", "13 13 Olympic Athletes from Russia (OAR) 2 6 9 17\n", "14 14 Czech Republic (CZE) 2 2 3 7\n", "15 15 Belarus (BLR) 2 1 0 3\n", "16 16 China (CHN) 1 6 2 9\n", "17 17 Slovakia (SVK) 1 2 0 3\n", "18 18 Finland (FIN) 1 1 4 6\n", "19 19 Great Britain (GBR) 1 0 4 5\n", "20 20 Poland (POL) 1 0 1 2\n", "21 21 Hungary (HUN) 1 0 0 1\n", "22 21 Ukraine (UKR) 1 0 0 1\n", "23 23 Australia (AUS) 0 2 1 3\n", "24 24 Slovenia (SLO) 0 1 1 2\n", "25 25 Belgium (BEL) 0 1 0 1\n", "26 26 Spain (ESP) 0 0 2 2\n", "27 26 New Zealand (NZL) 0 0 2 2\n", "28 28 Kazakhstan (KAZ) 0 0 1 1\n", "29 28 Latvia (LAT) 0 0 1 1\n", "30 28 Liechtenstein (LIE) 0 0 1 1\n", "31 Total (30 NOCs) Total (30 NOCs) 103 102 102 307\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "url <- \"http://ivanhernandez.com/webscraping/Wikipedia/medaltable.html\"\n", "webpage <- read_html(url)\n", "tables <- html_nodes(webpage,\"table\")\n", "\n", "# get the second table and fill in any irregularities\n", "html_table(tables[2],header = TRUE,fill=TRUE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Saving the content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we have extracted the content, we need to preserve the information for subsequent analysis.\n", "\n", "You can save the content you extracted in a\n", "+ list (within R's memory)\n", "+ text file (on your hard drive)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Webpage downloaded\"\n" ] } ], "source": [ "url <- \"http://ivanhernandez.com/webscraping/example.html\"\n", "webpage <- read_html(url)\n", "print(\"Webpage downloaded\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Saving the content to a vector/list" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"This text is inside a div tag, whose class is equal to box1\"\n", "[2] \"This text is inside a div tag, whose class is equal to box2\"\n", "[3] \"This text is inside a span tag, whose id is equal to box3\" \n" ] } ], "source": [ "data <- c()\n", "\n", "content <- html_node(webpage, \"div[class='box1']\")\n", "content_text <- html_text(content)\n", "data <- c(data, content_text)\n", "\n", "content2 <- html_node(webpage, \"div[class='box2']\")\n", "content_text2 <- html_text(content2)\n", "data <- c(data, content_text2)\n", "\n", "content3 <- html_node(webpage, \"span[id='box3']\")\n", "content_text3 <- html_text(content3)\n", "data <- c(data, content_text3)\n", "\n", "print(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Saving the contents to a text file" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Contents were saved\"\n" ] } ], "source": [ "contents <- html_nodes(webpage, \"div\")\n", "contents_text <- html_text(contents)\n", "write.table(contents_text,\"data.txt\",row.names = F,col.names = F)\n", "print(\"Contents were saved\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Viewing the contents of a text file" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. This text is inside a div tag, whose class is equal to box1
  2. \n", "\t
  3. This text is inside a div tag, whose class is equal to box2
  4. \n", "\t
  5. This text is inside a div tag whose role is equal to extra
  6. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item This text is inside a div tag, whose class is equal to box1\n", "\\item This text is inside a div tag, whose class is equal to box2\n", "\\item This text is inside a div tag whose role is equal to extra\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. This text is inside a div tag, whose class is equal to box1\n", "2. This text is inside a div tag, whose class is equal to box2\n", "3. This text is inside a div tag whose role is equal to extra\n", "\n", "\n" ], "text/plain": [ "[1] This text is inside a div tag, whose class is equal to box1\n", "[2] This text is inside a div tag, whose class is equal to box2\n", "[3] This text is inside a div tag whose role is equal to extra \n", "3 Levels: This text is inside a div tag, whose class is equal to box1 ..." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "file_contents <- read.table(\"data.txt\")\n", "file_contents$V1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Loop through many pages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When you only want a single piece of information from each page\n", "\n", "\n", "Below, we know that we want to access the health section, the politics section, and the science section of the New York Times. If we know ahead of time the ways we want to modify the url to access the pages,\n", "\n", "+ Start with a list of what we want to add to the url\n", "+ Make an empty list to hold all of the data you collect\n", "+ Make a url by combining a baseline url with the part we want to add\n", "+ Go through each page\n", "+ On each page, extract the data\n", "+ Clean/format the text to remove extra whitespace\n", "+ Add the data to the list that holds all of the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below accesses the following pages:\n", "- [http://ivanhernandez.com/webscraping/NYT/health.html](http://ivanhernandez.com/webscraping/NYT/health.html)\n", "\n", "- [http://ivanhernandez.com/webscraping/NYT/politics.html](http://ivanhernandez.com/webscraping/NYT/politics.html)\n", "\n", "- [http://ivanhernandez.com/webscraping/NYT/science.html](http://ivanhernandez.com/webscraping/NYT/science.html)\n", "\n", "It does so by keep the url suffixes in a vector.\n", "\n", "It then iterates through the suffixes and joins it to the url prefix (http://ivanhernandez.com/webscraping/NYT/)\n", "\n", "It then goes through each page, and extracts the headline data.\n", "\n", "Those headline titles are extracted, and the text is then cleaned.\n", "\n", "Each cleaned text is added to the headlines vector\n", "\n", "**Based on the html from:*\n", "- *https://www.nytimes.com/section/health*\n", "- *https://www.nytimes.com/section/politics*\n", "- *https://www.nytimes.com/section/science*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Loop finished\"\n" ] } ], "source": [ "pages <- c(\"health.html\",\"politics.html\",\"science.html\")\n", "\n", "headlines <- c()\n", "for (topic in pages){\n", " url <- paste(\"http://ivanhernandez.com/webscraping/NYT/\" , topic, sep=\"\")\n", " webpage <- read_html(url)\n", " titles <- html_nodes(webpage, \"h2[class='headline']\")\n", " titles_text <- html_text(titles)\n", " trimmed_titles <- trimws(titles_text)\n", " headlines <- c(headlines,trimmed_titles)\n", " }\n", "print(\"Loop finished\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "View the result of the loop" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [1] \"They’re Hosting Parasitic Worms in Their Bodies to Help Treat a Neglected Disease\" \n", " [2] \"Patients Eagerly Awaited a Generic Drug. Then They Saw the Price.\" \n", " [3] \"The Challenge of Doctor-Patient Relations in the Internet Age\" \n", " [4] \"Sneeze Into Your Elbow, Not Your Hand. Please.\" \n", " [5] \"Do You Have What It Takes to Be an Olympian?\" \n", " [6] \"Need a Date? First, Get a Dog\" \n", " [7] \"More Fitness, Less Fatness\" \n", " [8] \"F.D.A. to Expand Medication-Assisted Therapy for Opioid Addicts\" \n", " [9] \"Opening Mental Hospitals Unlikely to Prevent Mass Shootings, Experts Say\" \n", " [10] \"C-Sections and Gut Bacteria May Contribute to Overweight Kids\" \n", " [11] \"You Get Thirsty and Drink. How Does Your Brain Signal You’ve Had Enough?\" \n", " [12] \"‘Obesity Paradox’ Fails to Hold Up in Study\" \n", " [13] \"Arthur J. Moss, Who Pioneered Heart Treatments, Dies at 86\" \n", " [14] \"Bill Banning Circumcision in Iceland Alarms Religious Groups\" \n", " [15] \"Ground Control to Major Mom\" \n", " [16] \"Inside Wounded Flies, Fat Cells Race to the Rescue\" \n", " [17] \"What Happens When You Let Babies Feed Themselves?\" \n", " [18] \"Trump Blames Video Games for Mass Shootings. Researchers Disagree.\" \n", " [19] \"For Executives, Addiction Recovery in the Lap of Luxury\" \n", " [20] \"U.K. Moves Toward Making Adults Presumed Organ Donors\" \n", " [21] \"The Weekly Health Quiz: Weight Loss, Memory and a Cancer Cure\" \n", " [22] \"Can Being Cold Make You Sick?\" \n", " [23] \"Olympic Cross-Country Skiers Eat 8,000 Calories a Day. It’s Exhausting.\" \n", " [24] \"Measles Cases in Europe Quadrupled in 2017\" \n", " [25] \"Catherine Wolf, 70, Dies; Studied How People and Computers Interact\" \n", " [26] \"How Our Beliefs Can Shape Our Waistlines\" \n", " [27] \"Eating Fast May Raise Obesity Risk\" \n", " [28] \"Black Lung Disease Comes Storming Back in Coal Country\" \n", " [29] \"Opioids Tied to Risk of Fatal Infections\" \n", " [30] \"C-Sections and Gut Bacteria May Contribute to Overweight Kids\" \n", " [31] \"You Get Thirsty and Drink. How Does Your Brain Signal You’ve Had Enough?\" \n", " [32] \"‘Obesity Paradox’ Fails to Hold Up in Study\" \n", " [33] \"Arthur J. Moss, Who Pioneered Heart Treatments, Dies at 86\" \n", " [34] \"Bill Banning Circumcision in Iceland Alarms Religious Groups\" \n", " [35] \"Ground Control to Major Mom\" \n", " [36] \"Inside Wounded Flies, Fat Cells Race to the Rescue\" \n", " [37] \"What Happens When You Let Babies Feed Themselves?\" \n", " [38] \"Trump Blames Video Games for Mass Shootings. Researchers Disagree.\" \n", " [39] \"For Executives, Addiction Recovery in the Lap of Luxury\" \n", " [40] \"U.K. Moves Toward Making Adults Presumed Organ Donors\" \n", " [41] \"The Weekly Health Quiz: Weight Loss, Memory and a Cancer Cure\" \n", " [42] \"Can Being Cold Make You Sick?\" \n", " [43] \"Olympic Cross-Country Skiers Eat 8,000 Calories a Day. It’s Exhausting.\" \n", " [44] \"Measles Cases in Europe Quadrupled in 2017\" \n", " [45] \"Catherine Wolf, 70, Dies; Studied How People and Computers Interact\" \n", " [46] \"How Our Beliefs Can Shape Our Waistlines\" \n", " [47] \"Eating Fast May Raise Obesity Risk\" \n", " [48] \"Black Lung Disease Comes Storming Back in Coal Country\" \n", " [49] \"Opioids Tied to Risk of Fatal Infections\" \n", " [50] \"Senate Intelligence Leaders Say House G.O.P. Leaked a Senator’s Texts\" \n", " [51] \"Mnuchin Blocks U.C.L.A. From Releasing Video of Students Heckling Him\" \n", " [52] \"Trump Stuns Lawmakers With Seeming Embrace of Comprehensive Gun Control\" \n", " [53] \"Once Again, Push for Gun Control Collides With Political Reality\" \n", " [54] \"Happy Hour at the Hay-Adams With Nigel Farage, Brexit’s Bad Boy\" \n", " [55] \"The Real Risks of Trump’s Steel and Aluminum Tariffs\" \n", " [56] \"Jury Begins Deliberating in Percoco Corruption Trial\" \n", " [57] \"Trump Is ‘Losing a Limb’ With the Departure of Hope Hicks\" \n", " [58] \"Trump Wants to Arm Teachers. These Schools Already Do.\" \n", " [59] \"Washington State Is Set to Vote on a Carbon Tax. For the Governor, It’s a Gamble.\" \n", " [60] \"U.S. Ambassador to Mexico to Quit Amid Tense Relations Under Trump\" \n", " [61] \"Trump Targets MS-13, a Violent Menace, if Not the One He Portrays\" \n", " [62] \"Spurned by U.S. and Facing Danger Back Home, Iranian Christians Fear the Worst\" \n", " [63] \"Parts Suppliers Call for Cleaner Cars, Splitting With Their Main Customers: Automakers\" \n", " [64] \"President Trump’s Contradictory, and Sometimes False, Comments About Gun Policy to Lawmakers\"\n", " [65] \"China Envoy Seeks to Defuse Tensions With U.S. as a Trade War Brews\" \n", " [66] \"Exxon Mobil Scraps a Russian Deal, Stymied by Sanctions\" \n", " [67] \"U.S. Banks on Diplomacy With North Korea, but Moves Ahead on Military Plans\" \n", " [68] \"Is Bitcoin a Waste of Electricity, or Something Worse?\" \n", " [69] \"Walmart to Raise Age to Buy Guns and Ammunition to 21\" \n", " [70] \"Britain Presses U.S. to Avoid Death Penalty for ISIS Suspects\" \n", " [71] \"Stranded at Guantánamo, a Cooperative Detainee Criticizes Saudi Arabia\" \n", " [72] \"A Gun-Owning Trump Fan’s New Crusade: Clean Energy\" \n", " [73] \"Hope Hicks to Leave Post as White House Communications Director\" \n", " [74] \"Happy Hour at the Hay-Adams With Nigel Farage, Brexit’s Bad Boy\" \n", " [75] \"The Real Risks of Trump’s Steel and Aluminum Tariffs\" \n", " [76] \"Jury Begins Deliberating in Percoco Corruption Trial\" \n", " [77] \"Trump Is ‘Losing a Limb’ With the Departure of Hope Hicks\" \n", " [78] \"Trump Wants to Arm Teachers. These Schools Already Do.\" \n", " [79] \"Washington State Is Set to Vote on a Carbon Tax. For the Governor, It’s a Gamble.\" \n", " [80] \"U.S. Ambassador to Mexico to Quit Amid Tense Relations Under Trump\" \n", " [81] \"Trump Targets MS-13, a Violent Menace, if Not the One He Portrays\" \n", " [82] \"Spurned by U.S. and Facing Danger Back Home, Iranian Christians Fear the Worst\" \n", " [83] \"Parts Suppliers Call for Cleaner Cars, Splitting With Their Main Customers: Automakers\" \n", " [84] \"President Trump’s Contradictory, and Sometimes False, Comments About Gun Policy to Lawmakers\"\n", " [85] \"China Envoy Seeks to Defuse Tensions With U.S. as a Trade War Brews\" \n", " [86] \"Exxon Mobil Scraps a Russian Deal, Stymied by Sanctions\" \n", " [87] \"U.S. Banks on Diplomacy With North Korea, but Moves Ahead on Military Plans\" \n", " [88] \"Is Bitcoin a Waste of Electricity, or Something Worse?\" \n", " [89] \"Walmart to Raise Age to Buy Guns and Ammunition to 21\" \n", " [90] \"Britain Presses U.S. to Avoid Death Penalty for ISIS Suspects\" \n", " [91] \"Stranded at Guantánamo, a Cooperative Detainee Criticizes Saudi Arabia\" \n", " [92] \"A Gun-Owning Trump Fan’s New Crusade: Clean Energy\" \n", " [93] \"Hope Hicks to Leave Post as White House Communications Director\" \n", " [94] \"When Did Americans Stop Marrying Their Cousins? Ask the World’s Largest Family Tree\" \n", " [95] \"You Get Thirsty and Drink. How Does Your Brain Signal You’ve Had Enough?\" \n", " [96] \"When Stars Were Born: Earliest Starlight’s Effects Are Detected\" \n", " [97] \"No Sign of Newborn North Atlantic Whales This Breeding Season\" \n", " [98] \"Barbra Streisand Cloned Her Dog. For $50,000, You Can Clone Yours.\" \n", " [99] \"Why Scientists Love to Study Dogs (and Often Ignore Cats)\" \n", "[100] \"King Penguins Are Endangered by Warmer Seas\" \n", "[101] \"For Fiddler Crabs, ‘Size Does Matter’\" \n", "[102] \"Neanderthals, the World’s First Misunderstood Artists\" \n", "[103] \"They’re Hosting Parasitic Worms in Their Bodies to Help Treat a Neglected Disease\" \n", "[104] \"Inside Wounded Flies, Fat Cells Race to the Rescue\" \n", "[105] \"A 3-D Look Inside the Tasmanian Tiger’s Pouch, Long After Extinction\" \n", "[106] \"The Chambered Nautilus Is the Ocean’s Most Efficient Jet Engine\" \n", "[107] \"For Vampire Bats to Live on Blood, It Takes Guts\" \n", "[108] \"Washington State Is Set to Vote on a Carbon Tax. For the Governor, It’s a Gamble.\" \n", "[109] \"Cellphones on the Moon? Not So Fast\" \n", "[110] \"Parts Suppliers Call for Cleaner Cars, Splitting With Their Main Customers: Automakers\" \n", "[111] \"Paul Allen Wants to Teach Machines Common Sense\" \n", "[112] \"Arthur J. Moss, Who Pioneered Heart Treatments, Dies at 86\" \n", "[113] \"Dutch Supermarket Introduces Plastic-Free Aisle\" \n", "[114] \"German Court Rules Cities Can Ban Vehicles to Tackle Air Pollution\" \n", "[115] \"Why Build Kenya’s First Coal Plant? Hint: Think China\" \n", "[116] \"Seeds Only a Plant Breeder Could Love, Until Now\" \n", "[117] \"Robot Claw Shows Intricacies of Crab Courtship\" \n", "[118] \"California Scraps Safety Driver Rules for Self-Driving Cars\" \n", "[119] \"Scientists Fear for Colombia’s ‘Melted Rainbow’\" \n", "[120] \"Uncovering the Secrets of the ‘Girl With a Pearl Earring’\" \n", "[121] \"Saving Koalas, and Other Marsupials, With Milk Almost as Good as Mom’s\" \n", "[122] \"After Years of Fighting, Idaho Retains Climate Change in Its Education Guidelines\" \n", "[123] \"Patients Eagerly Awaited a Generic Drug. Then They Saw the Price.\" \n", "[124] \"Geothermal Energy Grows in Kenya\" \n", "[125] \"The Ocean Breathes So We Can, Too\" \n", "[126] \"Measles Cases in Europe Quadrupled in 2017\" \n", "[127] \"Opening Mental Hospitals Unlikely to Prevent Mass Shootings, Experts Say\" \n", "[128] \"Washington State Is Set to Vote on a Carbon Tax. For the Governor, It’s a Gamble.\" \n", "[129] \"Cellphones on the Moon? Not So Fast\" \n", "[130] \"Parts Suppliers Call for Cleaner Cars, Splitting With Their Main Customers: Automakers\" \n", "[131] \"Paul Allen Wants to Teach Machines Common Sense\" \n", "[132] \"Arthur J. Moss, Who Pioneered Heart Treatments, Dies at 86\" \n", "[133] \"Dutch Supermarket Introduces Plastic-Free Aisle\" \n", "[134] \"German Court Rules Cities Can Ban Vehicles to Tackle Air Pollution\" \n", "[135] \"Why Build Kenya’s First Coal Plant? Hint: Think China\" \n", "[136] \"Seeds Only a Plant Breeder Could Love, Until Now\" \n", "[137] \"Robot Claw Shows Intricacies of Crab Courtship\" \n", "[138] \"California Scraps Safety Driver Rules for Self-Driving Cars\" \n", "[139] \"Scientists Fear for Colombia’s ‘Melted Rainbow’\" \n", "[140] \"Uncovering the Secrets of the ‘Girl With a Pearl Earring’\" \n", "[141] \"Saving Koalas, and Other Marsupials, With Milk Almost as Good as Mom’s\" \n", "[142] \"After Years of Fighting, Idaho Retains Climate Change in Its Education Guidelines\" \n", "[143] \"Patients Eagerly Awaited a Generic Drug. Then They Saw the Price.\" \n", "[144] \"Geothermal Energy Grows in Kenya\" \n", "[145] \"The Ocean Breathes So We Can, Too\" \n", "[146] \"Measles Cases in Europe Quadrupled in 2017\" \n", "[147] \"Opening Mental Hospitals Unlikely to Prevent Mass Shootings, Experts Say\" \n" ] } ], "source": [ "print(headlines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Job Listings on Indeed\n", "\n", "For the following exercise, scrape the title for the job listings for specific queries on Indeed.com\n", "\n", "You should\n", "- create vector called \"page\" that holds the following pages\n", "- make a variable called \"jobtitles\" that holds an empty vector\n", "- create a for-loop that goes through each page and\n", "- creates a dynamic url that combines \"http://ivanhernandez.com/webscraping/Indeed/\" with the specific page address\n", "- then extracts the job titles from the page (see: http://ivanhernandez.com/webscraping/Indeed/spss.html for an example page)\n", "- extracts the text from the titles\n", "- appends the text to the jobtitles vector\n", "\n", "When the loop finishes running, use the print() function to print the job titles you collected" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [1] \"They’re Hosting Parasitic Worms in Their Bodies to Help Treat a Neglected Disease\" \n", " [2] \"Patients Eagerly Awaited a Generic Drug. Then They Saw the Price.\" \n", " [3] \"The Challenge of Doctor-Patient Relations in the Internet Age\" \n", " [4] \"Sneeze Into Your Elbow, Not Your Hand. Please.\" \n", " [5] \"Do You Have What It Takes to Be an Olympian?\" \n", " [6] \"Need a Date? First, Get a Dog\" \n", " [7] \"More Fitness, Less Fatness\" \n", " [8] \"F.D.A. to Expand Medication-Assisted Therapy for Opioid Addicts\" \n", " [9] \"Opening Mental Hospitals Unlikely to Prevent Mass Shootings, Experts Say\" \n", " [10] \"C-Sections and Gut Bacteria May Contribute to Overweight Kids\" \n", " [11] \"You Get Thirsty and Drink. How Does Your Brain Signal You’ve Had Enough?\" \n", " [12] \"‘Obesity Paradox’ Fails to Hold Up in Study\" \n", " [13] \"Arthur J. Moss, Who Pioneered Heart Treatments, Dies at 86\" \n", " [14] \"Bill Banning Circumcision in Iceland Alarms Religious Groups\" \n", " [15] \"Ground Control to Major Mom\" \n", " [16] \"Inside Wounded Flies, Fat Cells Race to the Rescue\" \n", " [17] \"What Happens When You Let Babies Feed Themselves?\" \n", " [18] \"Trump Blames Video Games for Mass Shootings. Researchers Disagree.\" \n", " [19] \"For Executives, Addiction Recovery in the Lap of Luxury\" \n", " [20] \"U.K. Moves Toward Making Adults Presumed Organ Donors\" \n", " [21] \"The Weekly Health Quiz: Weight Loss, Memory and a Cancer Cure\" \n", " [22] \"Can Being Cold Make You Sick?\" \n", " [23] \"Olympic Cross-Country Skiers Eat 8,000 Calories a Day. It’s Exhausting.\" \n", " [24] \"Measles Cases in Europe Quadrupled in 2017\" \n", " [25] \"Catherine Wolf, 70, Dies; Studied How People and Computers Interact\" \n", " [26] \"How Our Beliefs Can Shape Our Waistlines\" \n", " [27] \"Eating Fast May Raise Obesity Risk\" \n", " [28] \"Black Lung Disease Comes Storming Back in Coal Country\" \n", " [29] \"Opioids Tied to Risk of Fatal Infections\" \n", " [30] \"C-Sections and Gut Bacteria May Contribute to Overweight Kids\" \n", " [31] \"You Get Thirsty and Drink. How Does Your Brain Signal You’ve Had Enough?\" \n", " [32] \"‘Obesity Paradox’ Fails to Hold Up in Study\" \n", " [33] \"Arthur J. Moss, Who Pioneered Heart Treatments, Dies at 86\" \n", " [34] \"Bill Banning Circumcision in Iceland Alarms Religious Groups\" \n", " [35] \"Ground Control to Major Mom\" \n", " [36] \"Inside Wounded Flies, Fat Cells Race to the Rescue\" \n", " [37] \"What Happens When You Let Babies Feed Themselves?\" \n", " [38] \"Trump Blames Video Games for Mass Shootings. Researchers Disagree.\" \n", " [39] \"For Executives, Addiction Recovery in the Lap of Luxury\" \n", " [40] \"U.K. Moves Toward Making Adults Presumed Organ Donors\" \n", " [41] \"The Weekly Health Quiz: Weight Loss, Memory and a Cancer Cure\" \n", " [42] \"Can Being Cold Make You Sick?\" \n", " [43] \"Olympic Cross-Country Skiers Eat 8,000 Calories a Day. It’s Exhausting.\" \n", " [44] \"Measles Cases in Europe Quadrupled in 2017\" \n", " [45] \"Catherine Wolf, 70, Dies; Studied How People and Computers Interact\" \n", " [46] \"How Our Beliefs Can Shape Our Waistlines\" \n", " [47] \"Eating Fast May Raise Obesity Risk\" \n", " [48] \"Black Lung Disease Comes Storming Back in Coal Country\" \n", " [49] \"Opioids Tied to Risk of Fatal Infections\" \n", " [50] \"Senate Intelligence Leaders Say House G.O.P. Leaked a Senator’s Texts\" \n", " [51] \"Mnuchin Blocks U.C.L.A. From Releasing Video of Students Heckling Him\" \n", " [52] \"Trump Stuns Lawmakers With Seeming Embrace of Comprehensive Gun Control\" \n", " [53] \"Once Again, Push for Gun Control Collides With Political Reality\" \n", " [54] \"Happy Hour at the Hay-Adams With Nigel Farage, Brexit’s Bad Boy\" \n", " [55] \"The Real Risks of Trump’s Steel and Aluminum Tariffs\" \n", " [56] \"Jury Begins Deliberating in Percoco Corruption Trial\" \n", " [57] \"Trump Is ‘Losing a Limb’ With the Departure of Hope Hicks\" \n", " [58] \"Trump Wants to Arm Teachers. These Schools Already Do.\" \n", " [59] \"Washington State Is Set to Vote on a Carbon Tax. For the Governor, It’s a Gamble.\" \n", " [60] \"U.S. Ambassador to Mexico to Quit Amid Tense Relations Under Trump\" \n", " [61] \"Trump Targets MS-13, a Violent Menace, if Not the One He Portrays\" \n", " [62] \"Spurned by U.S. and Facing Danger Back Home, Iranian Christians Fear the Worst\" \n", " [63] \"Parts Suppliers Call for Cleaner Cars, Splitting With Their Main Customers: Automakers\" \n", " [64] \"President Trump’s Contradictory, and Sometimes False, Comments About Gun Policy to Lawmakers\"\n", " [65] \"China Envoy Seeks to Defuse Tensions With U.S. as a Trade War Brews\" \n", " [66] \"Exxon Mobil Scraps a Russian Deal, Stymied by Sanctions\" \n", " [67] \"U.S. Banks on Diplomacy With North Korea, but Moves Ahead on Military Plans\" \n", " [68] \"Is Bitcoin a Waste of Electricity, or Something Worse?\" \n", " [69] \"Walmart to Raise Age to Buy Guns and Ammunition to 21\" \n", " [70] \"Britain Presses U.S. to Avoid Death Penalty for ISIS Suspects\" \n", " [71] \"Stranded at Guantánamo, a Cooperative Detainee Criticizes Saudi Arabia\" \n", " [72] \"A Gun-Owning Trump Fan’s New Crusade: Clean Energy\" \n", " [73] \"Hope Hicks to Leave Post as White House Communications Director\" \n", " [74] \"Happy Hour at the Hay-Adams With Nigel Farage, Brexit’s Bad Boy\" \n", " [75] \"The Real Risks of Trump’s Steel and Aluminum Tariffs\" \n", " [76] \"Jury Begins Deliberating in Percoco Corruption Trial\" \n", " [77] \"Trump Is ‘Losing a Limb’ With the Departure of Hope Hicks\" \n", " [78] \"Trump Wants to Arm Teachers. These Schools Already Do.\" \n", " [79] \"Washington State Is Set to Vote on a Carbon Tax. For the Governor, It’s a Gamble.\" \n", " [80] \"U.S. Ambassador to Mexico to Quit Amid Tense Relations Under Trump\" \n", " [81] \"Trump Targets MS-13, a Violent Menace, if Not the One He Portrays\" \n", " [82] \"Spurned by U.S. and Facing Danger Back Home, Iranian Christians Fear the Worst\" \n", " [83] \"Parts Suppliers Call for Cleaner Cars, Splitting With Their Main Customers: Automakers\" \n", " [84] \"President Trump’s Contradictory, and Sometimes False, Comments About Gun Policy to Lawmakers\"\n", " [85] \"China Envoy Seeks to Defuse Tensions With U.S. as a Trade War Brews\" \n", " [86] \"Exxon Mobil Scraps a Russian Deal, Stymied by Sanctions\" \n", " [87] \"U.S. Banks on Diplomacy With North Korea, but Moves Ahead on Military Plans\" \n", " [88] \"Is Bitcoin a Waste of Electricity, or Something Worse?\" \n", " [89] \"Walmart to Raise Age to Buy Guns and Ammunition to 21\" \n", " [90] \"Britain Presses U.S. to Avoid Death Penalty for ISIS Suspects\" \n", " [91] \"Stranded at Guantánamo, a Cooperative Detainee Criticizes Saudi Arabia\" \n", " [92] \"A Gun-Owning Trump Fan’s New Crusade: Clean Energy\" \n", " [93] \"Hope Hicks to Leave Post as White House Communications Director\" \n", " [94] \"When Did Americans Stop Marrying Their Cousins? Ask the World’s Largest Family Tree\" \n", " [95] \"You Get Thirsty and Drink. How Does Your Brain Signal You’ve Had Enough?\" \n", " [96] \"When Stars Were Born: Earliest Starlight’s Effects Are Detected\" \n", " [97] \"No Sign of Newborn North Atlantic Whales This Breeding Season\" \n", " [98] \"Barbra Streisand Cloned Her Dog. For $50,000, You Can Clone Yours.\" \n", " [99] \"Why Scientists Love to Study Dogs (and Often Ignore Cats)\" \n", "[100] \"King Penguins Are Endangered by Warmer Seas\" \n", "[101] \"For Fiddler Crabs, ‘Size Does Matter’\" \n", "[102] \"Neanderthals, the World’s First Misunderstood Artists\" \n", "[103] \"They’re Hosting Parasitic Worms in Their Bodies to Help Treat a Neglected Disease\" \n", "[104] \"Inside Wounded Flies, Fat Cells Race to the Rescue\" \n", "[105] \"A 3-D Look Inside the Tasmanian Tiger’s Pouch, Long After Extinction\" \n", "[106] \"The Chambered Nautilus Is the Ocean’s Most Efficient Jet Engine\" \n", "[107] \"For Vampire Bats to Live on Blood, It Takes Guts\" \n", "[108] \"Washington State Is Set to Vote on a Carbon Tax. For the Governor, It’s a Gamble.\" \n", "[109] \"Cellphones on the Moon? Not So Fast\" \n", "[110] \"Parts Suppliers Call for Cleaner Cars, Splitting With Their Main Customers: Automakers\" \n", "[111] \"Paul Allen Wants to Teach Machines Common Sense\" \n", "[112] \"Arthur J. Moss, Who Pioneered Heart Treatments, Dies at 86\" \n", "[113] \"Dutch Supermarket Introduces Plastic-Free Aisle\" \n", "[114] \"German Court Rules Cities Can Ban Vehicles to Tackle Air Pollution\" \n", "[115] \"Why Build Kenya’s First Coal Plant? Hint: Think China\" \n", "[116] \"Seeds Only a Plant Breeder Could Love, Until Now\" \n", "[117] \"Robot Claw Shows Intricacies of Crab Courtship\" \n", "[118] \"California Scraps Safety Driver Rules for Self-Driving Cars\" \n", "[119] \"Scientists Fear for Colombia’s ‘Melted Rainbow’\" \n", "[120] \"Uncovering the Secrets of the ‘Girl With a Pearl Earring’\" \n", "[121] \"Saving Koalas, and Other Marsupials, With Milk Almost as Good as Mom’s\" \n", "[122] \"After Years of Fighting, Idaho Retains Climate Change in Its Education Guidelines\" \n", "[123] \"Patients Eagerly Awaited a Generic Drug. Then They Saw the Price.\" \n", "[124] \"Geothermal Energy Grows in Kenya\" \n", "[125] \"The Ocean Breathes So We Can, Too\" \n", "[126] \"Measles Cases in Europe Quadrupled in 2017\" \n", "[127] \"Opening Mental Hospitals Unlikely to Prevent Mass Shootings, Experts Say\" \n", "[128] \"Washington State Is Set to Vote on a Carbon Tax. For the Governor, It’s a Gamble.\" \n", "[129] \"Cellphones on the Moon? Not So Fast\" \n", "[130] \"Parts Suppliers Call for Cleaner Cars, Splitting With Their Main Customers: Automakers\" \n", "[131] \"Paul Allen Wants to Teach Machines Common Sense\" \n", "[132] \"Arthur J. Moss, Who Pioneered Heart Treatments, Dies at 86\" \n", "[133] \"Dutch Supermarket Introduces Plastic-Free Aisle\" \n", "[134] \"German Court Rules Cities Can Ban Vehicles to Tackle Air Pollution\" \n", "[135] \"Why Build Kenya’s First Coal Plant? Hint: Think China\" \n", "[136] \"Seeds Only a Plant Breeder Could Love, Until Now\" \n", "[137] \"Robot Claw Shows Intricacies of Crab Courtship\" \n", "[138] \"California Scraps Safety Driver Rules for Self-Driving Cars\" \n", "[139] \"Scientists Fear for Colombia’s ‘Melted Rainbow’\" \n", "[140] \"Uncovering the Secrets of the ‘Girl With a Pearl Earring’\" \n", "[141] \"Saving Koalas, and Other Marsupials, With Milk Almost as Good as Mom’s\" \n", "[142] \"After Years of Fighting, Idaho Retains Climate Change in Its Education Guidelines\" \n", "[143] \"Patients Eagerly Awaited a Generic Drug. Then They Saw the Price.\" \n", "[144] \"Geothermal Energy Grows in Kenya\" \n", "[145] \"The Ocean Breathes So We Can, Too\" \n", "[146] \"Measles Cases in Europe Quadrupled in 2017\" \n", "[147] \"Opening Mental Hospitals Unlikely to Prevent Mass Shootings, Experts Say\" \n", "[148] \"Creative Marketing Specialist - 4 to 5 YEARS EXPERIENCE REQU...\" \n", "[149] \"Marketing Coordinator\" \n", "[150] \"Experienced In-House Timeshare Marketing\" \n", "[151] \"Sales & Marketing Director\" \n", "[152] \"Creative Director, Global Hardware Marketing\" \n", "[153] \"Marketing Associate\" \n", "[154] \"Content Marketing Officer\" \n", "[155] \"Marketing Events Specialist\" \n", "[156] \"AR/VR Marketing strategist - Remote/Freelance/Project\" \n", "[157] \"Strategic Partner Associate, Online Partnerships Group\" \n", "[158] \"Marketing Assistant\" \n", "[159] \"Marketing Coordinator (BMD)- Based in New York City\" \n", "[160] \"Interactive and R&D Lead, Global Retail Marketing\" \n", "[161] \"Marketing Coordinator\" \n", "[162] \"Marketing Creative Specialist\" \n", "[163] \"Online Marketing Specilaist\" \n" ] } ], "source": [ "pages <- c(\"spss.html\",\"psychometrics.html\",\"marketing.html\")\n", "\n", "jobtitles <- c()\n", "for (topic in pages){\n", " url <- paste(\"http://ivanhernandez.com/webscraping/Indeed/\", topic, sep=\"\")\n", " webpage <- read_html(url)\n", " titles <- html_nodes(webpage, \"[data-tn-element='jobTitle']\")\n", " titles_text <- html_text(titles)\n", " jobtitles <- c(headlines,titles_text)\n", " }\n", "print(jobtitles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combining Everything\n", "\n", "There are many possibilities for what you can do. You can combine the different functions together to collect the links from one page and then use those links to collect data from the subsequent page.\n", "\n", "The code below, retrieves all of the links to job specific pages, and then goes to each of the links and retreives the job name and median wage.\n", "\n", "Finally, it writes both pieces of information to a textfile for every job." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
titletextskills
27-2011.00 - Actors Reading Comprehension
27-2011.00 - Actors Speaking
27-2011.00 - Actors Active Listening
27-2011.00 - Actors Social Perceptiveness
27-2011.00 - Actors Critical Thinking
27-2011.00 - Actors Monitoring
27-2011.00 - Actors Time Management
27-1011.00 - Art Directors Active Listening
27-1011.00 - Art Directors Speaking
27-1011.00 - Art Directors Judgment and Decision Making
27-1011.00 - Art Directors Time Management
27-1011.00 - Art Directors Complex Problem Solving
27-1011.00 - Art Directors Critical Thinking
27-1011.00 - Art Directors Reading Comprehension
27-1011.00 - Art Directors Coordination
27-1011.00 - Art Directors Active Learning
27-1011.00 - Art Directors Management of Personnel Resources
27-1011.00 - Art Directors Persuasion
27-1011.00 - Art Directors Social Perceptiveness
27-1011.00 - Art Directors Monitoring
27-1011.00 - Art Directors Writing
27-1011.00 - Art Directors Instructing
27-1011.00 - Art Directors Learning Strategies
27-1011.00 - Art Directors Management of Financial Resources
27-1011.00 - Art Directors Negotiation
27-1011.00 - Art Directors Operations Analysis
27-1011.00 - Art Directors Service Orientation
27-1011.00 - Art Directors Systems Analysis
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " titletext & skills\\\\\n", "\\hline\n", "\t 27-2011.00 - Actors & Reading Comprehension \\\\\n", "\t 27-2011.00 - Actors & Speaking \\\\\n", "\t 27-2011.00 - Actors & Active Listening \\\\\n", "\t 27-2011.00 - Actors & Social Perceptiveness \\\\\n", "\t 27-2011.00 - Actors & Critical Thinking \\\\\n", "\t 27-2011.00 - Actors & Monitoring \\\\\n", "\t 27-2011.00 - Actors & Time Management \\\\\n", "\t 27-1011.00 - Art Directors & Active Listening \\\\\n", "\t 27-1011.00 - Art Directors & Speaking \\\\\n", "\t 27-1011.00 - Art Directors & Judgment and Decision Making \\\\\n", "\t 27-1011.00 - Art Directors & Time Management \\\\\n", "\t 27-1011.00 - Art Directors & Complex Problem Solving \\\\\n", "\t 27-1011.00 - Art Directors & Critical Thinking \\\\\n", "\t 27-1011.00 - Art Directors & Reading Comprehension \\\\\n", "\t 27-1011.00 - Art Directors & Coordination \\\\\n", "\t 27-1011.00 - Art Directors & Active Learning \\\\\n", "\t 27-1011.00 - Art Directors & Management of Personnel Resources\\\\\n", "\t 27-1011.00 - Art Directors & Persuasion \\\\\n", "\t 27-1011.00 - Art Directors & Social Perceptiveness \\\\\n", "\t 27-1011.00 - Art Directors & Monitoring \\\\\n", "\t 27-1011.00 - Art Directors & Writing \\\\\n", "\t 27-1011.00 - Art Directors & Instructing \\\\\n", "\t 27-1011.00 - Art Directors & Learning Strategies \\\\\n", "\t 27-1011.00 - Art Directors & Management of Financial Resources\\\\\n", "\t 27-1011.00 - Art Directors & Negotiation \\\\\n", "\t 27-1011.00 - Art Directors & Operations Analysis \\\\\n", "\t 27-1011.00 - Art Directors & Service Orientation \\\\\n", "\t 27-1011.00 - Art Directors & Systems Analysis \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "titletext | skills | \n", "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| 27-2011.00 - Actors | Reading Comprehension | \n", "| 27-2011.00 - Actors | Speaking | \n", "| 27-2011.00 - Actors | Active Listening | \n", "| 27-2011.00 - Actors | Social Perceptiveness | \n", "| 27-2011.00 - Actors | Critical Thinking | \n", "| 27-2011.00 - Actors | Monitoring | \n", "| 27-2011.00 - Actors | Time Management | \n", "| 27-1011.00 - Art Directors | Active Listening | \n", "| 27-1011.00 - Art Directors | Speaking | \n", "| 27-1011.00 - Art Directors | Judgment and Decision Making | \n", "| 27-1011.00 - Art Directors | Time Management | \n", "| 27-1011.00 - Art Directors | Complex Problem Solving | \n", "| 27-1011.00 - Art Directors | Critical Thinking | \n", "| 27-1011.00 - Art Directors | Reading Comprehension | \n", "| 27-1011.00 - Art Directors | Coordination | \n", "| 27-1011.00 - Art Directors | Active Learning | \n", "| 27-1011.00 - Art Directors | Management of Personnel Resources | \n", "| 27-1011.00 - Art Directors | Persuasion | \n", "| 27-1011.00 - Art Directors | Social Perceptiveness | \n", "| 27-1011.00 - Art Directors | Monitoring | \n", "| 27-1011.00 - Art Directors | Writing | \n", "| 27-1011.00 - Art Directors | Instructing | \n", "| 27-1011.00 - Art Directors | Learning Strategies | \n", "| 27-1011.00 - Art Directors | Management of Financial Resources | \n", "| 27-1011.00 - Art Directors | Negotiation | \n", "| 27-1011.00 - Art Directors | Operations Analysis | \n", "| 27-1011.00 - Art Directors | Service Orientation | \n", "| 27-1011.00 - Art Directors | Systems Analysis | \n", "\n", "\n" ], "text/plain": [ " titletext skills \n", "1 27-2011.00 - Actors Reading Comprehension \n", "2 27-2011.00 - Actors Speaking \n", "3 27-2011.00 - Actors Active Listening \n", "4 27-2011.00 - Actors Social Perceptiveness \n", "5 27-2011.00 - Actors Critical Thinking \n", "6 27-2011.00 - Actors Monitoring \n", "7 27-2011.00 - Actors Time Management \n", "8 27-1011.00 - Art Directors Active Listening \n", "9 27-1011.00 - Art Directors Speaking \n", "10 27-1011.00 - Art Directors Judgment and Decision Making \n", "11 27-1011.00 - Art Directors Time Management \n", "12 27-1011.00 - Art Directors Complex Problem Solving \n", "13 27-1011.00 - Art Directors Critical Thinking \n", "14 27-1011.00 - Art Directors Reading Comprehension \n", "15 27-1011.00 - Art Directors Coordination \n", "16 27-1011.00 - Art Directors Active Learning \n", "17 27-1011.00 - Art Directors Management of Personnel Resources\n", "18 27-1011.00 - Art Directors Persuasion \n", "19 27-1011.00 - Art Directors Social Perceptiveness \n", "20 27-1011.00 - Art Directors Monitoring \n", "21 27-1011.00 - Art Directors Writing \n", "22 27-1011.00 - Art Directors Instructing \n", "23 27-1011.00 - Art Directors Learning Strategies \n", "24 27-1011.00 - Art Directors Management of Financial Resources\n", "25 27-1011.00 - Art Directors Negotiation \n", "26 27-1011.00 - Art Directors Operations Analysis \n", "27 27-1011.00 - Art Directors Service Orientation \n", "28 27-1011.00 - Art Directors Systems Analysis " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jobdata <- data.frame()\n", "url <- \"https://www.onetonline.org/find/family?f=27&g=Go\"\n", "webpage <- read_html(url)\n", "joblinks <- html_nodes(webpage, \"a[href^='https://www.onetonline.org/link/summary']\")\n", "links <- html_attr(joblinks,'href')\n", "links <- links[0:3] #get only the first three jobs\n", "for (link in links){\n", " jobwebpage <- read_html(link)\n", " pagetitle <- html_nodes(jobwebpage,\"title\")\n", " titletext <- html_text(pagetitle)\n", " skillscontent <- html_nodes(jobwebpage, \"div.section_Skills b\")\n", " skills <- html_text(skillscontent)\n", " if (length(skills) > 0){\n", " jobdata <- rbind(jobdata,data.frame(titletext,skills))\n", " }\n", " Sys.sleep(3) \n", "}\n", "jobdata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary: Automated Data Collection\n", "\n", "+ Decide on what content you want to scrape\n", "\n", "\n", "+ Get a list of the page urls you want to scrape\n", "\n", "\n", "+ Make a For-Loop that goes through every single page\n", "\n", "\n", "+ Scrape the content from each page\n", "\n", "\n", "+ Save the content to a text file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Additional notes\n", "\n", "+ Use the `Sys.sleep(10)` function to pause your script's execution for 10 seconds to not overload the server you are trying to collect\n", "\n", "\n", "+ Use [regular expression libraries (e.g., stringr)](http://stringr.tidyverse.org/articles/regular-expressions.html) to do pattern matching on the content extracted\n", "\n", "\n", "+ Be respectful of a site's Terms of Service Policies\n", "\n", "\n", "+ Use the [RSelenium](https://cran.r-project.org/web/packages/RSelenium/index.html) library to fill in forms for websites that want you to sign-in or enter information to access the content\n", "\n", "\n", "+ Try different things: Often there is not just one solution for collecting the data - many ways to parse the HTML" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }