{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Automated Data Collection With R\n", "\n", "## ** Ivan Hernandez, Ph.D** (ivan.hernandez@depaul.edu)\n", "\n", "### **DePaul University**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** The Steps of Automatically Collecting Data Online **\n", "\n", "+ Download the HTML Source of a page\n", "+ Extract the content from the HTML\n", "+ Save the Content\n", "+ Repeat the Process on A Different Page\n", "\n", "All of these steps can be automated, running indepdent of human interaction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 0: Import the needed libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before beginning, make sure you have R installed on your computer with the necessary libraries.\n", "\n", "If you do have R installed, a recommended development environment is [**RStudio**]( [https://www.rstudio.com/products/rstudio/download](https://www.rstudio.com/products/rstudio/download)\n", "\n", "We are first going to load a key library called `rvest`\n", "+ [rvest](https://cran.r-project.org/web/packages/rvest)\n", "\n", "The `rvest` library allows us to parse the HTML of a webpage and isolate content within HTML tags" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "library(rvest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Download the HTML Source of a page\n", "\n", "\n", "+ 1.1 Direct R to access the webpage, and save the page's HTML to a variable\n", "\n", "+ 1.2 Pass the downloaded HTML to an HTML parser in R (rvest)\n", "\n", "+ 1.3 Examine the downloaded content\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Direct R to Access the Webpage, and Save the Page's HTML to a variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step of scraping information from the web is downloading the source code of a specific webpage.\n", "\n", "You should have an idea in mind of the page(s) that contain the information you need.\n", "\n", "When you know what page you need to extract content from, then you can direct R to download it.\n", "\n", "We are going to extract content from the example html page:\n", "\n", "http://ivanhernandez.com/example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below does the following:\n", " + Creates a variable, called \"url\", that has the address of the webpage we want to scrape\n", " + Downloads the html for the webpage and save it as a variable called \"webpage\"\n", " + Prints that the Webpage has been downloaded" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Webpage downloaded\"\n" ] } ], "source": [ "url <- \"http://ivanhernandez.com/webscraping/example.html\"\n", "webpage <- read_html(url)\n", "print(\"Webpage downloaded\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can examine the contents of the source code using the \"paste\" function, which has R print the source code of the webpage as text." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "'<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\\n<html>\\n<head>\\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\\n<title>Example Page</title>\\n</head>\\n<body>\\n\\t<h1>\\n\\t\\tThis Title is in an H1 tag\\n\\t</h1>\\n\\t<div class=\"box1\">This text is inside a div tag, whose class is equal to box1</div>\\n\\t<br><div class=\"box2\">This text is inside a div tag, whose class is equal to box2</div>\\n\\t<br><span id=\"box3\">This text is inside a span tag, whose id is equal to box3</span>\\n\\t<p id=\"box4\">This text is inside a p tag, whose id is equal to box4</p>\\n\\t<div role=\"extra\">This text is inside a div tag whose role is equal to extra</div>\\n\\t<a href=\"http://google.com\">This is a link to Google</a>\\n\\t<br><br>\\n\\tAdditional Content: This content is not inside any tag\\n\\t<br><img src=\"http://ivanhernandez.com/webscraping/profile.png\">\\n</body>\\n</html>\\n'" ], "text/latex": [ "'\\textbackslash{}n\\textbackslash{}n
\\textbackslash{}n\\textbackslash{}nThis text is inside a p tag, whose id is equal to box4
\\textbackslash{}n\\textbackslash{}tThis text is inside a p tag, whose id is equal to box4
\\n\\t*
*This exercise is based on the html source from https://finance.google.com/finance?q=aapl*\n",
"\n",
"In the space below, write the code that would extract **ONLY** the stock price for Apple.\n",
" \n",
"Some starter code is provided.\n",
" \n",
"Your code should\n",
"- specify the url\n",
"- read the url's html into variable called webpage\n",
"- indicate which tag and/or selector contains the Apple stock price, and save that information to a variable called **content**.\n",
"- save the text information from content to a variable called **price**\n",
"- remove any extra whitespace by using: `trimws(price)`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Go to this url and right-click on the price, then click \"inspect\" to inspect the source code:**\n",
"\n",
"**http://ivanhernandez.com/webscraping/GoogleFinance/aapl.html**"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"'174.97'"
],
"text/latex": [
"'174.97'"
],
"text/markdown": [
"'174.97'"
],
"text/plain": [
"[1] \"174.97\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"url <- \"http://ivanhernandez.com/webscraping/GoogleFinance/aapl.html\"\n",
"webpage <- read_html(url)\n",
"content <- html_nodes(webpage, \"span.pr\")\n",
"price <- html_text(content)\n",
"trimws(price)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2 Collecting Multiple Pieces of Text Data from a Page\n",
"\n",
"If there are multiple elements you want to extract content from, you can use the **`html_nodes`** function.\n",
"\n",
"Note that the name of the function is the same as before (html_node), only with an \"s\" at the end indicating you expect many matches (html_nodes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's redownload the example webpage in case variable called \"content\" was used for the last activity"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"Webpage downloaded\"\n"
]
}
],
"source": [
"url <- \"http://ivanhernandez.com/webscraping/example.html\"\n",
"webpage <- read_html(url)\n",
"print(\"Webpage downloaded\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2.a When you know the tag name\n",
"\n",
"For example, there are two div tags in the entire HTML (the one with class=box1 and the one with class=box2). \n",
"\n",
"If we wanted to extract all of the content within div tags at once, then just ask rvest to find all the div tags and save them to a list called \"tags\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"tags are downloaded\"\n"
]
}
],
"source": [
"content <- html_nodes(webpage, \"div\")\n",
"tags <- html_text(content)\n",
"print(\"tags are downloaded\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `html_nodes()` function saves the results in a list, which is a data structure that holds a collection of items in order.\n",
"\n",
"We can iteratively access the specific content in the list using a \"for loop\""
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"This text is inside a div tag, whose class is equal to box1\"\n",
"[1] \"This text is inside a div tag, whose class is equal to box2\"\n",
"[1] \"This text is inside a div tag whose role is equal to extra\"\n"
]
}
],
"source": [
"for (tag in tags){\n",
" print(tag)\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use indices (1,2,3, etc.) to select specific items in that list"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"'This text is inside a div tag, whose class is equal to box2'"
],
"text/latex": [
"'This text is inside a div tag, whose class is equal to box2'"
],
"text/markdown": [
"'This text is inside a div tag, whose class is equal to box2'"
],
"text/plain": [
"[1] \"This text is inside a div tag, whose class is equal to box2\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tags[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get the last element, you can see how many items were retrieved and as for the item in that position"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"'This text is inside a div tag whose role is equal to extra'"
],
"text/latex": [
"'This text is inside a div tag whose role is equal to extra'"
],
"text/markdown": [
"'This text is inside a div tag whose role is equal to extra'"
],
"text/plain": [
"[1] \"This text is inside a div tag whose role is equal to extra\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"numberitems <- length(tags)\n",
"tags[numberitems]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2.b When there are many different tag names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we wanted to extract content that could be in either a div or a span or a p tag, then, we can place them all within quotations (separating each one with a comma), and run the `html_nodes()` function using that quotation of tags.\n",
"\n",
"Below, we specify that we want returned any content within a div, p, or span tag."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"This text is inside a div tag, whose class is equal to box1\"\n",
"[2] \"This text is inside a div tag, whose class is equal to box2\"\n",
"[3] \"This text is inside a span tag, whose id is equal to box3\" \n",
"[4] \"This text is inside a p tag, whose id is equal to box4\" \n",
"[5] \"This text is inside a div tag whose role is equal to extra\" \n"
]
}
],
"source": [
"contents <- html_nodes(webpage, \"div , p, span\")\n",
"results <- html_text(contents)\n",
"print(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2.c When there are many different selector names\n",
"\n",
"If we know precisely the names of the selectors that could match (class id, etc.), we can specify the tag name, as well all the id and class names we want to match by separating each with a comma."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"This text is inside a div tag, whose class is equal to box1\"\n",
"[2] \"This text is inside a div tag, whose class is equal to box2\"\n"
]
}
],
"source": [
"contents <- html_nodes(webpage, \"div[class='box1'], div[class='box2']\")\n",
"results <- html_text(contents)\n",
"print(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**You also can exclude certain classes from the results if you use the \"not\" operator** and in parentheses indicate what not to match.\n",
"\n",
"Separate each thing you do not want to match with a colon."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"This text is inside a div tag whose role is equal to extra\"\n"
]
}
],
"source": [
"contents <- html_nodes(webpage, \"div:not([class='box1']):not([class='box2'])\")\n",
"results <- html_text(contents)\n",
"print(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2.d When you only know the tag and part of the class name"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may be a situation where we know what we want to extract (e.g., we want something that is in between div tags , that has a class with the word \"box\" in it).\n",
"\n",
"Using the regular expression library (whose library is called \"re\"), we can have partial matches of tags or classes/ids.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you know what the name STARTS with, use the ^ operator"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"This text is inside a div tag, whose class is equal to box1\"\n",
"[2] \"This text is inside a div tag, whose class is equal to box2\"\n"
]
}
],
"source": [
"contents <- html_nodes(webpage, \"div[class^='box']\")\n",
"results <- html_text(contents)\n",
"print(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you know what the name ENDS with, use the * operator"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"This text is inside a div tag, whose class is equal to box1\"\n"
]
}
],
"source": [
"contents <- html_nodes(webpage, \"div[class*='1']\")\n",
"results <- html_text(contents)\n",
"print(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 2: Scrape the Headlines from Google News\n",
"\n",
"*
*This exercise is based on the html source from https://news.google.com*\n",
"\n",
"**Go to the following page and right-click and inspect the headline titles.**\n",
"\n",
"**http://ivanhernandez.com/webscraping/GoogleNews/GoogleNews.html**\n",
"\n",
"**What is the selector that identifies headlines?**\n",
"\n",
"When you see what selector identifies headlines, in the space below, write the code that would extract ALL of the headline titles from the Google News homepage.\n",
"\n",
"Save the list of headlines as a variable called **contents**.\n",
"\n",
"Then, extract the text from the contents variable, and save that to a variable called **headlines**\n",
"\n",
"Print the headlines on the last line."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" [1] \"The Real Risks of Trump's Steel and Aluminum Tariffs\" \n",
" [2] \"Trump says US to impose steep tariffs on steel, aluminum imports\" \n",
" [3] \"Trump announces tariffs on steel, aluminum imports\" \n",
" [4] \"GOP meltdown over Trump plan to impose steel, aluminum tariffs\" \n",
" [5] \"The Right Import Tax Is Zero\" \n",
" [6] \"'Every day is a new adventure': Trump upends Washington and Wall Street with shifts on trade, guns\" \n",
" [7] \"Don't Buy Putin's Missile Hype\" \n",
" [8] \"Why would Putin want to nuke Florida?\" \n",
" [9] \"Putin Vows to Lift Russia's Struggling Middle Class\" \n",
" [10] \"What Russia's newly announced nuclear systems actually mean\" \n",
" [11] \"Georgia Senate approves tax bill, snubbing Delta in NRA feud\" \n",
" [12] \"Casey Cagle on Twitter: \\\"I will kill any tax legislation that benefits @Delta unless the company changes its position and ...\"\n",
" [13] \"Georgia lawmakers yank tax break for Delta after airline cuts ties with NRA\" \n",
" [14] \"The Latest: Pro-gun lawmakers win victory over Delta\" \n",
" [15] \"READERS WRITE: MAR. 2\" \n",
" [16] \"White House preparing for McMaster exit as early as next month\" \n",
" [17] \"Hanson: Is Donald Trump going to fire one of his generals?\" \n",
" [18] \"White House preparing for exit of national security advisor HR McMaster: NBC News\" \n",
" [19] \"Dangerous nor'easter targets East Coast with snow, rain, wind\" \n",
" [20] \"Powerful nor'easter has nearly the entire East Coast under high wind watches, warnings\" \n",
" [21] \"Jared Kushner's troubles include an impending $1.2 billion company debt\" \n",
" [22] \"Kushner loses access to top-secret intelligence\" \n",
" [23] \"Jared Kushner's Future In The White House After A Security Clearance Downgrade\" \n",
" [24] \"Jared Kushner Should Be Fired After His Week From Hell, Democrats Say\" \n",
" [25] \"Regulator Seeks Kushner Loan Details From Deutsche Bank, Two Others\" \n",
" [26] \"'Javanka' should follow Hope Hicks out the door\" \n",
" [27] \"A Trump ally is likely to replace a career diplomat as US ambassador, and Mexicans are worried\" \n",
" [28] \"US Ambassador to Mexico to Quit Amid Tense Relations Under Trump\" \n",
" [29] \"US ambassador to Mexico resigning in another big loss for State Department\" \n",
" [30] \"US ambassador to Mexico unexpectedly resigns amid increased tensions under Trump\" \n",
" [31] \"Trump's lack of diplomacy threatens to derail US-Mexico relations\" \n",
" [32] \"US ambassador to Mexico stepping down\" \n",
" [33] \"US ambassador to Mexico quits in latest departure of high power talent from the State Department\" \n",
" [34] \"Xi Jinping's latest power grab is bad news for China's economy\" \n",
" [35] \"Sensitive Words: Xi to Ascend His Throne (Updated) - China Digital Times (CDT)\" \n",
" [36] \"China Blocks Winnie the Pooh, Again\" \n",
" [37] \"Why China banned the letter 'N'\" \n",
" [38] \"Language Log » The letter * has bee* ba**ed in Chi*a\" \n",
" [39] \"China Focus: CPC hears opinions on deepening reform of Party and state institutions\" \n",
" [40] \"Xi Jinping's authoritarian rise in China has been powered by sexism\" \n",
" [41] \"The Latest: UN Imagery Shows New Damage in Syria Suburb\" \n",
" [42] \"Syria war: Pakistani civilians leave besieged Eastern Ghouta\" \n",
" [43] \"UN official: Pauses in Syria suburb unilateral, 'not enough'\" \n",
" [44] \"Dr. Marc Siegel: Doctors in Syria are fighting an unwinnable battle against death -- They need our help\" \n",
" [45] \"Syrian civilians unable to evacuate despite pauses in fighting in besieged Ghouta\" \n",
" [46] \"A Teen Tried To Shoot Queen Elizabeth In 1981, Intelligence Report Says\" \n",
" [47] \"The Snowman and the Queen: Declassified intelligence service documents confirm assassination attempt on Queen ...\" \n",
" [48] \"New Zealand teenager tried to kill Queen Elizabeth in 1981\" \n",
" [49] \"A teenager allegedly tried to kill Queen Elizabeth in 1981. Police suspect a coverup.\" \n",
" [50] \"New Zealand teenager tried to assassinate Queen Elizabeth in 1981: intelligence agency\" \n",
" [51] \"Revealed: the day a schoolboy, 17, tried to assassinate the Queen\" \n",
" [52] \"Befuddled By Trump, Senate Will Not Vote On Gun Measures Next Week\" \n",
" [53] \"Donald Trump's dubious attack on US-Canada trade\" \n",
" [54] \"Baffled Republicans distance themselves from Trump on guns\" \n",
" [55] \"The Latest: Democrat pleased with Trump support on gun laws\" \n",
" [56] \"Why Democrats Are Losing the Gun Debate\" \n",
" [57] \"Trump's willingness to consider gun control is welcome. But can we believe him?\" \n",
" [58] \"Body of missing mother found, suspect in custody: Authorities\" \n",
" [59] \"Man charged with murder after missing Middlesex mom's body found\" \n",
" [60] \"TerriLynn St. John case: Grim end to search for mom who vanished from driveway\" \n",
" [61] \"Virginia mom, 23, who mysteriously disappeared from front yard found dead; suspect in custody\" \n",
" [62] \"Missing Virginia mother mysteriously disappeared from front yard\" \n",
" [63] \"Man accused of sending abusive letters with 'suspicious white powder' to Trump Jr., Sen. Stabenow\" \n",
" [64] \"Massachusetts Man Arrested for Mailing Threatening Letters Containing Suspicious White Powder | USAO-MA ...\" \n",
" [65] \"Massachusetts man charged in Donald Trump Jr. white powder hoax\" \n",
" [66] \"Powder hoax suspect's bizarre rants revealed\" \n",
" [67] \"Daniel Frisiello charged with 'powder' letters to Trump son, others\" \n",
" [68] \"Did HUD really need to spend $31000 of taxpayer money on that dining furniture for Ben Carson?\" \n",
" [69] \"ttongress of tbe Wntteb tates - Oversight and Government Reform - House.gov\" \n",
" [70] \"Ben Carson Tries to Cancel $31000 Dining Furniture Purchase for HUD Office\" \n",
" [71] \"$31000 dining set purchased by Ben Carson's HUD came from a Baltimore interior design firm\" \n",
" [72] \"Did Ben Carson Purchase a $31000 Dining Set and Charge It to HUD?\" \n",
" [73] \"Ben Carson's HUD Spends $31000 on Dining Set for His Office\" \n",
" [74] \"Ben Carson needs to check out Ikea\" \n",
" [75] \"There are far more major scandals brewing. So why is a $31000 dining set in the news?\" \n",
" [76] \"Equifax finds its big data breach hit an additional 2.4 million people\" \n",
" [77] \"Equifax Releases Updated Information on 2017 Cybersecurity Incident | Equifax\" \n",
" [78] \"Equifax Identifies Additional 2.4 Million Customers Hit by Data Breach\" \n",
" [79] \"Equifax Identifies Additional 2.4 Million Affected by 2017 Breach\" \n",
" [80] \"Equifax Releases Updated Information on 2017 Cybersecurity Incident\" \n",
" [81] \"Gunmaker American Outdoor Brands plummets 20% after quarterly sales plunge\" \n",
" [82] \"Financial Newsletter - Zacks\" \n",
" [83] \"Smith & Wesson parent AOBC off target in most recent earnings\" \n",
" [84] \"Smith & Wesson Parent Cuts Forecast, Sees Weak Market Ahead\" \n",
" [85] \"American Outdoor Brands Corporation - AOBC - Stock Price Today - Zacks\" \n",
" [86] \"Mass shootings have made gun stocks toxic assets on Wall Street\" \n",
" [87] \"Drunk man accidentally takes $1600 Uber from West Virginia to New Jersey\" \n",
" [88] \"Drunk Man Accidentally Takes $1600 Uber To NJ After Partying With Friends In West Virginia\" \n",
" [89] \"Drunk man passes out in Uber, takes $1600 ride\" \n",
" [90] \"Drunk bro 'blacked out' and took $1600 Uber ride\" \n",
" [91] \"A $1600 Uber ride? Drunk man blacks out, takes trip from W.Va. to NJ\" \n",
" [92] \"Corporations only break with the gun industry when it's cheap and easy\" \n",
" [93] \"Walmart Statement on Firearms Policy\" \n",
" [94] \"This Time It's Guns: Retail Activism Goes Mainstream With Dick's Sporting Goods\" \n",
" [95] \"Kroger Raises Age Limits on Gun Sales, Joining Walmart and Dick's\" \n",
" [96] \"Commentary: Dick's Is Showing Us Why Companies Can't Afford to Be Cowards Anymore\" \n",
" [97] \"Android Go is here to fix super cheap phones\" \n",
" [98] \"Android Go for Feature Phones: What You Need to Know\" \n",
" [99] \"New “Android Go” phones show how much you can get for $100\" \n",
"[100] \"Android P Release Date, Name, Features & Expected Android P Phones List\" \n",
"[101] \"Twitter: We Know the Platform Is Toxic, Please Help Us Fix\" \n",
"[102] \"Measuring The Health of Our Public Conversations — Cortico\" \n",
"[103] \"Twitter is asking the public to help measure how toxic it is\" \n",
"[104] \"Twitter seeks help measuring 'health' of its world\" \n",
"[105] \"Facebook to End News Feed Experiment in 6 Countries That Magnified Fake News\" \n",
"[106] \"News Feed FYI: Ending the Explore Feed Test | Facebook Newsroom\" \n",
"[107] \"Facebook ends its experiment with the alternative “explore” news feed\" \n",
"[108] \"Facebook Ditches Plan for 2 Separate News Feeds\" \n",
"[109] \"First Look: GMC goes big with the redesigned 2019 Sierra Denali, but will luxury truck buyers drive one home?\" \n",
"[110] \"2019 GMC Sierra 1500 Denali puts a tailgate in your tailgate\" \n",
"[111] \"Forget submarine steel: The 2019 GMC Sierra truck is made from carbon fiber\" \n",
"[112] \"2019 GMC Sierra 1500 First Look: Distinguishing Itself from Silverado\" \n",
"[113] \"4 Reasons Marvel And Disney Moved 'Avengers: Infinity War' To April\" \n",
"[114] \"'Avengers: Infinity War' Release Date Moves Up One Week to April\" \n",
"[115] \"Avengers: Infinity War toys showcase new villains\" \n",
"[116] \"Tori Spelling's Struggles: Overcoming Health Scares, Marriage Drama, Family Feuds and Money Troubles\" \n",
"[117] \"Police Called to Tori Spelling's Los Angeles Home Over 'Disturbance'\" \n",
"[118] \"Tori Spelling's 'breakdown' may be due to demands of motherhood, Corinne Olympios says\" \n",
"[119] \"Tori Spelling Visited by LAPD After Mysterious 'Disturbance' 911 Call Insinuating a Mental Breakdown\" \n",
"[120] \"Dean McDermott Friendly Chat with Cops After Tori's Apparent Mental Breakdown\" \n",
"[121] \"Corinne Olympios Says Tori Spelling Seemed 'Distant' Prior to Police Being Called to Her House (Exclusive)\" \n",
"[122] \"On the Red Carpet, Ryan Seacrest Is a Distraction in an Important Year\" \n",
"[123] \"Ryan Seacrest Sexual Abuse Allegations: E! Stylist Goes Into Detail – Variety\" \n",
"[124] \"Kelly Ripa defends Ryan Seacrest, says she's 'blessed' to work with him\" \n",
"[125] \"Kelly Ripa defends co-host Ryan Seacrest amid sex misconduct allegations\" \n",
"[126] \"Publicists to steer stars away from Seacrest on Oscars red carpet\" \n",
"[127] \"Hollywood's reckoning has transformed the red carpet\" \n",
"[128] \"Kylie Jenner Poses in Underwear One Month After Giving Birth to Baby Stormi\" \n",
"[129] \"She's snapped back into shape! Kylie Jenner, 20, proudly poses in a thong just one month after giving birth to Stormi\" \n",
"[130] \"Kylie Jenner Shows Off Post-Baby Body in Black Underwear 1 Month After Birth of Daughter Stormi\" \n",
"[131] \"Jeffree Star Vs. Kylie Jenner Continues: \\\"Who's Ready for Some Hot Tea Today???\\\"\" \n",
"[132] \"What did Sixers' Joel Embiid think of Rockets' James Harden's ridiculous crossover vs. Clippers? (VIDEO)\" \n",
"[133] \"The 13 most disrespectful things from James Harden's crossover on Wesley Johnson, ranked\" \n",
"[134] \"Cleveland Cavaliers SF LeBron James: James Harden pulled off move players dream about\" \n",
"[135] \"Celebrities spotted at the Rockets' win over the Clippers\" \n",
"[136] \"Three Things to Know: James Harden does Wesley Johnson, Clippers wrong\" \n",
"[137] \"Sean Miller's Statement Takes the Fight to ESPN: Is a Lawsuit the Next Step?\" \n",
"[138] \"Sources: Sean Miller talked payment on wiretap\" \n",
"[139] \"Sean Miller stands tall with Arizona's support, but he's hardly in the clear\" \n",
"[140] \"Sean Miller Denies Wrongdoing, Will Coach Arizona Amid Deandre Ayton Scandal\" \n",
"[141] \"Sources cast doubt on reported timeline of Miller-Dawkins call\" \n",
"[142] \"Could Sean Miller wiretap report be wrong? Seeds of doubt arise with ESPN recruiting story\" \n",
"[143] \"Vikings emerge as favorites to land Kirk Cousins at NFL Combine\" \n",
"[144] \"Minnesota Vikings - NFL - CBSSports.com\" \n",
"[145] \"Mike Zimmer: Wrong QB decision for Vikings means I'll 'probably get fired'\" \n",
"[146] \"Zimmer: If Vikings don't pick right QB, I'll 'get fired'\" \n",
"[147] \"Minnesota Vikings | Bleacher Report\" \n",
"[148] \"NFL Combine 2018: Saquon Barkley grew up a Jets fan, but could they actually draft him?\" \n",
"[149] \"2018 NFL combine: What we learned from RB and OL weigh-ins and measurements\" \n",
"[150] \"Saquon Barkley would love to be drafted by the Cleveland Browns\" \n",
"[151] \"Saquon Barkley: It Would Be 'Awesome' to Be Drafted by Browns, Struggling Teams\" \n",
"[152] \"Did Dark Matter Make The Early Universe Chill Out?\" \n",
"[153] \"An absorption profile centred at 78 megahertz in the sky-averaged spectrum | Nature\" \n",
"[154] \"Signal detected from 'cosmic dawn'\" \n",
"[155] \"A rare signal from the early universe sends scientists clues about dark matter\" \n",
"[156] \"Possible interaction between baryons and dark-matter particles revealed by the first stars | Nature\" \n",
"[157] \"The birth of the first stars\" \n",
"[158] \"Watch a bus-size asteroid buzz Earth to start the weekend\" \n",
"[159] \"Asteroid Watch - NASA Jet Propulsion Laboratory\" \n",
"[160] \"Watch Live Stream: Bus-Sized 2018 DV1 Asteroid Will Fly By Earth On Friday\" \n",
"[161] \"Bus-size asteroid to pass within 70000 miles of Earth Friday, closer than moon\" \n",
"[162] \"Virtual Telescope's WebTV - The Virtual Telescope Project 2.0\" \n",
"[163] \"A Bus-Size Asteroid Will Whiz by Earth Friday\" \n",
"[164] \"Advanced GOES satellites launched to improve weather forecasting\" \n",
"[165] \"Watch live as NASA launches the future of weather forecasting\" \n",
"[166] \"NASA launches advanced weather satellite for western US\" \n",
"[167] \"Watch NOAA's GOES-S Weather Satellite Launch, Live\" \n",
"[168] \"Flagship US space telescope facing further delays\" \n",
"[169] \"NASA's James Webb Telescope Likely To Be Delayed Yet Again\" \n",
"[170] \"NASA's Hubble successor may miss its launch window\" \n",
"[171] \"Nuts, Especially Tree Nuts, and Improved CRC Survival\" \n",
"[172] \"Eating Nuts Could Lower Colon Cancer Reoccurrence\" \n",
"[173] \"Nuts may be key to fighting this common cancer\" \n",
"[174] \"Doctors: More People Want Nose Jobs To Make Selfies Look Better\" \n",
"[175] \"AAFPRS - Media Resources - Statistics\" \n",
"[176] \"Selfies distort faces like a \\\"funhouse mirror,\\\" study finds\" \n",
"[177] \"Think Your Nose Is Too Big? Selfies Might Be to Blame\" \n",
"[178] \"Selfies make your nose look 30% bigger, study says\" \n",
"[179] \"A teen was told he likely had the flu. It turned out to be late-stage cancer.\" \n",
"[180] \"Helping Hunter Fight Cancer | Medical Expenses - YouCaring\" \n",
"[181] \"Florida Teen Initially Diagnosed with the Flu Discovers He Actually Has Stage 4 Cancer\" \n",
"[182] \"Teen who was told he had the flu, really had stage 4 cancer\" \n",
"[183] \"Tampa teen diagnosed with flu discovers he is battling stage 4 cancer\" \n",
"[184] \"FDA Committee Recommends 2018-2019 Influenza Vaccine Strains\" \n",
"[185] \"Interim Estimates of 2017–18 Seasonal Influenza Vaccine Effectiveness — United States, February 2018 | MMWR - CDC\" \n",
"[186] \"Flu deaths reach modern-day state record of 253; elderly comprise 73 percent of victims\" \n",
"[187] \"What's going around? Experts explain why the 2017-2018 flu season was one of the harshest yet\" \n",
"[188] \"Flu Articles, Photos, and Videos - Chicago Tribune\" \n",
"[189] \"Brutal flu has killed 84 children in the US - but its spread...\" \n",
"[190] \"All Children Should Have to Get the Flu Shot\" \n",
"[191] \"Five myths about outbreaks\" \n"
]
}
],
"source": [
"url <- \"http://ivanhernandez.com/webscraping/GoogleNews/GoogleNews.html\"\n",
"webpage <- read_html(url)\n",
"contents <- html_nodes(webpage, \"[role='heading']\")\n",
"headlines <- html_text(contents)\n",
"print(headlines)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.3 Extract content from a link"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's re-download the example page to make sure our webpage variable has the correct content"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"url <- \"http://ivanhernandez.com/webscraping/example.html\"\n",
"webpage <- read_html(url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may have a link that, if clicked, directs the user to a different page. \n",
"\n",
"We can extract various content from a link including:\n",
"+ The text the user sees for the link\n",
"+ The address the link directs the user to go, when clicked\n",
"\n",
"In the source code of the example page, the link is written as follows:\n",
"\n",
"`This is a link to Google`\n",
"\n",
"This link has both text: **\"This is a link to Google\"**\n",
"as well as a url: http://google.com\n",
"\n",
"We can use either the html_text() or html_attr() function, depending on what we want to extract."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3.a Extract the text from the link"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"This is a link to Google\"\n"
]
}
],
"source": [
"content <- html_nodes(webpage, \"a\")\n",
"linktext <- html_text(content)\n",
"print(linktext)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3.b Extract the url from the link\n",
"\n",
"Before, we used the html_text function to get the text encapsulated by the found tag/selector.\n",
"\n",
"We can extract other elements from the matched tag, such that information contained in the href selector.\n",
"\n",
"To get specific values for the selector, use the **html_attr** function.\n",
"\n",
"Below, we will extract the link information, which is contained in the href selector for a link."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"http://google.com\"\n"
]
}
],
"source": [
"content <- html_nodes(webpage, \"a\")\n",
"link <- html_attr(content,'href')\n",
"print(link)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 3: Scrape the Links from the White House Press Briefing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*
*This exercise is based on the html source from https://www.whitehouse.gov/briefing-room/press-briefings*\n",
"\n",
"Go to the following page and right-click on a press briefing link title (either a remark or statement):\n",
"http://ivanhernandez.com/webscraping/whitehouse/breifingstatements.html\n",
"\n",
"**What is the tag used for the headlines? Is this tag always within another tag?**\n",
"\n",
"In the space below, write the code that would extract all of the links to **JUST** the press briefings from the White House website. \n",
"\n",
"Save the list of links as a variable called **content**\n",
"\n",
"From your content variable get the link found the href tag. Save those links to a variable called **links**\n",
"\n",
"Use the print function to print the links\n",
"\n",
"*Hint on how isolate a tag within another tag*:\n",
"- *To get a tags that are only found within another specific tag, separate them with a space in the html_nodes() function*\n",
"- *So if the link tags (a) you want is only found within a paragraph tag (p), you would say `html_nodes(webpage, \"p a\")`*"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" [1] \"https://www.whitehouse.gov/briefings-statements/remarks-president-trump-listening-session-representatives-steel-aluminum-industry/\" \n",
" [2] \"https://www.whitehouse.gov/briefings-statements/remarks-first-lady-white-house-opioids-summit/\" \n",
" [3] \"https://www.whitehouse.gov/briefings-statements/readout-president-donald-j-trumps-call-president-moon-jae-republic-korea-6/\" \n",
" [4] \"https://www.whitehouse.gov/briefings-statements/remarks-vice-president-pence-department-homeland-security-15th-anniversary-event/\" \n",
" [5] \"https://www.whitehouse.gov/briefings-statements/national-colorectal-cancer-awareness-month/\" \n",
" [6] \"https://www.whitehouse.gov/briefings-statements/president-donald-j-trump-combatting-opioid-crisis/\" \n",
" [7] \"https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-bipartisan-members-congress-meeting-school-community-safety/\"\n",
" [8] \"https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-policy-agenda-annual-report-free-fair-reciprocal-trade/\" \n",
" [9] \"https://www.whitehouse.gov/briefings-statements/readout-president-donald-j-trumps-call-emir-tamim-bin-hamad-al-thani-qatar-2/\" \n",
"[10] \"https://www.whitehouse.gov/briefings-statements/remarks-president-trump-ceremony-preceding-lying-honor-reverend-billy-graham/\" \n"
]
}
],
"source": [
"url <- \"http://ivanhernandez.com/webscraping/whitehouse/breifingstatements.html\"\n",
"webpage <- read_html(url)\n",
"content <- html_nodes(webpage, \"h2 a\")\n",
"links <- html_attr(content,'href')\n",
"print(links)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.4 Extract Everything Else"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's re-download the example page to make sure our webpage variable has the correct content"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"url <- \"http://ivanhernandez.com/webscraping/example.html\"\n",
"webpage <- read_html(url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4.a Extract non-tagged text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, we have content that is not contained within tags, or has an irregular structure.\n",
"\n",
"**Notice in the source code above that the phrase, \"This content is not inside any tag\" does not a tag specific to it. It is by iteself.**\n",
"\n",
"**We can still extract this information if we know the text/characters that come immediately before and immediately after the content.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ Determine the HTML that goes immediately before the content (precontent)\n",
" + In this example, \"Additional Content:\" came before the text we want to extract\n",
" + We will split the HTML on where it says \"Additional Content:\" and keep the text after it\n",
" \n",
" \n",
"+ Determine the HTML that goes immediately after the content (postcontent)\n",
" + In this example, \"<\" came immediately after the text we want to extract\n",
" + We will split the HTML that we kept on where it says \"<\" and keep the text before it\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"The following code takes the html text and splits it at the phrase we indicate, and keeps the second half (the text after the indicated phrase of \"Additional Content:\")."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"' This content is not inside any tag\\n\\t<br><img src=\"http://ivanhernandez.com/webscraping/profile.png\">\\n</body>\\n</html>\\n'"
],
"text/latex": [
"' This content is not inside any tag\\textbackslash{}n\\textbackslash{}t