Package 'scrappy' reference manual

Title:	A Simple Web Scraper
Description:	A group of functions to scrape data from different websites, for academic purposes.
Authors:	Roberto Villegas-Diaz [aut, cre]
Maintainer:	Roberto Villegas-Diaz <[email protected]>
License:	MIT + file LICENSE
Version:	0.0.2
Built:	2025-03-12 02:42:49 UTC
Source:	https://github.com/villegar/scrappy

(Assisted) request of EDINA's Digimap data from the Ordnance Survey

Description

Digimap’s Ordnance Survey collection provides a full range of topographic Ordnance Survey data for the UK. Note that this function helps you to request the data, once your request has been processed, you will have to manually download the data (email instructions will be provided by Digimap).

Usage

digimap_os(
  client,
  area_name = NULL,
  dataset = NULL,
  format = NULL,
  version = NULL,
  org = Sys.getenv("ORG"),
  sleep = 1
)
digimap_os(
  client,
  area_name = NULL,
  dataset = NULL,
  format = NULL,
  version = NULL,
  org = Sys.getenv("ORG"),
  sleep = 1
)

Arguments

`client`	`RSelenium` client.
`area_name`	String with UK national grid name (e.g., 'SD', 'SD20'). See ordnancesurvey.co.uk/documents/resources/guide-to-nationalgrid.pdf.
`dataset`	String with the name of the data set to download (e.g., 'NTM' for National Tree Map or 'Terrain-5 DTM' for the OS Terrain 5 Digital Terrain Model). See https://digimap.edina.ac.uk/help/our-maps-and-data/os_products/.
`format`	String with the data format (e.g., 'SHAPE'). See https://digimap.edina.ac.uk/help/our-maps-and-data/os_products/.
`version`	String with the version of the data set (e.g., 'July 2023'). See https://digimap.edina.ac.uk/help/our-maps-and-data/os_products/.
`org`	String with your organisation name (for login purposes, this can be done manually). Done only once per session.
`sleep`	Integer with number of seconds to use as pause between actions on the web page.

Value

Logic value with the status of the data request.

Source

https://digimap.edina.ac.uk/os

Convert duration to date-time Convert a string with a duration (e.g. 'an hour ago') to a date-time string, based on a reference time

Description

Convert duration to date-time Convert a string with a duration (e.g. 'an hour ago') to a date-time string, based on a reference time

Usage

duration2datetime(
  str,
  ref_time = Sys.time(),
  output_format = "%Y-%m-%d %H:%M:%S %Z"
)
duration2datetime(
  str,
  ref_time = Sys.time(),
  output_format = "%Y-%m-%d %H:%M:%S %Z"
)

Arguments

`str`	String with a duration (see examples)
`ref_time`	Reference time (default: `Sys.time()`, current time)
`output_format`	String with the format of the output (default: `"%Y-%m-%d %H:%M:%S %Z"`)

Value

Date-time object based on the input string, str, and the reference time ref_time.

Examples

duration2datetime("a minute ago")
duration2datetime("an hour ago")
duration2datetime("a day ago")
duration2datetime("a week ago")
duration2datetime("a month ago")
duration2datetime("a year ago")
duration2datetime("2 minutes ago")
duration2datetime("2 hours ago")
duration2datetime("2 days ago")
duration2datetime("2 weeks ago")
duration2datetime("2 months ago")
duration2datetime("2 years ago")
duration2datetime("a minute ago")
duration2datetime("an hour ago")
duration2datetime("a day ago")
duration2datetime("a week ago")
duration2datetime("a month ago")
duration2datetime("a year ago")
duration2datetime("2 minutes ago")
duration2datetime("2 hours ago")
duration2datetime("2 days ago")
duration2datetime("2 weeks ago")
duration2datetime("2 months ago")
duration2datetime("2 years ago")

Scrape GP practices

Description

Scrape GP practices near a given postcode

Usage

find_a_gp(
  client,
  postcode,
  base = "https://www.nhs.uk/service-search/find-a-gp",
  sleep = 1
)
find_a_gp(
  client,
  postcode,
  base = "https://www.nhs.uk/service-search/find-a-gp",
  sleep = 1
)

Arguments

`client`	`RSelenium` client.
`postcode`	String with the target postcode.
`base`	String with the base URL for Google Maps website.
`sleep`	Integer with number of seconds to use as pause between actions on the web page.

Value

Data frame with GP practices near the given postcode.

Examples

## Not run: 
# Create RSelenium session
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE)

# Retrieve GP practices near L69 3GL
# (Waterhouse building, University of Liverpool)
wh_gps_tb <- scrappy::find_a_gp(rD$client, postcode = "L69 3GL")

# Stop server
rD$server$stop()

## End(Not run)
## Not run: 
# Create RSelenium session
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE)

# Retrieve GP practices near L69 3GL
# (Waterhouse building, University of Liverpool)
wh_gps_tb <- scrappy::find_a_gp(rD$client, postcode = "L69 3GL")

# Stop server
rD$server$stop()

## End(Not run)

Scrape Google Maps' reviews

Description

Scrape Google Maps' reviews

Usage

google_maps(
  client,
  name,
  place_id = NULL,
  base = "https://www.google.com/maps/search/?api=1&query=",
  sleep = 1,
  max_reviews = 100,
  result_id = 1,
  with_text = FALSE
)
google_maps(
  client,
  name,
  place_id = NULL,
  base = "https://www.google.com/maps/search/?api=1&query=",
  sleep = 1,
  max_reviews = 100,
  result_id = 1,
  with_text = FALSE
)

Arguments

`client`	`RSelenium` client.
`name`	String with the name of the target place.
`place_id`	String with the unique ID of the target place, useful when more than one place has the same name.
`base`	String with the base URL for Google Maps website.
`sleep`	Integer with number of seconds to use as pause between actions on the web page.
`max_reviews`	Integer with the maximum number of reviews to scrape. The number of existing reviews will define the actual number of reviews returned.
`result_id`	Integer with the result position to use, only relevant when multiple matches for the given `name` are found.
`with_text`	Boolean value to indicate if the `max_reviews` should only account for those reviews with a comment.

Value

Tibble with the reviews extracted from Google Maps.

Examples

## Not run: 
# Create RSelenium session
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE)
# Retrieve reviews for Sefton Park in Liverpool
sefton_park_reviews_tb <-
  scrappy::google_maps(
    client = rD$client,
    name = "Sefton Park",
    place_id = "ChIJrTCHJVkge0gRm1LWF0fSPgw",
    max_reviews = 20
  )

sefton_park_reviews_tb_with_text <-
  scrappy::google_maps(
    client = rD$client,
    name = "Sefton Park",
    place_id = "ChIJrTCHJVkge0gRm1LWF0fSPgw",
    max_reviews = 20,
    with_text = TRUE
  )
# Stop server
rD$server$stop()

## End(Not run)
## Not run: 
# Create RSelenium session
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE)
# Retrieve reviews for Sefton Park in Liverpool
sefton_park_reviews_tb <-
  scrappy::google_maps(
    client = rD$client,
    name = "Sefton Park",
    place_id = "ChIJrTCHJVkge0gRm1LWF0fSPgw",
    max_reviews = 20
  )

sefton_park_reviews_tb_with_text <-
  scrappy::google_maps(
    client = rD$client,
    name = "Sefton Park",
    place_id = "ChIJrTCHJVkge0gRm1LWF0fSPgw",
    max_reviews = 20,
    with_text = TRUE
  )
# Stop server
rD$server$stop()

## End(Not run)

Retrieve data from NEWA at Cornell University

Description

Retrieve Weather data from the Network for Environment and Weather Applications (NEWA) at Cornell University.

Usage

newa_nrcc(
  client,
  year,
  month,
  station,
  base = "http://newa.nrcc.cornell.edu/newaLister",
  interval = "hly",
  sleep = 6,
  table_id = "#dtable",
  path = getwd(),
  save_file = TRUE
)
newa_nrcc(
  client,
  year,
  month,
  station,
  base = "http://newa.nrcc.cornell.edu/newaLister",
  interval = "hly",
  sleep = 6,
  table_id = "#dtable",
  path = getwd(),
  save_file = TRUE
)

Arguments

`client`	`RSelenium` client.
`year`	Numeric value with the year.
`month`	Numeric value with the month.
`station`	String with the station abbreviation. Check the http://newa.cornell.edu/index.php?page=station-pages for a list.
`base`	Base URL (default: http://newa.nrcc.cornell.edu/newaLister).
`interval`	String with data interval (default: hly, hourly).
`sleep`	Numeric value with the number of seconds to wait for the page to load the results (default: 6 seconds).
`table_id`	String with the unique HTML ID assigned to the table containing the data (default: `#dtable`)
`path`	String with path to location where CSV files should be stored (default: `getwd()`).
`save_file`	Boolean flag to indicate whether or not the output should be stored as a CSV file.

Value

Tibble with the data retrieved from the server.

Examples

## Not run: 
# Create RSelenium session
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE)
# Retrieve data for the Geneva (Bejo) station on 2020/12
scrappy::newa_nrcc(rD$client, 2020, 12, "gbe")
# Stop server
rD$server$stop()

## End(Not run)
## Not run: 
# Create RSelenium session
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = FALSE)
# Retrieve data for the Geneva (Bejo) station on 2020/12
scrappy::newa_nrcc(rD$client, 2020, 12, "gbe")
# Stop server
rD$server$stop()

## End(Not run)

Retrieve data from NEWA v3.0 at Cornell University

Description

Retrieve Weather data from the Network for Environment and Weather Applications (NEWA) version 3.0 at Cornell University.

Usage

newa_nrcc3(
  year,
  month,
  day,
  hour,
  station,
  base = "https://hrly.nrcc.cornell.edu/stnHrly"
)
newa_nrcc3(
  year,
  month,
  day,
  hour,
  station,
  base = "https://hrly.nrcc.cornell.edu/stnHrly"
)

Arguments

`year`	Numeric value with the start year.
`month`	Numeric value with the start month.
`day`	Numeric value with the start day.
`hour`	Numeric value with the start hour.
`station`	String with the station abbreviation. Check `scrappy::newa3_stations` for a list of stations and abbreviations.
`base`	Base URL (default: https://hrly.nrcc.cornell.edu/stnHrly).

Value

List of data frames with hourly, daily, hourly_forecast, and daily forecast (daily_forecast) data.

Examples

scrappy::newa_nrcc3(2021, 12, 01, 00, "gbe")
scrappy::newa_nrcc3(2021, 12, 01, 00, "gbe")

NEWA Weather Stations dataset

Description

A dataset containing information of 718 weather stations in the Network for Environment and Weather Applications (NEWA) at Cornell University.

Usage

data(newa_stations)
data(newa_stations)

Format

A data frame with 718 rows and 3 variables:

name: Station's name.
state: State where the station is located.
code: Station's code.

Author(s)

Network for Environment and Weather Applications [email protected]

Source

http://newa.cornell.edu/index.php?page=station-pages

NEWA v3 Weather Stations dataset

Description

A dataset containing information of 801 weather stations in the Network for Environment and Weather Applications (NEWA) version 3 at Cornell University.

Usage

data(newa3_stations)
data(newa3_stations)

Format

A data frame with 801 rows and 10 variables:

name: Station's name.
state: State where the station is located.
code: Station's code.
affiliation: Entity to which the entity is affiliated.
affiliation_url: Entity's URL.
latitude: Station's latitude.
longitude: Station's longitude.
elevation: Station's elevation.
start_year: Start year (data available).
is_icao: Boolean flag to indicate if the station is part of the International Civil Aviation Organization (ICAO) (e.g is an airport).

Author(s)

Network for Environment and Weather Applications [email protected]

Source

https://newa.cornell.edu

Print Values

Description

Print Values

Print Google Maps' reviews

Usage

## S3 method for class 'gmaps_reviews'
print(x, ...)
## S3 method for class 'gmaps_reviews'
print(x, ...)

Arguments

`x`	an object used to select a method.
`...`	further arguments passed to or from other methods.

Wait until page has finished loading

Description

Wait until page has finished loading the element with the tag value

Usage

wait_to_load(client, using = "css", value = "body", sleep = 1)
wait_to_load(client, using = "css", value = "body", sleep = 1)

Arguments

`client`	`RSelenium` client.
`using`	String with the property to use to find the element (e.g. "css", "xpath", etc.) (default: "css").
`value`	String with the tag of the page element to wait to load (default: "body").
`sleep`	Numeric value with the number of seconds to wait for the page to load the results (default: 1 second).

Package 'scrappy'

Help Index

(Assisted) request of EDINA's Digimap data from the Ordnance Survey

Description

Usage

Arguments

Value

Source

Convert duration to date-time Convert a string with a duration (e.g. 'an hour ago') to a date-time string, based on a reference time

Description

Usage

Arguments

Value

Examples

Scrape GP practices

Description

Usage

Arguments

Value

Examples

Scrape Google Maps' reviews

Description

Usage

Arguments

Value

Examples

Retrieve data from NEWA at Cornell University

Description

Usage

Arguments

Value

Examples

Retrieve data from NEWA v3.0 at Cornell University

Description

Usage

Arguments

Value

Examples

NEWA Weather Stations dataset

Description

Usage

Format

Author(s)

Source

NEWA v3 Weather Stations dataset

Description

Usage

Format

Author(s)

Source

Print Values

Description

Usage

Arguments

Wait until page has finished loading

Description

Usage

Arguments