A Big Data Glossary

3NF

(Course 1) Third Normal Form(see Third Normal Form)

ACID

(Course 1) An acronym for Atomicity, Consistency, Isolation, and Durability; a condition on database operations in which write operations will not result an invalid state of the database for concurrent or subsequent reads

Active Database

(Course 2) The database to which a user is connected (see also current database)

Aggregate

(Course 2) The result of aggregation

Aggregate Expression

(Course 2) An expression that is calculated using multiple rows and returning a single value, using an aggregate function (compare withscalar expression)

Aggregate Function

(Course 2) A function that performs aggregation

Aggregation

(Course 2) Reducing multiple rows down to a single row, usually by comparing values (such as finding a maximum or minimum value) or combining values (such as finding a sum or average)

Analog Data

(Course 1) Information that exists in non-digital form

Analytic Database

(Course 1) A database focused on storing somewhat static data for analytic queries to answer complex questions about the data (see also operational database)

Antipattern

(Course 3) A common but inefficient or ineffective response to an issue

Apache Avro 

(Course 3) An efficient data serialization framework that provides a file format using optimized binary encoding to store data efficiently

Apache Hadoop

(Course 1) An open-source framework for big data storage and processing

Apache Hive

(Course 1, Course 2) An open source SQL query engine for use with big data; it creates a MapReduce or Apache Spark job to run the query

Apache Impala

(Course 1, Course 2) An open source SQL query engine for use with big data; it has its own dedicated processing engine

Apache ORC

(Course 3) A columnar file format used by Apache Hive; ORC stands for Optimized Record Columnar

Apache Parquet

(Course 3) A columnar file format

Apache Spark

(Course 1) A programming language for processing data in a distributed system; one of the underlying languages that Apache Hive will translate HiveQL (SQL) statements into

Ascending

(Course 2) From least to greatest (as opposed to descending); for strings, alphabetical order (A to Z) is ascending order

Atomic

(Course 1) Generally, indivisible into smaller parts

  1. In databases, an atomic column can contain parts if the individual parts would contribute no purpose to how the database is used
  2. A transaction is atomic if the component DMLstatements are guaranteed to be made effective in the database at the same time

Avro

(Course 3) See Apache Avro

Batch Mode

(Course 2) See non-interactive

Beeline

(Course 2) The CLI shell for issuing SQL statements to Apache Hive

BI

(Course 2) Business Intelligence; typically an application to create reports related to a business

Big Data

(Course 1) Significantly large datasets, typically into several dozen terabytes, petabytes, or larger

Binary Operator

(Course 2) An operator that takes two operands; most mathematical operators (such as +, –, and *) are binary operations

Binning

(Course 2) Separating a column of continuous numerical values into a limited number of groups called bins; for example, values from 0 to 100 might be binned into intervals [0, 10), [10, 20), [20, 30), and so on

BLOB

(Course 1) Binary Large Object; a binary data object, potentially up to 4GB 

Bottom-NQuery

(Course 2) A query that returns some number of results whose values are the least of the values in the data (compare with top-N query)

Bucketing

(Course 1) See table bucketing

Built-In Function

(Course 2) A named process that is predefined by the system or engine, usable without further preparation

Bulk Load

(Course 1) Importing many records of data at once, rather than one record at a time

Business Rules

(Course 1) Constraints to which your business procedures are intended to conform

Cartesian Join

(Course 2) See cross join

Cascading DML

(Course 1) A series of DMLstatements set to occur along with a particular DML statement

Case 

(Course 2) Usually refers to the capitalization of letters, or lack thereof; for example, lowercase (this is lowercase) and uppercase (THIS IS UPPERCASE)

Casting

(Course 2) See type conversion, implicit casting, and explicit casting

Categorical

(Course 2) Containing a limited number of possible values, which typically represent categories

Clause

(Course 2) A part of a SELECT statement, typically a keyword followed by one or more expressions, literal values, or column references

CLI

(Course 2) Command-line interface; a text-based interface for issuing commands, such as to an operating system or to a shell utility such as Beelineor Impala Shell

CLOB

(Course 1) Character Large Object; a data object consisting of characters, potentially up to 4GB 

Cloning

(Course 3) When applied to a table, creating a new table with the same schema (but no data); typically done with the LIKE keyword

Cloud Storage

(Course 1) Remote storage of data, typically through a storage service such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (compare with on-premises storage)

Cluster

(Course 1) A network of servers (called nodesor hosts) for storage and processing

Collection Function

(Course 3) A function that operates on a collection type (complex data type)

Collection Type

(Course 3) Another name for a complex data type, because the type collects data into one column

Column

(Course 1) A data field in the records (or rows) of a table

Column Alias

(Course 2) An alternate name given to a column or expression, to allow simpler references in other parts of a SELECT statement or in the results

Column Reference

(Course 2) A name or alias used in a SQL statement to refer to a column in a table or result set

Columnar File Format

(Course 3) A file format that organizes and stores data by column rather than by row

Comparison Operators

(Course 2) Operators that provide a comparison between two values; examples include <, >, and =

Complex Data Types

(Course 1, Course 3) Data types that comprise one or more fields, which in some cases can be of different types; for example, an array of INT values or a structure that contains two STRING values and an INT value

Compound Key

(Course 1) A primary keyor foreign keycomposed of more than one column

Compression

(Course 3) One of many processes for encoding data to take up less storage

Conditional Functions

(Course 2) Functions that test a condition and return potentially different values based on whether the condition is true or not

Consistent

(Course 1) For transactions, keeping the database in agreement with design constraints

Coupled

(Course 1) How the data and the metadata of a database are connected; see loosely coupled and tightly coupled 

Cross Join

(Course 2) A jointhat uses no join condition; each row of one table is combined with every row from another table (also called Cartesian join)

CRUD

(Course 1) An acronym for Create-Retrieve-Update-Delete, the types of queries used in an operational database

Current Database

(Course 2) The database to which a user is connected (see also active database)

Daemon

(Course 1) A process running on a node in a network (pronounced DEE-mon or DAY-mon)

Data

(Course 1) A representation of something, that captures some features and ignores others; in this specialization and this course, data by itself typically means digital data (see also analog data and digital data)

Data Analysis

(Course 2) The process of exploring data or using it to answer questions 

Data Control Language (DCL)

(Course 1) A category of SQL commands for managing user access permissions: GRANTREVOKE

Data Definition Language (DDL)

(Course 1) A category of SQL commands for defining a database and tables: CREATEALTERDROP

Data Dictionary

(Course 1) A portion of a RDBMS that stores the database metadata (implementation details about the tables in the database)

Data Lake

(Course 1) A collection of data, which can be of any type or size (also called data reservoir, data store, or enterprise data hub)

Data Locality

(Course 3) A phenomenon in which data processing happens in the same location (on the same machine) where the data is stored

Data Manipulation Language (DML)

(Course 1) A category of SQL commands for loading and updating data in a database: INSERTUPDATEDELETE; also (for transactions) START TRANSACTIONCOMMIT, and ROLLBACK

Data Query Language (DQL)

(Course 1) A category of SQL commands (actually only one command) for asking questions to get answers from a database: SELECT

Data Reservoir

(Course 1) A collection of data, which can be of any type or size (also called data lake, data store, or enterprise data hub)

Data Retrieval

(Course 2) The process of accessing and possibly displaying data

Data Store

(Course 1) A collection of data, which can be of any type or size (also called data lake, data reservoir, or enterprise data hub)

Data Type

(Course 1) The category into which a single piece of data might fall, such as integer (INT in SQL) or string (STRING in SQL)

Data Warehouse

(Course 1) A form of analytic databasethat gathers data from one or more data sources

Database

  1. (Course 1) An organized data store, such as a spreadsheet or other table with rows and columns
  2. (Course 2) A logical container for a collection of tables (see also schema)

Database Management System

(Course 1) Software to organize and manage data in a database

Database Normalization

(Course 1) A strategy for designing a database to conform to particular rules

Database Transaction

(Course 1) A set of DML statements bundled into one indivisible action

Database Trigger

(Course 1) An activity stored in a database that automatically occurs as part of DML statements

DBMS

(Course 1) Database Management System

DCL

(Course 1) See Data Control Language

DDL

(Course 1) See Data Definition Language

Delimiter

(Course 2) The character in a text-based data file that separates one field or column from the next; such a data file is said to be delimited

Denormalization

(Course 1) A strategy of purposefully breaking rules of normalization

Derived Column

(Course 1) A column whose value is intended to be determined from values in one or more other columns, typically in the same table

Descending

(Course 2) From greatest to least (as opposed to ascending); for strings, reverse alphabetical order (Z to A) is descending order

Deserialize

(Course 3)  The process for decoding a stored file (see also serialize)

Deterministic Order

(Course 2) A predictable order, with no opportunity for the SQL engine to arbitrarily order two or more rows (such as it would in the case of a tie)

Digital Data

(Course 1) Information that can be transmitted, stored, and processed using modern digital technologies like the internet, disk drives, and modern computers; in this specialization and this course, data by itself typically means digital data (see also analog data)

Distributed Processing

(Course 1) Processing data across multiple computers, each processing a portion of the data with some interim reorganization (see shuffle); a characteristic of big data processing

Distributed Storage

(Course 1) Storage of a dataset across multiple disks, rather than on a single disk; a characteristic of big data storage

DML

(Course 1) See Data Manipulation Language

DQL

(Course 1) See Data Query Language

Durable

(Course 1) For transactions, guaranteeing that changes are persistent, that is, safely stored within the database so they will not be lost

Edge Node

(Course 3) See gateway node

Enforcing Business Rules

(Course 1) For databases, the result of good design of tables, including triggers, that allows operational systems to centralize management of business rules

Enterprise Data Hub

(Course 1) A collection of data, which can be of any type or size (also called data lake, data reservoir, or data store)

Equijoin

(Course 2) A join in which the join condition uses equality (compare with non-equijoin)

Escape Sequence

(Course 3) A sequence of characters that does not represent itself, but rather is translated into another character or a sequence of characters that might be difficult or impossible to represent directly; for example, \t is the escape sequence representing the tab character

ETL

(Course 1) Extract, transform, and load; a process used by analytic databases to harvest data from an operational system

Execution Plan

(Course 3) A description of the tasks required for a query, the order in which they'll be executed, and some details about each task

Explicit Casting

(Course 2) Issuing a function to cast a data element into a different data type (see alsoimplicit casting)

Exponentiation

(Course 2) The mathematical operation xy in which is the exponent (also power function)

Expression

(Course 2) A combination of literal values, column references, operators, and functions

Externally Managed Table

(Course 3) See unmanaged table

Fetch Task

(Course 3) A simple query that does not require processing of data; in Hive, fetch tasks are not sent to the underlying processing engine

File System

(Course 3) A system for storing data in files that can be listed and accessed by different methods (compare with storage engine)

Foreign Key

(Course 1) One or more columns in a table that refer to a primary key in a different table

Foreign Key Constraint

(Course 1) The properties that define a foreign key

Full Outer Join

(Course 2) An outer join in which all rows from each table is included in the results regardless of whether a match for the join condition exists in the other table(compare with inner join, left outer join, and right outer join)

Function

(Course 2) A named process

Gateway Node

(Course 3) A computer that provides an interface between the Hadoopcluster and the outside network; also called edge node

GROUP BY List

(Course 2) A list of column references used in a GROUP BY clause

Grouping

(Course 2) Using GROUP BY to combine multiple rows that share a value (either of an existing column or of an expression), ideally in conjunction with aggregation

Hadoop 

(Course 1) See Apache Hadoop

Hadoop Distributed File System (HDFS)

(Course 1) The file system used for big data in Apache Hadoop

Hadoop File System (HDFS) Shell Commands

(Course 3) Commands issued at the command line that interact with HDFS; these commands start with hdfs dfs or hadoop fs

Hive

(Course 1, Course 2) See Apache Hive

Hive Metastore

(Course 1) The metastore used by both Apache Hive and Apache Impala

Hive on … (MapReduce, Spark, Tez)

(Course 3) The mode of operation for Apache Hive, identifying the underlying engine (MapReduce, Spark, or Tez) that an instance of Hive uses for processing queries

HiveQL

(Course 1) The dialect of SQL used by Apache Hive

Hive Warehouse Directory

(Course 3) The directory in HDFS within which Hiveand Impala store table data by default; typically at the path /user/hive/warehouse

Home Directory

(Course 3) In HDFS, your personal user directory, typically /user/username/ where username is your Hadoop username

Hosts

(Course 1) A server in a network or clusterof machines for storage and processing; also called a node

Hue

(Course 2) Hadoop User Experience; an open source, browser-based graphical interface for working with big data

Hybrid Storage

(Course 1) A mix of on-premises and cloud storage

Identifier

(Course 2) A word in SQL used to identify a column, table, or database

Immutable

(Course 1) Unable to be changed 

Impala

(Course 1, Course 2) See Apache Impala

Impala Shell

(Course 2) The CLI for issuing commands to Apache Impala

Impala SQL

(Course 1) The dialect of SQL used by Apache Impala

Implicit Casting

(Course 2) The automatic casting (by Apache Hive, for example) of a data element into a different data type when incompatibilities would otherwise cause an error, with no guidance from the user (see also explicit casting)

Inner Join

(Course 2) A join in which a row from either table is included in the results only if a match for the join condition exists in the other table (compare with outer join)

Intelligent Keys

(Course 1) Keys (primaryor foreign) for which the value has meaning other than its use as an identifying value

Interactive

(Course 2) For working with SQL engines at the command line, a mode in which each statement is entered directly and immediately acted upon, then a prompt is given in readiness for a new statement (compare with non-interactive)

Isolated 

(Course 1) For transactions, free of interaction with other transactions so two transactions occurring at the same time will not cause errors

JDBC

(Course 1) Java Database Connectivity, a type of interface for connecting to a database

Join

(Course 2) A combination of two tables, in which a row from one table is put together with a row from another table if the two rows satisfy a join condition(see also inner join and outer join)

Join Key Columns

(Course 2) The columns used in a join condition

Join Condition

(Course 2) A conditional expression, often (but not necessarily) an equality expression, used to determine which rows in a join should be considered a match and therefore be combined into one row in the results

JSON

(Course 1) JavaScript Object Notation; a format for identifying data: brackets {} enclose a single row, and elements are identified by a key:value format, with the key in quotes and elements separated by commas

Key Constraint

(Course 1) The properties that define a key (see also foreign key constraint and primary key constraint)

Keyword

(Course 1) A word in SQL that has particular meaning (examples: SELECT, AS, and OR)

Lazy Initialization

(Course 3) A technique by which a processor does not instantiate objects until they’re needed

Left Outer Join

(Course 2) An outer join in which a row from the left (first) table is included in the results regardless of whether a match for the join condition exists in the other table, but a row from the right (second) table is only included if a match exists (compare with inner join, right outer join, and full outer join)

Left Semi-Join

(Course 2) A query technique using an inner join but returning only columns from the left table; since only rows that match are included, the join condition acts as a filter

Literal Value (Literals)

(Course 2) A specified, static value to be taken verbatim; it will not change

Locality

(Course 3) See data locality

Logical Operators

(Course 2) Operators that work with Boolean values (true or false); binary logical operators are AND and OR, and the unary logical operator is NOT

Loosely Coupled

(Course 1) When the metadata for a database may not contain all information about all the data in the database; the metadata does not govern the data (compare withtightly coupled)

Managed Table

(Course 3) In Hadoop, a table whose data is managed by Hive or Impala—dropping the table will also delete the data if permissions allow (see also unmanaged table)

Manipulate

(Course 2) For data, to make calculations and changes during analysis as a means to uncover more information; this does not make permanent changes to the data, nor is it a means to alter data dishonestly

MapReduce

(Course 1) A programming language for processing data in a distributed system; one of the underlying languages that Apache Hive will translate HiveQL (SQL) statements into

Metadata

  1. (Course 1) For a table, the definition of the columns by name and data type (see also schema)
  2. (Course 1) In general, information about data (for example, the artist, song title, and release date of an mp3 data file)

Metastore

(Course 1) A relational database within a big data system that holds metadata such as table definitions; it replaces the data dictionaryof a conventional RDBMS (see also Hive metastore)

Modulus 

(Course 2) A binary operation that (put simply) returns the remainder of division, typically using % as the binary operator (for example, if a bx c where x is an integer and 0 ≤ c b, then a b c)

Namespace

(Course 3) A domain of names, such as databases in a big data system; different objects (such as tables) with the same name can exist in different namespaces and still be uniquely identified 

Nodes

(Course 1) Servers used as part of a storage or processing network called a cluster; also called hosts

Non-Aggregate Expressions

(Course 2) Expressions that operate on a single row and return one value per row; also called scalar expressions (compare with aggregate expressions)

Non-Equijoin

(Course 2) A join in which the join condition uses inequality rather than equality (compare with equijoin)

Non-Interactive

(Course 2) For working with a SQL engine, a mode in which you specify statements to be run when connecting to the engine, after which the engine is disconnected; sometimes called batch mode (compare with interactive)

Normalized

(Course 1) For a database, designed to conform to particular rules

NoSQL

(Course 1) A category of operational systems that do not mandate a schema on records; these systems physically organize records by a specific lookup key

NULL

(Course 1, Course 2) A missing or unknown data value; this is different from 0 for numbers or an empty string; NULL might signal bad data, but it could also mean not applicable or indeterminate

NULL-Safe Operator

(Course 2) An equality operator using the symbol <=> that evaluates NULL <=> x as true instead of NULL ifx is NULL, and as false instead of NULL if x is not NULL; this is the same as IS NOT DISTINCT FROM

ODBC

(Course 1) Open Database Connectivity, a type of interface for connecting to a database

On-Premises Storage

(Course 1) An option for data storage in which the data is stored in machines located in the institution's building, rather than remotely (such as "in the cloud"); sometimes shortened to "on-prem"

Online Transaction Processing (OLTP) System

(Course 1) A type of operational database that supports transactions and business rules

Operand

(Course 2) An entity that functions as an argument in an operation (for example, in 2 + 3, the 2 and 3 are operands for the addition operation)

Operational Database

(Course 1) A database focused on storing data that provides information about the current state of a process or system; most queries will be DMLstatements and simple data lookups (see also analytic database)

Operator

(Course 2) A symbol that represents a process (such as a calculation) acting on one or more entities (called operands) (for example, in 2 + 3, the + is the operator)

Options

(Course 2) Arguments for a CLI command

ORC

(Course 3) See Apache ORC

ORDER BY List

(Course 2) The comma-separated list of columns or expressions in the ORDER BY clause, used to determine how the result rows should be sorted

Outer Join

(Course 2) A join in which a row from either table is included in the results, even if no match for the join condition exists in the other table (compare with inner join and see also left outer join, right outer join, and full outer join)

Pagination (or Paging)

(Course 2) Separating a query into multiple queries to return discrete sections, as if each component query produced one page of rows

Parameterization

(Course 2) Removing hard-coded values and replacing them with set variables or parameters that can be passed through a command

Parquet

(Course 3) See Apache Parquet

Partition Pruning

(Course 1) Skipping table partitionsthat are irrelevant to a particular query

Partitioning

(Course 1, Course 3) See table partitioning

Petabyte

(Course 1) 1000 terabytes

Power Function

(Course 2) The mathematical operation xy in which is raised to the power of (also exponentiation)

Protocol

(Course 3) A set of rules that allow computers to communicate with each other; a URI (Uniform Resource Identifier) starts with a protocol such as https://hdfs://,or s3://

Primary Key

(Course 1) One or more columns used to uniquely identify a row

Primary Key Constraint

(Course 1) The properties that define a primary key

Pseudocolumn

(Course 3) An element within a column of the complex data types ARRAY or MAP; in Impala, these complex types are treated as if they were tables with pseudocolumns

Pushdown

(Course 2) An approach used by business intelligence (BI) and analytics applications in which data transformation operations are performed by a SQL engine (they are pushed down to the SQL engine) and the results are loaded back into the application; contrast this with the approach in which the full data is loaded into the application and data transformations are performed by the application

Qualified Path

(Course 3) A path to a file location that includes the protocol (see also unqualified path)

Qualified Table Name

(Course 2) The name of the table, prepended by the database it's from and a dot (.) to separate the database name from the table name

RCFile

(Course 3) A columnar file format (RC stands for Record Columnar) for Apache Hive; largely replaced by Apache ORC

RDBMS

(Course 1) Relational Database Management System, a traditional database system in which tables definitions also define what data can be stored (that is, only data that fits into an existing table can be stored in the system)

Redirection

(Course 2) A way of specifying that the output of a process is to be written to a file rather than to the screen (standard output)

Relation

(Course 1) A formal concept in relational database theory—a table is a representation of this formal concept

Reserved Word

(Course 2) A word in SQL that cannot be used as an identifier without quoting with backticks; these typically include keywords and property names but can change among engines, and among versions of the same engine 

Right Outer Join

(Course 2) An outer join in which a row from the right (second) table is included in the results regardless of whether a match for the join condition exists in the other table, but a row from the left (first) table is only included if a match exists (compare withinner join, left outer join, and full outer join)

Row

(Course 1) A record within a data set

S3 URL

(Course 3) Also S3 URI; the unique path to a file or directory in an Amazon S3 bucket, typically in the form s3a://bucketname/path/to/file.ext (for HDFS shell commands) or s3://bucketname/path/to/file.ext (for the AWS CLI)

Scalar Expressions

(Course 2) Expressions that operate on a single row and return one value per row; also called non-aggregate expressions (compare with aggregate expressions)

Schema

  1. (Course 1) For a table, the definition of the columns by name and data type (see also metadata)
  2. (Course 2) A logical container of a collection of tables (see also database)

Schema Evolution

(Course 3) Automatic transformation of table schema to match changes made to data

Schema-on-Read

(Course 1) A characteristic of databases that store data without validation against schema—structure is imposed and data is validated when read from storage into an active data structure, and how invalid data is handled depends on the database system and configurations

Schema-on-Write

(Course 1) A characteristic of databases that require data to conform to schema before it can be stored—invalid data is rejected and not stored

Search Engines

(Course 1) A category of data stores that specialize in quick searches of undifferentiated documents with imprecise matches (allowing synonyms, misspellings, and different forms of a word)

SELECT Clause

(Course 2) The portion of a SELECT statement starting with SELECT and ending before the next clause (typically FROM)

SELECT List

(Course 2) The list of column references, expressions, and literal values in the SELECT clause that specify what the columns of the result set will be

Self Describing File

(Course 3) A file that embeds information about what's in the file

Self-join

(Course 2) A join that combines a table with itself

Semi-Structured Data

(Course 1) Data in which fields in a record are tagged, but there is no definite schema that all records are guaranteed to meet; also, data with some structure, but without the regularity or schema of structured data (for example, log files)

SequenceFiles

(Course 3) A file format that stores key-value pairs in a binary container

SerDe

(Course 3) Serializer/deserializer; an interface for converting data to and from stored files

Serialize

(Course 3) The process for converting data to bytes so it can be stored (see also deserialize)

Shuffle (or Shuffle and Sort)

(Course 1, Course 3) An interim phase of distributed processing, in which partially processed data is sorted and redistributed across multiple machines before additional processing can be completed

Small Files Problem

(Course 3) A common issue for distributed (big data) systems, which are optimized for working with large files; when there are many small files, the system loses efficiency

Sparse Data

(Course 2) Data in which a large proportion of the values are NULL

SQL

(Course 1) Structured Query Language, the seminal language for working with databases (pronounced either ess-cue-ell or sequel)

SQL Client

(Course 2) A utility or application that allows you to enter SQL statements, run them, and see the results

SQL Query Utility

(Course 2) See SQL Client

Storage Engine

(Course 3) A system that encapsulates data storage; such a system will manage the data using an abstraction that hides the details of how the data is stored and accessed (compare with file system)

Stored Procedure

(Course 1) A routine that can be called by users and programs that performs some sequence of actions in a database

String

(Course 1, Course 2) A value consisting of characters; also a data type (or data type category) using string values

Structured Data

(Course 1) Data that conforms to a set schema

Table

(Course 1) A holding area for different kinds of data, in which a record is a row separated into columns

Table Bucketing

(Course 1) A method used with big data to divide tables into sections in a somewhat arbitrary or unpredictable way, useful for sampling data when the full data is not needed; (compare with table partitioning)

Table Partitioning

(Course 1, Course 3) A method used with big data to divide tables into predictable sections that can speed access by allowing sections irrelevant to a particular query to be skipped entirely, which is called partition pruning (compare with table bucketing)

Terabyte

(Course 1) 1000 gigabytes

Terminal

(Course 2) Operating system command line (CLI) and gateway to other CLIs

Third Normal Form (3NF)

(Course 1) A particular set of rules for a normalized database

Tie

(Course 2) A case in which the value (or values) in the column (or columns) in the ORDER BY clause are the same for two or more rows

Tightly Coupled

(Course 1) When the metadata is strictly applied to the data, governing what is allowed to be stored as data within the database (compare with loosely coupled)

Top-NQuery

(Course 2) A query that returns some number of results whose values are the greatest of the values in the data (compare with bottom-N query)

Transaction

(Course 1) See Database Transaction

Transformation

(Course 1, Course 2) A change in data, often a change in format or calculation for storage or query purposes (see also manipulate)

Trigger 

(Course 1) See Database Trigger

Type Conversion

(Course 2) Changing a data entity from one data type to another

Query

(Course 2) A SELECT statement in the SQL language

Result Set (also result)

(Course 2) The data returned by a query

UDF

(Course 1) See user-defined function

Unary Operator

(Course 2) An operator that uses one operand; the negative sign (-) is an example of a unary operator

Unmanaged Table

(Course 3) Also called external or externally managed table, in Hadoop, a table whose data is not managed by Hive or Impala—dropping the table using Hive or Impala will not delete the data (see also managed table)

Unqualified Path

(Course 3) A path to a file location that doesnotinclude the protocol—the protocol is assumed depending on the command (see also qualified path)

Unstructured Data

(Course 1) Data without clear, definite structure; examples include natural language text and media files such as photos, audio files, and movies

User-Defined Function

(Course 1) A function that is created in a general programming language and added to the database software

Utility Statement

(Course 2) A SQL statement that provides information about the database and its tables, but not actual data

XML

(Course 1) A hierarchical system of tags to label and organize data

Virtual Machine (VM)

(Course 1)

  1. Generally, an image of a computer that can be run within another host machine, through software such as VMware Workbench Play, VMware Fusion, or VirtualBox
  2. The hands-on environment for this specialization