Understanding Tokens In C/C++

Understanding Tokens In C/C++
Understanding Tokens In C/C++

Introduction 

Have you ever wondered how a compiler differentiates when int is used to declare a variable or a variable is named? 

We all have at some point used <int> to declare a variable with an integral value. But have you ever wondered how the compiler identifies that <int> is being used for this special purpose? It is because the compiler recognises this word as a special reserved word – a keyword.

Keywords come in the category of the smallest elements of a program that are meaningful to the compiler. These elements are called tokens.

What are Tokens? 

Like every other complex thing in the world, each program we write is made by building up from the smallest and most basic elements. The smallest element of a program that is meaningful to the compiler is called a token.

No matter which programming language you use, each has its own predefined tokens. In this article, we will majorly focus on understanding the tokens in C/C++. Although the two have similar types of tokens, the C++ language has two additional types. 

We have the following types of tokens in C/C++ programming languages:

(Note that ‘Yes’ indicates that the given token is considered as a token for a particular language.)

TokenCC++
keywordYesYes
identifierYesYes
constantYesNo
Numeric, Boolean and Pointer LiteralsNoYes
String and Character LiteralsYesYes
User-Defined LiteralsNoYes
punctuatorYesYes

In the following sections, we will discuss in detail each of these tokens along with their examples. 

Keywords

Look at the simple C++ code given below to add two numbers.

int main()
{
    int x, y, sum;

    //taking the value of the two numbers
    cout << "Enter the two integers you want to add: ";
    cin >> x >> y;

    // storing the sum of two integers in sum
    sum = x + y;

    // prints sum 
    cout << x << " + " <<  y << " = " << sum;     

    return 0;
}

Output:

Enter the two integers you want to add: 3
2
3 + 2 = 9

If we observe the code, we can identify certain words that are conventionally used in our codes very frequently. The word <int> and <return> are two such words. These are identified as keywords in C/C++. Keywords are predefined reserved words that have a special meaning to the compiler. These cannot be used as identifiers. 

Some of the reserved keywords in C/C++ are given below.

autobreakcase
returnintchar
boolprivatepublic
protectedfalsetrue
iforelse
floatwhilenew

For the full list of keywords, please refer to Keywords (C++) and C Keywords.

Identifiers

Identifiers are symbols or words that one supplies to the variables, functions, types, classes, objects and other such components of one’s code. If we look again at the program to add two numbers in C++, we observe that to identify the value of the first number, we use the identifier ‘x’, for the second number, ‘y’ and for the sum of the two, we use ‘sum’. 

There are some rules that must be followed while using identifiers as tokens in C/C++ these are as follows:

  • Keywords cannot be used as identifiers. However, identifiers that contain a keyword are legal. For example, ‘Tint’ is a legal identifier, but ‘int’ is not. 
  • Identifiers are case-sensitive. Thus ‘FileName’ will correspond to a different memory address than ‘fileName’. 
  • The first character of an identifier must be an alphabetic character, either uppercase or lowercase, or an underscore ( _ ). Therefore, ‘2numbers’ is an illegal identifier. 

Each identifier has a scope or visibility. This scope is the region of the program in which this identifier can be accessed. It may be limited (in order of increasing restrictiveness) to the file, function, block, or function prototype in which it appears. 

Constant

A constant is a token in C that corresponds to a number, character, or character string that can be used as a value in a program. Every constant has a type and a value on the basis of which, constants are categorised into the following types:

  • Floating Point Constants: It is a decimal number that represents a signed real number. The representation of a signed real number includes an integer portion, a fractional portion, and an exponent. 
  • Integer Constants: It is a decimal (base 10), octal (base 8), or hexadecimal (base 16) number that represents an integral value. We use these to represent integer values that cannot be changed.
  • Character Constants: A “character constant” is formed by enclosing a single character from the representable character set within single quotation marks (‘ ‘). 
  • Enumeration Constants: The named integer identifiers that are defined by enumeration types are called enumeration constants. To read more on enumeration, you might want to refer to C enumeration declarations
//floating point constants
15.75
1.575E1   /* = 15.75   */
1575e-2   /* = 15.75   */
-2.5e-3   /* = -0.0025 */
25E-4     /* =  0.0025 */


//integer constants
28
0x1C   /* = Hexadecimal representation for decimal 28 */
034    /* = Octal representation for decimal 28 */


//character constants
char    schar =  'x';   /* A character constant          */
wchar_t wchar = L'x';   /* A wide-character constant for
                            the same character           */

Numeric, Boolean and Pointer Literals

The Numeric, Boolean and Pointer Literals are considered as tokens only by C++. Before jumping to what numeric, boolean and pointer literals are, let us understand the term ‘literals’. So, literals are the tokens of a program that directly represent a value.

Take a look at the following:

const int = 20;      // integer literal
double d = sin(107.87);     // floating point literal passed to sin func                          
bool b = false;              // boolean literal
TestClass* mc = nullptr;      // pointer literal

The values 20, 107.87, false, nullptr are directly representative of their respective constants. Thus, these are literals. Let’s discuss each of these types of literals.

Integer Literal
In the example given above, the expression <const int = 20;> is a constant expression. The value <20> is the integer literal. Every integral literal has two aspects – Prefix and Suffix. The prefix of the integer literal indicates the base in which it is to be read while the suffix of the integer literal indicates the type in which it is to be read. The following example will make it clearer while studying tokens in c/c++.


12345678901234LL /* indicates a long long integer value 
                    because of the suffix LL */



0x10 = 16   /* the prefix 0x indicates the HexaDecimal base */

Boolean Literal
The expression ‘false’ is the boolean literal. This literal is used to represent the boolean data types. A boolean can only have two values – true and false.

Pointer Literal
In the expression, ‘nullptr’ is referred to as the point literal. C++ introduces the nullptr literal to specify a zero-initialised pointer.

Character and String Literals

These kinds of tokens too are only recognised by the C++ compiler. A Character Literal stores a single character which is written within single quotes. Only a single character can be represented by one character literal. In order to store multiple characters, one must use character arrays.

If we use a character literal to store multiple characters, the compiler will throw a warning and end up storing only the last character of the literal. 

A String literal is also similar to a character literal except that it can represent multiple characters written within double-quotes. It can also contain special characters. 

Here is a piece of code that illustrates the two. 

int main()
{
    const string str = “Welcome to Coding Ninjas.”;
    cout << str;
    const char character = ‘x’;
    cout << character;
    return 0;
}

Output:

Welcome to Coding Ninjas.
x

User-defined literals

These kinds of literals were added in C++ from C++ 11. If we recall, we know six major types of literals namely integer, floating-point, boolean, string, character and pointer. On the basis of these, we can also define our own literals. These are called UDLs or User Defined Literals.

The need for UDLs arises when the in-build literals fall insufficient. The example below will help you understand. 

UDLs are only supported in a suffix manner. To get a clearer understanding of this, take a look at the following example.

27h                // hours
3.6i                // imaginary

The prefix ‘h’ is used to define an hour literal and ‘i’ is used to define an imaginary number literal. Thus, these literals will now help us to directly represent values in hours and imaginary numbers. You can read about UDLs in detail here.

Punctuators 

Punctuators are tokens in C and C++  that are semantically and syntactically meaningful to the compiler but whose operations depend on the context. Some punctuators, either alone or in combination, can also be C++ operators or be significant to the preprocessor. Following are some examples of punctuators.

! % ^ & * ( ) - + = { } | ~
[ ] \ ; ' : " < > ? , . / #

Frequently Asked Questions

What are the tokens in C++?

The smallest element of a program that is meaningful to the compiler is called a token. Some of the tokens in C++ identified by the compiler are keywords, identifiers, punctuators, literals etc.

Is ++ a token in C?

Yes, the increment operator is a unary operator which is identified as a token in C++

What is a C token with an example?

A token is the smallest element that is meaningful to the compiler. For example, keywords like and are considered tokens. The tokens identified in C are:

1. Keywords
2. Identifiers
3. Strings
4. Operators
5. Constant
6. Special Characters

How many types of tokens are there in C++?

There are broadly seven types of tokens in C++ and these are as follows:

1. Keywords
2. Identifiers
3. Numeric, Boolean and Pointer Literals
4. String and Character Literals
5. User-Defined Literals
6. Operators
7. Punctuators

Key Takeaways

Every program has certain tokens which are the smallest elements that are meaningful to the compiler. In C/C++ we have keywords, identifiers, constants, literals and punctuators as tokens. In this article, we discussed each of these in detail, along with examples. 

We hope that this blog on tokens in c/c++ has helped you more about the concept.

By Khushi Sharma